学习笔记：ALTERing a Huge MySQL Table - 对一个超大表做alter调整

The Problem
Future Solutions
Overview of Solution
Shortcut
Assumptions/Restrictions/Complications
Database Layer
The Alter
One Iteration (Copy one Clump)
Layer's INSERTs
Helper Table: Migrate
Tuning on the fly
Deactivation
Alternative - Log table
Alternative - Dual Master
Postlog

Brought to you by Rick James

The Problem

You have a "huge" table, and you need to ALTER it. By "huge" I mean that the ALTER would take an unacceptable amount of time.

Things you might want to do. and for which this is well suited:
    ⚈  Add/drop columns
    ⚈  Modify datatypes (INT -> BIGINT, SIGNED -> UNSIGNED)
    ⚈  Modify character set / collation of column(s)
    ⚈  Add/drop indexes
    ⚈  Add/drop/modify PARTITIONing
    ⚈  Normalize/Denormalize
    ⚈  Change Engine (MyISAM to InnoDB)

⚈添加/删除列
    ⚈修改数据类型（INT - > BIGINT，SIGNED - > UNSIGNED）
    ⚈修改列的字符集/排序
    ⚈添加/删除索引
    ⚈添加/删除/修改PARTITIONing
    ⚈标准化/非规范化
    ⚈更改引擎（MyISAM到InnoDB）

Future Solutions

Percona's pt-online-schema-change
(aka pt-osc) can do an ALTER with very little downtime. It does, however, require adding a TRIGGER to the table.

gh-ost
is a new and promising competitor to pt-online-schema-change; it uses the binlog.

The ALTER in version 5.6.7 can do lots of ALTERs without blocking other operations. For a list of what 5.7 can and cannot do via ALGORITHM=INPLACE, see 5.7 ALTER TABLE
版本5.6.7中的ALTER可以执行大量ALTER而不会阻止其他操作。有关5.7可通过ALGORITHM = INPLACE执行和不可执行的操作的列表，请参阅 5.7更改表

If you don't have the latest version, or you can't use Percona's solution, read on.
如果您没有最新版本，或者您无法使用Percona的解决方案，请继续阅读。

Overview of Solution

1.  Build the 'Alter' script to copy rows into a new table (in clumps of rows)
    2.  Push code to add a Database Layer between client(s) and the database
    3.  Push code to augment the Database Layer to handle the effect of the Alter
    4.  Turn on the Alter
    5.  Push code to deactivate the Alter

The Layer may already be in place; it is 'best practice'. However, after seeing what is ahead (below), you may want to clean up the Layer some. One example is to change a BEGIN...COMMIT into a single API call (if practical). If this leads to a cleaner API, then that helps the API. Later, when you get to step 3, life will be simpler for other reasons.

Steps 2 and 3 are separate because I am assuming you also cannot afford screwups. Step 2 focuses on the API from Client to Layer; Step 3 focuses on tweaking the Layer for this Alter -- they are separable focuses, and is safer to think about only one at a time.

"Turning on the Alter" is essentially running the script (Step 1) to copy the table over. This may take hours or days.

The guiding principles...
    ⚈  If the existing table continues to be read/written by the identical code as before, all queries on it should continue to work correctly.
    ⚈  At all times, rows with id <= $left_off (a "highwater mark") will be correctly transformed, inserted, updated, etc.
    ⚈  At the end, when $left_off is at the end of the table, the transformation is complete.

Shortcut

Most of the rest of this discussion centers on the complex case of a table that is being modified -- any row could be UPDATEd, INSERTs could go anywhere into the table. And it assumes a single machine, or a single Master.

Some likely special cases are covered near the end ("Alternative...")

Assumptions/Restrictions/Complications

Assumptions
    ⚈  This discussion assumes you can walk through the original table using an explicit PRIMARY or UNIQUE key.
    ⚈  Single-column (not 'compound') key is used to walk through the table.
    ⚈  You have enough disk space to simultaneously hold both the original table and the new table(s).
    ⚈  There is enough 'wiggle room' in the performance of the databases so that the overhead of this process (100%?) can be handled.
    ⚈  INSERTs are single-row, UPDATEs are single-table, DELETEs are single-row
    ⚈  You can modify the Layer to change how INSERTs, etc, operate.
    ⚈  SELECTs need no modification. (Nice, eh?)
    ⚈  UPDATE statements do not modify the column being used to walk through the table.
    ⚈  INSERT..ON DUPLICATE KEY UPDATE.. is not used
    ⚈  INSERT IGNORE does not depend on a secondary UNIQUE key
    ⚈  There are no FOREIGN KEYs in the definition of this table, nor any in other tables referencing this table.
    ⚈  Only one table is involved
    ⚈  You have all the write operations (no ad hoc queries from users)
    ⚈  Self-joins

假设
    ⚈本讨论假定您可以使用显式PRIMARY或UNIQUE键遍历原始表。
    ⚈单列（非'复合'）键用于遍历表格。
    ⚈您有足够的磁盘空间来同时保存原始表和新表。
    ⚈数据库的性能有足够的“摆动空间”，这样就可以处理这个过程（100％？）的开销。
    ⚈INSERT是单行，UPDATE是单表，DELETE是单行
    ⚈您可以修改Layer以更改INSERT等的操作方式。
    ⚈选择不需要修改。（很好，嗯？）
    ⚈UPDATE语句不会修改用于遍历表的列。
    ⚈   INSERT..ON DUPLICATE密钥更新..不使用
    ⚈   INSERT IGNORE不依赖于二次UNIQUE键
    ⚈没有外键在该表的定义，也没有任何其它表中引用此表。
    ⚈只涉及一个表
    ⚈您拥有所有写操作（没有来自用户的临时查询）
    ⚈自连接

Applicability
    ⚈  Engine -- The outline given here should work equally well for MyISAM or InnoDB.
    ⚈  Replication -- You can perform this on the read/write Master and have it propagate gracefully to the slaves.
    ⚈  Cluster (Galera, etc) -- It should work.

适用性
    ⚈引擎 - 此处给出的大纲应该对MyISAM或InnoDB同样有效。
    ⚈复制 - 您可以在读/写主服务器上执行此操作，并使其正常传播到从服务器。
    ⚈群集（Galera等） - 它应该工作。

What to do if assumptions are not met?
    ⚈  No PRIMARY/UNIQUE key -- A non-unique key can be used, but this may lead to arbitrarily large locks on the table.
    ⚈  No explicit keys at all -- Punt. (LIMIT+OFFSET is not viable!)
    ⚈  Inadequate disk space -- Punt. Get space first.
    ⚈  If the system gets bogged down, two dynamic tunables can be tweaked to slow down the Alter and make it less invasive.
    ⚈  Multi-row INSERT -- design the Layer's API to make it easy to determine which rows are below $left_off.
    ⚈  Multi-table UPDATE -- This is a challenge -- especially 'self-joins' that interrogate later parts of the table to decide on changes to earlier parts.
    ⚈  DELETE .. IN(...) -- pass the IN list in an array for easy handling
    ⚈  DELETE .. WHERE (multi-row) -- no problem
    ⚈  UPDATE that changes the key -- You must recognize the situation and write some messy code. (It would be handy if the API helped make this obvious.)
    ⚈  INSERT..ON DUPLICATE KEY UPDATE.. -- Don't know.
    ⚈  INSERT IGNORE -- The problem is that it might get INSERTed before the conflicting record is inserted. No workaround available.
    ⚈  FOREIGN KEYS -- have not investigated
    ⚈  Extra tables for normalization -- not a big deal
    ⚈  ad hoc write queries -- must disallow for the duration of the Alter.
    ⚈  Compound index -- The code is more complex, but doable. See
Iterating through a compound key

如果不符合假设该怎么办？
    ⚈无PRIMARY / UNIQUE键 - 可以使用非唯一键，但这可能会导致表上任意大的锁定。
    ⚈根本没有明确的钥匙 - Punt。（LIMIT + OFFSET不可行！）
    ⚈磁盘空间不足 - Punt。首先获得空间。
    ⚈如果系统陷入困境，可以调整两个动态可调参数来减慢Alter的速度并减少其侵入性。
    ⚈多行INSERT - 设计Layer的API，以便于确定哪些行低于$ left_off。
    ⚈多表更新 - 这是一个挑战 - 尤其是“自联接”，它会询问表格的后续部分，以决定对早期部分的更改。
    ⚈DELETE.. IN（...） - 传递数组中的IN列表以便于处理
    ⚈删除.. WHERE（多行） - 没问题
    ⚈更新密钥的更新 - 您必须认识到情况和写一些凌乱的代码。（如果API帮助使这明显这将是得心应手。）
    ⚈   INSERT..ON重复键UPDATE .. -不知道。
    ⚈   INSERT IGNORE -问题是，插入冲突的记录之前，它可能会插入。没有可用的解决方法。
    ⚈ 外国钥匙   - 尚未调查
    ⚈标准化的额外表格 - 不是什么大问题
    ⚈临时写入查询 - 必须在Alter期间不允许。
    ⚈复合索引 - 代码更复杂，但可行。请参阅
迭代复合键

Database Layer

It is "best practice" to have a "Layer" between your clients and your database. SQL statements are only known to the Layer, not to the clients. By having the Layer, you are segregating "business logic" from "database details". For the purposes here, the "database details" will be changing; it is better to have such code isolated.

The Layer would be called from the Client with calls like "Insert this stuff...". If the 'stuff' needs massaguing to be compatible with MySQL (eg, timestamps), then the Layer is the 'right' place to do the conversion. The Layer is also a good place to hide any "normalization" and "lookup" tables -- clients should no care whether a 'name' is stored in the table directly, or normalized into another table and only an id is put into the "Fact" table.

Keep the API clean and simple; hide any database complexity in the Layer. And, for this task, it will get complex.

For the task at hand, we will depend on the Layer to make it easy to migrate from one table to another, and to do that without any real knowledge of "business logic". This ignorance of the client side of things makes it easier and safer to write the code and have confidence that it is correct.

The Alter

This is a script that does the conversion by copying the data into a new table, which already has the 'new' schema elements (added columns, dropped indexes, etc).

⚈  CREATE TABLE -- the new table, with all new indexes/datatypes, etc.
    ⚈  Change the row in the helper table Migrate (running=1, etc) (below)
    ⚈  Loop (below)
    ⚈  Deactivate (running=0)
    ⚈  RENAME TABLE existing TO old, new TO existing;

Each iteration does:
    ⚈  Migrates a "clump" of, say, 100 rows from the old table to the new.
    ⚈  Any transforms (normalization, datatype conversion, etc) are done for those rows.
    ⚈  Locks out all 'write' operations while handling the clump. (More later)

The starting values for Migrate:
    ⚈  running = 1
    ⚈  clump_size = 100 (1000 if existing table is InnoDB with no secondary keys); <100 if lots of indexes)
    ⚈  delay = 1 (second)
    ⚈  left_off_* -- whatever key is less than first value (eg, 0 or '')

The Loop should watch for running out of rows. When that happens, it needs to do the Deactivate and RENAME _before_ the Unlock.

One Iteration (Copy one Clump)

⚈  Fetch the next row from Migrate
    ⚈  Find the key of the 100th row after where you 'left_off'.
    ⚈  Lock (exclusive)
    ⚈  INSERT INTO new SELECT * FROM existing WHERE id > $left_off AND id <= $hundredth_hence;
    ⚈  If no more rows, Deactivate and RENAME
    ⚈  Update row in Migrate
    ⚈  Unlock
    ⚈  Sleep 1 second (dynamically tunable)

Notes:
    ⚈  The "Find" step is outside the Lock/Unlock so as to minimize the time in lock state.
    ⚈  Risk: with the Find step outside, you could end up with more than 100 rows to move. (Probably not a big deal.)
    ⚈  Lock/Unlock -- an exclusive LOCK on the Migrate table. Note that all INSERTs/UPDATEs/etc must also Lock that table
    ⚈  The SELECT step should include any transforms needed, build normalization records, and whatever else is required.

Layer's INSERTs

The database Layer is vital for giving us a simple way to modify all INSERT/UPDATE/REPLACE/DELETE statements. All client write operations must be going through the Layer. (SELECTs should go through the Layer, but that does not matter for this discussion.)

I will discuss 'simple' write operations only.

Around every atomic operation, you need to add the following:
    ⚈  Lock (read-lock) on Migrate
    ⚈  Fetch the row from Migrate
    ⚈  If not running, then skip most of these steps. (Alter has not yet started, or has finished.)
    ⚈  Perform the SQL statement on old table
    ⚈  Modify statement to include AND id <= $left_off, and perform it on new table.
    ⚈  Unlock

For "transactions" in BEGIN..COMMIT, either make the whole transaction a single API call, or carefully coordinate the Lock/Unlock with BEGIN..COMMIT. BEGIN should map to the Lock and Fetch steps; COMMIT should map to the Unlock step.

REPLACE should be split into what it actually does -- a DELETE and an INSERT. Both of these sound be inside a single Lock/Unlock pair.

INSERT .. ON DUPLICATE KEY UPDATE .., especially if including a SELECT can probably be broken into two steps as discussed in [staging_table#normalization][High Speed Ingestion]]

As noted earlier, some multi-row and multi-table write operations get more complicated.

Helper Table: Migrate

This table has only 1 row.

      CREATE TABLE Migrate (

         running TINYINT UNSIGNED NOT NULL DEFAULT '0',

         clump_size INT UNSIGNED NOT NULL DEFAULT '100',

         delay FLOAT NOT NULL DEFAULT '1',

         left_off_1 ... (INT, VARCHAR, etc, matching first field in KEY being used)

         left_off_2 ... (more field(s), if needed)

         clumps_moved INT UNSIGNED NOT NULL DEFAULT '0',

         rows_moved BIGINT UNSIGNED NOT NULL DEFAULT '0',

         lock_time DOUBLE NOT NULL DEFAULT '0',

         move_time DOUBLE NOT NULL DEFAULT '0',

         sleep_time DOUBLE NOT NULL DEFAULT '0',

         last_move TIMESTAMP NOT NULL

      ) ENGINE=MyISAM;

Discussion:
    ⚈  running is 1 while Alter is running. INSERTs, etc, check this to know whether to INSERT into the new table.
    ⚈  clump_size says how many rows to move per iteration; "100" is used as an example. This can be dynamically tuned.
    ⚈  delay is how long to sleep(), in seconds, between iterations; 1 second is a good first guess. This is tunable.
    ⚈  left_off_* contain the value(s) of the last key used by the last clump. (Initialize to 0 or '', or whatever is less than first key.)
    ⚈  clumps_moved and rows_moved tally the progress.
    ⚈  the *_time fields keep track of the seconds consumed by the phases of the loop.
    ⚈  last_move is a timestamp, overwritten at the end of each iteration. At the end of Alter, it is the finish time. (If the value is too long ago, but not finished, maybe it is broken.)
    ⚈  Engine=MyISAM was chosen to avoid any risk of involving this in BEGIN..COMMIT transactions.
    ⚈  Replicating the table is optional; it has no function on Slaves.
    ⚈  It was not intended to JOIN this table with any other; I don't know if "left_off_*" is useful.

The table must be CREATEd before the Layer has the code in it that looks at the table. The imporant value (at that point) is running=0.

Tuning on the fly

What to watch...
    ⚈  In Slaves, keep an eye on Seconds_behind_master (or a heartbeat if you have such)
    ⚈  (lock_time / clumps_moved) -- This should be close to zero; if it is high, then Client processes are not getting enough time.
    ⚈  (move_time / clumps_moved) -- The average time that it takes to move a clump. This impacts latency of Client write operations.
    ⚈  sleep_time versus move_time -- which is getting more time, the Clients, or the Alter.

The following two values in the Migrate table can be changed at will; changes take effect at the start of the next clump.

clump_size (nominally 100) effectively controls how long INSERTs and other writes are blocked at a time. Decrease during busy times; increase during lulls.

delay (nominally 1 second; declared FLOAT), controls the gap between clumps. It is intended to let other operations get in. Also, it somewhat controls the overloading of the replication stream. Increase delay if replication gets behind.

Deactivation

After the Alter is finished, the code automatically "deactivates" (by setting running=0) itself. No further writes are needed to the new table (which has been RENAMEd out of existence anyway).

However, the Layer is still checking Migrate, but this is no longer necessary. To clean up the code, you need another code push. This code push should be no more complicated than turning off a flag (a code flag, not the dynamic 'running' flag). The flag should have been built in when Alter is originally written. It will control whether to look at the Migrate table, or simply follow the "not running" path and skip the lock and unlock.

I recommend leaving these vestiges of the code in place; you are likely to do another Alter in the future. Meanwhile, the code will be sufficiently efficient. (A few "if false" statements are insignifant.)

Alternative - Log table

If the table to be Altered is a simple, append-only, "log", then the job can be done a lot simpler. There is no need for the Database Layer (though that might be best practice). The Migrate script can probably be run faster.

⚈  Migrate 1K-10K records per iteration, no extra locking needed.
    ⚈  Key being used must be the one for which new rows are sent to the 'end' of the table.
    ⚈  The final step that involves the RENAME gets tricky.

The only step that needs careful design (and I don't have the details ironed out...)

Plan A. When the Migrate script discovers that there are no rows left to move, it LOCKs the existing and new tables, does the RENAME, and UNLOCKs. (Could some INSERT slip in? Especially for InnoDB?)

Plan B. When the Migrate script discovers that there are no rows left to move, it does the RENAME, then checks for any extra rows that snuck in. If there are extra rows, it moves them. (What happens to any INSERTs that are trying to run when the RENAME occurs? Maybe they wait? Maybe they die? Does your code automatically retry in that case?)

Alternative - Dual Master

If you are set up with a pair of Masters replicating from each other (for quick recovery from a crash), then this is a much simpler approach. Or it is simpler for the developers -- it can be done entirely by the SEs.

The idea is to do some ALTER on offline server(s), failover, then do the rest of the ALTER(s).

Caveat: This only applies if the replication stream will not be confused by a Master having the 'old' schema, and the a Slave having the 'new' schema. (Note: RBR will probably be messed up by COLUMN changes, but not INDEX changes?) If any INSERT or UPDATE that would be confused, then you need to use the main approach so that you can have two flavors of statement and know which to apply to the existing versus the new tables.

Caution: Replication will need to be stopped or replication will be stuck, perhaps for as long as the ALTER takes. If the ALTER takes a week, you need enough disk space for a week's worth of binlogs and/or relay logs.

Caution: A really big table could take _days_ to ALTER. There will be multiple (different machines), non-overlaping, ALTERs. Plan accordingly.

Plan A -- Allow the ALTER to replicate, but carefully staged: The idea is to run the ALTER on the Backup Master, and do the failover just when it is ready to start on the Live Master. You can spot that time by knowing when the ALTER finishes on the Backup Master. If the ALTER turns out to be "wrong", fail back to the original Live Master and lick your wounds. Now you have the equivalent of the Backup Master having crashed.

Plan B -- Manually do one machine at a time. This requires SET SESSION SQL_LOG_BIN=0; in front of the ALTER each time.
    1.  One at a time, take a Slave out of rotation and do the ALTER on it. Do multiple Slaves at a time, but not more than you can afford to have offline (Out Of Rotation) simultaneously.
    2.  Ditto for the Backup Master. Note: Any slaves connected to this server will become 'stale', so should be offline.
    3.  Failover to the Backup Master
    4.  ALTER what had been the Live Master. Again note that its slaves should be offline for the length of the ALTER.

Caveat: The Backup Master will be unavailable for failover during the ALTER. However, you can abort the ALTER if you need to failover.

Caveat: Any machine offline should probably have replication turned on (though stuck) so that the I/O thread is receiving updates. This should avoid having binlogs purged out from under you. However, it means that the receiving machine will have a pileup in the relay logs -- be sure to have enough disk space for such.

Postlog

Original writing -- Feb, 2011; Refreshed: Sep, 2017

-- Rick James