PG CREATEINDEX CONCURRENTLY

官方说法

根据9.1的文档

Creating an index can interfere with regular operation of a database. Normally PostgreSQL locks the table to be indexed against writes and performs the entire index build with a single scan of the table. Other transactions can still read the table, but if they try to insert, update, or delete rows in the table they will block until the index build is finished. This could have a severe effect if the system is a live production database. Very large tables can take many hours to be indexed, and even for smaller tables, an index build can lock out writers for periods that are unacceptably long for a production system.

PostgreSQL supports building indexes without locking out writes. This method is invoked by specifying the CONCURRENTLY option of CREATE INDEX. When this option is used, PostgreSQL must perform two scans of the table, and in addition it must wait for all existing transactions that could potentially use the index to terminate. Thus this method requires more total work than a standard index build and takes significantly longer to complete. However, since it allows normal operations to continue while the index is built, this method is useful for adding new indexes in a production environment. Of course, the extra CPU and I/O load imposed by the index creation might slow other operations.

正常的create index会锁表阻止所有写操作,但是变通的方法是 create index concurrently。当然这样做会慢。PG会启动两次表扫描并且等待现存的所有事务结束。

In a concurrent index build, the index is actually entered into the system catalogs in one transaction, then two table scans occur in two more transactions. Any transaction active when the second table scan starts can block concurrent index creation until it completes, even transactions that only reference the table after the second table scan starts. Concurrent index creation serially waits for each old transaction to complete using the method outlined in section Section 45.56.

在concurrent build的过程中,索引实际上进入system catalogs,然后另外启动两个table scan。任何在第二个table scan开始时活动的事务都会block concurrent index build,甚至事务只是reference 了那个表。Concurrently index creation等待所有上述事务结束,用的方法是45.56节中。(基本上依赖pg_locks:The view pg_locks provides access to information about the locks held by open transactions within the database server. )

If a problem arises while scanning the table, such as a uniqueness violation in a unique index, the CREATE INDEX command will fail but leave behind an "invalid" index. This index will be ignored for querying purposes because it might be incomplete; however it will still consume update overhead. The psql \d command will report such an index as INVALID:

postgres=# \d tab

Table "public.tab" Column | Type | Modifiers

--------+---------+----------- col | integer | Indexes:

"idx" btree (col) INVALID

The recommended recovery method in such cases is to drop the index and try again to perform CREATE INDEX CONCURRENTLY. (Another possibility is to rebuild the index with REINDEX. However, since REINDEX does not support concurrent builds, this option is unlikely to seem attractive.)

如果build失败了,就会有一个invalid的index,这个index查询的时候用不上,但是update的时候还要更新它。推荐做法是drop之后重新create index。

Another caveat when building a unique index concurrently is that the uniqueness constraint is already being enforced against other transactions when the second table scan begins. This means that constraint violations could be reported in other queries prior to the index becoming available for use, or even in cases where the index build eventually fails. Also, if a failure does occur in the second scan, the "invalid" index continues to enforce its uniqueness constraint afterwards.

有一个问题:如果是一个unique的index的话,unique约束已经对其他事务可见了。这就意味着其他事务中可能在索引可用之前就报constraint violation,哪怕索引建失败了。另外,如果在第二次扫描的时候出错,这个invalid的索引依然会强加一个unique的constraint。

Concurrent builds of expression indexes and partial indexes are supported. Errors occurring in the evaluation of these expressions could cause behavior similar to that described above for unique constraint violations.

expression和partial index同样支持,但是还是有上述的unique的问题。

Regular index builds permit other regular index builds on the same table to occur in parallel, but only one concurrent index build can occur on a table at a time. In both cases, no other types of schema modification on the table are allowed meanwhile. Another difference is that a regular CREATE INDEX command can be performed within a transaction block, but CREATE INDEX CONCURRENTLY cannot.

正常的索引容许在同一个表上同时建多个索引,currently build的index不行,同时只容许一个。另外一个区别是create index可以在事务中但是create index concurrently不行。

上述的缺陷本质上是因为,create index concurrently 包含了事务。

代码分析

一个正常的create index流程应该是,DefineIndex->index_create->index_build。

index_build

/*
* index_build - invoke access-method-specific index build procedure
*
* On entry, the index's catalog entries are valid, and its physical disk
* file has been created but is empty. We call the AM-specific build
* procedure to fill in the index contents. We then update the pg_class
* entries of the index and heap relation as needed, using statistics
* returned by ambuild as well as data passed by the caller.
*
* isprimary tells whether to mark the index as a primary-key index.
* isreindex indicates we are recreating a previously-existing index.
*
* Note: when reindexing an existing index, isprimary can be false even if
* the index is a PK; it's already properly marked and need not be re-marked.
*
* Note: before Postgres 8.2, the passed-in heap and index Relations
* were automatically closed by this routine. This is no longer the case.
* The caller opened 'em, and the caller should close 'em.
*/
void
index_build(Relation heapRelation,
Relation indexRelation,
IndexInfo *indexInfo,
bool isprimary,
bool isreindex)
```
这个函数的参数大概是本文里面最少的一个,比较单纯。重要的东西见注释。
### index_create
这个函数完成create index的工作。
1. 校验错误,
2. build元数据数据,并插入pg_class. index,constraint 等元数据表。
3. 如果是正常的索引:调用index_build真的建索引;否则只更新元数据表。
### DefineIndex
create index语句会辗转调到DefineIndex里面来,这个函数的参数很多,其中专门有个参数是bool concurrent。就是控制是不是concurrently create index的。实际上控制concurrently build的逻辑大部分都在这个函数里。
#### 各种校验
目前没有指的关注的。
#### index create
在各种校验之后,
```
/*
* Make the catalog entries for the index, including constraints. Then, if
* not skip_build || concurrent, actually build the index.
*/
indexRelationId =
index_create(rel, indexRelationName, indexRelationId,
indexInfo, indexColNames,
accessMethodId, tablespaceId,
collationObjectId, classObjectId,
coloptions, reloptions, primary,
isconstraint, deferrable, initdeferred,
allowSystemTableMods,
skip_build || concurrent,
concurrent);
```
注意倒数第二行,暗示index_create不要真的去建索引。
后面的流程,一遍看代码一遍解释吧
```
if (!concurrent)
{
/* Close the heap and we're done, in the non-concurrent case */
heap_close(rel, NoLock);
return;
}
```
正常直接退出。下面都是concurrently的流程。
#### index concurrently create
```
/* save lockrelid and locktag for below, then close rel */
heaprelid = rel->rd_lockInfo.lockRelId;
SET_LOCKTAG_RELATION(heaplocktag, heaprelid.dbId, heaprelid.relId);
heap_close(rel, NoLock);
```
这个时候,table的锁已经释放,其他事务不会被block住了。
```
/*
* For a concurrent build, it's important to make the catalog entries
* visible to other transactions before we start to build the index. That
* will prevent them from making incompatible HOT updates. The new index
* will be marked not indisready and not indisvalid, so that no one else
* tries to either insert into it or use it for queries.
*
* We must commit our current transaction so that the index becomes
* visible; then start another. Note that all the data structures we just
* built are lost in the commit. The only data we keep past here are the
* relation IDs.
*
* Before committing, get a session-level lock on the table, to ensure
* that neither it nor the index can be dropped before we finish. This
* cannot block, even if someone else is waiting for access, because we
* already have the same lock within our transaction.
*
* Note: we don't currently bother with a session lock on the index,
* because there are no operations that could change its state while we
* hold lock on the parent table. This might need to change later.
*/
```
关于上面这段注释,先解释两个词,它们不是一个英文单词,它们是pg_index的两个字段。
1. indisvalid bool If true, the index is currently valid for queries. False means the index is possibly incomplete: it must still be modified by INSERT/UPDATE operations, but it cannot safely be used for queries. If it is unique, the uniqueness property is not true either.
1. indisready bool If true, the index is currently ready for inserts. False means the index must be ignored by INSERT/UPDATE operations. 这里涉及到了session level lock的概念,这个锁不会block其他的insert update之类。
```
LockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock); PopActiveSnapshot();
CommitTransactionCommand();
StartTransactionCommand();
```
这里是第一次commit。注意,ShareUpdateExclusiveLock不阻止insert/update/delete
```
/*
* Phase 2 of concurrent index build (see comments for validate_index()
* for an overview of how this works)
*
* Now we must wait until no running transaction could have the table open
* with the old list of indexes. To do this, inquire which xacts
* currently would conflict with ShareLock on the table -- ie, which ones
* have a lock that permits writing the table. Then wait for each of
* these xacts to commit or abort. Note we do not need to worry about
* xacts that open the table for writing after this point; they will see
* the new index when they open it.
*
* Note: the reason we use actual lock acquisition here, rather than just
* checking the ProcArray and sleeping, is that deadlock is possible if
* one of the transactions in question is blocked trying to acquire an
* exclusive lock on our table. The lock code will detect deadlock and
* error out properly.
*
* Note: GetLockConflicts() never reports our own xid, hence we need not
* check for that. Also, prepared xacts are not reported, which is fine
* since they certainly aren't going to do anything more.
*/
old_lockholders = GetLockConflicts(&heaplocktag, ShareLock);
```
取得当前正在运行的事务
```
while (VirtualTransactionIdIsValid(*old_lockholders))
{
VirtualXactLockTableWait(*old_lockholders);
old_lockholders++;
} /*
* At this moment we are sure that there are no transactions with the
* table open for write that don't have this new index in their list of
* indexes. We have waited out all the existing transactions and any new
* transaction will have the new index in its list, but the index is still
* marked as "not-ready-for-inserts". The index is consulted while
* deciding HOT-safety though. This arrangement ensures that no new HOT
* chains can be created where the new tuple and the old tuple in the
* chain have different index keys.
*
* We now take a new snapshot, and build the index using all tuples that
* are visible in this snapshot. We can be sure that any HOT updates to
* these tuples will be compatible with the index, since any updates made
* by transactions that didn't know about the index are now committed or
* rolled back. Thus, each visible tuple is either the end of its
* HOT-chain or the extension of the chain is HOT-safe for this index.
*/
```
仔细读注释
```
/* Open and lock the parent heap relation */
rel = heap_openrv(heapRelation, ShareUpdateExclusiveLock); /* And the target index relation */
indexRelation = index_open(indexRelationId, RowExclusiveLock); /* Set ActiveSnapshot since functions in the indexes may need it */
PushActiveSnapshot(GetTransactionSnapshot()); /* We have to re-build the IndexInfo struct, since it was lost in commit */
indexInfo = BuildIndexInfo(indexRelation);
Assert(!indexInfo->ii_ReadyForInserts);
indexInfo->ii_Concurrent = true;
indexInfo->ii_BrokenHotChain = false; /* Now build the index */
index_build(rel, indexRelation, indexInfo, primary, false); /* Close both the relations, but keep the locks */
heap_close(rel, NoLock);
index_close(indexRelation, NoLock); /*
* Update the pg_index row to mark the index as ready for inserts. Once we
* commit this transaction, any new transactions that open the table must
* insert new entries into the index for insertions and non-HOT updates.
*/
pg_index = heap_open(IndexRelationId, RowExclusiveLock); indexTuple = SearchSysCacheCopy1(INDEXRELID,
ObjectIdGetDatum(indexRelationId));
if (!HeapTupleIsValid(indexTuple))
elog(ERROR, "cache lookup failed for index %u", indexRelationId);
indexForm = (Form_pg_index) GETSTRUCT(indexTuple); Assert(!indexForm->indisready);
Assert(!indexForm->indisvalid); indexForm->indisready = true; simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
CatalogUpdateIndexes(pg_index, indexTuple); heap_close(pg_index, RowExclusiveLock); /* we can do away with our snapshot */
PopActiveSnapshot(); /*
* Commit this transaction to make the indisready update visible.
*/
CommitTransactionCommand();
StartTransactionCommand();
```
第一次scan结束,这个时候index已经对上层可见了。
```
/*
* Phase 3 of concurrent index build
*
* We once again wait until no transaction can have the table open with
* the index marked as read-only for updates.
*/
old_lockholders = GetLockConflicts(&heaplocktag, ShareLock); while (VirtualTransactionIdIsValid(*old_lockholders))
{
VirtualXactLockTableWait(*old_lockholders);
old_lockholders++;
}
```
again,等所有不认识index的事务结束。
```
/*
* Now take the "reference snapshot" that will be used by validate_index()
* to filter candidate tuples. Beware! There might still be snapshots in
* use that treat some transaction as in-progress that our reference
* snapshot treats as committed. If such a recently-committed transaction
* deleted tuples in the table, we will not include them in the index; yet
* those transactions which see the deleting one as still-in-progress will
* expect such tuples to be there once we mark the index as valid.
*
* We solve this by waiting for all endangered transactions to exit before
* we mark the index as valid.
*
* We also set ActiveSnapshot to this snap, since functions in indexes may
* need a snapshot.
*/
snapshot = RegisterSnapshot(GetTransactionSnapshot());
PushActiveSnapshot(snapshot); /*
* Scan the index and the heap, insert any missing index entries.
*/
validate_index(relationId, indexRelationId, snapshot);
```
第二次扫描结束,这个时候索引可用了。
```
/*
* The index is now valid in the sense that it contains all currently
* interesting tuples. But since it might not contain tuples deleted just
* before the reference snap was taken, we have to wait out any
* transactions that might have older snapshots. Obtain a list of VXIDs
* of such transactions, and wait for them individually.
*
* We can exclude any running transactions that have xmin > the xmin of
* our reference snapshot; their oldest snapshot must be newer than ours.
* We can also exclude any transactions that have xmin = zero, since they
* evidently have no live snapshot at all (and any one they might be in
* process of taking is certainly newer than ours). Transactions in other
* DBs can be ignored too, since they'll never even be able to see this
* index.
*
* We can also exclude autovacuum processes and processes running manual
* lazy VACUUMs, because they won't be fazed by missing index entries
* either. (Manual ANALYZEs, however, can't be excluded because they
* might be within transactions that are going to do arbitrary operations
* later.)
*
* Also, GetCurrentVirtualXIDs never reports our own vxid, so we need not
* check for that.
*
* If a process goes idle-in-transaction with xmin zero, we do not need to
* wait for it anymore, per the above argument. We do not have the
* infrastructure right now to stop waiting if that happens, but we can at
* least avoid the folly of waiting when it is idle at the time we would
* begin to wait. We do this by repeatedly rechecking the output of
* GetCurrentVirtualXIDs. If, during any iteration, a particular vxid
* doesn't show up in the output, we know we can forget about it.
*/
old_snapshots = GetCurrentVirtualXIDs(snapshot->xmin, true, false,
PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
&n_old_snapshots); for (i = 0; i < n_old_snapshots; i++)
{
if (!VirtualTransactionIdIsValid(old_snapshots[i]))
continue; /* found uninteresting in previous cycle */ if (i > 0)
{
/* see if anything's changed ... */
VirtualTransactionId *newer_snapshots;
int n_newer_snapshots;
int j;
int k; newer_snapshots = GetCurrentVirtualXIDs(snapshot->xmin,
true, false,
PROC_IS_AUTOVACUUM | PROC_IN_VACUUM,
&n_newer_snapshots);
for (j = i; j < n_old_snapshots; j++)
{
if (!VirtualTransactionIdIsValid(old_snapshots[j]))
continue; /* found uninteresting in previous cycle */
for (k = 0; k < n_newer_snapshots; k++)
{
if (VirtualTransactionIdEquals(old_snapshots[j],
newer_snapshots[k]))
break;
}
if (k >= n_newer_snapshots) /* not there anymore */
SetInvalidVirtualTransactionId(old_snapshots[j]);
}
pfree(newer_snapshots);
} if (VirtualTransactionIdIsValid(old_snapshots[i]))
VirtualXactLockTableWait(old_snapshots[i]);
}
```
请读注释,基本意思是在做validate_index,这个时间可预见的会很长,so需要修改这段时间的数据,让它正确。
```
/*
* Index can now be marked valid -- update its pg_index entry
*/
pg_index = heap_open(IndexRelationId, RowExclusiveLock); indexTuple = SearchSysCacheCopy1(INDEXRELID,
ObjectIdGetDatum(indexRelationId));
if (!HeapTupleIsValid(indexTuple))
elog(ERROR, "cache lookup failed for index %u", indexRelationId);
indexForm = (Form_pg_index) GETSTRUCT(indexTuple); Assert(indexForm->indisready);
Assert(!indexForm->indisvalid); indexForm->indisvalid = true; simple_heap_update(pg_index, &indexTuple->t_self, indexTuple);
CatalogUpdateIndexes(pg_index, indexTuple); heap_close(pg_index, RowExclusiveLock); /*
* The pg_index update will cause backends (including this one) to update
* relcache entries for the index itself, but we should also send a
* relcache inval on the parent table to force replanning of cached plans.
* Otherwise existing sessions might fail to use the new index where it
* would be useful. (Note that our earlier commits did not create reasons
* to replan; relcache flush on the index itself was sufficient.)
*/
CacheInvalidateRelcacheByRelid(heaprelid.relId); /* we can now do away with our active snapshot */
PopActiveSnapshot(); /* And we can remove the validating snapshot too */
UnregisterSnapshot(snapshot); /*
* Last thing to do is release the session-level lock on the parent table.
*/
UnlockRelationIdForSession(&heaprelid, ShareUpdateExclusiveLock);
```
让索引对查询可见。
#### validate_index
```
* validate_index() works by first gathering all the TIDs currently in the
* index, using a bulkdelete callback that just stores the TIDs and doesn't
* ever say "delete it". (This should be faster than a plain indexscan;
* also, not all index AMs support full-index indexscan.) Then we sort the
* TIDs, and finally scan the table doing a "merge join" against the TID list
* to see which tuples are missing from the index. Thus we will ensure that
* all tuples valid according to the reference snapshot are in the index.
*
* Building a unique index this way is tricky: we might try to insert a
* tuple that is already dead or is in process of being deleted, and we
* mustn't have a uniqueness failure against an updated version of the same
* row. We could try to check the tuple to see if it's already dead and tell
* index_insert() not to do the uniqueness check, but that still leaves us
* with a race condition against an in-progress update. To handle that,
* we expect the index AM to recheck liveness of the to-be-inserted tuple
* before it declares a uniqueness error.
```
如注释所示,第二次扫面,实际上执行了一个tid的收集以及一个merge join操作。 ## 实验
本实验的目的在于测试,concurrent对性能的影响
### 机器配置
普通台式机,PG 8.4,默认配置。
### 数据格式
```sql
CREATE TABLE test2 (
did integer PRIMARY KEY,
val float,
name varchar(40)
);
insert into test2 select generate_series(1, 10000000), random()*100., md5(random()::text);
create index i2v on test2(val);
create index concurrently i2vc on test2(val);
```
### 结果
| 元组个数| val | concurrently on value | name | concurrently on name |
| :--: | :--:| :--: | :--: | :--: |
| 1,000,000 | 4597.145 ms | 8678.217 ms | 13179.993 ms | 17327.401 ms |
| 10,000,000 | 58315.700 ms | 102393.426 ms |164888.703 | 208067.063 ms |
| 20,000,000 | 117594.772 ms | 235908.396 ms | 167957.984 ms | 488424.852 ms | ### 小结
在一下条件下:
1. 上述数据的取得是在没有查询的情况下得出的。
1. 平均来看,使用concurrently create index的时间是不适用concurently build时间的2倍上下。

PG CREATEINDEX CONCURRENTLY的更多相关文章

  1. 简析服务端通过GT导入SHP至PG的方法

    文章版权由作者李晓晖和博客园共有,若转载请于明显处标明出处:http://www.cnblogs.com/naaoveGIS/ 1.背景 项目中需要在浏览器端直接上传SHP后服务端进行数据的自动入PG ...

  2. PG 中 JSON 字段的应用

    13 年发现 pg 有了 json 类型,便从 oracle 转 pg,几年下来也算比较熟稔了,总结几个有益的实践. 用途一:存储设计时无法预料的文档性的数据.比如,通常可以在人员表准备一个 json ...

  3. pg gem 安装(postgresql94)

    使用下面命令安装报错 gem install pg 错误: [root@AS-test middle_database]# gem install pgBuilding native extensio ...

  4. #pg学习#postgresql的安装

    1.按照官网给的步骤编译安装(Mac安装是比较容易的,相比Liunx) cd /Users/renlipeng/Desktop/postgresql-9.5.1 ./configure --prefi ...

  5. PG 函数的易变性(Function Volatility Categories)

    此概念的接触是在做分区表的时候碰到的,分区表按时间字段分区,在查询时当where条件中时间为now()或者current_time()等时是无法查询的,即使进行格式转换也不行,只有是时间格式如‘201 ...

  6. mysql 序列与pg序列的比较

    mysql序列(这里只谈innodb引擎): 在使用mysql的AUTO_INCREMENT时,使用AUTO_INCREMENT的字段必须建有索引,也可以为索引的一部分.当没有索引时会报错:      ...

  7. 使用zfs进行pg的pitr恢复测试

    前段时间做了一下zfs做pg的增量恢复测试,mark一下. 服务器信息: 主机:192.168.173.43 备机:192.168.173.41 主备使用流复制搭建,在备机上面进行了zfs快照备份. ...

  8. PG, Pool之间的一些数量关系

    先说一下我的环境: Ceph cluster中包含6台OSD节点 (osd.0 - 5), 一共有10个Pool (0 - 9), 这些Pool共享了144个PG (这个数字是所有Pool的PG_SI ...

  9. ruby on rails gem install pg时无法安装

    gem install pg -v '0.18.2' Building native extensions. This could take a while... ERROR: Error insta ...

随机推荐

  1. linux根文件系统制作

    在嵌入式中移植的内核下载到开发板上,是没有办法真正的启动Linux操作系统的,会出现无法加载文件系统的错误. 那么根文件系统在系统启动中到底是什么时候挂载的呢?先将/dev/ram0挂载,而后执行/l ...

  2. Squid 和 SNAT 上网

    方法1 , squid (具体配置如之前的记录) 代理机上安装squid软件 apt-get install squid service squid3 start 在被代理机上配置代理环境 vi /e ...

  3. Sql Server Analysis Service 转换为UnknownMember的正确设置 (转载)

    转载: http://www.cnblogs.com/OpenCoder/p/4754447.html#commentform 在SSAS中事实表数据被归类到为UnknownMember 的时候分为两 ...

  4. exiv2 如何改变时间戳

    有很多c/c++库可以实现jpeg的exif信息的读取,比如libexif或者CxImage等.但如果我们对jpeg进行有关处理后,想在写入文件时保留原有的exif信息(比如拍摄时间等)就有点棘手了. ...

  5. [2014.01.27]WFsoft.wfLibrary.wfVerifyImage 1.8

    全新开发的验证码图片控件--WFsoft.wfLibrary.wfVerifyImage,使用简单,功能强大.     完全支持.net 2.0编写,对下一代操作系统平稳过渡.     同时支持.ne ...

  6. Comparable和Comparator实现对象比较

    由文生义: 继承Comparable ,表示该类的实例是可以相互比较的; 继承Comparator,表示该类是一个比较器,里面设置了按什么属性比较,list需要按这个比较器里的规则来比较; 使用方法如 ...

  7. Angular中的Ajax

    //我们使用Ajax访问本地或者同域名下的数据或者文件module.controller('InTheatersController',['$scope','$http', function($sco ...

  8. TRIGGER_15.8.3BACKUP

    1 CREATE OR REPLACE TRIGGER "XMV502"."ADDAREA12" before insert on bd_areacl for ...

  9. delphi locate多字段查询

    简单格式: IF MSQ_NewBillQuantity.Locate('FStockID;FMarchID', VarArrayOf([FStockID, FMarchID]), []) = Fal ...

  10. 挖掘微信Web版通信的全过程 [转]

    昨天是周末,在家闲得无聊,于是去weiphone.com逛了一圈,偶然发现有人发了一帖叫<微信 for Mac>, 这勾起了我的好奇心,国内做Mac开发的人确实很少,对于那些能够独自开发一 ...