GUID vs INT Debate【转】
http://blogs.msdn.com/b/sqlserverfaq/archive/2010/05/27/guid-vs-int-debate.aspx
I recently read a blog post on what was better using GUIDs or Integer values. This is been an age long debate and there are advocates in both camps stressing on the disadvantages of the other. Well both implementations have their advantages and disadvantages. At the outset, I shall mention that the answer to this debate is: IT DEPENDS! J
It is highly dependent on your database design, migration needs and overall architecture. There is a good reason why SQL Server replication uses GUIDs to track the changes to the replicated articles. So, it is not that the usage to GUIDs is necessarily a bad practice. SQL Server Books Online lists the following disadvantages for uniqueidentifier data type:
- The values are long and obscure. This makes them difficult for users to type correctly, and more difficult for users to remember.
- The values are random and cannot accept any patterns that may make them more meaningful to users.
- There is no way to determine the sequence in which uniqueidentifier values were generated. They are not suited for existing applications that depend on incrementing key values serially.
- At 16 bytes, the uniqueidentifier data type is relatively larger than other data types, such as 4-byte integers. This means indexes that are built using uniqueidentifier keys might be relatively slower than indexes using an int key.
If you are using NEWID function in SQL Server, then this generates random UUIDs which have a huge domain but the chances of GUID collisions are always there though the probability is very slim in nature. If you are using NEWID function to generate uniqueidentifiers as row identifiers in your table, then you need to think again! Uniqueness of the row should be enforced using aUnique or Primary Key constraint on the table. NewSequentialID function uses identification number of the computer network card plus a unique number from the CPU clock to generate the uniqueidentifier (Reference article). So the chance of getting a globally unique value is practically guaranteed as long as the machine has a network card. Moreover, possibility of a GUID collision while using NewSequentialID is virtually impossible.
I shall use a simple example of how database operations can be affected when using the following as clustered indexes:
- Random GUIDs (using NEWID function)
- Sequential GUIDs (using NewSequentialGUID function)
- BIGINT identity value
- INT identity value
The last three would always introduce a hot-spot in your database table when concurrent inserts are occurring on the table as the INSERT operations would always insert into the last page of the table as the data for a clustered index is ordered in ascending order by default. This however is a discussion for a different post altogether. The tables that I have created are quite narrow and typically the average record size would be small:
create table tblGUID (RecID uniqueidentifier default newid(), Record varchar(255), ModifiedTime datetime, SessionID int)
create table tblSeqGUID (RecID uniqueidentifier default newsequentialid(), Record varchar(255), ModifiedTime datetime, SessionID int)
create table tblINT (RecID bigint identity(1,1), Record varchar(255),ModifiedTime datetime, SessionID int)
create table tblBigINT (RecID int identity(1,1), Record varchar(255),ModifiedTime datetime, SessionID int)
The clustered index for all the above tables are defined on the RecID column. The tables have no other indexes defined on them. I created a WHILE loop to insert 1 million records in each table and here are the times taken:
tblGUID: 22 seconds
tblSequential GUID: 16 seconds
tblBIGINT: 23 seconds
tblINT: 23 seconds
Given that you have a beefy server, the above time difference would not make much of a difference unless and until you only have a high number of concurrent INSERT workload on the server or during a Data Load operation which would cause a significant impact. What is interesting to note is that the fragmentation on the tables after the first batch of 1 million inserts.
|
Object Name |
Index Name |
Pages |
Average Record Size |
Extents |
Average Page Density |
Logical Fragmentation |
Extent Fragmentation |
|
tblGUID |
cidx_tblGUID |
9608 |
51.89 |
1209.00 |
69.27 |
99.14 |
0.25 |
|
tblSeqGUID |
cidx_tblSeqGUID |
6697 |
51.89 |
845.00 |
99.39 |
0.76 |
0.12 |
|
tblBigINT |
cidx_tblBigINT |
5671 |
43.89 |
714.00 |
99.95 |
0.48 |
0.14 |
|
tblINT |
cidx_tblINT |
5194 |
39.89 |
653.00 |
99.62 |
0.37 |
0.15 |
If you look at the above data, you will see that the random GUIDs have 99% logical fragmentation in the tables. This is due to the random nature of the GUIDs generated which end up causing high number of page splits in the database.
Since GUIDs require more storage than integers, you will notice the number of pages for the tables using GUIDs have higher page counts. After inserting another million records in the above tables, I inserted 10,000 records (in batches of 100) in the above tables and tracked the times for the inserts at a more granular level:
tblGuid : 23 seconds (10 ms to 1120 ms)
tblSeqGuid : 0 seconds (3 ms to 10 ms)
tblBigINT : 0 seconds (3 ms to 13 ms)
tblINT : 0 seconds (3 ms to 20 ms)
You will find that the minimum and maximum times mentioned above show a high degree of variation for the random GUID table. The rest of the times are more or less comparable. The time taken for an insert is dependent on the range that the newly generated GUID falls into. The above data clearly shows that using random GUIDs can not only create fragmentation and page splits during Insert/Update operations.
Since NewSequentialID function uses the network card identifier to generate a new value, it might not be preferred where privacy is a concern because it is possible to guess the value of the next generated GUID and, therefore, access data associated with that GUID.
Now when I run a simple SELECT query against the above tables (using a cold buffer cache), I find the a COUNT(*) for the entire table shows the following IO statistics:
Table 'tblGUID'. Scan count 9, logical reads 20942, physical reads 236, read-ahead reads 20564, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 375 ms, elapsed time = 6212 ms.
Table 'tblSeqGUID'. Scan count 9, logical reads 14519, physical reads 383, read-ahead reads 14216, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 466 ms, elapsed time = 1442 ms.
Table 'tblBigINT'. Scan count 9, logical reads 12229, physical reads 287, read-ahead reads 12007, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 374 ms, elapsed time = 1039 ms.
Table 'tblINT'. Scan count 9, logical reads 11144, physical reads 252, read-ahead reads 10983, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times: CPU time = 392 ms, elapsed time = 908 ms.
You will see that the number of pages read for the table using random GUIDs is the highest due to the high fragmentation on the tables. After I rebuilt the clustered index of the tblGUID table, the same query with a cold buffer cache only performed 14K+ reads which is comparable to the tblSeqGUID table.
Table 'tblGUID'. Scan count 9, logical reads 14376, physical reads 143, read-ahead reads 10642, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
However if you have frequent inserts in the table, you cannot keep the fragmentation down to a negligible value.
I have summarized what we discussed above in the table below:
|
Criteria |
GUIDs |
Seq. GUIDs |
BIGINT |
INT |
|
Storage |
16 bytes |
16 bytes |
8 bytes |
4 bytes |
|
Insert/Update performance |
Slowest |
Comparable but the index keys are larger in size. For wider tables, this would be slower than Integer values. |
Faster than sequential GUIDs |
Fastest |
|
Hotspot contention |
Very rare |
Yes |
Yes |
Highest, due to smaller size of RIDs |
|
Fragmentation/Page Splits |
High |
Minimal |
Minimal |
Minimal |
|
JOIN Performance/SORT operations |
Least performance (Rank 4 = Least performance) |
Better than random GUIDs due lesser fragmentation (Rank: 3) |
High performance (Rank: 2) |
High Performance (Rank: 1) |
|
Logical reads |
Rank 4=Highest |
Rank 3 |
Rank 2 |
Rank 1=Least |
|
Merging data across servers |
Beneficial |
Beneficial |
Difficult |
Difficult |
|
Uniqueness |
Rare chance of duplicates |
Globally unique. Virtually no chance of collisions |
Limited by range of BIGINT |
Limited by range of INT |
|
|
The above considerations make the use of GUIDs unfavorable for a clustered index in environments which have large number of queries performing JOIN operations in OLTP and when referential integrity is enforced on the database among multiple tables. Throw non-clustered indexes that you created on the table as covering indexes for the frequently run queries against the database, you can have a significant performance bottleneck.
So hopefully the above points will help you choose an appropriate column for the clustered index on your table. The other strategy is to use natural keys. With this strategy, you do not use any kind of manufactured key, but instead use a business key to uniquely identify records. For example, a table that is used to store customer information might use the social security number column as the primary key instead of an identity column. The drawback to this approach is that the primary key might become large if more than one column is required to uniquely identify a record. Furthermore, this compound key must be propagated to other tables to support one or more foreign key relationships. These relationships, in turn, adversely affect join performance.
Based on the above considerations, you should be able to take an educated decision whether to use an integer, business key or GUIDs for your primary key of your table.
As a parting note, I would like to say that there is no short cut to testing. You would need to test the design that you decide to implement with a workload which is as close to your production workload to make sure that your design is performing as expected.
System Specifications used for the test:
Operating System: Microsoft Windows Server 2008 R2 Enterprise (x64)
Machine: Hewlett-Packard HP Z800 Workstation
Processor: 2 Quad Core [Intel(R) Xeon(R) CPU E5506 @ 2.13GHz, 2128 Mhz, 4 Core(s), 4 Logical Processor(s)]
RAM: 16.0 GB
Hard Disk: Barracuda 7200.12 SATA 3Gb/s 500GB Hard Drive
SQL Server: SQL Server 2008 R2
Regards,
Amit Banerjee
SEE, SQL Support
GUID vs INT Debate【转】的更多相关文章
- 【转载】GUID vs INT Debate
I recently read a blog post on what was better using GUIDs or Integer values. This is been an age lo ...
- Guid和Int还有Double、Date的ToString方法的常见格式(转载)
Guid的常见格式: 1.Guid.NewGuid().ToString("N") 结果为: 38bddf48f43c48588e0d78761eaa1ce6 2.Gu ...
- Guid和Int还有Double、Date的ToString方法的常见格式
Guid的常见格式: 1.Guid.NewGuid().ToString("N") 结果为: 38bddf48f43c48588e0d78761eaa1ce6 2.Gu ...
- Guid vs Int
http://www.uml.org.cn/net/200512283.htm http://www.cnblogs.com/twodays/archive/2004/07/19/25562.aspx ...
- guid转int
如果你想生成一个数字序列,你将会获得一个19位长的序列. 下面的方法会把GUID转换为Int64的数字序列. private static long GenerateIntID() { ...
- 使用Guid做主键和int做主键性能比较
使用Guid做主键和int做主键性能比较 在数据库的设计中我们常常用Guid或int来做主键,根据所学的知识一直感觉int做主键效率要高,但没有做仔细的测试无法 说明道理.碰巧今天在数据库的优化过程中 ...
- Sql server 数据库 int 和guid 两者的比较
我们公司的数据库全部是使用GUID做主键的,很多人习惯使用int做主键.所以呢,这里总结一下,将两种数据类型做主键进行一个比较. 使用INT做主键的优点: 1.需要很小的数据存储空间,仅仅需要4 by ...
- Guid与id区别
一.产生Guid的方法 1.SqlServer中使用系统自带的NEWID函数: select NEWID() 2.C#中,使用Guid类型的NewGuid方法: Guid gid; ...
- SQL GUID和自增列做主键的优缺点
我们公司的数据库全部是使用GUID做主键的,很多人习惯使用int做主键.所以呢,这里总结一下,将两种数据类型做主键进行一个比较. 使用INT做主键的优点: 1.需要很小的数据存储空间,仅仅需要4 by ...
随机推荐
- list map vector set 常用函数列表
#include <stdio.h> #include <iostream>//cin,cout #include <sstream>//ss transfer. ...
- oracle 性能优化建议小结
原则一:注意WHERE子句中的连接顺序: ORACLE采用自下而上的顺序解析WHERE子句,根据这个原理,表之间的连接必须写在其他WHERE条件之前, 那些可以过滤掉最大数量记录的条件必须写在WHER ...
- caffe中的filler.hpp源码的作用:
filler.hpp文件:(它应该没有对应的.cpp文件,一切实现都是在头文件中定义的,可能是因为filler只分在网络初始化时用到那么一次吧) 1,首先定义了基类:Filler,它包括:一个纯虚函数 ...
- PHP SPL标准库之SplFixedArray使用实例
SplFixedArray主要是处理数组相关的主要功能,与普通php array不同的是,它是固定长度的,且以数字为键名的数组,优势就是比普通的数组处理更快. 看看我本机的Benchmark测试: i ...
- BigDecimal类型比较大小
这个类是java里精确计算的类 1 比较对象是否相等 一般的对象用equals,但是BigDecimal比较特殊,举个例子: BigDecimal a=BigDecimal.value ...
- 上传本地文件到HDFS
源代码: import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hado ...
- vbox下android分辨率设置
VBoxManage setextradata "android" "CustomVideoMode1" "1280x800x16" 1. ...
- Spring框架的反序列化远程代码执行漏洞分析(转)
欢迎和大家交流技术相关问题: 邮箱: jiangxinnju@163.com 博客园地址: http://www.cnblogs.com/jiangxinnju GitHub地址: https://g ...
- Windows日志查看工具合集
欢迎关注我的社交账号: 博客园地址: http://www.cnblogs.com/jiangxinnju GitHub地址: https://github.com/jiangxincode 知乎地址 ...
- J2EE相关总结
Java Commons The Java™ Tutorials: http://docs.oracle.com/javase/tutorial/index.html Java Platform, E ...