How Google Backs Up The Internet Along With Exabytes Of Other Data
How
Google Backs Up The Internet Along With Exabytes Of Other Data
Raymond Blum leads a team of Site Reliability Engineers charged with keeping Google's data secret and keeping it safe.
Of course Google would never say how much data this actually is, but from comments it seems that it is not yet a yottabyte, but is many exabytes in
size. GMail alone is approaching low exabytes of data.
Mr. Blum, in the video How Google Backs Up the Internet, explained common backup strategies don’t work for Google for a very googly
sounding reason: typically they scale effort with capacity. If backing up twice as much data requires twice as much stuff to do it, where stuff is time, energy, space, etc., it won’t work, it doesn’t scale. You have to find efficiencies so
that capacity can scale faster than the effort needed to support that capacity. A different plan is needed when making the jump from backing up one exabyte to backing up two exabytes. And the talk is largely about how Google makes that happen.
Some major themes of the talk:
No data loss, ever. Even the infamous GMail outage did not lose data, but the story is more complicated than just a lot of tape backup. Data was retrieved from across the stack, which requires
engineering at every level, including the human.Backups are useless. It’s the restore you care about. It’s a restore system not a backup system. Backups are a tax you pay for the luxury of a restore. Shift work to backups and make them
as complicated as needed to make restores so simple a cat could do it.You can’t scale linearly. You can’t have 100 times as much data require 100 times the people or machine resources. Look for force multipliers. Automation is the major way of improving utilization
and efficiency.Redundancy in everything. Google stuff fails all the time. It’s crap. The same way cells in our body die. Google doesn’t dream that things don’t die. It plans for it.
Diversity in everything. If you are worried about site locality put data in multiple sites. If you are worried about user error have levels of isolation from user interaction. If you want
protection from a software bug put it on different software. Store stuff on different vendor gear to reduce large vendor bug effects.Take humans out of the loop. How many copies of an email are kept by GMail? It’s not something a human should care about. Some parameters are configured by GMail and the system take care
of it. This is a constant theme. High level policies are set and systems make it so. Only bother a human if something outside the norm occurs.Prove it. If you don’t try it it doesn’t work. Backups and restores are continually tested to verify they work.
There’s a lot to learn here for any organization, big or small. Mr. Blum’s talk is entertaining, informative, and well worth watching.
He does really seem to love the challenge of his job.
Here’s my gloss on this very interesting talk where we learn many secrets from inside the beast:
Data availability must be 100%. No data loss ever.
Statistically if you lose 200K of a2GB file it sounds good, but the file is probably now useless, think an executable or a tax return.
Availability of data is more important the availability of access. If a system is down it’s not the end of the world. If data is lost, it is.
Google guarantees you are covered for all of the following in every possible combination:
location isolation
isolation from application layer problems
isolation from storage layer problems
isolation from media failure
Consider the dimensions you can move the sliders around on. Put software on the vertical and location on the horizontal. If you want to cover everything you would need a copy of layer of
the software in different locations. You can do that with VMs in different locations.
Redundancy is not the same as recoverability.
Making many copies does not help meet the no loss guarantee.
Many copies is effective for certain kinds of outages. If an asteroid hits a datacenter and you have a copy far away you are covered.
If you have a bug in your storage stack, copying to N places doesn’t help because the bug corrupts all copies. See the GMail outage as an example.
There aren’t as many asteroids as there are bugs in code, user errors, or writes of a corrupt buffer.
Redundancy is good for locality of reference. Copying is good when you want all data references as close as possible to where the data is being used.
The entire system is incredibly robust because there’s so much of it.
Google stuff fails all the time. It’s crap. The same way cells in our body die. We don’t dream that things don’t die. We plan for it. Machines die all the time.
Redundancy is the answer. The result is more reliable on the aggregate than a single high quality machine. A single machine can be destroyed by an asteroid. Machines put in 50 different
locations are much harder to destroy.
Massively parallel system have more opportunities for loss.
MapReduce running on 30K machines is great, until you have a bug. You have the same bug waiting to run everywhere at once which magnifies the effect.
Local copies do not protect against site outages.
If you have a flood in your server room RAID doesn’t help.
Google File System (GFS), used throughout Google until about a year ago, takes the concept of RAID up a notch. Using coding
techniques to write to multiple datacenters in different cities at once, so you only need N-1 fragments to reconstruct the data. So with three datacenters once can die and you still have the data available.
Availability and integrity are an organization wide characteristic.
Google engineers, BigTable, GFS, Colossus, all know data durability and integrity is job one. Lots of systems in place to check and correct any lapses in data availability and integrity.
You want diversity in everything.
If you are worried about site locality put data in multiple sites.
If you are worried about user error have levels of isolation from user interaction.
If you want protection from a software bug put it on different software. Store stuff on different vendor gear to reduce large vendor bug effects.
Tape to back things up works really really well.
Tape is great because it’s not disk. If they could they would use punch cards.
Imagine if you have a bug in a device driver for SATA disks. Tapes save you from that. It increases your diversity because different media implies different software.
Tape capacity is following Moore’s law, so they are fairly happy with tape as a backup medium, though they are working on alternatives, won’t say what they are.
Tapes are encrypted implying that nefarious forces would have a hard time getting anything useful from them.
Backups are useless. It’s the restore you care about.
Find out if there’s a problem before someone needs the data. When you need a restore you really need it.
Run continuous restores. Constantly select at random 5% of backups and restore them to compare them. Why? To find out if backups work before data is lost. Catches a lot of problems.
Run automatic comparisons. Can’t compare to the original because the originals have changed. So checksum everything and compare the checksums. Get it back to the source media, disk, or
flash, or wherever it came from. Make sure the data can do a round trip. This is done all the time.
Alert on changes in rates of failure.
If something is different you probably want to know about it. If everything is running normally don’t tell me.
Expect some failures, but don’t alert on a file that doesn’t restore on the first attempt.
Let’s say the rate of failure on the first attempt is typically N. The rate of failure on the second attempt is Y. If there’s a change in the rates of failure then something has gone wrong.
Everything breaks.
Disk breaks all the time time but you know when it happens because you are monitoring it.
With tape you don’t know it’s is broken until you try to use it. Tapes last a long time, but you want to test the tape before you need it.
RAID4 on tape.
Don’t write to just one tape. They are cartridges. A robot might drop them or some magnetic flux might occur. Don’t take a chance.
When writing to tape tell the writer to hold on to the data until we say it’s OK to change. If you do you’ve broken a contract.
Build up 4 full tapes and then generate a 5th code tape by XORing everything together. You can lose any one of the 5 tapes and recover the data.
Now tell the writer they can change the source data because the data has made it to its final physical location and is now redundant.
Every bit of data Google backs up goes through this process.
Hundreds of tapes a month are lost, but don’t have hundreds cases of data loss per month because of this process.
If one tape is lost it is detected using the continuous restore and the sibling tapes are used to rebuild another tape and all is well. In the rare case where two tapes are corrupted you’ve only lost data
if the same two spots on the tapes are damaged, so reconstruction is done at the subtape level.Don’t have data loss because of these techniques. It’s expensive, but it’s the cost of doing business.
Backups are a tax you pay for the luxury of a restore.
It’s a restore system not a backup system. Restores are a non-maskable interrupt. They trump everything. Get the backup restored.
Make backups as complicated and take as long as they need. Make restores as quick and automatic as possible.
Recovery should be stupid, fast, and simple. Want a cat to be able to trigger a central restore.
Restores could happen when you are well rested or when you are dog tired. So you don’t want the human element to determine the success of restoring a copy of your serving data. You are under stress. So
do all the work and thinking when you have all the time in the world, which is when making the backup.Huge percentage of systems work this way.
Data sources may have to be able to store data for a period, perhaps days, before it can be promised that it is backed up. But once backed up, it can be restored and restored quickly.
No longer make the most efficient use of media on backups in order to make restores faster. Taking two hours to read a tape is bad. Write only half a tape and read them in parallel so you get the data back
in half the time.
Scale is a problem.
When you have exabytes of data there are real world constraints. If you have to copy 10 exabytes then it could take 10 weeks to backup every day’s data.
With datacenters around the world you have a few choices. Do you give near infinite backup capacity to every site? Do you cluster all the backup by region? What about bandwidth to ship the data? Don’t you
need the bandwidth to serve money making traffic?Look at relevant costs. There are compromises. Not every site has backup facilities. Must balance available capacity on the network. Where do you get the most bang for the buck. For example, backups must
happen at site X because it has the bandwidth.
You can’t scale linearly.
Can’t just say you want more network bandwidth and more tape drives. Drives break, so if you have 10,000 the number of drives you’ll need 10,000 times the number of operators to replace them. Do you have
10,000 times the amount of loading dock to put the tape drives on until a truck picks them up. None of this can be linear.Though the number of tape libraries has gone up a full order of magnitude, there isn’t 10 times as many people involved. There are some number more, but far from a linear increase.
Example was an early prediction that as the number of telephones grew 30% of the US population would be employed as telephone operators. What they didn’t see coming is automated switching.
Automate everything.
Scheduling is automated. If you have a service you say I have a datastore and I need a copy every N and restores must happen in M. Internally systems make this happen. Backups are scheduled, restore testing
is run, and integrity testing is run, etc. When a broken tape is detected it is automatically handled.You as a human don’t see any of this. You might someday ask how many tapes are breaking on average. Or an alert might go out if the rate of tape breakage changes from 100 tapes per day to 300 tapes per
day. But until then don’t tell me if 100 tapes a day broke if that’s within the norm.
Humans should not be involved in steady state operations.
Packing up and shipping drives is still a human activity. Automated interfaces prepare shipping labels, get RMA numbers, check that packages have gone out, get acknowledgement of receipt, and if this breaks
down a human as to intervene.Library software maintenance likewise. If there’s a firmware update a human doesn’t run to every system and perform the upgrade. Download it. Let it get pushed to a canary library. Let
it be tested. Let the results be verified as accurate. Then let it be pushed out. A human isn’t involved in normal operations.
Handle machine death automatically.
Machines are dying twice a minute. If a machine is dying during a MapReduce job that uses 30,000 machines don’t tell me about it, just handle it and move on. Find another machine, move the work, and restart.
If there are dependencies then schedule a wait. If you wait too long then let me know. You handle your own scheduling. This is the job for an algorithm, not a human.
Keep efficiency improving with growth.
Improve utilization and efficiency a lot. Can’t have 100 times as much data require 100 times the people or machine resources.
The Great GMail Outage and Restoral of 2011. Story of how Google dropped data and got it back.
At 10:31AM on a Sunday he got a page that said “Holly Crap call xxx-xxxx”. More on the outage here.
Gmail is approaching low exabytes of data. That’s a lot of tapes.
100% recovery. Availability was not 100%. It wasn’t all there on the first or second day. But at the end of a period it was all there.
A whole series of bugs and mishaps occurred in the layer where replication happens. Yes we had three identical files, but they are all empty. Even with unit tests, system tests, and integration tests, bugs
get through.Restored from tape. Massive job. This is where restoral time is relative to scale. Getting back a gigabyte can be done in milliseconds to seconds. Getting back 200,000 inboxes of several gig each will take
a while.Woke up a couple colleagues in Europe because they are were fresher. An advantage of distributed workforce.
Data was restored from many tapes and verified. It didn’t take weeks or months, it took days. Which they were happy with. Other companies in similar situations have taken a month to realized they couldn’t
get the data back. Steps have been taken to make sure the process would be faster next time.One tape drive takes 2 hours to read. The tapes were located all over the place. Otherwise no single location would have enough power to read all the tapes involved in the restoration process.
With compression and checksums they actually didn’t need to read 200K tapes.
The restoral process has been much improved since then.
Prioritize restores.
Archived data can be restored after more important data like your current inbox and sent email.
Accounts that have not been touched in a month can wait while more active users are restored first.
Backup system is viewed as a huge global organism.
Don’t want GMail backups just in New York for example, because if that datacenter grew or shrank the backups would need to scale appropriately.
Treat backup as one giant world spanning system. When a backup occurs it might be somewhere else entirely.
A restore on a tape has to happen where the tape is located. But until it makes the tape the data could be in New York and the backup is in Oregon because that’s where there was capacity. Location isolation
is handled automatically, no client is told where their data is backed up.Capacity can be moved around. As long as there’s global capacity and the network can support it it doesn’t matter where the tapes are located.
The more data you have the more important it is to keep it.
The larger things are the more important they are as a rule. Google used to be just search. Now it is Gmail, stuff held in drives, docs, etc. It’s both larger now and more important.
Have a good infrastructure.
Really good to have general purpose swiss army knives at your disposal. When MapReduce was written they probably never thought of it being used for backups. But without already having MapReduce the idea
of using it for backups couldn’t have occurred.
Scaling is really important and you can’t have any piece of it--Software, Infrastructure, Hardware, Processes--that doesn’t scale.
You can’t say I’m going to deploy more tape drives with having the operations staff. If you are going to hire twice as many people do you have twice as many parking spots? Room in the cafeteria? Restrooms?
Everything has to scale up. You’ll hit one single bottleneck and it will stop you.
Prove it.
Don’t take anything for granted. Hope is not a strategy.
If you don’t try it it doesn’t work. A restore must happen to verify a backup. Until you get to the end you haven’t proven anything. This attitude has found a lot of failures.
DRT. Disaster recovery testing.
Every N months a disaster scenario is played out. Simulate the response at every level of the organization.
How will the company survive without whatever was taken away by the disaster? Must learn to adapt.
Finds enormous holes in infrastructure and physical security.
Imagine a datacenter with one road leading to it and there are trucks loaded with fuel for the backup generators on the road. What happens when the road is lost? Better have another road and another supplier
for diesel fuel.Do have supply chain redundancy strategies.
Redundancy in different software stacks at different locations at different points in in time.
Don’t let just data migrate through the stack. Keep the data in different layers of the stack for a particular dwell period. So if you lose this and this you have the data somewhere in
an overlap. Time, location, and software.Consider the GMail outage example. How if replication was corrupted could no data be lost? This was a question from the audience and he didn’t really want to give details. Data is constantly being backed
up. Let’s say we have the data as of 9PM. Let’s say the corruption started at 8PM but hadn’t made it to tape yet. The corruption was stopped. Software was rolled back to a working release. At some point in the stack all the data is still there. There’s stuff
on tape. There’s stuff being replicated. There’s stuff on the front-ends. There’s stuff in logs. There was overlap from all these sources and it was possible to reconstruct all the data. Policy is to not take data out of one stuck until N hours after it has
been put in another stack for just these cases.
Delete problem.
I want to delete this. Not going to rewrite tapes just to delete data. It’s just too expensive at scale.
One approach is to do something smart with encryption keys. He didn’t tell us what Google does. Perhaps the key is lost which effectively deletes the data?
A giant organization can work when you trust your colleagues and shard responsibilities.
Trust that they understand their part.
Make sure organizational and software Interfaces are well defined. Implement verification tests between layers.
Whitelisting and blacklisting.
Ensure data is in a guaranteed location and guaranteed not to be in a certain location, which goes against much of the rest of the philosophy which is location diversity and location independence.
Was not originally a feature of the stack. Had to add in to support government requirements.
Responsibility pushed down as low as possible in the stack. Fill out the right profile and it magically happens.
Related Articles
How Google Backs Up The Internet Along With Exabytes Of Other Data的更多相关文章
- Google Volley: How to send a POST request with Json data?
sonObjectRequest actuallyaccepts JSONObject as body. From http://arnab.ch/blog/2013/08/asynchronous- ...
- Android 应用程序集成Google 登录及二次封装
谷歌登录API: https://developers.google.com/identity/sign-in/android/ 1.注册并且登录google网站 https://accounts. ...
- 高效率使用google
Google良好的搜索和易用性已经得到了广大网友的欢迎,但是除了我们经常使用的Google网站.图像和新闻搜索之外,它还有很多其他搜索功能和搜索技巧.如果我们也能充分利用,必将带来更大的便利.这里我介 ...
- 美国政府关于Google公司2013年度的财务报表红头文件
请管理员移至新闻版块,谢谢! 来源:http://www.sec.gov/ 财务报表下载↓ 此文仅作参考分析. 10-K 1 goog2013123110-k.htm FORM 10-K UNIT ...
- 每日英语:Google Scraps Plan to Build Hong Kong Data Center
Internet giant Google Inc. has scrapped a plan to build its own data center in Hong Kong and will in ...
- 高效率使用google,国外搜索引擎,国内顺利使用Google的另类技巧,可用谷歌镜像, 可用google学术, 如何使用robots不让百度和google收录
Google良好的搜索和易用性已经得到了广大网友的欢迎,但是除了我们经常使用的Google网站.图像和新闻搜索之外,它还有很多其他搜索功能和搜索技巧.如果我们也能充分利用,必将带来更大的便利.这里我介 ...
- Google 地图 API V3 之 叠加层
Google官方教程: Google 地图 API V3 使用入门 Google 地图 API V3 针对移动设备进行开发 Google 地图 API V3 之事件 Google 地图 API V3 ...
- google的作恶与不作恶
Google刚刚出现时,那时互联网还似桃花源,路不拾遗夜不闭户,最多升级升级病毒库.Google的发展,从商业模式上带来了广告对互联网无孔不入的渗透,如今Google.百度.阿里等各大巨头都有自己的广 ...
- How to export a model from SolidWorks to Google SketchUp
How to export a model from SolidWorks to Google SketchUp While Google SketchUp is not a professional ...
随机推荐
- HTTPie:一个不错的 HTTP 命令行客户端
转自:http://top.jobbole.com/9682/ HTTPie:一个不错的 HTTP 命令行客户端 HTTPie (读aych-tee-tee-pie)是一个 HTTP 的命令行客户端. ...
- Vue学习之路第十五篇:v-if和v-show指令
1.v-if和v-show都是用来实现条件判断的指令. 2.先看代码 <body> <div id="app"> <button @click=&qu ...
- Selenium+Python+jenkins搭建web自动化测测试框架
python-3.6.2 chrome 59.0.3071.115 chromedriver 2.9 安装python https://www.python.org/downloads/ (Wind ...
- [TJOI2008]彩灯
线性基裸题,求最大线性无关组. 注意:1ll<<i #include <cstdio> int n,m; const int mod=2008; long long b[64] ...
- 基于vue的可视化编辑器
https://github.com/jaweii/Vue-Layout https://github.com/L-Chris/vue-design https://github.com/fir ...
- mac下连接本地安装的mysql报错提示密码过期
前提: mac中之前安装了mysql,一段时间没使用,今天使用mysql客户端去连接,报错提示密码过期,原因是mysql5.7之后版本有密码过期这个功能. error: Your password h ...
- 稀疏编码(Sparse Coding)的前世今生(二)
为了更进一步的清晰理解大脑皮层对信号编码的工作机制(策略),须要把他们转成数学语言,由于数学语言作为一种严谨的语言,能够利用它推导出期望和要寻找的程式.本节就使用概率推理(bayes views)的方 ...
- 软件測试、ios中的測试概念以及步骤
软件測试: 软件測试的目标是应该服务于软件项目的目标,能够通过建议反馈使用更加高效的方法和工具,提升软件开发效率以及软件开发质量.同一时候还能够通过过一些手段,更早.更快.很多其它地发现缺陷.从容减少 ...
- Autodesk 举办的 Revit 2015 二次开发速成( 1.5 天),教室培训, 地点武汉
2014年8月26日9:00 – 17:00 2014年8月27日9:00 – 12:00 培训地点: Ø 湖北工业大学 实训楼605教室 Ø 地址:武汉市武昌区南湖李家墩一村一号 Ø 交通路线说明: ...
- Hadoop问题记录:Wrong FS: hdfs://hp5-249:9000/, expected: file:///
一般在对文件操作的时候可能出现这个问题,可能是打开文件的时候出错,也可能是对文件夹进行遍历的时候出问题. 出现这样的问题通常是在eclipse中执行hadoop的时候出现,直接切换到shell下发送命 ...