http://highscalability.com/blog/2014/7/21/stackoverflow-update-560m-pageviews-a-month-25-servers-and-i.html

The folks at Stack Overflow remain incredibly open about what they are doing and why. So it’s time for another update. What has Stack Overflow been up to?

The network of sites that make up StackExchange, which includes StackOverflow, is now ranked 54th for traffic in the world; they have 110 sites and are growing at a rate of 3 or 4 a month; 4 million users; 40 million answers; and 560 million pageviews a month.

This is with just 25 servers. For everything. That’s high availability, load balancing, caching, databases, searching, and utility functions. All with a relative handful of employees. Now that’s quality engineering.

This update is based on The architecture of StackOverflow (video) by Marco Cecconi andWhat it takes to run Stack Overflow (post) by Nick Craver. In addition, I’ve merged in comments from various sources. No doubt some of the details are out of date as I meant to write this article long ago, but it should still be representative.

Stack Overflow still uses Microsoft products. Microsoft infrastructure works and is cheap enough, so there’s no compelling reason to change. Yet SO is pragmatic. They use Linux where it makes sense. There’s no purity push to make everything Linux or keep everything Microsoft. That wouldn’t be efficient.

Stack Overflow still uses a scale-up strategy. No clouds in site. With their SQL Servers loaded with 384 GB of RAM and 2TB of SSD, AWS would cost a fortune. The cloud would also slow them down, making it harder to optimize and troubleshoot system issues. Plus, SO doesn’t need a horizontal scaling strategy. Large peak loads, where scaling out makes sense, hasn’t  been a problem because they’ve been quite successful at sizing their system correctly.

So it appears Jeff Atwood’s quote: "Hardware is Cheap, Programmers are Expensive", still seems to be living lore at the company.

Marco Ceccon in his talk says when talking about architecture you need to answer this question first: what kind of problem is being solved?

First the easy part. What does StackExchange do? It takes topics, creates communities around them, and creates awesome question and answer sites.

The second part relates to scale. As we’ll see next StackExchange is growing quite fast and handles a lot of traffic. How does it do that? Let’s take a look and see….

Stats

  • StackExchange network has 110 sites growing at a rate of 3 or 4 a month.

  • 4 million users

  • 8 million questions

  • 40 million answers

  • As a network #54 site for traffic in the world

  • 100% year over year growth

  • 560 million pageviews a month

  • Peak is more like 2600-3000 requests/sec on most weekdays. Programming, being a profession, means weekdays are significantly busier than weekends.

  • 25 servers

  • 2 TB of SQL data all stored on SSDs

  • Each web server has 2x 320GB SSDs in a RAID 1.

  • Each ElasticSearch box has 300 GB also using SSDs.

  • Stack Overflow has a 40:60 read-write ratio.

  • DB servers average 10% CPU utilization

  • 11 web servers, using IIS

  • 2 load balancers, 1 active, using HAProxy

  • 4 active database nodes, using MS SQL

  • 3 application servers implementing the tag engine, anything searching by tag hits

  • 3 machines doing search with ElasticSearch

  • 2 machines for distributed cache and messaging using Redis

  • 2 Networks (each a Nexus 5596 + Fabric Extenders)

  • 2 Cisco 5525-X ASAs (think Firewall)

  • 2 Cisco 3945 Routers

  • 2 read-only SQL Servers for used mainly for the Stack Exchange API

  • VMs also perform functions like deployments, domain controllers, monitoring, ops database for sysadmin goodies, etc.

Platform

  • ElasticSearch

  • Redis

  • HAProxy

  • MS SQL

  • Opserver

  • TeamCity

  • Jil - Fast .NET JSON Serializer, built on Sigil

  • Dapper - a micro ORM.

UI

  • The UI has message inbox that is sent a message when you get a new badge, receive a message, significant event, etc. Done using WebSockets and is powered by redis.

  • Search box is powered by ElasticSearch using a REST interface.

  • With so many questions on SO it was impossible to just show the newest questions, they would change too fast, a question every second. Developed an algorithm to look at your pattern of behaviour and show you which questions you would have the most interest in. It’s uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Server side templating is used to generate pages.

Servers

  • The 25 servers are not doing much, that is the CPU load is low. It’s calculated SO could run on only 5 servers.

  • The database server is at 10%, except when it bursts while performing a backups.

  • How so low? The databases servers have 384GB of RAM and the web servers are at 10%-15% CPU usage.

  • Scale-up is still working. Other scale-out sites with a similar number of pageviews tend to run on 100, 200, up to 300 servers.

  • Simple system. Built on .Net. Have only 9 projects, others systems have 100s. Reason to have so few projects is is so compilation is lightning fast, which requires planning at the beginning. Compilation takes 10 seconds on a single computer.

  • 110K lines of code. A small number given what it does.

  • This minimalist approach comes with some problems. One problem is not many tests. Tests aren’t needed because there’s a great community. Meta.stackoverflow is a discussion site for the community and where bugs are reported. Meta.stackoverflow is also a beta site for new software. If users find any problems with it they report the bugs that they’ve found, sometimes with solution/patches.

  • Windows 2012 is used in New York but are upgrading to 2012 R2  (Oregon is already on it). For Linux systems it’s Centos 6.4.

  • Load is really almost all over 9 servers, because 10 and 11 are only for meta.stackexchange.com, meta.stackoverflow.com, and the development tier. Those servers also run around 10-20% CPU which means we have quite a bit of headroom available.

SSDs

  • Intel 330 as the default (web tier, etc.)

  • Intel 520 for mid tier writes like Elastic Search

  • Intel 710 & S3700 for the database tier. S3700 is simply the successor to the high endurance 710 series.

  • Exclusively RAID 1 or RAID 10 (10 being any arrays with 4+ drives). Failures have not been a problem, even with hundreds of intel 2.5" SSDs in production, a single one hasn’t failed yet. One or more spare parts are kept for each model, but multiple drive failure hasn't been a concern.

  • ElasticSearch performs much better on SSDs, given SO writes/re-indexes very frequently.

  • SSD changes the use of search. Lucene.net couldn’t handle SO’s concurrent workloads due to locking issues, so they moved to ElasticSearch. It turns out locks around the binary readers really aren't necessary in an all SSD environment.

  • The only scale-up problems so far is SSD space on the SQL boxes due to the growth pattern of reliability vs. space in the non-consumer space, that isdrives that have capacitors for power loss and such.

High Availability

  • The main datacenter is in New York and the backup datacenter is in Oregon.

  • Redis has 2 slaves, SQL has 2 replicas, tag engine has 3 nodes, elastic has 3 nodes - any other service has high availability as well (and exists in both data centers).

  • Not everything is slaved between data centers (very temporary cache data that's not needed to eat bandwidth by syncing, etc.) but the big items are, so there is still a shared cache in case of a hard down in the active data center. A start without a cache is possible, but it isn't very graceful.

  • Nginx was used for SSL, but a transition has been made to using HAProxy to terminate SSL.

  • Total HTTP traffic sent is only about 77% of the total traffic sent. This is because replication is happening to the secondary data center in Oregon as well as other VPN traffic. The majority of this traffic is the data replication to SQL replicas and redis slaves in Oregon.

Databasing

  • MS SQL Server.

  • Stack Exchange has one database per-site, so Stack Overflow gets one, Super User gets one, Server Fault gets one, and so on. The schema for these is the same. This approach of having different database is effectively a form of partitioning and horizontal scaling.

  • In the primary data center (New York) there is usually 1 master and 1 read-only replica in each cluster. There's also 1 read-only replica (async) in the DR data center (Oregon). When running in Oregon then the primary is there and both of the New York replicas are read-only and async.

  • There are a few wrinkles. There is one "network wide" database which has things like login credentials, and aggregated data (mostly exposed through stackexchange.com user profiles, or APIs).

  • Careers Stack Overflow, stackexchange.com, and Area 51 all have their own unique database schema.

  • All the schema changes are applied to all site databases at the same time. They need to be backwards compatible so, for example, if you need to rename a column - a worst case scenario - it's a multiple steps process: add a new column, add code which works with both columns, back fill the new column, change code so it works with the new column only, remove the old column.

  • Partitioning is not required. Indexing takes care of everything and the data just is not large enough. If something warrants a filtered indexes, why not make it way more efficient? Indexing only on DeletionDate = Null and such is a common pattern, others are specific FK types from enums.

  • Votes are in 1 table per item, for example 1 table for post votes, 1 table for comment votes. Most pages we render real-time, caching only for anonymous users. Given that, there's no cache to update, it’s just a re-query.

  • Scores are denormalized, so querying is often needed. It's all IDs and dates, the post votes table just has 56,454,478 rows currently. Most queries are just a few milliseconds due to indexing.

  • The Tag Engine is entirely self-contained, which means not having to depend on an external service for very, very core functionality. It's a huge in-memory struct array structure that is optimized for SO use cases and precomputed results for heavily hit combinations. It's a simple windows service running on a few boxes working in a redundant team. CPU is about 2-5% almost always. Three boxes are not needed for load, just redundancy. If all those do fail at once, the local web servers will load the tag engine in memory and keep on going.

  • On Dapper's lack of a compiler checking queries compared to traditional ORM. The compiler is checking against what you told it the database looks like. This can help with lots of things, but still has the fundamental disconnect problem you'll get at runtime. A huge problem with the tradeoff is the generated SQL is nasty, and finding the original code it came from is often non-trivial. Lack of ability to hint queries, control parameterization, etc. is also a big issue when trying to optimize queries. For example. literal replacement was added to Dapper to help with query parameterization which allows the use of things like filtered indexes. Dapper also intercepts the SQL calls to dapper and add add exactly where it came from. It saves so much time tracking things down.

Coding

  • The process:

    • Most programmers work remotely. Programmers code in their own batcave.

    • Compilation is very fast.

    • Then the few test that they have are run.

    • Once compiled, code is moved to a development staging server.

    • New features are hidden via feature switches.

    • Runs on same hardware as the rest of the sites.

    • It’s then moved to Meta.stackoverflow for testing. 1000 users per day use the site, so its a good test.

    • If it passes it goes live on the network and is tested by the larger community.

  • Heavy usage of static classes and methods, for simplicity and better performance.

  • Code is simple because the complicated bits are packaged in a library and open sourced and maintained. The number of .Net projects stays low because community shared parts of the code are used.

  • Developers get two or three monitors. Screens are important, they help you be productive.

Caching

  • Cache all the things.

  • 5 levels of caches.

  • 1st: is the network level cache: caching in the browser, CDN, and proxies.

  • 2nd: given for free by the .Net framework and is called the HttpRuntime.Cache. An in-memory, per server cache.

  • 3rd: Redis. Distributed in-memory key-value store. Share cache elements across different servers that serve the same site. If StackOverflow has 9 servers then all servers will be able to find the same cached items.

  • 4th: SQL Server Cache. The entire database is cached in-memory. The entire thing.

  • 5th: SSD. Usually only hit when the SQL server cache is warming up.

  • For example, every help page is cached. Code to access a page is very terse:

    • Static methods and static classes re used. Really bad from an OOP perspective, but really fast and really friendly towards terse code. All code is directly addressed.

    • Caching is handled by a library layer of Redis and Dapper, a micro ORM.

  • To get around garbage collection problems, only one copy of a class used in templates are created and kept in a cache. Everything is measured, including GC operation, from statistics it is known that layers of indirection increase GC pressure to the point of noticeable slowness.

  • CDN hits vary, since the  query string hash is based on file content, it’s only re-fetched on a build. It's typically 30-50 million hits a day for 300 to 600 GB of bandwidth.

  • A CDN is not used for CPU or I/O load, but to help users find answers faster.

Deploying

  • Want to deploy 5 times a day. Don’t build grand gigantic things and then put then live. Important because:

    • Can measure performance directly.

    • Forced to build the smallest thing that can possibly work.

  • TeamCity builds then copies to each web tier via a powershell script. The steps for each server are:

    • Tell HAProxy to take the server out of rotation via a POST

    • Delay to let IIS finish current requests (~5 sec)

    • Stop the website (via the same PSSession for all the following)

    • Robocopy files

    • Start the website

    • Re-enable in HAProxy via another POST

  • Almost everything is deployed via puppet or DSC, so upgrading usually consist of just nuking the RAID array and installing from a PXE boot. it's very fast and you know it's done right/repeatable.

Teaming

  • Teams:

    • SRE (System Reliability Engineering): - 5 people

    • Core Dev (Q&A site) : ~6-7 people

    • Core Dev Mobile: 6 people

    • Careers team that does development solely for the SO Careers product: 7 people

  • Devops and developer teams are really close-knit.

  • There's a lot of movement between teams.

  • Most employees work remotely.

  • Offices are mostly sales, Denver and London exclusively so.

  • All else equal, it is slightly prefered to have people in NYC, because the in-person time is a plus for the casual interaction that happens in between "getting things done". But the set up makes it possible to do real work and official team collaboration works almost entirely online.

  • They’ve learned that the in-person benefit is more than outweighed by how much you get from being able to hire the best talent that loves the product anywhere, not just the ones willing to live in the city you happen to be in.

  • The most common reason for someone going remote is starting a family. New York's great, but spacious it is not.

  • Offices are in Manhattan and a lot of talent is there. The data center needs to not be a crazy distance away since it is always being improved. There’s also a slightly faster connection to many backbones in the NYC location - though we're talking only a few milliseconds (if that) of difference there.

  • Making an awesome team: Love geeks. Early Microsoft, for example, was full of geeks and they conquered the world.

  • Hire from Stack Overflow community. They looks for a passion for coding, a passion for helping others, and a passion for communicating.

Budgeting

  • Budgets are pretty much project based. Money is only spent as infrastructure is added for new projects. The web servers that have such low utilization are the same ones purchased 3 years ago when the data center was built.

Testing

  • Move fast and break things. Push it live.

  • Major changes are tested by pushing them. Development has an equally powerful SQL server and it runs on the same web tier, so performance testing isn't so bad.

  • Very few tests. Stack Overflow doesn't use many unit tests because of their active community and heavy usage of static code.

  • Infrastructure changes. There’s 2 of everything, so there’s a backup with the old configuration whenever possible, with a quick failback mechanism. For example, keepalived does failback quickly between load balancers.

  • Redundant systems fail over pretty often just to do regular maintenance. SQL backups are tested by having a dedicated server just for restoring them, constantly (that's a free license - do it). Plan to start full data center failovers every 2 months or so - the secondary data center is read-only at all other times.

  • Unit tests, integration tests and UI tests run on every push. All the tests must succeed before a production build run is even possible. So there’s some mixed messages going on about testing.

  • The things that obviously should have tests have tests. That means most of the things that touch money on the Careers product, and easily unit-testable features on the Core end (things with known inputs, e.g. flagging, our new top bar, etc), for most other things we just do a functionality test by hand and push it to our incubating site (formerly meta.stackoverflow, now meta.stackexchange).

Monitoring / Logging

  • Now considering using http://logstash.net/ for log management. Currently a dedicated service inserts the syslog UDP traffic into a SQL database. Web pages add headers for the timings on the way out which are captured with HAProxy and are included in the syslog traffic.

  • Opserver and Realog. are how many metrics are surfaced. Realog is a logging display system built by Kyle Brandt and Matt Jibson in Go

  • Logging is from the HAProxy load balancer via syslog instead of via IIS. This is a lot more versatile than IIS logs.

Clouding

  • Hardware is cheaper than developers and efficient code. You are only as fast as your slowest bottleneck and all the current cloud solutions have fundamental performance or capacity limits. 
  • Could you build SO well if building for the cloud from day one? Mostl likely. Could you consistency render all your pages performing several up to date queries and cache fetches across that cloud network you don't control and getting sub 50ms render times? That's another matter. Unless you're talking about substantially higher cost (at least 3-4x), the answer is no - it's still more economical for SO to host in their own servers.

Performance As A Feature

  • StackOverflow puts aheavy emphasis on performance. The goal for the main page  is to load in less than 50ms, but can be as low as 28ms.

  • Programmers are fanatic about reducing page load times and improving the user experience.

  • Timings for every single request to the network are recorded. With these kind of metrics you can make decisions on where to improve your system.

  • The primary reason their servers run at such low utilization is efficient code. Web servers average between 5-15% CPU, 15.5 GB of RAM used and 20-40 Mb/s network traffic.  The SQL servers average around 5-10% CPU, 365 GB of RAM used, and 100-200 Mb/s of network traffic. This has three major benefits: general room to grow before and upgrade is necessary; headroom to stay online for when things go crazy (bad query, bad code, attacks, whatever it may be); and the ability to clock back on power if needed.

Lessons Learned

  • Why use Redis if you use MS products? gabeech: It's not about OS evangelism. We run things on the platform they run best on. Period. C# runs best on a windows machine, we use IIS. Redis runs best on a *nix machine we use *nix.

  • Overkill as a strategy. Nick Craver on why their network is over provisioned: Is 20 Gb massive overkill? You bet your ass it is, the active SQL servers average around 100-200 Mb out of that 20 Gb pipe.  However, things like backups, rebuilds, etc. can completely saturate it due to how much memory and SSD storage is present, so it does serve a purpose.

  • SSDs Rock. The database nodes all use SSD and the average write time is 0 milliseconds.

  • Know your read/write workload.

  • Keeping things very efficient means new machines are not needed often. Only when a new project comes along that needs different hardware for some reason is new hardware added. Typically memory is added, but other than that efficient code and low utilization means it doesn't need replacing. So typically talking about adding a) SSDs for more space, or b) new hardware for new projects.

  • Don’t be afraid to specialize. SO uses complicated queries based on tags, which is why a specialized Tag Engine was developed.

  • Do only what needs to be done. Tests weren’t necessary because an active community did the acceptance testing for them. Add projects only when required. Add a line of code only when necessary. You Aint Gone Need It really works.

  • Reinvention is OK. Typical advice is don’t reinvent the wheel, you’ll just make it worse, by making it square, for example. At SO they don't worry about making a "Square Wheel". If developers can write something more lightweight than an already developed alternative, then go for it.

  • Go down to the bare metal. Go into the IL (assembly language of .Net). Some coding is in IL, not C#. Look at SQL query plans. Take memory dumps of the web servers to see what is actually going on. Discovered, for example, a split call generated 2GB of garbage.

  • No bureaucracy. There’s always some tools your team needs. For example, an editor, the most recent version of Visual Studio, etc. Just make it happen without a lot of process getting in the way.

  • Garbage collection driven programming. SO goes to great lengths to reduce garbage collection costs, skipping practices like TDD, avoiding layers of abstraction, and using static methods. While extreme, the result is highly performing code. When you're doing hundreds of millions of objects in a short window, you can actually measure pauses in the app domain while GC runs. These have a pretty decent impact on request performance.

  • The cost of inefficient code can be higher than you think.  Efficient code stretches hardware further, reduces power usage, makes code easier for programmers to understand.

Related Articles

StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It's All About Performance的更多相关文章

  1. [收藏]NET技术+25台服务器怎样支撑世界第54大网站

    .NET技术+25台服务器怎样支撑世界第54大网站   英文原文:StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It's ...

  2. .NET技术+25台服务器怎样支撑世界第54大网站(转)

    [编者按]StackOverflow是一个IT技术问答网站,用户可以在网站上 提交和回答问题.当下的StackOverflow已拥有400万个用户,4000万个回答,月PV5.6亿,世界排行第54.然 ...

  3. (转).NET技术+25台服务器怎样支撑世界第54大网站

    英文原文:StackOverflow Update: 560M Pageviews A Month, 25 Servers, And It's All About Performance StackO ...

  4. .NET技术+25台服务器怎样支撑世界第54大网站

    摘要:同时使用Linux和Windows平台产品,大量使用静态的方法和类,Stack Overflow是个重度性能控.同时,取代横向扩展,他们坚持着纵向扩展思路,因为“硬件永远比程序员便宜”. Sta ...

  5. Cheatsheet: 2014 07.01 ~ 07.31

    Web Maximize Compression with Zopfli Browser Detection with JavaScript Simple MySQL Master HA with m ...

  6. Fedora 25/24/23 nVidia Drivers Install Guide

    https://www.if-not-true-then-false.com/2015/fedora-nvidia-guide/ search Most Popular Featured Linux ...

  7. linux tcp调优

    Linux TCP Performance Tuning News Linux Performance Tuning Recommended Books Recommended Links Linux ...

  8. Intel CPU Microarchitecture

    http://en.wikipedia.org/wiki/Intel_Tick_Tock Atom Roadmap[16]   Fabrication process Microarchitectur ...

  9. CQRS, Task Based UIs, Event Sourcing agh!

    原文地址:CQRS, Task Based UIs, Event Sourcing agh! Many people have been getting confused over what CQRS ...

随机推荐

  1. Eplan 2D安装版布局,部件、端子竖放

    部件竖放,不是通过变量的选择实现,而是通过设置实现的,具体设置在: 选项-设置-用户-2D安装板布局: 部件方向-更改为 垂直 部件放置-旋转角度-更改为90° 这样在连续放置部件的时候就变为竖放了, ...

  2. Oracle数据库入门——pctfree和pctused详解

    一.建立表时候,注意PCTFREE参数的作用 PCTFREE:为一个块保留的空间百分比,表示数据块在什么情况下可以被insert,默认是10,表示当数据块的可用空间低于10%后,就不可以被insert ...

  3. 译:用InnoSetup模块化安装依赖项

    译文出处:http://www.codeproject.com/Articles/20868/NET-Framework-Installer-for-InnoSetup 源文件下载:http://fi ...

  4. #ifdef 和 #if defined 的区别 -- 转

    #ifdef 和 #if defined 的区别在于,后者可以组成复杂的预编译条件,比如 #if defined (AAA) && defined (BBB) xxxxxxxxx #e ...

  5. C# 通过WebService方式 IIS发布网站 上传文件到服务器

    应用场景:要将本地的文件 上传到服务器的虚拟机上 网络环境:公司局域网(如下图中第二种) 开发环境:VS2010 服务器环境:WinServer2008    虚拟机环境:WinServer2008 ...

  6. sp_executesql 两种写法

    写法1: AlTER PROCEDURE TryAgain @ReturnValue int output AS declare @aa nvarchar(), @ForumID int, @coun ...

  7. 剑指架构师系列-tomcat6通过IO复用实现connector

    由于tomcat6的配置文件如下: <Connector port="80" protocol="org.apache.coyote.http11.Http11Ni ...

  8. SQL查询语句去除重复行

    1.存在两条完全相同的纪录 这是最简单的一种情况,用关键字distinct就可以去掉 select distinct * from table(表名) where (条件) 2.存在部分字段相同的纪录 ...

  9. Velocity魔法堂系列一:入门示例

    一.前言 Velocity作为历史悠久的模板引擎不单单可以替代JSP作为Java Web的服务端网页模板引擎,而且可以作为普通文本的模板引擎来增强服务端程序文本处理能力.而且Velocity被移植到不 ...

  10. 汇编学习:float与double速度问题

    X86处理器包含两种类型的浮点数寄存器.第一种使用8个浮点寄存器组成浮点寄存器栈,另一种为向量寄存器(XMM,YMM),它们对于单双精度的处理是不同的.本文将讨论两种模式下的浮点数计算速度问题. 一. ...