How Instagram Feeds Work: Celery and RabbitMQ(转)
原文:http://blogs.vmware.com/vfabric/2013/04/how-instagram-feeds-work-celery-and-rabbitmq.html
Instagram is one of the poster children for social media site successes. Founded in 2010, the photo sharing site now supports upwards of 90 million active photo-sharing users. As with every social media site, part of the fun is that photos and comments appear instantly so your friends can engage while the moment is hot. Recently, at PyCon 2013 last month, Instagram engineer Rick Branson shared how Instagram needed to transform how these photos and comments showed up in feeds as they scaled from a few thousand tasks a day to hundreds of millions.
Rick started off his talk demonstrating how traditional database approaches break, calling them the “naïve approach”. In this approach, when working to display a user feed, the application would directly fetch all the photos that the user followed from a single, monolithic data store, sort them by creation time and then only display the latest 10:
SELECT * FROM photos
WHERE author_id IN
(SELECT target_id FROM following
WHERE source_id = %(user_id)d)
ORDER BY creation_time DESC
LIMIT 10;
Instead, Instagram chose to follow a modern distributed data strategy that will allow them to scale nearly linearly.
To start, they built a system in Redis that essentially stores a users feed that they would fetch at any given time. Each user is assigned a media ID. In the diagram to the right, this particular users media ID is 943058139. From there, they rely on asynchronous tasks to populate individual feeds as photos are posted. Each time a photo is posted, the system finds out all the users followers (in this case, 3 followers are identified with IDs 487, 3201, and 441), and assigns individual tasks to place the photo into each followers feed. This data strategy is called a Fanout-On-Write approach, and its very well suited for fast reads. Since reads in their system outweigh writes by 100:1, and most of these reads are sourcing from mobile devices, it was imperative to weigh this heavily towards minimizing read costs.
Write costs are essentially equal to the number of followers each user has and is done for each post. To do this reliably for every user on mobile phones over web requests including Justin Bieber, who has over 7 million followers, this process needed to be handled asynchronously and in the background.
The posts are delivered using a task manager and message broker. For the task manager, they chose Celery, an open source distributed task framework that is written in Python and is known to be highly extensible, feature rich and has great tooling.
With the task manager selected, the Instagram team now needed a message broker to buffer the tasks and distribute to the workers. Initially they looked to Redis, as they already had it in house. However, the fact that it relied on polling meant that it would not scale as they needed, and replication would need to be manually built out, adding additional work to implement it. Also, Redis is an in-memory solution, which in events where the queues built up if the machines ran out of memory, there was risk to lose the tasks.
Next they considered Beanstalk, a purpose built task queue which seemed ideal. It was fast, it pushed to consumers, and it spilled to disk in the event of running out of memory. However, it did not support replication in any way, which was a deal breaker.
Finally, the team landed on RabbitMQ. It was reasonably fast, efficient, supported low-maintenance synchronous replication, and is highly compatible with Celery. Additionally, it was multi-purpose which allowed them to use their message broker for other tasks like cross-posting to other networks asynchronously such as Facebook and Twitter. (TIME- AND BATTERY-SAVER TIP: In my personal experience, at big community, sporting or music events when access bandwidth and therefore Facebook can be difficult, it is much faster to post to Instagram and allow it to post to Facebook in the background.)
The setup is fairly straight-forward. A web request pushes the post to the RabbitMQ broker. Messages are distributed out to workers in a round robin style fashion. If a worker fails, the task is redistributed to the next worker. They use RabbitMQ 3.0 clustered over two mirrored broker nodes in Amazon’s EC2. Typically highly over-provisioned to account for spikes in traffic, they can easily scale out by adding broker clusters.
The result is that the Instagram application has about 25,000 application threads pushing about 4000 tasks per second and completes tasks between 5 and 10 milliseconds. The system has no problem with rolling restarts, it spans data centers well and they’ve been able to bring new engineers on the team up to speed really quickly. Most importantly, however, having hit their high of over 10,000 connections of users simultaneously posting pictures, they are confident it could scale even further.
To see Branson’s full presentation, including more detail on how their configurations and details on different types of tasks, check out the video below:http://i.tianqi.com/index.php?c=code&id=1&bdc=%23&icon=2&wind=1&num=1(国内无法访问,可能链接失效或被墙了)
How Instagram Feeds Work: Celery and RabbitMQ(转)的更多相关文章
- Flask、Celery、RabbitMQ学习计划
Flask (9.16-9.23) 相关组件了解 (9.16-17) WSGI:Werkzeug 数据库:SQLAlchemy *重点查看 urls和视图 (9.18-19) session和co ...
- Celery和Rabbitmq自学
异步消息队列,也能用于定时和周期性任务.每次修改的task代码还要重启worker,这个有点麻烦 所有带task()装饰器的可调用对象(usertask)都是celery.app.task.Task类 ...
- celery使用rabbitmq报错[Errno 104] Connection reset by peer.
写好celery任务文件,使用celery -A app worker --loglevel=info启动时,报告如下错误: [2019-01-29 01:19:26,680: ERROR/MainP ...
- 用Python组合Celery Redis RabbitMQ进行分布式数据抓取
首先,记录下遇到的问题吧,在抓取的过程中为了避免IO操作,主要用Redis做插入缓存,当内存占用率很大时,会周期性的持续到Mysql里 虽然是拆东墙补西墙,但把数据抓取完毕后持续化可以慢慢进行,毕竟数 ...
- celery+RabbitMQ 实战记录2—工程化使用
上篇文章中,已经介绍了celery和RabbitMQ的安装以及基本用法. 本文将从工程的角度介绍如何使用celery. 1.配置和启动RabbitMQ 请参考celery+RabbitMQ实战记录. ...
- Airflow 配置celery+rabbitmq和celery+redis
Airflow 配置celery+rabbitmq 1.安装celery和rabbitmq组件 pip3 install apache-airflow[celery] pip3 install apa ...
- airflow 安装配置celery+rabbitmq celery+redis
AirFlow的安装可以参考:https://www.cnblogs.com/braveym/p/11378851.html 这里介绍的是AirFlow 安装配置celery+rabbitmq 和 ...
- 定时任务管理之python篇celery使用
一.为什么要用celery celery是一个简单.灵活.可靠的,处理大量消息的分布式系统,并且提供维护这样一个系统的必须工具.他是一个专注于实时处理的任务队列,同时也支持任务调度. celery是异 ...
- Celery进阶
Celery进阶 在你的应用中使用Celery 我们的项目 proj/__init__.py /celery.py /tasks.py 1 # celery.py 2 from celery ...
随机推荐
- WAMP环境下配置虚拟主机
1.编辑httpd.conf,查找#Include conf/extra/httpd-vhosts.conf,把前面注释符号“#”删掉 2.编辑httpd-vhosts.conf文件, <Vir ...
- 最全Java学习路线图——Java学习指南
准备篇 适用/适合人群:适合基础小白 目标:掌握JavaSE. ●技术点小节: 1.开发工具的安装配置的介绍 2.JDK安装 3.DOS环境编程 4.Eclipse的安装使用 ●JAVA基础 1.基本 ...
- MathType公式波浪线怎么编辑
数学公式中有很多符号与数学样式,在用手写时是没有问题的,但是很多论文或者期刊中也是需要用到这些符号或者样式的,比如公式波浪线,那么MathType公式波浪线怎么编辑出来呢? 具体操作步骤如下: 1.打 ...
- POJ 1655 Balancing Act(求树的重心--树形DP)
题意:求树的重心的编号以及重心删除后得到的最大子树的节点个数size,假设size同样就选取编号最小的. 思路:随便选一个点把无根图转化成有根图.dfs一遍就可以dp出答案 //1348K 125MS ...
- Swift - UITableView的用法
因为倾向于纯代码编码,所以不太喜欢可视化编程,不过也略有研究,所以项目里面的所有界面效果,全部都是纯代码编写! 终于到了重中之重的tableview的学习了,自我学习ios编程以来,工作中用得最多的就 ...
- DAGDGC特殊调弦
DAGDGC 特殊调弦 重要知识点: 1)音高从高到低排序为:BAGFEDC 2)吉他标准音是(1到6弦) EBGDAE 3)吉他同一弦,每相差一个品级,相差是半个音 调弦方法:1)第一弦(E-> ...
- Nginx(三)-- 配置文件之日志管理
1.日志文件的默认存放位置 默认的日志文件存放位置在:nginx/logs/ 文件夹下,logs文件夹下有:access.log error.log nginx.pid 文件 2.nginx. ...
- pow()
pow() 如果接收两个参数,如 pow(x, y),则结果相当于 x**y,也就是 x 的 y 次方pow() 如果接收三个参数,如 pow(x, y, z),则结果相当于 (x**y) % z,也 ...
- 使用fetch出现unexpected end of input 解决方法
传统的ajax(即xmlhttprequest)由于使用叫复杂,于是js新推出了fetch来获取后台数据,无需引进jq的$.ajax,也可以使用promise的链式用法去处理回调地狱,着实很方便,在谷 ...
- js中字符串支持正则表达式的方法
设一个字符串var myName = "fangming";则支持正则表达式的方法有:split(分割),replace(替换),search(查找),match(元素参数的数组) ...