why’s kafka so fast
As we all know that Kafka is very fast, much faster than most of its competitors. So what’s the reason here?
Avoid Random Disk Access
Kafka writes everything onto the disk in order and consumers fetch data in order too. So disk access always works sequentially instead of randomly. For traditional hard disks(HDD), sequential access is much faster than random access. Here is a comparison:
hardware | sequential writes | random writes |
---|---|---|
6 * 7200rpm SATA RAID-5 | 300MB/s | 50KB/s |
Kafka Writes Everything Onto The Disk Instead of Memory
Yes, you read that right. Kafka writes everything onto the disk instead of memory. But wait a moment, isn’t memory supposed to be faster than disks? Typically it’s the case, for Random Disk Access. But for sequential access, the difference is much smaller. Here is a comparison taken from https://queue.acm.org/detail.cfm?id=1563874.
As you can see, it’s not that different. But still, sequential memory access is faster than Sequential Disk Access, why not choose memory? Because Kafka runs on top of JVM, which gives us two disadvantages.
1.The memory overhead of objects is very high, often doubling the size of the data stored(or even higher).
2.Garbage Collection happens every now and then, so creating objects in memory is very expensive as in-heap data increases because we will need more time to collect unused data(which is garbage).
So writing to file systems may be better than writing to memory. Even better, we can utilize MMAP(memory mapped files) to make it faster.
Memory Mapped Files(MMAP)
Basically, MMAP(Memory Mapped Files) can map the file contents from the disk into memory. And when we write something into the mapped memory, the OS will flush the change onto the disk sometime later. So everything is faster because we are using memory actually, but in an indirect way. So here comes the question. Why would we use MMAP to write data onto disks, which later will be mapped into memory? It seems to be a roundabout route. Why not just write data into memory directly? As we have learned previously, Kafka runs on top of JVM, if we wrote data into memory directly, the memory overhead would be high and GC would happen frequently. So we use MMAP here to avoid the issue.
Zero Copy
Suppose that we are fetching data from the memory and sending them to the Internet. What is happening in the process is usually twofold.
1.To fetch data from the memory, we need to copy those data from the Kernel Context into the Application Context.
2.To send those data to the Internet, we need to copy the data from the Application Context into the Kernel Context.
As you can see, it’s redundant to copy data between the Kernel Context and the Application Context. Can we avoid it? Yes, using Zero Copy we can copy data directly from the Kernel Context to the Kernel Context.
Batch Data
Kafka only sends data when batch.size
is reached instead of one by one. Assuming the bandwidth is 10MB/s, sending 10MB data in one go is much faster than sending 10000 messages one by one(assuming each message takes 100 bytes).
"I would be fine and made no troubles."
why’s kafka so fast的更多相关文章
- Apache Kafka for Item Setup
At Walmart.com in the U.S. and at Walmart's 11 other websites around the world, we provide seamless ...
- Apache Kafka - Schema Registry
关于我们为什么需要Schema Registry? 参考, https://www.confluent.io/blog/how-i-learned-to-stop-worrying-and-love- ...
- Understanding, Operating and Monitoring Apache Kafka
Apache Kafka is an attractive service because it's conceptually simple and powerful. It's easy to un ...
- Build an ETL Pipeline With Kafka Connect via JDBC Connectors
This article is an in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via ...
- How to choose the number of topics/partitions in a Kafka cluster?
This is a common question asked by many Kafka users. The goal of this post is to explain a few impor ...
- kafka producer源码
producer接口: /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor l ...
- Flume-ng+Kafka+storm的学习笔记
Flume-ng Flume是一个分布式.可靠.和高可用的海量日志采集.聚合和传输的系统. Flume的文档可以看http://flume.apache.org/FlumeUserGuide.html ...
- Apache Kafka: Next Generation Distributed Messaging System---reference
Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...
- Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel--reference
[This article was originally written by Yves Trudeau.] http://java.dzone.com/articles/exploring-mess ...
随机推荐
- JavaScript BOM Cookie 的用法
JavaScript Cookie Cookie是计算机上存储在小文本文件中的数据.当Web服务器将网页发送到浏览器时,连接将关闭,服务器将忘记用户的所有内容.发明Cookie是为了解决“如何记住用户 ...
- CTF必备技能丨Linux Pwn入门教程——栈溢出基础
这是一套Linux Pwn入门教程系列,作者依据i春秋Pwn入门课程中的技术分类,并结合近几年赛事中出现的一些题目和文章整理出一份相对完整的Linux Pwn教程. 课程回顾>>Linux ...
- Java基础回顾一
1.JDK和JRE的区别: JDK:java开发工具包,提供java的开发环境和运行环境 JRE:java运行环境,为java的运行提供所需要的环境 2. ==和qruals的区别: == 基本类型: ...
- idea加载springboot 项目热加载失效
需要打开 help -> find action ->registry ->其中的compiler.automake.allow.when.app.running勾上
- js数据类型详解
一.js数据类型分类 (1)原始数据类型(值类型) null 空类型,变量声明了并赋值为null.转化为数字是0 undefined 未定义,变量声明了但未赋值.转化为数字为NaN boolean 布 ...
- 《SQL Server 2008 R2》 收缩数据库日志文件
USE [master] GO /****** Object: StoredProcedure [dbo].[pro_Shrink_Log] Script Date: 2019/8/16 16:56: ...
- 第16节_BLE协议GAP层
学习资料:官方手册 Vol 3: Core System Package [Host volume] Part C: Generic Access Profile 下面这个图是BLE协议各层跟医院的各 ...
- 201871010123-吴丽丽《面向对象程序设计(Java)》第十一周学习总结
201871010123-吴丽丽<面向对象程序设计(Java)>第十一周学习总结 项目 内容 这个作业属于哪个课程 https://www.cnblogs.com/nwnu-daizh/ ...
- day38_8_22数据库(navicat操作)
补充: exist存在EXISTS关字键字表示存在.在使用EXISTS关键字时,内层查询语句不返回查询的记录,而是返回一个真假值,True或False. 当返回True时,外层查询语句将进行查询当返回 ...
- python-新建文件夹
import tensorflow as tf import os categories = ['folder1', 'folder2'] for folderName in categories: ...