As we all know that Kafka is very fast, much faster than most of its competitors. So what’s the reason here?

Avoid Random Disk Access

Kafka writes everything onto the disk in order and consumers fetch data in order too. So disk access always works sequentially instead of randomly. For traditional hard disks(HDD), sequential access is much faster than random access. Here is a comparison:

hardware sequential writes random writes
6 * 7200rpm SATA RAID-5 300MB/s 50KB/s

Kafka Writes Everything Onto The Disk Instead of Memory

Yes, you read that right. Kafka writes everything onto the disk instead of memory. But wait a moment, isn’t memory supposed to be faster than disks? Typically it’s the case, for Random Disk Access. But for sequential access, the difference is much smaller. Here is a comparison taken from https://queue.acm.org/detail.cfm?id=1563874.

As you can see, it’s not that different. But still, sequential memory access is faster than Sequential Disk Access, why not choose memory? Because Kafka runs on top of JVM, which gives us two disadvantages.

1.The memory overhead of objects is very high, often doubling the size of the data stored(or even higher).

2.Garbage Collection happens every now and then, so creating objects in memory is very expensive as in-heap data increases because we will need more time to collect unused data(which is garbage).

So writing to file systems may be better than writing to memory. Even better, we can utilize MMAP(memory mapped files) to make it faster.

Memory Mapped Files(MMAP)

Basically, MMAP(Memory Mapped Files) can map the file contents from the disk into memory. And when we write something into the mapped memory, the OS will flush the change onto the disk sometime later. So everything is faster because we are using memory actually, but in an indirect way. So here comes the question. Why would we use MMAP to write data onto disks, which later will be mapped into memory? It seems to be a roundabout route. Why not just write data into memory directly? As we have learned previously, Kafka runs on top of JVM, if we wrote data into memory directly, the memory overhead would be high and GC would happen frequently. So we use MMAP here to avoid the issue.

Zero Copy

Suppose that we are fetching data from the memory and sending them to the Internet. What is happening in the process is usually twofold.

1.To fetch data from the memory, we need to copy those data from the Kernel Context into the Application Context.

2.To send those data to the Internet, we need to copy the data from the Application Context into the Kernel Context.

As you can see, it’s redundant to copy data between the Kernel Context and the Application Context. Can we avoid it? Yes, using Zero Copy we can copy data directly from the Kernel Context to the Kernel Context.

Batch Data

Kafka only sends data when batch.size is reached instead of one by one. Assuming the bandwidth is 10MB/s, sending 10MB data in one go is much faster than sending 10000 messages one by one(assuming each message takes 100 bytes).

"I would be fine and made no troubles."

why’s kafka so fast的更多相关文章

  1. Apache Kafka for Item Setup

    At Walmart.com in the U.S. and at Walmart's 11 other websites around the world, we provide seamless ...

  2. Apache Kafka - Schema Registry

    关于我们为什么需要Schema Registry? 参考, https://www.confluent.io/blog/how-i-learned-to-stop-worrying-and-love- ...

  3. Understanding, Operating and Monitoring Apache Kafka

    Apache Kafka is an attractive service because it's conceptually simple and powerful. It's easy to un ...

  4. Build an ETL Pipeline With Kafka Connect via JDBC Connectors

    This article is an in-depth tutorial for using Kafka to move data from PostgreSQL to Hadoop HDFS via ...

  5. How to choose the number of topics/partitions in a Kafka cluster?

    This is a common question asked by many Kafka users. The goal of this post is to explain a few impor ...

  6. kafka producer源码

    producer接口: /** * Licensed to the Apache Software Foundation (ASF) under one or more * contributor l ...

  7. Flume-ng+Kafka+storm的学习笔记

    Flume-ng Flume是一个分布式.可靠.和高可用的海量日志采集.聚合和传输的系统. Flume的文档可以看http://flume.apache.org/FlumeUserGuide.html ...

  8. Apache Kafka: Next Generation Distributed Messaging System---reference

    Introduction Apache Kafka is a distributed publish-subscribe messaging system. It was originally dev ...

  9. Exploring Message Brokers: RabbitMQ, Kafka, ActiveMQ, and Kestrel--reference

    [This article was originally written by Yves Trudeau.] http://java.dzone.com/articles/exploring-mess ...

随机推荐

  1. PHP工作岗位要求

    初级PHP 企业对初级PHP的要求是,在日常工作中,保证编码质量,对一般问题具有解决能力. 1.团队合作:经常是Git或者SVN.主要是为了能够融入敏捷开发团队2.前端:HTML.CSS.JS要精通. ...

  2. Mac破解百度云

    https://github.com/CodeTips/BaiduNetdiskPlugin-macOS

  3. Promise的三兄弟:all(), race()以及allSettled()

    摘要: 玩转Promise. 原文:Promise 中的三兄弟 .all(), .race(), .allSettled() 译者:前端小智 Fundebug经授权转载,版权归原作者所有. 从ES6 ...

  4. centos7 ntp server & samba

    最近公司内部一个需求:必须 Linux建个 ntp server ,并且 Windows可以net time \\ip 访问. 想要解决问题,还得解决前置问题. 服务器不能上网,无法直接访问外部 yu ...

  5. Resin 4.0 部署SSL证书

    前言 Resin目前最新的版本还是4.0 (4.0.49),使用Java EE6,在Resin上部署证书,一般有两种方式,首先我们推荐采用Openssl方式,不仅因为Openssl模式下的速度更快,而 ...

  6. 201871010117--石欣钰--《面向对象程序设计(java)》第十六周学习总结

    博文正文开头格式:(2分) 项目 内容 这个作业属于哪个课程 https://www.cnblogs.com/nwnu-daizh 这个作业的要求在哪里 https://www.cnblogs.com ...

  7. 201871010118-唐敬博《面向对象程序设计(JAVA)》第十四周学习总结

    博文正文开头格式:(2分) 项目 内容 这个作业属于哪个课程 <<https://home.cnblogs.com/u/nwnu-daizh/>> 这个作业的要求在哪里 < ...

  8. Spring Cloud Alibaba Sentinel 的配置选项:spring.cloud.sentinel.transport.port,默认值:8719

    spring.cloud.sentinel.transport.port 端口配置会在应用对应的机器上启动一个 Http Server,该 Server 会与 Sentinel 控制台做交互.比如 S ...

  9. python selenium2 动态调试

    #coding=utf-8'''Created on 2017-9-9 @author: ceshi 转自https://testerhome.com/topics/9897''' # rpcserv ...

  10. javaConfig下的springmvc配置

    javaConfig下的springmvc配置 一.静态资源过滤 XML的配置 <mvc:resources mapping="/**" location="/&q ...