Ewen Cheslack-Postava  March 25, 2015  时间有点久,但讲的还是很清楚的

As part of Confluent Platform 1.0 released about a month ago, we included a new Kafka REST Proxy to allow more flexibility for developers and to significantly broaden the number of systems and languages that can access Apache Kafka clusters. In this post, I’ll explain the REST Proxy’s features, how it works, and why we built it.

What is the REST Proxy and why do you need one?

The REST Proxy is an open source HTTP-based proxy for your Kafka cluster. The API supports many interactions with your cluster, including producing and consuming messages and accessing cluster metadata such as the set of topics and mapping of partitions to brokers. Just as with Kafka, it can work with arbitrary binary data, but also includes first-class support for Avro and integrates well with Confluent’s Schema Registry. And it is scalable, designed to be deployed in clusters and work with a variety of load balancing solutions.

We built the REST Proxy first and foremost to meet the growing demands of many organizations that want to use Kafka, but also want more freedom to select languages beyond those for which stable native clients exist today. However, it also includes functionality beyond traditional clients, making it useful for building tools for managing your Kafka cluster. See the documentation for a more detailed description of the included features.

A quick example

If you’ve used the Confluent Platform Quickstart to start a local test cluster, starting the REST Proxy for your local Kafka cluster should be as simple as running

$ kafka-rest-start

To use it with a real cluster, you only need to specify a few connection settings. The proxy includes good default settings so you can start using it without any need for customization.

The complete API provides too much functionality to cover in this blog post, but as an example I’ll show a couple of the most common use cases. To keep things language agnostic, I’ll just show the cURL commands. Producing messages is as easy as:
$ curl -i -X POST -H "Content-Type: application/vnd.kafka.avro.v1+json" --data '{
   "value_schema": "{\"type\": \"record\", \"name\": \"User\", \"fields\": [{\"name\": \"username\", \"type\": \"string\"}]}",
   "records": [
     {"value": {"username": "testUser"}},
     {"value": {"username": "testUser2"}}
   ]
}' \
http://localhost:8082/topics/avrotest

This sends an HTTP request using the POST method to the endpoint http://localhost:8082/topics/avrotest, which is a resource representing the topic avrotest. The content type, application/vnd.kaka.avro.v1+json, indicates the data in the request is for the Kafka proxy (application/vnd.kafka), contains Avro keys and values (.avro), using the first API version (.v1), and JSON encoding (+json). The payload contains a value schema to specify the format of the data (records with a single field username) and a set of records. Records can specify a key, value, and partition, but in this case we only include the value. The values are just JSON objects because the REST Proxy can automatically translate Avro’s JSON encoding, which is more convenient for your applications, to the more efficient binary encoding you want to store in Kafka.

The server will respond with:
HTTP/1.1 200 OK
Content-Length: 209
Content-Type: application/vnd.kafka.v1+json
Server: Jetty(8.1.16.v20140903)
{
   "key_schema_id": null,
   "value_schema_id": 1,
   "offsets": [
     {"partition": 0, "offset":0, "error_code": null, "error": null},
     {"partition": 0, "offset":1, "error_code": null, "error": null}
   ]
}

This indicates the request was successful (200 OK) and returns some information about the messages. Schema IDs are included, which can be used as shorthand for the same schema in future requests. Information about any errors for individual messages (error and error_code) are provided in case of failure, or are null in case of success. Successfully recorded messages include the partition and offset of the message.

Consuming messages requires a bit more setup and cleanup, but is also easy. Here we’ll consume the messages we just produced to the topic avrotest. Start by creating the consumer, which will return a base URI you use for all subsequent requests:
$ curl -i -X POST -H "Content-Type: application/vnd.kafka.v1+json" \
--data '{"format": "avro", "auto.offset.reset": "smallest"}' \
http://localhost:8082/consumers/my_avro_consumer

HTTP/1.1 200 OK
Content-Length: 121
Content-Type: application/vnd.kafka.v1+json
Server: Jetty(8.1.16.v20140903)

{
  "instance_id": "rest-consumer-1",
   "base_uri": "http://localhost:8082/consumers/my_avro_consumer/instances/rest-consumer-1"
}

Notice that we specified that we’ll consume Avro messages and that we want to read from the beginning of topics we subscribe to. Next, just GET a topic resource under that consumer to start receiving messages:
$ curl -i -X GET -H "Accept: application/vnd.kafka.avro.v1+json" \
http://localhost:8082/consumers/my_avro_consumer/instances/rest-consumer-1/topics/avrotest

HTTP/1.1 200 OK
Content-Length: 134
Content-Type: application/vnd.kafka.avro.v1+json
Server: Jetty(8.1.16.v20140903)

[
  {
     "key": null,
     "value": {"username": "testUser"},
     "partition": 0,
     "offset": 0
   },
   {
     "key": null,
     "value": {"username": "testUser2"},
     "partition": 0,
     "offset": 1
   }
]

When you’re done consuming or need to shut down this instance, you should try to clean up the consumer so other instances in the same group can quickly pick up the partitions that had been assigned to this consumer:
$ curl -i -X DELETE \
http://localhost:8082/consumers/my_avro_consumer/instances/rest-consumer-1

HTTP/1.1 204 No Content
Server: Jetty(8.1.16.v20140903)

This is only a short example of the API, but hopefully shows how simple it is to work with. The example above can quickly be adapted to any language with a good HTTP library in just a few lines of code. Using the API in real applications is just as simple.

How does the REST Proxy work?

I’ll leave a detailed explanation of the REST Proxy implementation for another post, but I do want to highlight some high level design decisions.

HTTP wrapper of Java libraries – At it’s core, the REST Proxy simply wraps the existing libraries provided with the Apache Kafka open source project. This includes not only the producer and consumer you would expect, but also access to cluster metadata and admin operations. Currently this means using some internal (but publicly visible) interfaces that may require some ongoing maintenance because that code has no compatibility guarantees; therefore the REST Proxy depends on specific versions of the Kafka libraries and the Confluent Platform packaging ensures the REST Proxy uses a compatible version. As the Kafka code standardizes some interfaces (and protocols) for these operations we’ll be able to rely on those public interfaces and therefore be less tied to particular Kafka versions.

JSON with flexible embedded data – Good JSON libraries are available just about everywhere, so this was an easy choice. However, REST Proxy requests need to include embedded data — the serialized key and value data that Kafka deals with. To make this flexible, we use vendor specific content types in Content-Type and Accept headers to make the format of the data explicit. We recommend using Avro to help your organization build a manageable and scalable stream data platform. Avro has a JSON encoding, so you can embed your data in a natural, readable way, as the introductory example demonstrated. Schemas need to be included with every request so they can be registered and validated against the Confluent Schema Registry and the data can be serialized using the more efficient binary encoding. To avoid the overhead of sending schemas with every request, API responses include the schema ID that the Schema Registry returns, which can subsequently be used in place of the full schema. If you opt to use raw binary data, it cannot be embedded directly in JSON, so the API uses a string containing the base64 encoded data.

Stateful consumers – Consumers are stateful and tied to a particular proxy instance. This is actually a violation of REST principles, which state that requests should be stateless and should contain all the information necessary to serve them. There are ways to make the REST Proxy layer stateless, e.g., by moving consumer state to a separate service, just as a separate database is used to manage state for a RESTful API. However, we felt the added complexity and the cost of extra network hops and latency wasn’t warranted in this case. The approach we use is simpler and the consumer implementation already provides fault tolerance: if one instance of the proxy fails, other consumers in the same group automatically pick up the slack until the failed consumers can be restarted.

Distributed and load balanced – Despite consumers being tied to the proxy instance that created them, the entire REST Proxy API is designed to be accessed via any load balancing mechanism (e.g. discovery services, round robin DNS, or software load balancing proxy). To support consumers, each instance must be individually addressable by clients, but all non-consumer requests can be handled by any REST Proxy instance.

Why a REST Proxy?

In addition to the official Java clients, there are more than a dozen community-contributed clients across a variety of languages listed on the wiki. So why bother writing the REST Proxy at all?

It turns out that writing a feature complete, high performance Kafka client is currently a pretty difficult endeavor for a few reasons:

Flexible consumer group model – Kafka has a consumer group model, as shown in the figure below, that generalizes a few different messaging models, including both queuing and publish-subscribe semantics. However, the protocol currently leaves much of the work to clients. This has some significant benefits for brokers, e.g., minimizing the work they do and making performance very predictable. But it also means Kafka consumers are “thick clients” that have to implement complex algorithms for features like partition load balancing and offset management. As a result, it’s common to find Kafka client libraries supporting only a fraction of the Java client’s functionality, sometime omitting high-level consumer support entirely.

Kafka’s consumer group model supports multiple consumers on the same topic, organized into groups of variable and dynamic size, and supports offset management. This is very flexible, scalable, and fault tolerant, but means non-Java clients have to implement more functionality to achieve feature parity with the Java clients.

Low-level networking – Even a small Kafka cluster on moderate hardware can support very high throughput, but writing a cluster to take advantage of this requires low level networking code. Producers need to efficiently batch messages to achieve high throughput while minimizing latency, manage many connections to brokers (usually using an async I/O API), and gracefully handle a wide variety of edge cases and errors. It’s a substantial time investment to write, debug, and maintain these libraries.

Protocol evolution – The 0.8 releases come with a promise of compatibility and a new versioned protocol to enable this. The community has also implemented a new approach to making user visible changes, both to catch compatibility problems before they occur and to act as documentation for users. As the protocol evolves and new features are added, client implementations need to add support for new features and remove deprecated ones. By leveraging the official Java clients, the REST Proxy picks up most of these changes and fixes for free. Some new features will require REST API changes, but careful API design that anticipates new features will help insulate applications from these changes.

The REST Proxy is a simple solution to these problems: instead of waiting for a good, feature-complete client for your language of choice, you can use a simple HTTP-based interface that should be easily accessible from just about any language.

But there were already a few REST proxies for Kafka out there, so why build another? Although the existing proxies were good — and informed the design of the Confluent REST Proxy — it was clear that they were written to address a few specific use cases. We wanted to provide a truly comprehensive interface to Kafka, which includes administrative actions in addition to complete producer and consumer functionality. We haven’t finished yet — today we provide read-only access to cluster metadata but plan to add support for administrative actions such as partition assignment — but soon you should be able to both operate and use your cluster entirely through the REST Proxy.

Finally, it’s worth pointing out that the goal of the REST Proxy is not to replace existing clients. For example, if you’re using C or Python and only need to produce or consume messages, you may be better served by the existing high quality librdkafka library for C/C++ or kafka-python library for Python. Confluent is invested in development of native clients for a variety of popular languages. However, the REST Proxy provides a broad-reaching solution very quickly and supports languages where no developer has yet taken on the task of developing a client.

Tradeoffs

Of course, there are some tradeoffs to using the REST Proxy instead of native clients for your language, but in many cases these tradeoffs are acceptable when you consider the capabilities you get with the REST Proxy.

The most obvious tradeoff is complexity: the REST Proxy is yet another service you need to run and maintain. We’ve tried to make this as simple as possible, but adding any service to your stack has a cost.

Another cost is in performance. The REST Proxy adds extra processing: clients construct and make HTTP requests, the REST Proxy needs to parse requests, transform data between formats both for produce and consume requests, and handle all the interaction with Kafka itself, at least doubling bandwidth usage. And the performance costs aren’t only as simple as doubling costs because the REST Proxy can’t use all the same optimizations that the Kafka broker can (e.g., sendfile is not an option for the proxy consumer responses). Still, the REST Proxy can achieve high throughput. I’ll discuss some benchmarking numbers in more detail in a future post, but the most recent version of the proxy achieves about ⅔ the throughput of the Java clients when producing binary messages and about ⅕ the rate when consuming binary messages. Some of this performance can be regained with higher performance CPUs on your REST Proxy instances because a significant fraction of the performance reduction is due to increased CPU usage (parsing JSON, decoding embedded content in produce requests or encoding embedded content in responses, serializing JSON responses).

The REST Proxy also exposes a very limited subset of the normal client configuration options to applications. It is designed to act as shared infrastructure, sitting between your Kafka cluster and many of your applications. Limiting configuration of many settings to the operator of the proxy in this multi-tenant environment is important to maintaining stability. One good example where this is critical is consumer memory settings. To achieve good throughput, consumer responses should include a relatively large amount of data to amortize per-request overheads over many messages. But the proxy needs to buffer that amount of data for each consumer, so it would be risky to give complete control of that setting to each application that creates consumers. One misbehaving consumer could affect all your applications. Future versions should offer more control, but these options need to be added to the API with considerable care.

Finally, unless you have a language-specific wrapper of the API, you won’t get an idiomatic API for your language. This can have a significant impact on the verbosity and readability of your code. However, this is also the simplest problem to fix, and at relatively low cost. For example, to evaluate and validate the API, I wrote a thin node.js wrapper of the API. The entire library clocks in at about 650 lines of non-comment non-whitespace code, and a simple example that ingests data from Twitter’s public stream and computes trending hashtags is only about 200 lines of code.

Conclusion

This first version of the REST Proxy provides basic metadata, producer, and consumer support using Avro or arbitrary binary data. We hope it will enable broader adoption of Kafka by making all of Kafka’s features accessible, no matter what language or technology stack you’re working with.

We’re just getting started with the REST Proxy and have a lot more functionality planned for future versions. You can get some idea of where we’re heading by looking at the features we’ve noted are missing right now, but we also want to hear from you about what features you would find most useful, either on our community mailing list, on the open source project’s issue tracker, or through pull requests.

kafka-rest:A Comprehensive, Open Source REST Proxy for Kafka的更多相关文章

  1. 一次flume exec source采集日志到kafka因为单条日志数据非常大同步失败的踩坑带来的思考

    本次遇到的问题描述,日志采集同步时,当单条日志(日志文件中一行日志)超过2M大小,数据无法采集同步到kafka,分析后,共踩到如下几个坑.1.flume采集时,通过shell+EXEC(tail -F ...

  2. 【kafka】安装部署kafka集群(kafka版本:kafka_2.12-2.3.0)

    3.2.1 下载kafka并安装kafka_2.12-2.3.0.tgz tar -zxvf kafka_2.12-2.3.0.tgz 3.2.2 配置kafka集群 在config/server.p ...

  3. kafka入门:简介、使用场景、设计原理、主要配置及集群搭建(转)

    问题导读: 1.zookeeper在kafka的作用是什么? 2.kafka中几乎不允许对消息进行"随机读写"的原因是什么? 3.kafka集群consumer和producer状 ...

  4. Akka(17): Stream:数据流基础组件-Source,Flow,Sink简介

    在大数据程序流行的今天,许多程序都面临着共同的难题:程序输入数据趋于无限大,抵达时间又不确定.一般的解决方法是采用回调函数(callback-function)来实现的,但这样的解决方案很容易造成“回 ...

  5. kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition...Error: NETWORK_EXCEPTION

      kafka producer生产数据到kafka异常:Got error produce response with correlation id 16 on topic-partition... ...

  6. Kafka#4:存储设计 分布式设计 源码分析

    https://sites.google.com/a/mammatustech.com/mammatusmain/kafka-architecture/4-kafka-detailed-archite ...

  7. Kafka剖析:Kafka背景及架构介绍

    <Kafka剖析:Kafka背景及架构介绍> <Kafka设计解析:Kafka High Availability(上)> <Kafka设计解析:Kafka High A ...

  8. Kafka分布式:ZooKeeper扩展

    [ZooKeeper] 服务注册.服务发现.客户端负载均衡.Offset偏移量分布式存储. kafka使用zookeeper来实现动态的集群扩展,不需要更改客户端(producer和consumer) ...

  9. kafka之一:Windows上搭建Kafka运行环境

    搭建环境 1. 安装JDK 1.1 安装文件:http://www.oracle.com/technetwork/java/javase/downloads/jre8-downloads-213315 ...

随机推荐

  1. Docker实用技巧之更改软件包源提升构建速度

    一.开篇 地球,中国,成都市,某小区的阳台上,一青年负手而立,闭目沉思,阵阵的凉风吹得他衣衫呼呼的飘.忽然,他抬起头,刹那间,睁开了双眼,好似一到精光射向星空,只见这夜空......一颗星星都没有.他 ...

  2. Android应用系列:仿MIUI的Toast动画效果实现(有图有源码)

    前言 相信有些人用过MIUI,会发现小米的Toast跟Android传统的Toast特么是不一样的,他会从底部向上飞入,然后渐变消失.看起来效果是挺不错的,但是对于Android原生Toast是不支持 ...

  3. Android Studio升级到3.1.4后打开旧项目警告:The `android.dexOptions.incremental` property is deprecated and it has no effect on the build process.

    现象截图 问题原因&解决方案 在build.gralde中,对Android开发过程中突破的方法数的限制,做了如下解决配置: dexOptions { incremental true jav ...

  4. Lucene 01 - 初步认识全文检索和Lucene

    目录 1 搜索简介 1.1 搜索实现方案 1.2 数据查询方法 1.2.1 顺序扫描法 1.2.2 倒排索引法(反向索引) 1.3 搜索技术应用场景 2 Lucene简介 2.1 Lucene是什么 ...

  5. mybatis-generator自动生成代码插件使用详解

    mybatis-generator是一款在使用mybatis框架时,自动生成model,dao和mapper的工具,很大程度上减少了业务开发人员的手动编码时间,今天自己研究了一下,也分享一下使用心得供 ...

  6. 分享几个有趣的Linux命令

    前言 最近工作比较忙,没时间写博客,这次介绍几个有趣的Linux命令. 命令:sl 当你使用这个命令时会看到一辆小火车从你的屏幕经过.亲测! 安装命令如下: yum -y install sl 执行效 ...

  7. Predicate--入门简介

    说明:表示定义一组条件并确定指定对象是否符合这些条件的方法.此委托由 Array 和 List 类的几种方法使用,用于在集合中搜索元素. var predicate = new Predicate&l ...

  8. Python全栈开发之---redis数据库

    1.redis简介 redis是一个key-value存储系统.和Memcached类似,它支持存储的value类型相对更多,包括string(字符串).list(链表).set(集合).zset(s ...

  9. PhpStorm代码编辑区竖线的用途以及如何去掉

    相信经常用PhpStorm的童鞋都知道代码区有这么一条竖线,但是知道这个竖线究竟是干嘛的么? 相传是古时代的编辑器是没有自动折行的功能的,而且终端分辨率的问题,一行只能显示80个字符,因此很多上古时代 ...

  10. 当桌面的快捷方式图标左下角出现一个X(叉)的时候应该怎么去掉

    win+r打开运行,然后复制粘贴如下命令就OK辣 cmd /k reg delete "HKEY_CLASSES_ROOT\lnkfile" /v IsShortcut /f &a ...