https://buoyant.io/2017/04/25/whats-a-service-mesh-and-why-do-i-need-one/

Update 2018-02-06: Since this article was published, we’ve launched Conduit, an open source, ultralight service mesh for Kubernetes! To learn more, check out our Conduit launch blog post.

tl;dr: A service mesh is a dedicated infrastructure layer for making service-to-service communication safe, fast, and reliable. If you’re building a cloud native application, you need a service mesh.

Over the past year, the service mesh has emerged as a critical component of the cloud native stack. High-traffic companies like Paypal, Ticketmaster, and Credit Karma have all added a service mesh to their production applications, and this January, Linkerd, the open source service mesh for cloud native applications, became an official project of the Cloud Native Computing Foundation. But what is a service mesh, exactly? And why is it suddenly relevant?

In this article, I’ll define the service mesh and trace its lineage through shifts in application architecture over the past decade. I’ll distinguish the service mesh from the related, but distinct, concepts of API gateways, edge proxies, and the enterprise service bus. Finally, I’ll describe where the service mesh is heading, and what to expect as this concept evolves alongside cloud native adoption.

WHAT IS A SERVICE MESH?

A service mesh is a dedicated infrastructure layer for handling service-to-service communication. It’s responsible for the reliable delivery of requests through the complex topology of services that comprise a modern, cloud native application. In practice, the service mesh is typically implemented as an array of lightweight network proxies that are deployed alongside application code, without the application needing to be aware. (But there are variations to this idea, as we’ll see.)

The concept of the service mesh as a separate layer is tied to the rise of the cloud native application. In the cloud native model, a single application might consist of hundreds of services; each service might have thousands of instances; and each of those instances might be in a constantly-changing state as they are dynamically scheduled by an orchestrator like Kubernetes. Not only is service communication in this world incredibly complex, it’s a pervasive and fundamental part of runtime behavior. Managing it is vital to ensuring end-to-end performance and reliability.

IS THE SERVICE MESH A NETWORKING MODEL?

The service mesh is a networking model that sits at a layer of abstraction above TCP/IP. It assumes that the underlying L3/L4 network is present and capable of delivering bytes from point to point. (It also assumes that this network, as with every other aspect of the environment, is unreliable; the service mesh must therefore also be capable of handling network failures.)

In some ways, the service mesh is analogous to TCP/IP. Just as the TCP stack abstracts the mechanics of reliably delivering bytes between network endpoints, the service mesh abstracts the mechanics of reliably delivering requests between services. Like TCP, the service mesh doesn’t care about the actual payload or how it’s encoded. The application has a high-level goal (“send something from A to B”), and the job of the service mesh, like that of TCP, is to accomplish this goal while handling any failures along the way.

Unlike TCP, the service mesh has a significant goal beyond “just make it work”: it provides a uniform, application-wide point for introducing visibility and control into the application runtime. The explicit goal of the service mesh is to move service communication out of the realm of the invisible, implied infrastructure, and into the role of a first-class member of the ecosystem—where it can be monitored, managed and controlled.

WHAT DOES A SERVICE MESH ACTUALLY DO?

Reliably delivering requests in a cloud native application can be incredibly complex. A service mesh like Linkerd manages this complexity with a wide array of powerful techniques: circuit-breaking, latency-aware load balancing, eventually consistent (“advisory”) service discovery, retries, and deadlines. These features must all work in conjunction, and the interactions between these features and the complex environment in which they operate can be quite subtle.

For example, when a request is made to a service through Linkerd, a very simplified timeline of events is as follows:

  1. Linkerd applies dynamic routing rules to determine which service the requester intended. Should the request be routed to a service in production or in staging? To a service in a local datacenter or one in the cloud? To the most recent version of a service that’s being tested or to an older one that’s been vetted in production? All of these routing rules are dynamically configurable, and can be applied both globally and for arbitrary slices of traffic.
  2. Having found the correct destination, Linkerd retrieves the corresponding pool of instances from the relevant service discovery endpoint, of which there may be several. If this information diverges from what Linkerd has observed in practice, Linkerd makes a decision about which source of information to trust.
  3. Linkerd chooses the instance most likely to return a fast response based on a variety of factors, including its observed latency for recent requests.
  4. Linkerd attempts to send the request to the instance, recording the latency and response type of the result.
  5. If the instance is down, unresponsive, or fails to process the request, Linkerd retries the request on another instance (but only if it knows the request is idempotent).
  6. If an instance is consistently returning errors, Linkerd evicts it from the load balancing pool, to be periodically retried later (for example, an instance may be undergoing a transient failure).
  7. If the deadline for the request has elapsed, Linkerd proactively fails the request rather than adding load with further retries.
  8. Linkerd captures every aspect of the above behavior in the form of metrics and distributed tracing, which are emitted to a centralized metrics system.

And that’s just the simplified version–Linkerd can also initiate and terminate TLS, perform protocol upgrades, dynamically shift traffic, and fail over between datacenters!

The linkerd service mesh manages service-to-service communication and decouples it from application code.

It’s important to note that these features are intended to provide both pointwise resilience and application-wide resilience. Large-scale distributed systems, no matter how they’re architected, have one defining characteristic: they provide many opportunities for small, localized failures to escalate into system-wide catastrophic failures. The service mesh must be designed to safeguard against these escalations by shedding load and failing fast when the underlying systems approach their limits.

WHY IS THE SERVICE MESH NECESSARY?

The service mesh is ultimately not an introduction of new functionality, but rather a shift in where functionality is located. Web applications have always had to manage the complexity of service communication. The origins of the service mesh model can be traced in the evolution of these applications over the past decade and a half.

Consider the typical architecture of a medium-sized web application in the 2000’s: the three-tiered app. In this model, application logic, web serving logic, and storage logic are each a separate layer. The communication between layers, while complex, is limited in scope—there are only two hops, after all. There is no “mesh”, but there is communication logic between hops, handled within the code of each layer.

When this architectural approach was pushed to very high scale, it started to break. Companies like Google, Netflix, and Twitter, faced with massive traffic requirements, implemented what was effectively a predecessor of the cloud native approach: the application layer was split into many services (sometimes called “microservices”), and the tiers became a topology. In these systems, a generalized communication layer became suddenly relevant, but typically took the form of a “fat client” library—Twitter’s Finagle, Netflix’s Hystrix, and Google’s Stubby being cases in point.

In many ways, libraries like Finagle, Stubby, and Hystrix were the first service meshes. While they were specific to the details of their surrounding environment, and required the use of specific languages and frameworks, they were forms of dedicated infrastructure for managing service-to-service communication, and (in the case of the open source Finagle and Hystrix libraries) found use outside of their origin companies.

Fast forward to the modern cloud native application. The cloud native model combines the microservices approach of many small services with two additional factors: containers (e.g. Docker), which provide resource isolation and dependency management, and an orchestration layer (e.g. Kubernetes), which abstracts away the underlying hardware into a homogenous pool.

These three components allow applications to adapt with natural mechanisms for scaling under load and for handling the ever-present partial failures of the cloud environment. But with hundreds of services or thousands of instances, and an orchestration layer that’s rescheduling instances from moment to moment, the path that a single request follows through the service topology can be incredibly complex, and since containers make it easy for each service to be written in a different language, the library approach is no longer feasible.

This combination of complexity and criticality motivates the need for a dedicated layer for service-to-service communication decoupled from application code and able to capture the highly dynamic nature of the underlying environment. This layer is the service mesh.

THE FUTURE OF THE SERVICE MESH

While service mesh adoption in the cloud native ecosystem is growing rapidly, there is an extensive and exciting roadmap ahead still to be explored. The requirements for serverless computing (e.g. Amazon’s Lambda) fit directly into the service mesh’s model of naming and linking, and form a natural extension of its role in the cloud native ecosystem. The roles of service identity and access policy are still very nascent in cloud native environments, and the service mesh is well poised to play a fundamental part of the story here. Finally, the service mesh, like TCP/IP before it, will continue to be pushed further into the underlying infrastructure. Just as Linkerd evolved from systems like Finagle, the current incarnation of the service mesh as a separate, user-space proxy that must be explicitly added to a cloud native stack will also continue to evolve.

CONCLUSION

The service mesh is a critical component of the cloud native stack. A little more than one year from its launch, Linkerd is part of the Cloud Native Computing Foundation and has a thriving community of contributors and users. Adopters range from startups like Monzo, which is disrupting the UK banking industry, to high scale Internet companies like Paypal, Ticketmaster, and Credit Karma, to companies that have been in business for hundreds of years like Houghton Mifflin Harcourt.

The Linkerd open source community of adopters and contributors are demonstrating the value of the service mesh model every day. We’re committed to building an amazing product and continuing to grow our incredible community. Join us!

What’s a service mesh? And why do I need one?的更多相关文章

  1. 解读2017之Service Mesh:群雄逐鹿烽烟起

    https://mp.weixin.qq.com/s/ur3PmLZ6VjP5L5FatIYYmg 在过去的2016年和2017年,微服务技术得以迅猛普及,和容器技术一起成为这两年中最吸引眼球的技术热 ...

  2. 深入解读Service Mesh的数据面Envoy

    在前面的一篇文章中,详细解读了Service Mesh中的技术细节,深入解读Service Mesh背后的技术细节. 但是对于数据面的关键组件Envoy没有详细解读,这篇文章补上. 一.Envoy的工 ...

  3. 深入解读Service Mesh背后的技术细节

    在Kubernetes称为容器编排的标准之后,Service Mesh开始火了起来,但是很多文章讲概念的多,讲技术细节的少,所以专门写一篇文章,来解析Service Mesh背后的技术细节. 一.Se ...

  4. 微服务(Microservices)和服务网格(Service Mesh)架构概念整理

    注:文章内容为摘录性文字,自己阅读的一些笔记,方便日后查看. 微服务(Microservices) 在过去的 2016 年和 2017 年,微服务技术迅猛普及,和容器技术一起成为这两年中最吸引眼球的技 ...

  5. Istio入门实战与架构原理——使用Docker Compose搭建Service Mesh

    本文将介绍如何使用Docker Compose搭建Istio.Istio号称支持多种平台(不仅仅Kubernetes).然而,官网上非基于Kubernetes的教程仿佛不是亲儿子,写得非常随便,不仅缺 ...

  6. Service Mesh 数据平面 SOFAMosn

    https://mp.weixin.qq.com/s/DJ_IeDswGGFQiWqJ75pmig 开源 | Service Mesh 数据平面 SOFAMosn 深层揭秘 朵晓东 蚂蚁金服科技 20 ...

  7. 大规模微服务架构下的Service Mesh探索之路

    小结: 1. 第一.二代Service Mesh meetup-slides/敖小剑-蚂蚁金服-大规模微服务架构下的Service Mesh探索之路.pdf https://github.com/se ...

  8. 微服务架构基础之Service Mesh

    ServiceMesh(服务网格) 概念在社区里头非常火,有人提出 2018 年是 ServiceMesh 年,还有人提出 ServiceMesh 是下一代的微服务架构基础. 那么到底什么是 Serv ...

  9. CDRAF之Service mesh

    最近翻看一些网上的文章,偶然发现我们的CDRAF其实就是Service mesh的C++版本.不管从架构的理念上,或者功能的支持上面,基本完全符合.发几个简单的文章链接,等有时间的时候,再来详细描述. ...

  10. Service Mesh

    概念 A service mesh is a dedicated infrastructure layer for handling service-to-service communication. ...

随机推荐

  1. uva 696 - How Many Knights

    题目链接:uva 696 - How Many Knights 题目大意:给出一个n * m的网格,计算最多可以放置几个国际象棋中的骑士. 解题思路:分成三类来讨论: 1)min(n, m) == 1 ...

  2. 算法笔记_189:历届试题 横向打印二叉树(Java)

    目录 1 问题描述 2 解决方案   1 问题描述 问题描述 二叉树可以用于排序.其原理很简单:对于一个排序二叉树添加新节点时,先与根节点比较,若小则交给左子树继续处理,否则交给右子树. 当遇到空子树 ...

  3. Eventually Consistent(最终一致性)(转)

    应该说搞分布式系统必读的文章了,转过来,这是2008年12月Werner revise过的版本,先贴上内容简介:分布式系统的CAP理论 CAP理论(data consistency, system a ...

  4. iOS-APP启动页加载广告

    概述 加载广告页, 展现跳过按钮实现倒计时功能, 并判断广告页面是否更新. 详细 代码下载:http://www.demodashi.com/demo/10698.html 目前市场上很多APP(如淘 ...

  5. 各大主流.Net的IOC框架

    Autofac下载地址:http://code.google.com/p/autofac/ Castle Windsor下载地址:http://sourceforge.net/projects/cas ...

  6. macOS Sierra Git Gui Crash 解决方法

    本篇文章由:http://xinpure.com/macos-sierra-git-gui-crash-solution/ 问题描述 自从升级到 macOS Sierra 10.12 之后,git g ...

  7. HDUOJ------Worm

    Worm Time Limit: 1000/1000 MS (Java/Others)    Memory Limit: 32768/32768 K (Java/Others)Total Submis ...

  8. javascript高级程序设计第二章

    看后总结: 1.js代码用得最多的两种加载方式: a)外部文件形式:<script type="text/javascript" src="jquery.min.j ...

  9. web实现QQ第三方登录 开放平台-web实现QQ第三方登录

    应用场景     web应用通过QQ登录授权实现第三方登录.   操作步骤     1  注册成为QQ互联平台开发者,http://connect.qq.com/     2  准备一个可访问的域名, ...

  10. python学习笔记——mongodb数据库

    1 概述 1.1 文件管理阶段 优点:可以长期保存 能存储大量数据 缺点:没有结构化的组织 查找不方便 数据容易冗余 1.2 数据库管理阶段 有文件存储的优点,同时解决了文件存储的问题 缺点 : 操作 ...