Build Telemetry for Distributed Services之Elastic APM
官网地址:https://www.elastic.co/guide/en/apm/get-started/current/index.html
Overview
Elastic APM is an application performance monitoring system built on the Elastic Stack. It allows you to monitor software services and applications in real time — collect detailed performance information on response time for incoming requests, database queries, calls to caches, external HTTP requests, and more. This makes it easy to pinpoint and fix performance problems quickly.
Elastic APM also automatically collects unhandled errors and exceptions. Errors are grouped based primarily on the stacktrace, so you can identify new errors as they appear and keep an eye on how many times specific errors happen.
Metrics are another important source of information when debugging production systems. Elastic APM agents automatically pick up basic host-level metrics and agent specific metrics, like JVM metrics in the Java Agent, and Go runtime metrics in the Go Agent.
Components and documentation
Elastic APM consists of four components: APM Agents, APM Server, Elasticsearch, and Kibana.
APM Agents
APM agents are open source libraries written in the same language as your service. You may only need one, or you might use all of them. You install them into your service as you would install any other library. They instrument your code and collect performance data and errors at runtime. This data is buffered for a short period and sent on to APM Server.
Each agent has its own documentation:
- Go agent
- Java agent
- .NET agent
- Node.js agent
- Python agent
- Ruby agent
- JavaScript Real User Monitoring (RUM) agent
APM Server
APM Server is an open source application that receives performance data from your APM agents. It’s a separate component by design, which helps keep the agents light, prevents certain security risks, and improves compatibility across the Elastic Stack.
After the APM Server has validated and processed events from the APM agents, the server transforms the data into Elasticsearch documents and stores them in corresponding Elasticsearch indices. In a matter of seconds you can start viewing your application performance data in the Kibana APM UI.
The APM Server reference provides everything you need when it comes to working with the server. Here you can learn about installation, configuration, security, monitoring, and more.
Elasticsearch
Elasticsearch is a highly scalable open-source full-text search and analytics engine. It allows you to store, search, and analyze large volumes of data quickly and in near real time. Elasticsearch is used to store APM performance metrics and make use of its aggregations.
APM Kibana UI
Kibana is an open source analytics and visualization platform designed to work with Elasticsearch. You use Kibana to search, view, and interact with data stored in Elasticsearch.
Since application performance monitoring is all about visualizing data and detecting bottlenecks, it’s crucial you understand how to use the Kibana APM UI. The following sections will help you get started:
APM also has built-in integrations with Machine Learning. To learn more about this feature, refer to the Kibana UI documentation for Machine learning integration.
Visualizing Application Bottlenecks
Elastic APM captures different types of information from within instrumented applications:
- Spans contain information about a specific code path that has been executed. They measure from the start to end of an activity, and they can have a parent/child relationship with other spans.
- Transactions are a special kind of span that have extra metadata associated with them. You can think of transactions as the highest level of work you’re measuring within a service. As an example, a transaction could be a request to your server, a batch job, or a custom transaction type.
- Errors contain information about the original exception that occurred or about a log created when the exception occurred.
Each of these information types have a specific page associated with them in the APM UI. These various pages display the captured data in curated charts and tables that allow you to easily compare and debug your applications
For example, you can see information about response times, requests per minute, and status codes per endpoint. You can even dive into a specific request sample and get a complete waterfall view of what your application is spending its time on. You might see that your bottlenecks are in database queries, cache calls, or external requests. For each incoming request and each application error, you can also see contextual information such as the request header, user information, system values, or custom data that you manually attached to the request.
Having access to application-level insights with just a few clicks can drastically decrease the time you spend debugging errors, slow response times, and crashes.
Using APM
APM is designed to be as intuitive as possible, but you might come across certain terms or concepts that don’t feel native to you. Not to worry, we’ve created this guide to help you get the most out of Elastic APM.
APM is available via the navigation sidebar in Kibana.
- Filters
- Services overview
- Traces overview
- Transaction overview
- Span timeline visualization
- Debug errors
- Metrics overview
- Machine learning integration
- Advanced queries
Services overview
The Services overview gives you quick insights into the health and general performance of each service.
You can add services by setting the service.name
configuration in each of the APM agents you’re instrumenting.
Traces overview
The Traces overview displays the entry transaction for all traces in your application. If you’re using Distributed tracing, this view is key to finding the critical paths within your application. Transactions with the same name are grouped together and only shown once in this table.
By default, transactions are sorted by Impact. Impact helps show the most used and slowest endpoints in your service - in other words, it’s the collective amount of pain a specific endpoint is causing your users. If there’s a particular endpoint you’re worried about, you can click on it to view the transaction details.
Distributed tracing
Elastic APM supports distributed tracing. Distributed tracing is a key feature of modern application performance monitoring as application architectures are shifting from monolithic to more distributed, service-based architectures.
Distributed tracing allows APM users to automatically trace requests all the way through the service architecture, and visualize those traces in one single view in the APM UI. This is accomplished by tracing all of the requests, from the initial web request to your front-end service, to queries made to your back-end services. This makes finding possible bottlenecks throughout your application much easier and faster.
By definition, a distributed trace includes more than one transaction. You can use the span timeline visualization to view a waterfall display of all of the transactions from individual services that are connected in a trace.
Distributed tracing is supported by all APM agents and there’s no additional configuration needed.
Transaction overview
A transaction describes an event captured by an Elastic APM agent instrumenting a service. The APM agents automatically collect performance metrics on HTTP requests, database queries, and much more.
Selecting a service brings you to the transactions overview. The time spent by span type, transaction duration and requests per minutechart display information on all transactions associated with the selected service. The Transactions table, however, provides only a list of transaction groups for the selected service. In other words, this view groups all transactions of the same name together, and only displays one transaction for each group.
Time spent by span type — [beta] This functionality is in beta and is subject to change. The design and code is less mature than official GA features and is being provided as-is with no warranties. Beta features are not subject to the support SLA of official GA features.Certain agents support breakdown graphs in the APM UI. This graph is an easy way to visualize where your application is spending most of its time. For example, is your app spending time in external calls, database processing, or application code execution?
The time a transaction took to complete is also recorded and displayed on the chart under the "app" label. "app" indicates that something was happening within the application, but we’re not sure exactly what. This could be a sign that the agent does not have auto-instrumentation for whatever was happening during that time.
It’s important to note that if you have asynchronous spans, the sum of all span times may exceed the duration of the transaction.
If the Time spent by span type chart is missing in the APM UI, it means your agent does not support this feature yet.
Transaction duration shows the response times for this service and is broken down into average, 95th, and 99th percentile. If there’s a weird spike that you’d like to investigate, you can simply zoom in on the graph - this will adjust the specific time range, and all of the data on the page will update accordingly.
Requests per minute is divided into response codes: 2xx, 3xx, 4xx, etc., and is useful for determining if you’re serving more of one code than you typically do. Like in the Transaction duration graph, you can zoom in on anomalies to further investigate them.
The Transactions table is similar to the traces overview and shows the name of each transaction occurring in the selected service. Transactions with the same name are grouped together and only shown once in this table. By default, transaction groups are sorted by Impact. Impact helps show the most used and slowest endpoints in your service - in other words, it’s the collective amount of pain a specific endpoint is causing your users. If there’s a particular endpoint you’re worried about, you can click on it to view the transaction details.
The transaction overview will only display helpful information when the transactions in your service are named correctly.
Elastic APM Agents come with built-in support for popular frameworks out-of-the-box. However, if you only see one route in the Transaction overview page, or if you have transactions named "unknown route", it could be a symptom that the agent either wasn’t installed correctly or doesn’t support your framework.
For further details, including troubleshooting and custom implementation instructions, refer to the documentation for each APM Agent you’ve implemented.
Transaction details
Selecting a transaction group will bring you to the transaction details. Transaction details include a high-level overview of the time spent by span type, transaction group duration, requests per minute, and transaction group duration distribution. It’s important to note that all of these graphs show data from every transaction within the selected transaction group
A single sampled transaction is also displayed. This sampled transaction is based on your selection in the Transactions duration distribution. You can update the sampled transaction by selecting a new bucket in the transactions duration distribution graph. The number of requests per bucket is displayed when hovering over the graph, and the selected bucket is highlighted to stand out.
For a particular transaction sample, we can get even more information in the metadata tab:
- Labels - Custom labels added by agents
- HTTP request/response information
- Host information
- Container information
- Service - The service/application runtime, agent, name, etc..
- Process - The process id that served up the request.
- Agent information
- URL
- User - Requires additional configuration, but allows you to see which user experienced the current transaction.
- Custom - You can configure your agent to add custom contextual information on transactions.
All of this data is stored in documents in Elasticsearch. This means you can select "Actions - View sample document" to see the actual Elasticsearch document under the discover tab.
Span timeline
A span is defined as the duration of a single event. Spans are automatically captured by APM agents, and you can also define custom spans. Each span has a type and is defined by a different color in the timeline/waterfall visualization.
The span timeline visualization is a bird’s-eye view of what your application was doing while it was trying to respond to the request that came in. This makes it useful for visualizing where the selected transaction spent most of its time.
View a span in detail by clicking on it in the timeline waterfall. For example, in the below screenshot we’ve clicked on an SQL Select database query. The information displayed includes the actual SQL that was executed, how long it took, and the percentage of the trace’s total time. You also get a stack trace, which shows the SQL query in your code. Finally, APM knows which files are your code and which are just modules or libraries that you’ve installed. These library frames will be minimized by default in order to show you the most relevant stack trace.
If your span timeline is colorful, it’s indicative of a distributed trace. Services in a distributed trace are separated by color and listed in the order they occur.
Don’t forget, a distributed trace includes more than one transaction. When viewing these distributed traces in the timeline waterfall, you’ll see this icon, which indicates the next transaction in the trace. These transactions can be expanded and viewed in detail by clicking on them.
After exploring these traces, you can return to the full trace by clicking View full trace in the upper right hand corner of the page
Metrics overview
The Metrics overview provides agent-specific metrics, which lets you perform more in-depth root cause analysis investigations within the APM UI.
If you’re experiencing a problem with your service, you can use this page to attempt to find the underlying cause. For example, you might be able to correlate a high number of errors with a long transaction duration, high CPU usage, or a memory leak.
Machine Learning integration
The Machine Learning integration will initiate a new job predefined to calculate anomaly scores on transaction response times. The response time graph will show the expected bounds and annotate the graph when the anomaly score is 75 or above.
Jobs can be created per transaction type and based on the average response time. You can manage jobs in the Machine Learning jobs management. It might take some time for results to appear on the graph.
Machine learning is a platinum feature. For a comparison of the Elastic license levels, see the subscription page.
Data Model
Elastic APM agents capture different types of information from within their instrumented applications. These are known as events, and can be spans
, transactions
, errors
, or metrics
.
Events can contain additional metadata which further enriches your data.
Spans
Spans contain information about a specific code path that has been executed. They measure from the start to end of an activity, and they can have a parent/child relationship with other spans.
Agents automatically instrument a variety of libraries to capture these spans from within your application. In addition, you can use the Agent API for ad hoc instrumentation of specific code paths.
A span contains:
- A
transaction.id
attribute that refers to their parent transaction. - A
parent.id
attribute that refers to their parent span, or their transaction. - start time and duration
- name
- type
stack trace
(optional)
Most agents limit keyword fields (e.g. span.id
) to 1024 characters, and non-keyword fields (e.g. span.start.us
) to 10,000 characters.
Metrics
APM agents automatically pick up basic host-level metrics, including system and process-level CPU and memory metrics. Agent specific metrics are also available, like JVM metrics in the Java Agent, and Go runtime metrics in the Go Agent.
Infrastructure and application metrics are important sources of information when debugging production systems, which is why we’ve made it easy to filter metrics for specific hosts or containers in the Kibana metrics overview.
Metrics have the processor.event
property set to metric
.
Metrics are stored in metric indices.
For a full list of tracked metrics, see the relevant agent documentation:
Transactions
Transactions are a special kind of span that have additional attributes associated with them. They describe an event captured by an Elastic APM agent instrumenting a service. You can think of transactions as the highest level of work you’re measuring within a service. As an example, a transaction might be a:
- Request to your server
- Batch job
- Background job
- Custom transaction type
Agents decide whether to sample transactions or not, and provide settings to control sampling behavior. If sampled, the spans of a transaction are sent and stored as separate documents. Within one transaction there can be 0, 1, or many spans captured.
A transaction contains:
- The timestamp of the event
- A unique id, type, and name
Data about the environment in which the event is recorded:
- Service - environment, framework, language, etc.
- Host - architecture, hostname, IP, etc.
- Process - args, PID, PPID, etc.
- URL - full, domain, port, query, etc.
- User - (if supplied) email, ID, username, etc.
- Other relevant information depending on the agent. Example: The JavaScript RUM agent captures transaction marks, which are points in time relative to the start of the transaction with some label.
In addition, agents provide options for users to capture custom metadata. Metadata can be indexed - labels
, or not-indexed - custom
.
Transactions are grouped by their type
and name
in the APM UI’sTransaction overview. If you’re using a supported framework, APM agents will automatically handle the naming for you. If you’re not, or if you wish to override the default, all agents have API methods to manually set the type
and name
.
type
should be a keyword of specific relevance in the service’s domain, e.g.request
,backgroundjob
, etc.name
should be a generic designation of a transaction in the scope of a single service, e.g.GET /users/:id
,UsersController#show
, etc.
Most agents limit keyword fields (e.g. labels
) to 1024 characters, non-keyword fields (e.g. span.db.statement
) to 10,000 characters.
Transactions are stored in transaction indices.
Errors
An error event contains at least information about the original exception
that occurred or about a log
created when the exception occurred. For simplicity, errors are represented by a unique ID.
An Error contains:
- Both the captured
exception
and the capturedlog
of an error can contain astack trace
, which is helpful for debugging. - The
culprit
of an error indicates where it originated. - An error might relate to the transaction during which it happened, via the
transaction.id
. - Data about the environment in which the event is recorded:
- Service - environment, framework, language, etc.
- Host - architecture, hostname, IP, etc.
- Process - args, PID, PPID, etc.
- URL - full, domain, port, query, etc.
- User - (if supplied) email, ID, username, etc.
In addition, agents provide options for users to capture custom metadata. Metadata can be indexed - labels
, or not-indexed - custom
.
Errors are stored in error indices.
Distributed tracinge
Together, Transactions
and Spans
form a Trace
. Traces are not events, but group together events that have a common root.
Elastic APM supports distributed tracing. Distributed tracing enables you to analyze performance throughout your microservices architecture all in one view. This is accomplished by tracing all of the requests - from the initial web request to your front-end service - to queries made to your back-end services. This makes finding possible bottlenecks throughout your application much easier and faster. Best of all, there’s no additional configuration needed for distributed tracing, just ensure you’re using the latest version of the applicable agent.
The APM UI in Kibana also supports distributed tracing. The Timeline visualization has been redesigned to show all of the transactions from individual services that are connected in a trace:
Real User Monitoring (RUM)
Real User Monitoring captures user interaction with clients such as web browsers. The JavaScript Agent is Elastic’s RUM Agent. To use it you need to enable RUM support in the APM Server.
Unlike Elastic APM backend agents which monitor requests and responses, the RUM JavaScript agent monitors the real user experience and interaction within your client-side application. The RUM JavaScript agent is also framework-agnostic, which means it can be used with any frontend JavaScript application.
You will be able to measure metrics such as "Time to First Byte", domInteractive
, and domComplete
which helps you discover performance issues within your client-side application as well as issues that relate to the latency of your server-side application.
OpenTracing bridge
All Elastic APM agents have OpenTracing compatible bridges.
The OpenTracing bridge allows you to create Elastic APM transactionsand spans using the OpenTracing API. This means you can reuse your existing OpenTracing instrumentation to quickly and easily begin using Elastic APM.
Agent specific details
Not all features of the OpenTracing API are supported. In addition, there are some Elastic APM specific tags you should be aware of. Please see the relevant Agent documentation for more detailed information:
Build Telemetry for Distributed Services之Elastic APM的更多相关文章
- Build Telemetry for Distributed Services之OpenTracing实践
官网:https://opentracing.io/docs/best-practices/ Best Practices This page aims to illustrate common us ...
- Build Telemetry for Distributed Services之Open Telemetry简介
官网链接:https://opentelemetry.io/about/ OpenTelemetry is the next major version of the OpenTracing and ...
- Build Telemetry for Distributed Services之Jaeger
github链接:https://github.com/jaegertracing/jaeger 官网:https://www.jaegertracing.io/ Jaeger: open sourc ...
- Build Telemetry for Distributed Services之OpenCensus:C#
OpenCensus Easily collect telemetry like metrics and distributed traces from your services OpenCensu ...
- Build Telemetry for Distributed Services之Open Telemetry来历
官网:https://opentelemetry.io/ github:https://github.com/open-telemetry/ Effective observability requi ...
- Build Telemetry for Distributed Services之OpenTracing项目
中文文档地址:https://wu-sheng.gitbooks.io/opentracing-io/content/pages/quick-start.html 中文github地址:https:/ ...
- Build Telemetry for Distributed Services之OpenTracing简介
官网地址:https://opentracing.io/ What is Distributed Tracing? Who Uses Distributed Tracing? What is Open ...
- Build Telemetry for Distributed Services之OpenCensus:Tracing2(待续)
part 1:Tracing1 Sampling Sampling Samplers Global sampler Per span sampler Rules References
- Build Telemetry for Distributed Services之OpenTracing指导:C#
官网链接:https://opentracing.io/guides/ 官方微博:https://medium.com/opentracing Welcome to the OpenTracing G ...
随机推荐
- 优秀技术Leader应具备的六项能力
技术Leader是互联网公司中,战斗在一线的技术领导者,技术Leader们能力的强弱,决定着公司整个技术团队的战斗力,结合我之前管理上百人技术团队的经验,谈谈我心目中优秀技术Leader五个方面的能力 ...
- 微信小程序开发(十二)Promise将异步改为同步
// utils/utils.js /** * requestPromise用于将wx.request改写成Promise方式 * @param:{string} myUrl 接口地址 * @retu ...
- mongodb索引简介
上面讲解了数据的查询和索引的简单使用,并且说明索引可以显著的加快查询速度,实际上查询的种类有很多,与之对应的索引的种类也有很多,接下来会与索引一起,在说明索引种类的同时,详细介绍下查询的参数 1.索引 ...
- CDN加速地址URL拿不到,显示“无法访问此网站”
问题:CDN加速地址URL拿不到,显示“无法访问此网站” 原因:浏览器缓冲原因,导致拿到的content-encoding不是一个标准的值 解决方法: 1. 客户机器 ping一下访问的CDN加速域名 ...
- vue3.0+typeScript项目
https://segmentfault.com/a/1190000018720570#articleHeader15 https://segmentfault.com/a/1190000016423 ...
- Windows Dialog对话框
一.MessageBox弹出框 MessageBox.Show(<字符串> Text, <字符串> Title, <整型> nType,MessageBoxIcon ...
- WPF界面开发者注意啦!Scheduler控件支持时区功能了,你get了吗
DevExpress广泛应用于ECM企业内容管理. 成本管控.进程监督.生产调度,在企业/政务信息化管理中占据一席重要之地.通过DevExpress WPF Controls,您能创建有着强大互动功能 ...
- web上传下载文件
WebService代码: /// /// 上传文件 /// /// 文件的byte[] /// 上传文件的路径 /// 上传文件名字 /// ...
- BZOJ 2229 / Luogu P3329 [ZJOI2011]最小割 (分治最小割板题)
题面 求所有点对的最小割中<=c的数量 分析 分治最小割板题 首先,注意这样一个事实:如果(X,Y)是某个s1-t1最小割,(Z,W)是某个s2-t2最小割,那么X∩Z.X∩W.Y∩Z.Y∩W这 ...
- Python 2--序列