splunk 索引过程
术语:
Event :Events are records of activity in log files, stored in Splunk indexes. 简单说,处理的日志或话单中中一行记录就是一个Event;
Source type: 来源类型,identifies the format of the data,简单说,一种特定格式的日志,可以定义为一种source type;Splunk默认提供有500多种确定格式数据的type,包括apache log、常见OS的日志、Cisco等网络设备的日志等;
Index: The index is the repository for Splunk Enterprise data. Splunk transforms incoming data into events, which it stores in indexes. 有两层含义:一是数据物理存储上的表达,也是一个数据处理的动作表达:Splunk indexes your data,这个过程会产生两类数据:
The raw data in compressed form (rawdata)
Indexes that point to the raw data, plus some metadata files (index files)
Indexer: An indexer is a Splunk Enterprise instance that indexes data. 通常说的索引概念,也是对Splunk中“Indexer”这个特定模块的称呼,是一种Splunk Enterprise Instance;
Bucket: Index储存的两类数据按照age组织为不同的目录,称为buckets;
职责——具体再见后文图:
Search Head:前端搜索;
Deployment Server:相当于配置管理中心,对其它节点统一管理;
Forwarder:负责收集、预处理和前转数据至Indexer(consume data and forward it on to indexers),配合构成类似Flume的Agent和Collector的机制;动作包括:
· Tagging of metadata (source, sourcetype, and host)
· Configurable buffering
· Data compression
· SSL security
· Use of any available network ports
· Running scripted inputs locally
注意:转发器可以传输三种类型的数据:原始、未解析、已解析。转发器可以发送的数据类型取决于转发器类型以及配置方式。通用转发器和轻型转发器可以发送原始或未解析
的数据。重型转发器可以发送原始或解析的数据。
Indexer:负责对数据“索引化”处理,即indexing process,也可称为event processing;包括:
· Separating the datastream into individual, searchable events.(分行)
· Creating or identifying timestamps. (识别时间戳)
· Extracting fields such as host, source, and sourcetype. (外置公共字段处理)
· Performing user-defined actions on the incoming data, such as identifying custom fields, masking sensitive data, writing new or modified keys, applying breaking rules for multi-line events, filtering unwanted events, and routing events to specified indexes or servers.
Parts of an indexer cluster——分布式部署
An indexer cluster is a group of Splunk Enterprise instances, or nodes, that, working in concert, provide a redundant indexing and searching capability. Each cluster has three types of nodes:
- A single master node to manage the cluster.
- Several to many peer nodes to index and maintain multiple copies of the data and to search the data.
- One or more search heads to coordinate searches across the set of peer nodes.
The master node manages the cluster. It coordinates the replicating activities of the peer nodes and tells the search head where to find data. It also helps manage the configuration of peer nodes and orchestrates remedial activities if a peer goes down.
The peer nodes receive and index incoming data, just like non-clustered, stand-alone indexers. Unlike stand-alone indexers, however, peer nodes also replicate data from other nodes in the cluster. A peer node can index its own incoming data while simultaneously storing copies of data from other nodes. You must have at least as many peer nodes as the replication factor. That is, to support a replication factor of 3, you need a minimum of three peer nodes.
The search head runs searches across the set of peer nodes. You must use a search head to manage searches across indexer clusters.——将搜索请求发给indexer节点,然后合并搜索请求
For most purposes, it is recommended that you use forwarders to get data into the cluster.
Here is a diagram of a basic, single-site indexer cluster, containing three peer nodes and supporting a replication factor of 3:
This diagram shows a simple deployment, similar to a small-scale non-clustered deployment, with some forwarders sending load-balanced data to a group of indexers (peer nodes), and the indexers sending search results to a search head. There are two additions that you don't find in a non-clustered deployment:
- The indexers are streaming copies of their data to other indexers.
- The master node, while it doesn't participate in any data streaming, coordinates a range of activities involving the search peers and the search head.
How indexing works
Splunk Enterprise can index any type of time-series data (data with timestamps). When Splunk Enterprise indexes data, it breaks it into events, based on the timestamps.
Event processing
Event processing occurs in two stages, parsing and indexing. All data that comes into Splunk Enterprise enters through the parsing pipeline as large (10,000 bytes) chunks. During parsing, Splunk Enterprise breaks these chunks into events which it hands off to the indexing pipeline, where final processing occurs.
While parsing, Splunk Enterprise performs a number of actions, including:
- Extracting a set of default fields for each event, including
host
,source
, andsourcetype
. - Configuring character set encoding.
- Identifying line termination using linebreaking rules. While many events are short and only take up a line or two, others can be long.
- Identifying timestamps or creating them if they don't exist. At the same time that it processes timestamps, Splunk identifies event boundaries.
- Splunk can be set up to mask sensitive event data (such as credit card or social security numbers) at this stage. It can also be configured toapply custom metadata to incoming events.
In the indexing pipeline, Splunk Enterprise performs additional processing, including:
- Breaking all events into segments that can then be searched upon. You can determine the level of segmentation, which affects indexing and searching speed, search capability, and efficiency of disk compression.
- Building the index data structures.
- Writing the raw data and index files to disk, where post-indexing compression occurs.
The breakdown between parsing and indexing pipelines is of relevance mainly when deploying forwarders. Heavy forwarders can parse data and then forward the parsed data on to indexers for final indexing. Some source types - those that reference structured data - require configuration on the forwarder prior to indexing. See "Extract data from files with headers".
For more information about events and what happens to them during the indexing process, see the chapter "Configure event processing" in the Getting Data In Manual.
Note: Indexing is an I/O-intensive process.
This diagram shows the main processes inherent in indexing:
Note: This diagram represents a simplified view of the indexing architecture. It provides a functional view of the architecture and does not fully describe Splunk Enterprise internals. In particular, the parsing pipeline actually consists of three pipelines: parsing, merging, and typing, which together handle the parsing function. The distinction can matter during troubleshooting, but does not generally affect how you configure or deploy Splunk Enterprise.
How indexer acknowledgment works
In brief, indexer acknowledgment works like this: The forwarder sends data continuously to the receiving peer, in blocks of approximately 64kB. The forwarder maintains a copy of each block in memory until it gets an acknowledgment from the peer. While waiting, it continues to send more data blocks.
If all goes well, the receiving peer:
1. receives the block of data, parses and indexes it, and writes the data (raw data and index data) to the file system.
2. streams copies of the raw data to each of its target peers.
3. sends an acknowledgment back to the forwarder.
The acknowledgment assures the forwarder that the data was successfully written to the cluster. Upon receiving the acknowledgment, the forwarder releases the block from memory.
If the forwarder does not receive the acknowledgment, that means there was a failure along the way. Either the receiving peer went down or that peer was unable to contact its set of target peers. The forwarder then automatically resends the block of data. If the forwarder is using load-balancing, it sends the block to another receiving node in the load-balanced group. If the forwarder is not set up for load-balancing, it attempts to resend data to the same node as before.
Important: To ensure end-to-end data fidelity, you must explicitly enable indexer acknowledgment for each forwarder that's sending data to the cluster, as described earlier in this topic. If end-to-end data fidelity is not a requirement for your deployment, you can skip this step.
For more information on how indexer acknowledgment works, read "Protect against loss of in-flight data" in the Forwarding Data manual.
splunk 索引过程的更多相关文章
- OSChina 的全文搜索设计说明 —— 索引过程
http://www.oschina.net/question/12_71591 言: OSChina 的搜索做得并不好,很久之前一直想在细节方面进行改造,一直也没什么好的思路.但作为整体的结构或许对 ...
- ElasticSearch核心知识 -- 索引过程
1.索引过程图解: api向集群发送索引请求,集群会使用负载均衡节点来处理该请求,如果没有单独的负载均衡点,master节点会充当负载均衡点的角色. 负载均衡节点根据routing参数来计算要将该索引 ...
- ElasticSearch优化系列六:索引过程
大家可能会遇到索引数据比较慢的过程.其实明白索引的原理就可以有针对性的进行优化.ES索引的过程到相对Lucene的索引过程多了分布式数据的扩展,而这ES主要是用tranlog进行各节点之间的数据平衡. ...
- 记录一则ORACLE MOVE操作后重建索引过程被强制中断导致的ORA-8104案例
环境:SunOS + Oracle 11.2.0.3 对部分表进行Move操作之后,未重建对应的索引,会导致ORA-1502 索引不可用. 此时需要用下面的查询拼接出重建不可用索引的sql语句: ...
- Lucene搜索/索引过程笔记
lucene索引文档过程: > 初始化IndexWriter > 构建Document > 调用IndexWriter.addDocument执行写入 > 初始化Documen ...
- ES索引文件和数据文件大小对比——splunk索引文件大小远小于ES,数据文件的压缩比也较ES更低,有趣的现象:ES数据文件zip压缩后大小和splunk的数据文件相当!词典文件tim/tip+倒排doc/pos和cfs文件是索引的大头
和splunk对比: ES中各个倒排索引文件的分布: 测试说明:ES2.41版本,数据使用500次批量插入,每批数据都不同,大小500条,每条数据50个字段,对应的字符串使用长度为1-10个单词随机生 ...
- 理解Lucene索引与搜索过程中的核心类
理解索引过程中的核心类 执行简单索引的时候需要用的类有: IndexWriter.Directory.Analyzer.Document.Field 1.IndexWriter IndexWr ...
- lucene建立索引的过程
建立索引过程 用户提交数据=>solr建立索引=>调用lucene包建立索引 官方建立索引和查询索引的例子如下: http://lucene.apache.org/core/4_10_3/ ...
- Lucene构建索引时的一些概念和索引构建的过程
在搜索文档内容之前要做的事情就是对从各种不同来源(网页,数据库,电子邮件等)的文档进行索引,索引的过程就是对内容进行提取,规范化(通过对内容进行建模来实现),然后存储. 在索引的过程中有几个基本的概念 ...
随机推荐
- C语言文法定义与C程序的推导过程
program à external_declaration | program external_declaration <程序> -> <外部声明> | < ...
- 7.mybatis一对多关联查询
和第5节一对一查询类似,但是不同的是,一对一使用的是association,而一对多使用collection. 实例: 1个班级Class,对应1个老师Teacher,对应多个学生Student 1. ...
- SQL - 批量修改表中所有行数据某字段的部分内容
UPDATE 表名 SET 字段名 = REPLACE (字段名, 'old', 'new');
- HDU1518 Square(DFS)
Square Time Limit: 10000/5000 MS (Java/Others) Memory Limit: 65536/32768 K (Java/Others) Total Su ...
- 转 cocos2d-x 优化(纹理渲染优化、资源缓存、内存优化)
概述 包括以下5种优化:引擎底层优化.纹理优化.渲染优化.资源缓存.内存优化 引擎优化 2.0版本比1.0版本在算法上有所优化,效率更高.2.0版本使用OpenGl ES 2.0图形库,1.0版本 ...
- java中判断用户是否为第一次登陆(在页面上进行控制)
<%@ page language="java" import="java.util.*" pageEncoding="gbk"%&g ...
- Nexus4_文件名乱码
1. 官方的出厂映像 for Android4.4:occam-krt16s-factory-2006f418.tgz 2. 自己编译的 Android-4.4_r1 (AOSP on Mako) 映 ...
- Javascript中typeof instanceof constructor的区别
typeof typeof,是一个运算符,运算中需要一个操作数,运算的结果就是这个操作数的类型,运算的结果是一个字符串.他有一定的局限性,对于对象类型的值,只能得到一个object结果,却不能精确得到 ...
- Java源码初学_ArrayList
一.ArrayList的构造器和构造方法 在ArrayList中定义了两个空数组,分别对应当指定默认构造方法时候,指向的数组已经给定容量,但是容量等于0的时候,指向的数组.此外在构造函数中传入Coll ...
- golang文件上传和下载
[代码]golang 实现的文件服务(包括上传,下载的server端和client端) (2013-09-20 02:03:52) 转载▼ 标签: golang go 文件服务器 it 分类: GO相 ...