Logstash：使用 Logstash 导入 CSV 文件示例

转载自：https://elasticstack.blog.csdn.net/article/details/114374804

在今天的文章中，我将展示如何使用 file input 结合 multiline 来展示如何导入一个 CSV 文件。针对 multiline，我在之前的文章 “运用 Elastic Stack 分析 Spring boot 微服务日志 (一）” 有讲到过。另外我也有两篇关于使用 Logstash 导入 CSV 的例子

    Logstash：应用实践 - 装载 CSV 文档到 Elasticsearch

    Logstash：导入 zipcode CSV 文件和 Geo Search 体验

针对 CSV 的导入，我们也可以使用 Filebeat 来解析 CSV 文件。如果你有兴趣的话，请参考：

    Beats：运用 Elastic Stack 分析 COVID-19 数据并进行可视化分析

准备数据

在今天的练习中，我们有如下的测试数据：

multiline.csv

    INV-12402400071,05/31/2018,2595,Hy-Vee Wine and Spirits / Denison,"1620  4th Ave, South",Denison,51442,"1620 4th Ave, South Denison 51442(42.012395, -95.348601)",24,CRAWFORD,1011100,Blended Whiskies,260,DIAGEO AMERICAS,25608,Seagrams 7 Crown Bl Whiskey,6,1750,11.96,17.94,1,107.64,1.75,0.46

    S29195400002,11/21/2015,2205,Ding's Honk And Holler,900 E WASHINGTON,CLARINDA,51632,"900 E WASHINGTON

    CLARINDA 51632

    (40.739238, -95.02756)",73,Page,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,12,325.68,9.00,2.38

    S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN

    KEOKUK 52632

    (40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19

    S29198800001,11/20/2015,2191,Keokuk Spirits,1013 MAIN,KEOKUK,52632,"1013 MAIN

    KEOKUK 52632

    (40.39978, -91.387531)",56,Lee,,,255,Wilson Daniels Ltd.,297,Templeton Rye w/Flask,6,750,18.09,27.14,6,162.84,4.50,1.19

这个数据来源于 https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy/data。其中的有些数据具有多行输入，也就是多出了一些换行符 "\n"，从而导致有些记录分布在多行，尽管这种情况比较少见。在上面，我们可以看到如下的三个文档：

    INV-12402400071

    S29195400002

    S29198800001

其中 S29195400002 及 S29198800001 连个文档的内容跨三行。和第一个文档显然是不同的。那么我们该如何处理这种情况呢？首先，我们看到文档都是以 INV- 已经 S 开头的行。一般来说 Logstash 的架构图如下：

首先它含有一个 Input, 然后经过0个或多个 filter 的处理，最终输出到 Output。

针对我们的情况，我们可以使用如下的架构来对它进行处理：

我们可以使用 file input 配合 multiline，然后把数据传入到 csv, mutate, 及 Grok 这样的过滤器来进行处理。

首先，我们创建一个叫做 logstash_csv.conf 文件

logstash_csv.conf

    input {

      # Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks

      file {

        start_position => "beginning"

        path => "/Users/liuxg/data/logstash_multiline/multline.csv"

        sincedb_path => "/dev/null"

        codec => multiline {

          pattern => "^(S|INV-)[0-9][0-9]"

          negate => "true"

          what => "previous"

         }

       }

    }

    output {

      stdout {

        codec => rubydebug

      }

    }

在上面，我们使用 file 把指定位置的 multilne.csv 读入进来。我们使用了如下的 codec：

        codec => multiline {

          pattern => "^(S|INV-)[0-9][0-9]"

          negate => "true"

          what => "previous"

         }

它首先匹配以 S 或 INV- 为开头的行，紧接着 S 或 INV- 后面接0-9之中的两个数字。negate 为 true 表示没有匹配的行需要添加到 previous （前面）已经匹配的行里从而组成一个文档。如果你对这个还不是很理解的话，请参阅之前在 “Beats：使用 Filebeat 传送多行日志” 中的描述。

我们使用  Logstash 运行上面的配置文件：

sudo ./bin/logstash -f logstash_csv.conf

那么输出的结果为：

我们看到文档虽然一个文档被分为三行，但是它们还是被正确地识别为一个文档。在文档中，我们看见有 \n 字符出现。在接下来的处理中，我们需要把这个字符去掉。

我们接下来使用 csv 过滤器来进行处理：

logstash_csv.conf

    input {

      # Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks

      file {

        start_position => "beginning"

        path => "/Users/liuxg/data/logstash_multiline/multline.csv"

        sincedb_path => "/dev/null"

        codec => multiline {

          pattern => "^(S|INV-)[0-9][0-9]"

          negate => "true"

          what => "previous"

         }

       }

    }

    filter {

      # Parse the csv values define fields as integers and \floats

      csv {

        columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]

        convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}

        remove_field => ["message"]

      }

    }



    output {

      stdout {

        codec => rubydebug

      }

    }

在上面，我们把 CSV 文档中的项进行解析，并形成各个字段。同时我们也使用 convert 把字段里的数值字段转换为数值类型以便于分析。删除 message 字段。

重新运行 Logstash, 并查看结果：

在上面，我们看到 Country 以及 City，它们都是大写字母，我们想把它们转换为小写字母。同时在 StoreLocation 中，我们发现有 \n 字符。我们在 filter 部分添加 mutate 来对它们进行处理： 

logstash_csv.conf

    input {

      # Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks

      file {

        start_position => "beginning"

        path => "/Users/liuxg/data/logstash_multiline/multline.csv"

        sincedb_path => "/dev/null"

        codec => multiline {

          pattern => "^(S|INV-)[0-9][0-9]"

          negate => "true"

          what => "previous"

         }

       }

    }

    filter {

      # Parse the csv values define fields as integers and \floats

      csv {

        columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]

        convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}

        remove_field => ["message"]

      }

      # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file

      mutate {

        gsub => [ "StoreLocation", "\n", " " ]

        lowercase => [ "County", "City" ]

      }

    }

    output {

      stdout {

        codec => rubydebug

      }

    }

重新运行 Logstash 并查看输出结果：

我们看到 Country 及 City 的字母都变为小写了，同时在 StoreLocation 中再也没有 \n 字符了。

接下来，我们想提取 StoreLocation 里面的位置信息。我们可以看到里面含有一个坐标（经纬度）。我们可以使用 grok 过滤器来进行匹配：

logstash_csv.conf

    input {

      # Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks

      file {

        start_position => "beginning"

        path => "/Users/liuxg/data/logstash_multiline/multline.csv"

        sincedb_path => "/dev/null"

        codec => multiline {

          pattern => "^(S|INV-)[0-9][0-9]"

          negate => "true"

          what => "previous"

         }

       }

    }

    filter {

      # Parse the csv values define fields as integers and \floats

      csv {

        columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]

        convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}

        remove_field => ["message"]

      }

      # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file

      mutate {

        gsub => [ "StoreLocation", "\n", " " ]

        lowercase => [ "County", "City" ]

      }

      # Get the lat/lon if there is a (numbers,numbers) data in the location

      grok {

        match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }

      }

    }

    output {

      stdout {

        codec => rubydebug

      }

    }

我们匹配 StoreLocation 里的含有括号 （）里的内容并赋予给 location。字符含 -,.0-9。重新运行 Logstash：

从上面我们可以看出来 location 从 StoreLocation 中被提取出来了。

接下来，我们来把文档的时间修改为来自文档中的时间。我们可以看到目前的 @timestamp 不是我们文档的 Date 字段的时间。

logstash_csv.conf

    input {

      # Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks

      file {

        start_position => "beginning"

        path => "/Users/liuxg/data/logstash_multiline/multline.csv"

        sincedb_path => "/dev/null"

        codec => multiline {

          pattern => "^(S|INV-)[0-9][0-9]"

          negate => "true"

          what => "previous"

         }

       }

    }

    filter {

      # Parse the csv values define fields as integers and \floats

      csv {

        columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]

        convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}

        remove_field => ["message"]

      }

      # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file

      mutate {

        gsub => [ "StoreLocation", "\n", " " ]

        lowercase => [ "County", "City" ]

      }

      # Get the lat/lon if there is a (numbers,numbers) data in the location

      grok {

        match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }

      }

      # Match the date to just daily and the correct timezone

      date {

         "match" => [ "Date", "MM/dd/YYYY" ]

         "timezone" => "America/Chicago"

      }

    }

    output {

      stdout {

        codec => rubydebug

      }

    }

再次运行 Logstash：

显然现在的 @timestamp 变为来自文档中的时间了。

我们接下来可以添加输出到 Elasticsearch:

logstash_csv.conf

    input {

      # Read the csv file. also use the multiline codec, everything that does not start with S or INV- is part of the prior line due to addresses having line breaks

      file {

        start_position => "beginning"

        path => "/Users/liuxg/data/logstash_multiline/multline.csv"

        sincedb_path => "/dev/null"

        codec => multiline {

          pattern => "^(S|INV-)[0-9][0-9]"

          negate => "true"

          what => "previous"

         }

       }

    }

    filter {

      # Parse the csv values define fields as integers and \floats

      csv {

        columns => ["InvoiceItemNumber","Date","StoreNumber","StoreName","Address","City","ZipCode","StoreLocation","CountyNumber","County","Category","CategoryName","VendorNumber","VendorName","ItemNumber","ItemDescription","Pack","BottleVolumeml","StateBottleCost","StateBottleRetail","BottlesSold","SaleDollars","VolumeSoldLiters","VolumeSoldGallons"]

        convert => { "StoreNumber" => "integer" "ItemNumber" => "integer" "Category" => "integer" "CountyNumber" => "integer" "VendorNumber" => "integer" "Pack" => "integer" "SaleDollars" => "float" "StateBottleCost" => "float" "StateBottleRetail" => "float" "BottleVolumeml" => "float" "BottlesSold" => "float" "VolumeSoldLiters" => "float" "VolumeSoldGallons" => "float"}

        remove_field => ["message"]

      }

      # Take the linebreaks out of the location and convert to spaces and lowercase the city and county as they change in the source file

      mutate {

        gsub => [ "StoreLocation", "\n", " " ]

        lowercase => [ "County", "City" ]

      }

      # Get the lat/lon if there is a (numbers,numbers) data in the location

      grok {

        match => { "StoreLocation" => "\((?<location>[-,.0-9 ]*)\)" }

      }

      # Match the date to just daily and the correct timezone

      date {

         "match" => [ "Date", "MM/dd/YYYY" ]

         "timezone" => "America/Chicago"

      }

    }

    output {

      elasticsearch {

        hosts => ["https://your.cluster.here:9243"]

        index => ["iowa-liquor"]

        user => "elastic"

        password => "redacted"

        manage_template => false

       }

      #output dots while we process

      stdout { codec => "dots" }

      #if we saw a date parse failure, dump it to screen to review

      if "_dateparsefailure" in [tags] {

         stdout { codec => "rubydebug" }

      }

    }

Logstash：使用 Logstash 导入 CSV 文件示例的更多相关文章

neo4j导入csv文件
neo4j导入csv文件关于neo4j的安装官网和网上博客提供了n中安装的方法,这里不再赘述: 普通安装: https://cloud.tencent.com/developer/article/ ...
导出csv文件示例
导出csv文件示例 csv文件默认以英文逗号,做为列分隔符换行符\n作为行分隔符,写入到一个.csv文件即可.含有英文逗号,和换行符会发生数据输出会出现混乱,下面列出一些处理方法.特殊字符处理1.含有 ...
ACCESS导入CSV文件出现乱码解决办法
在ACCESS或Excel中导入CSV文件时常常出现乱码,这是因为简体中文版的windows操作系统及其应用软件默认都是ANSI/GBK编码,而导入的文件使用的编码与操作系统默认的编码不相符.出现这种 ...
C# 将List中的数据导入csv文件中
//http://www.cnblogs.com/mingmingruyuedlut/archive/2013/01/20/2849906.html C# 将List中的数据导入csv文件中将数 ...
oracle导入csv文件
oracle导入csv文件: 1.建好对应的表和字段: 2.新建test.ctl文件,用记事本编辑写入: load data infile 'e:\TB_KC_SERV.csv' --修改对应的文件路 ...
python导入csv文件时，出现SyntaxError
背景 np.loadtxt()用于从文本加载数据. 文本文件中的每一行必须含有相同的数据. *** loadtxt(fname, dtype=<class 'float'>, commen ...
R: 导入 csv 文件，导出到csv文件，；绘图后导出为图片、pdf等
################################################### 问题:导入 csv 文件如何从csv文件中导入数据,?参数怎么设置?常用参数模板是啥? 解决方 ...
python导入csv文件出现SyntaxError问题分析
python导入csv文件出现SyntaxError问题分析先简单描述下碰到的题目,要求是写出2个print的结果可以看到,a指向了一个列表list对象,在Python中,这样的赋值语句,其实内部 ...
Oracle数据库导入csv文件(sqlldr命令行)
1.说明 Oracle数据库导入csv文件, 当csv文件较小时, 可以使用数据库管理工具, 比如DBevaer导入到数据库, 当csv文件很大时, 可以使用Oracle提供的sqlldr命令行工具, ...

随机推荐

MySQL-过滤数据（WHERE语句）
1.使用WHERE子句在SELECT语句中,数据根据WHERE子句中指定的搜索条件进行过滤.WHERE子句在表名( FROM子句)之后给出,如下所示: SELECT prod_name,prod_p ...
WPF 制作 Windows 屏保
分享如何使用WPF 制作 Windows 屏保 WPF 制作 Windows 屏保作者:驚鏵原文链接:https://github.com/yanjinhuagood/ScreenSaver 框架 ...
.netcore 定制化项目开发的思考和实现
今年年初进了一家新公司,进入之后一边维护老项目一边了解项目流程,为了接下来的项目重做积累点经验. 先说下老项目吧,.net fx 3.5+oracle...... 在实际维护中逐渐发现,老项目有标准版 ...
自动挂载mount
# 自动挂载mount(/etc/fstab) /dev/fd0 /media/floppy0 auto rw,user,noauto,exec,utf8 0 0 # 第一列:/dev/fd0 挂载源 ...
点击>>>解锁Apache Hadoop Meetup 2021！
" 10月16日,属于开源发烧友的狂欢日来啦! Apache Hadoop Meetup 2021 开源大数据行业交流盛会盛大开启!让我们相约北京,一起嗨翻初秋~ 在当今信息化时代,逐渐成熟 ...
MySQL 数据定义语句
表相关修改表名 alter table grade rename hang; 新增表字段 alter table grade add `name` varchar(100); 修改表字段类型 alt ...
idea主类main左侧栏启动按钮消失原因
今天在开发完一个小项目后,打开idea发现我的springboot项目的启动类左侧栏的按钮消失了,然后我又去看了看mapp等文件的调转也全部消失了,我就很纳闷是不是idea配置坏了,赶忙点击导航栏的按 ...
渗透攻防Web篇-深入浅出SQL注入
1 背景京东SRC(Security Response Center)收录大量外部白帽子提交的sql注入漏洞,漏洞发生的原因多为sql语句拼接和Mybatis使用不当导致. 2 手工检测 2.1 前 ...
Linux应急响应学习
Linux应急响应-系统日志排查-溯源溯源找到攻击者.系统日志分析攻击者的ip 攻击者可能留下了一些代码样本网上的信息很大程度上是不可信的. 方法: 蜜罐高交互的蜜罐溯源: ip 日志 ...
Laravel框架中文件所在的位置

Logstash：使用 Logstash 导入 CSV 文件示例

Logstash：使用 Logstash 导入 CSV 文件示例的更多相关文章

随机推荐

热门专题