使用tap-minio-csv 我们可以将s3 中csv 的文件,通过singer 的target 写到不用的系统中,可以兼容
s3 的存储类型,以下是一个集成minio 的测试,将minio 中的csv 数据导入到pg中

环境准备

  • docker-compose 文件
version: "3"
services:
  s3: 
    image: minio/minio
    command: server /data
    ports: 
    - "9000:9000"
    environment:
    - "MINIO_ACCESS_KEY=dalongapp"
    - "MINIO_SECRET_KEY=dalongapp"
    volumes: 
    - "./data:/data" 
  target:
    image: postgres:9.6.11
    ports:
    - "5432:5432"
    environment:
    - "POSTGRES_PASSWORD:dalong"
 
 
  • 创建bucket 并上传文件

  • 文件格式
    my_table.csv
id,username,userage2,classinfo
1,"dalong",11,"v1"
2,"rong",29,"v2"
3,"appdemo",30,"v3"
4,"tetst",30,"v4"

my_table2.csv

id2,username2,userage3,classinfo2
7,"dalong",11,"v1"
8,"rong",29,"v2"
9,"appdemo",30,"v3"
10,"tetst",30,"v4"

tap 环境准备

  • venv 初始化
mkdir  s3-tap
python3 -m venv venv
  • 激活虚拟环境
source  venv/bin/activate
  • 安装tap
pip install tap-minio-csv
 
{
    "start_date": "2017-11-02T00:00:00Z",
    "bucket": "demo",
    "aws_access_key_id":"dalongapp",
    "aws_secret_access_key":"dalongapp",
    "endpoint_url":"http://localhost:9000",
    "tables": "[{\"search_prefix\":\"exports\",\"search_pattern\":\"my_table.csv\",\"table_name\":\"my_table\",\"key_properties\":\"id\",\"delimiter\":\",\"},{\"search_prefix\":\"exports\",\"search_pattern\":\"my_table2.csv\",\"table_name\":\"my_table2\",\"key_properties\":\"id2\",\"delimiter\":\",\"}]"
}
  • 简单说明
    我们通过 tables 可以定义文件查找的方法以及csv 的处理规则

target 配置

  • venv 初始化
mkdir  pg-target
python3 -m venv venv
  • 激活虚拟环境
source  venv/bin/activate 
  • 安装tap
pip install  target_postgres
  • 配置target
{
    "host": "localhost",
    "port": 5432,
    "dbname": "postgres",
    "user": "postgres",
    "password": "dalong",
    "schema": "copy"
}

集成使用

  • 模式发现
s3-tap/venv/bin/tap-minio-csv -c s3-tap/tap-config.json -d > catalog.json

效果

WARNING I have direct access to the bucket without assuming the configured role.
INFO Starting discover
INFO Sampling records to determine table schema.
INFO Sampling files (max files: 5)
INFO Checking bucket "demo" for keys matching "my_table.csv"
INFO exports/my_table.csv
INFO 2019-08-22 05:27:52+00:00
INFO Will download key "exports/my_table.csv" as it was last modified 2019-08-22 05:27:52+00:00
INFO Sampling exports/my_table.csv (max records: 1000, sample rate: 5)
INFO Sampled 1 rows from exports/my_table.csv
INFO exports/my_table2.csv
INFO 2019-08-22 09:01:45+00:00
INFO Found 2 files.
INFO Sampling records to determine table schema.
INFO Sampling files (max files: 5)
INFO Checking bucket "demo" for keys matching "my_table2.csv"
INFO exports/my_table.csv
INFO 2019-08-22 05:27:52+00:00
INFO exports/my_table2.csv
INFO 2019-08-22 09:01:45+00:00
INFO Will download key "exports/my_table2.csv" as it was last modified 2019-08-22 09:01:45+00:00
INFO Sampling exports/my_table2.csv (max records: 1000, sample rate: 5)
INFO Sampled 1 rows from exports/my_table2.csv
INFO Found 2 files.
INFO Finished discover
  • 启用同步配置
    主要是启用selected,在每个stream 的metadata 中
 
      "metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "table-key-properties": [
              "id"
            ],
+ "selected":true
          }
        }

完整如下:

{
  "streams": [
    {
      "stream": "my_table",
      "tap_stream_id": "my_table",
      "schema": {
        "type": "object",
        "properties": {
          "id": {
            "type": [
              "null",
              "integer",
              "string"
            ]
          },
          "username": {
            "type": [
              "null",
              "string"
            ]
          },
          "userage2": {
            "type": [
              "null",
              "integer",
              "string"
            ]
          },
          "classinfo": {
            "type": [
              "null",
              "string"
            ]
          },
          "_sdc_source_bucket": {
            "type": "string"
          },
          "_sdc_source_file": {
            "type": "string"
          },
          "_sdc_source_lineno": {
            "type": "integer"
          },
          "_sdc_extra": {
            "type": "array",
            "items": {
              "type": "string"
            }
          }
        }
      },
      "metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "table-key-properties": [
              "id"
            ],
            "selected":true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "id"
          ],
          "metadata": {
            "inclusion": "automatic"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "username"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "userage2"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "classinfo"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_sdc_source_bucket"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_sdc_source_file"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_sdc_source_lineno"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_sdc_extra"
          ],
          "metadata": {
            "inclusion": "available"
          }
        }
      ]
    },
    {
      "stream": "my_table2",
      "tap_stream_id": "my_table2",
      "schema": {
        "type": "object",
        "properties": {
          "id2": {
            "type": [
              "null",
              "integer",
              "string"
            ]
          },
          "username2": {
            "type": [
              "null",
              "string"
            ]
          },
          "userage3": {
            "type": [
              "null",
              "integer",
              "string"
            ]
          },
          "classinfo2": {
            "type": [
              "null",
              "string"
            ]
          },
          "_sdc_source_bucket": {
            "type": "string"
          },
          "_sdc_source_file": {
            "type": "string"
          },
          "_sdc_source_lineno": {
            "type": "integer"
          },
          "_sdc_extra": {
            "type": "array",
            "items": {
              "type": "string"
            }
          }
        }
      },
      "metadata": [
        {
          "breadcrumb": [],
          "metadata": {
            "table-key-properties": [
              "id2"
            ],
            "selected":true
          }
        },
        {
          "breadcrumb": [
            "properties",
            "id2"
          ],
          "metadata": {
            "inclusion": "automatic"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "username2"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "userage3"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "classinfo2"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_sdc_source_bucket"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_sdc_source_file"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_sdc_source_lineno"
          ],
          "metadata": {
            "inclusion": "available"
          }
        },
        {
          "breadcrumb": [
            "properties",
            "_sdc_extra"
          ],
          "metadata": {
            "inclusion": "available"
          }
        }
      ]
    }
  ]
}
 
  • 执行同步
 s3-tap/venv/bin/tap-minio-csv -c s3-tap/tap-config.json -p catalog.json | pg-target/venv/bin/target-postgres -c pg-target/target.json
 

效果:

WARNING I have direct access to the bucket without assuming the configured role.
INFO Starting sync.
INFO my_table: Starting sync
INFO Syncing table "my_table".
INFO Getting files modified since 2017-11-02 00:00:00+00:00.
INFO Checking bucket "demo" for keys matching "my_table.csv"
INFO Table 'my_table' does not exist. Creating... CREATE TABLE copy.my_table ("_sdc_extra" jsonb, "_sdc_source_bucket" character varying, "_sdc_source_file" character varying, "_sdc_source_lineno" bigint, "classinfo" character varying, "id" character varying, "userage2" character varying, "username" character varying, PRIMARY KEY ("id"))
INFO exports/my_table.csv
INFO 2019-08-22 05:27:52+00:00
INFO Will download key "exports/my_table.csv" as it was last modified 2019-08-22 05:27:52+00:00
INFO exports/my_table2.csv
INFO 2019-08-22 09:01:45+00:00
INFO Found 2 files.
INFO Syncing file "exports/my_table.csv".
INFO Wrote 4 records for table "my_table".
INFO my_table: Completed sync (4 rows)
INFO my_table2: Starting sync
INFO Syncing table "my_table2".
INFO Getting files modified since 2017-11-02 00:00:00+00:00.
INFO Checking bucket "demo" for keys matching "my_table2.csv"
INFO Table 'my_table2' does not exist. Creating... CREATE TABLE copy.my_table2 ("_sdc_extra" jsonb, "_sdc_source_bucket" character varying, "_sdc_source_file" character varying, "_sdc_source_lineno" bigint, "classinfo2" character varying, "id2" character varying, "userage3" character varying, "username2" character varying, PRIMARY KEY ("id2"))
INFO exports/my_table.csv
INFO 2019-08-22 05:27:52+00:00
INFO exports/my_table2.csv
INFO 2019-08-22 09:01:45+00:00
INFO Will download key "exports/my_table2.csv" as it was last modified 2019-08-22 09:01:45+00:00
INFO Found 2 files.
INFO Syncing file "exports/my_table2.csv".
INFO Wrote 4 records for table "my_table2".
INFO my_table2: Completed sync (4 rows)
INFO Done syncing.
INFO Loading 4 rows into 'my_table'
INFO COPY my_table_temp ("_sdc_extra", "_sdc_source_bucket", "_sdc_source_file", "_sdc_source_lineno", "classinfo", "id", "userage2", "username") FROM STDIN WITH (FORMAT CSV, ESCAPE '\')
INFO UPDATE 0
INFO INSERT 0 4
INFO Loading 4 rows into 'my_table2'
INFO COPY my_table2_temp ("_sdc_extra", "_sdc_source_bucket", "_sdc_source_file", "_sdc_source_lineno", "classinfo2", "id2", "userage3", "username2") FROM STDIN WITH (FORMAT CSV, ESCAPE '\')
INFO UPDATE 0
INFO INSERT 0 4
{"bookmarks": {"my_table": {"modified_since": "2019-08-22T05:27:52+00:00"}, "my_table2": {"modified_since": "2019-08-22T09:01:45+00:00"}}}
 
 

pg 内容

说明

以上是一个简单的实践,详细的使用可以参考https://github.com/rongfengliang/tap-minio-csv/blob/master/README.md

参考资料

https://github.com/rongfengliang/tap-minio-csv/blob/master/README.md
https://github.com/rongfengliang/tap-minio-csv-demo

singer tap-minio-csv 使用的更多相关文章

  1. python csv 模块的使用

    python csv 模块的使用 歌曲推荐:攀登(live) csv 是用逗号分隔符来分隔列与列之间的. 1. csv的写入 1.简单的写入,一次写入一行 import csv with open(& ...

  2. pipelinewise 基于singer 指南的的数据pipeline 工具

    pipelinewise 是基于开源singer 指南开发的数据pipeline工具,与singer tap 以及target 兼容 支持的特性 内置的elt 特性 轻量级 支持多种复制方法,cdc( ...

  3. Singer 修改tap-s3-csv 支持minio 连接

    singer 团队官方处了一个tap-s3-csv 的tap,对于没有使用aws 的人来说并不是很方便了,所以简单修改了 下源码,可以支持通用的s3 csv 文件的处理,同时发布到了官方pip 仓库中 ...

  4. Singer 学习八 运行&&开发taps、targets (三 开发tap)

    如何没有找到适合的tap,那么我们可以自己开发一个 hello world tap 仅仅是一个程序,我们可以使用任何语言进行编写,根据singer 指南,输出数据到stdout 即可,实际上一个简单的 ...

  5. Singer 学习三 使用Singer进行mongodb 2 postgres 数据转换

    Singer 可以方便的进行数据的etl 处理,我们可以处理的数据可以是api 接口,也可以是数据库数据,或者 是文件 备注: 测试使用docker-compose 运行&&提供数据库 ...

  6. Singer 学习一 使用Singer进行mysql 2 postgres 数据转换

    Singer 因为版本的问题,推荐的运行方式是使用virtualenv,对于taps&& target 的运行都是 推荐使用此方式,不然包兼容的问题太费事了 备注: 使用docker- ...

  7. Supercharging your ETL with Airflow and Singer

    转自:https://www.stitchdata.com/blog/supercharging-etl-with-airflow-and-singer/ singer 团队关于singer 与air ...

  8. 使用singer tap-postgres 同步数据到pg

    singer 是一个很不错的开源etl 解决方案,以下演示一个简单的数据从pg 同步到pg 很简单就是使用tap-postgres + target-postgres 环境准备 对于测试的环境的数据库 ...

  9. [转]Build An Image Manager With NativeScript, Node.js, And The Minio Object Storage Cloud

    本文转自:https://www.thepolyglotdeveloper.com/2017/04/build-image-manager-nativescript-node-js-minio-obj ...

随机推荐

  1. 引入 ServletContextListener @Autowired null 解决办法

    public class ScheduleController implements ServletContextListener { @Autowired private ScheduleServi ...

  2. Redis高级功能-1、高并发基本概述

    1.可能的问题 要将redis运用到工程项目中,只使用一台redis是万万不能的,原因如下: (1)从结构上,单个redis服务器会发生单点故障,并且一台服务器需要处理所有的请求负载,压力较大. (2 ...

  3. Django---Http协议简述和原理,HTTP请求码,HTTP请求格式和响应格式(重点),Django的安装与使用,Django项目的创建和运行(cmd和pycharm两种模式),Django的基础文件配置,Web框架的本质,服务器程序和应用程序(wsgiref服务端模块,jinja2模板渲染模块)的使用

    Django---Http协议简述和原理,HTTP请求码,HTTP请求格式和响应格式(重点),Django的安装与使用,Django项目的创建和运行(cmd和pycharm两种模式),Django的基 ...

  4. 【开发笔记】-CentOS配置Java环境变量

    如果开发java应用,经常需要配置JAVA_HOME路径,如果是通过yum安装的jdk(一般系统会自带open-jdk),下面讲述配置过程: A 定位JDK安装路径 1. 终端输入: which ja ...

  5. Java 之 Random 类

    一.Random 类  random 类的实例用于生成伪随机数. Demo: Random r = new Random(); int i = r.nextInt(); 二.Random 使用步骤 1 ...

  6. CSRF漏洞的挖掘与利用

    0x01 CSRF的攻击原理 CSRF 百度上的意思是跨站请求伪造,其实最简单的理解我们可以这么讲,假如一个微博关注用户的一个功能,存在CSRF漏洞,那么此时黑客只需要伪造一个页面让受害者间接或者直接 ...

  7. git 自定义log

    个人记录防止忘记 log别名: git config --global alias.lg=log --color --graph --pretty=format:'%Cred%h%Creset -%C ...

  8. Jmeter CSV参数带汉字处理

    问题1:请求参数中有汉字,在windows上调测压测没有问题,直接把参数文件上传到linux 服务器上进行分布式压测时发现参数取出后为乱码,linux上后台查看文件也是乱码 处理方法: 初步想到是因为 ...

  9. Django的orm操作之表查询二

    复习 单表查询 # 单表操作 # 增 # 方式1 user_obj=models.User.objects.create(**kwargs) # 之一create # 方式2 user_obj=mod ...

  10. 【异常】ERROR in ch.qos.logback.core.joran.spi.Interpreter@159:22 - no applicable action for [charset], current ElementPath is [[configuration][appender][encoder][charset]]

    一.异常信息 Exception in thread "restartedMain" java.lang.reflect.InvocationTargetException at ...