How to use a PDI job to move a file into HDFS.

Prerequisites

In order to follow along with this how-to guide you will need the following:

  • Hadoop
  • Pentaho Data Integration

Sample Files

The sample data file needed for this guide is:

File Name Content
weblogs_rebuild.txt.zip Unparsed, raw weblog data

Step-By-Step Instructions

Setup

Start Hadoop if it is not already running.

Create a Job to Put the Files into Hadoop

In this task you will load a file into HDFS.

Speed Tip
You can download the Kettle Job load_hdfs.kjb if you don't want to do every step
  1. Start PDI on your desktop. Once it is running choose 'File' -> 'New' -> 'Job' from the menu system or click on the 'New file' icon on the toolbar and choose the 'Job' option.
  2. Add a Start Job Entry: You need to tell PDI where to start the job, so expand the 'General' section of the Design palette and drag a 'Start' job entry onto the job canvas.
  3. Add a Copy Files Job Entry: You will copy files from your local disk to HDFS, so expand the 'Big Data' section of the Design palette and drag a 'Hadoop Copy Files' job entry onto the job canvas. Your canvas should look like this:
  4. Connect the Start and Copy Files Job Entries: Hover the mouse over the 'Start' node and a tooltip will appear.  Click on the output connector (the green arrow pointing to the right) and drag a connector arrow to the 'Hadoop Copy Files' node. Your canvas should look like this:
  5. Edit the Copy Files Job Entry: Double-click on the 'Hadoop Copy Files' node to edit its properties. Enter this information:
    1. File/Folder source(s): The folder containing the sample files you want to add to the HDFS.
    2. File/Folder destination(s): hdfs://<NAMENODE>:<PORT>/user/pdi/weblogs/raw
    3. Wildcard (RegExp): Enter ^.*\.txt
    4. Click the Add button to add the above entries to the list of files you wish to copy.
    5. Check the "Create destination folder" option to ensure that the weblogs folder is created in HDFS the first time this job is executed.
      When you are done your window should look like this (your file paths may be different):

      Click 'OK' to close the window.
  6. Save the Job: Choose 'File' -> 'Save as...' from the menu system. Save the transformation as 'load_hdfs.kjb' into a folder of your choice.
  7. Run the Job: Choose 'Action' -> 'Run' from the menu system or click on the green run button on the job toolbar. An 'Execute a job' window will open. Click on the 'Launch' button. An 'Execution Results' panel will open at the bottom of the PDI window and it will show you the progress of the job as it runs. After a few seconds the job should finish successfully:

    If any errors occurred the job step that failed will be highlighted in red and you can use the 'Logging' tab to view error messages.

Check Hadoop

  1. Run the following command:

    hadoop fs -ls /user/pdi/weblogs/raw

    This should return:
    -rwxrwxrwx 3 demo demo 77908174 2011-12-28 07:16 /user/pdi/weblogs/raw/weblog_raw.txt

    Summary

    In this guide you learned how to copy local files into HDFS using PDI's graphical design tool. You can use this tool to put files into the HDFS from many different sources.

Troubleshooting

  • Make sure you have the correct shim configured and that it matches your Hadoop cluster's distro and version.
  • Problem: Hadoop copy files step creates an empty file in HDFS and hangs or never writes any data.
    Check: The Hadoop client side API that Pentaho calls to copy files to HDFS requires that PDI has network connectivity to the nodes in the cluster. The DNS names or IP addresses used within the cluster must resolve the same relative to the PDI machine as they do in the cluster. When PDI requests to put a file into HDFS, the Name Node will return the DNS names (or IP address' depending on the configuration) of the actual nodes that the data will be copied to.
  • Problem: Permission denied: user=XXXX, access=EXECUTE, inode="/user/pdi/weblogs/raw":raw:hadoop:drwxr-x---
    When not using Kerberos security, the Hadoop API used by this step sends the username of the logged in user when trying to copy the file(s) regardless of what username was used in the connect field. To Change the user you must set the environment variable HADOOP_USER_NAME. You can modify spoon.bat or spoon.sh by changing the OPT variable:

    OPT="$OPT .... -DHADOOP_USER_NAME=HadoopNameToSpoof"
 
 

This documentation is maintained by the Pentaho community, and members are encouraged to create new pages in the appropriate spaces, or edit existing pages that need to be corrected or updated.

Please do not leave comments on Wiki pages asking for help. They will be deleted. Use the forums instead.

Browse Space

Add Content

Your Account 
Anonymous

Loading Data into HDFS的更多相关文章

  1. [转帖]Loading Data into HAWQ

    Loading Data into HAWQ Leave a reply Loading data into the database is required to start using it bu ...

  2. 使用OGG&quot;Loading data from file to Replicat&quot;的方法应该注意的问题:replicat进程是前台进程

    使用OGG的 "Loading data from file to Replicat"的方法应该注意的问题:replicat进程是前台进程 因此.最好是在vncserver中调用该 ...

  3. OGG &quot;Loading data from file to Replicat&quot;table静态数据同步配置过程

    OGG "Loading data from file to Replicat"table静态数据同步配置过程 一个.mgr过程 GGSCI (lei1) 3> view p ...

  4. Sample: Write And Read data from HDFS with java API

    HDFS: hadoop distributed file system 它抽象了整个集群的存储资源,可以存放大文件. 文件采用分块存储复制的设计.块的默认大小是64M. 流式数据访问,一次写入(现支 ...

  5. Loading Data into a Table;MySQL从本地向数据库导入数据

    在localhost中准备好了一个test数据库和一个pet表: mysql> SHOW DATABASES; +--------------------+ | Database | +---- ...

  6. loading data into a table(亲测有效)

    一.实验要求 导入数据到数据库的表里    表内容如下: name owner species sex birth death Fluffy Harold cat f 1993-02-04   Cla ...

  7. HeadFirst Ruby 第十五章总结 Saving and loading data

    前言 在上一章讲述了如何进行基础的操作,比如 处理 GET 请求的 get route, 再比如下载 gem 等等方面的知识.在这一章节,作者告诉我们如何储存.处理数据.整个过程分三步走: 首先,当 ...

  8. 解决eclipse+adt出现的 loading data for android 问题

    因为公司最近做的项目中有用到一些第三方demo,蛋疼的是这些demo还比较旧...eclipse的... 于是给自己的eclipse装上了ADT插件,但是...因为我的eclipse比较新,Versi ...

  9. flume data to hdfs

    flume 开发梳理 flume 数据到hadoop conf/hdfsAgent.conf #配置sources.channels.sinks a1.sources=r1 a1.channels=c ...

随机推荐

  1. .h头文件和.c文件的作用和区别

    .h头文件和.c文件的作用和区别 在小工程中,.h的作用没有得到充分的使用,在大工程中才能充分体现出.h文件的作用. .h和.c文件都是代码.头文件好处有: 一:头文件便于共享,只需要添加一句“inc ...

  2. 这是我用Microsoft Word 2010 直接发布的测试用博客

    目的:如题所示.   那么先试试拷贝一段网页内容,发布后观察各种格式的显示效果如何. 下面的文字来自中国网新闻,地址是http://news.china.com.cn/2015-10/23/conte ...

  3. 程序自动生成Dump文件

    前言:通过drwtsn32.NTSD.CDB等调试工具生成Dump文件, drwtsn32存在的缺点虽然NTSD.CDB可以完全解决,但并不是所有的操作系统中都安装了NTSD.CDB等调试工具.了解了 ...

  4. eclipse代码自动提示功能设置

    一 般默认情况下,Eclipse ,MyEclipse的代码提示功能是比Microsoft Visual Studio的差很多的,主要是Eclipse ,MyEclipse本身有很多选项是默认关闭的, ...

  5. jsp关于include html、jsp等文件出现乱码问题的解决方案

    一般来说使用jsp标签<jsp:include>引入一个jsp文件: ①可以在被引入的jsp中加入:<%@ page contentType="text/html;char ...

  6. codeforces #305 B Mike and Feet

    跟之前做过的51Nod的移数博弈是一样的QAQ 我们考虑每个数的贡献 定义其左边第一个比他小的数的位置为L 定义其右边第一个比他小的数的位置为R 这个可以用排序+链表 或者 单调队列 搞定 那么对于区 ...

  7. 嵌入式开发之NorFlash 和NandFlash

    http://blog.csdn.net/tigerjibo/article/details/9322035 [摘要]:作为一个嵌入式工程师,要对NorFlash 和NandFlash要有最起码的认知 ...

  8. R语言学习笔记:取数据子集

    上文介绍了,如何生成序列,本文介绍一下如何取出其数据子集 取出元素的逻辑值 > x<-c(0,-3,4,-1,45,90,5) > x>0 [1] FALSE FALSE  T ...

  9. Vim Gtags插件

    01.Gtags func:查看定义处 02.Gtags -r func:查看引用处 03.Gtags -s text:查看未被数据库定义的tags 04.copen:打开quick fix显示窗口 ...

  10. 对于eclipse新建maven工程需要注意的地方。

    新建项目的时候,如果想配置默认的maven的jre为1.6或者别的. http://hi.baidu.com/hi_hi/item/765ec8bbc49880d384dd79d1 1.cmd命令建立 ...