Overview

The File System (FS) shell includes various shell-like commands that directly interact with the Hadoop Distributed File System (HDFS) as well as other file systems that Hadoop supports, such as Local FS, WebHDFS, S3 FS, and others. The FS shell is invoked by:

文件系统(FS)包括各种类脚本命令,这些类脚本命令直接与Hadoop分布式文件系统(HDFS)以及Hadoop支持的其他文件系统例如:本地Local FS、WebHDFS、S3 FS等直接交互。FS脚本由bin/hadoop  fs <args> 执行

bin/hadoop fs <args>

All FS shell commands take path URIs as arguments. The URI format is scheme://authority/path. For HDFS the scheme is hdfs, and for the Local FS the scheme is file. The scheme and authority are optional. If not specified, the default scheme specified in the configuration is used. An HDFS file or directory such as /parent/child can be specified as hdfs://namenodehost/parent/child or simply as /parent/child (given that your configuration is set to point to hdfs://namenodehost).

所有的FS脚本命令都使用URIs作为参数。URI的格式是schema://authority/path.对于HDFS来说schema代表的是hdfs(hdfs://authority/path),对于Local FS来说schema是file(file://authority/path).schema和authority参数是可选的,如果没有指定的话,配置中指定的默认的schema将会被使用。一个HDFS文件或目录例如/parent/child可以被指定为hdfs://namenode/parent/child或者简单指定为/parent/child(假设你的默认配置是hdfs://namenodehost).

Most of the commands in FS shell behave like corresponding Unix commands. Differences are described with each of the commands. Error information is sent to stderr and the output is sent to stdout.

If HDFS is being used, hdfs dfs is a synonym.

FS脚本中的大多数命令都像对应的UNIX命令一样。不同的是对每个命令的描述。将错误信息发送到标准错误输出,输出发送到标准输出。

如果使用了HDFS,hdfs和dfs是同义词。

Relative paths can be used. For HDFS, the current working directory is the HDFS home directory /user/<username> that often has to be created manually. The HDFS home directory can also be implicitly accessed, e.g., when using the HDFS trash folder, the .Trash directory in the home directory.

See the Commands Manual for generic shell options.

可以使用相对路径。对于HDFS来说,当前的工作目录是HDFS的家目录/user/<username>,这一目录通常会自动创建。HDFS的目录也可以隐式访问,例如,当使用HDFS垃圾文件的.Trash(隐藏文件夹)目录在家目录下。

appendToFile

Usage: hadoop fs -appendToFile <localsrc> ... <dst>

Append single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and appends to destination file system.

  • hadoop fs -appendToFile localfile /user/hadoop/hadoopfile
  • hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
  • hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
  • hadoop fs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

Exit Code:

Returns 0 on success and 1 on error.

appendToFile

用法:hadoop fs -appendToFile <本地路径> ...  <目标路径>

添加一个或多个本地文件系统的文件到目标文件系统。同样从标准输入读取并添加到目的文件系统中。

  • hadoop fs -appendToFile localfile /user/hadoop/hadoopfile
  • hadoop fs -appendToFile localfile1 localfile2 /user/hadoop/hadoopfile
  • hadoop fs -appendToFile localfile hdfs://nn.example.com/hadoop/hadoopfile
  • hadoop fs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile

从标准输入读取。

退出码:

返回0代表成功1代表错误。

cat

Usage: hadoop fs -cat [-ignoreCrc] URI [URI ...]

Copies source paths to stdout.

Options

  • The -ignoreCrc option disables checkshum verification.

Example:

  • hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
  • hadoop fs -cat file:///file3 /user/hadoop/file4

Exit Code:

Returns 0 on success and -1 on error.

cat

用法:hadoop fs -cat [-ignoreCrc] URI [URI]

拷贝文件到标准输出。

选项:

-ignoreCrc选项不验证checksum

示例:

hadoop fs  -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com.file2

hadoop fs -cat file://file3  /user/hadoop/file4

返回码:

返回0代表成功,-1代表失败。

checksum

Usage: hadoop fs -checksum URI

Returns the checksum information of a file.

Example:

  • hadoop fs -checksum hdfs://nn1.example.com/file1
  • hadoop fs -checksum file:///etc/hosts

checksum

用法:hadoop fs -checksum URI

返回一个文件的checksum信息

示例

hadoop fs -checksum hdfs://nn1.example.com/file1

hadoop fs -checksum file:///etc/hosts

chgrp

Usage: hadoop fs -chgrp [-R] GROUP URI [URI ...]

Change group association of files. The user must be the owner of files, or else a super-user. Additional information is in the Permissions Guide.

Options

  • The -R option will make the change recursively through the directory structure.

chgrp

用法:hadoop fs -chgrp [-R] GROUP URI [URI...]

修改文件的用户组关系,用户必须是文件的所属者或者超级用户。更多的信息请查看Permissions Guide.

选项

-R选项将通过文件夹的结果递归更改

chmod

Usage: hadoop fs -chmod [-R] <MODE[,MODE]... | OCTALMODE> URI [URI ...]

Change the permissions of files. With -R, make the change recursively through the directory structure. The user must be the owner of the file, or else a super-user. Additional information is in the Permissions Guide.

Options

  • The -R option will make the change recursively through the directory structure.

chomd

用法:hadoop fs -chmod [-R] <Mode[,Mode]  | OCTALMODE> URI [URI ...]

更改文件的权限。使用-R选项,将递归改变文件夹结构中的所有目录。用户必须是文件的所有者或者是超级用户,更多的信息请查看Permissions Guide.

选项:

-R选项将递归的改变目录结构中的文件

chown

Usage: hadoop fs -chown [-R] [OWNER][:[GROUP]] URI [URI ]

Change the owner of files. The user must be a super-user. Additional information is in the Permissions Guide.

Options

  • The -R option will make the change recursively through the directory structure.

chown

用法:hadoop fs -chown [-R] [OWNER] [:[GROUP]] URI [URI]

改变用户的所有者。用户必须是超级用户,更多的信息请查看Permissions Guide.

选项:

-R选项将递归的更改目录结构中的文件

copyFromLocal

Usage: hadoop fs -copyFromLocal <localsrc> URI

Similar to the fs -put command, except that the source is restricted to a local file reference.

Options:

  • -p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
  • -f : Overwrites the destination if it already exists.
  • -l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.
  • -d : Skip creation of temporary file with the suffix ._COPYING_.

copyFromLocal

用法:hadoop fs  -copyFromLocal  <localsrc> URI

除了源文件必须是本地文件之外与fs -put命令一样,

选项:

-p:保留访问和修改时间、所有权和权限。(假定权限可以通过文件系统传播)

-f:如果目标文件已经存在则覆盖

-l: 允许datanode延迟持久化文件到磁盘,强制复制因子为1,。这个标志将阀值复制因子持久性降低。小心使用。

-d:跳过_COPYING为前缀的临时文件

copyToLocal

Usage: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

Similar to get command, except that the destination is restricted to a local file reference.

copyToLocal

用法: hadoop fs -copyToLocal [-ignorecrc] [-crc] URI <localdst>

除了文件文件的引用是本地之外与get命令一样。

count

Usage: hadoop fs -count [-q] [-h] [-v] [-x] [-t [<storage type>]] [-u] [-e] <paths>

Count the number of directories, files and bytes under the paths that match the specified file pattern. Get the quota and the usage. The output columns with -count are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME

The -u and -q options control what columns the output contains. -q means show quotas, -u limits the output to show quotas and usage only.

The output columns with -count -q are: QUOTA, REMAINING_QUOTA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, DIR_COUNT, FILE_COUNT, CONTENT_SIZE, PATHNAME

The output columns with -count -u are: QUOTA, REMAINING_QUOTA, SPACE_QUOTA, REMAINING_SPACE_QUOTA, PATHNAME

The -t option shows the quota and usage for each storage type. The -t option is ignored if -u or -q option is not given. The list of possible parameters that can be used in -t option(case insensitive except the parameter "“): ”“, ”all“, ”ram_disk“, ”ssd“, ”disk“ or ”archive".

The -h option shows sizes in human readable format.

The -v option displays a header line.

The -x option excludes snapshots from the result calculation. Without the -x option (default), the result is always calculated from all INodes, including all snapshots under the given path. The -x option is ignored if -u or -q option is given.

The -e option shows the erasure coding policy for each file.

The output columns with -count -e are: DIR_COUNT, FILE_COUNT, CONTENT_SIZE, ERASURECODING_POLICY, PATHNAME

The ERASURECODING_POLICY is name of the policy for the file. If a erasure coding policy is setted on that file, it will return name of the policy. If no erasure coding policy is setted, it will return "Replicated" which means it use replication storage strategy.

Example:

  • hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
  • hadoop fs -count -q hdfs://nn1.example.com/file1
  • hadoop fs -count -q -h hdfs://nn1.example.com/file1
  • hadoop fs -count -q -h -v hdfs://nn1.example.com/file1
  • hadoop fs -count -u hdfs://nn1.example.com/file1
  • hadoop fs -count -u -h hdfs://nn1.example.com/file1
  • hadoop fs -count -u -h -v hdfs://nn1.example.com/file1
  • hadoop fs -count -e hdfs://nn1.example.com/file1

Exit Code:

Returns 0 on success and -1 on error.

count

用法: hadoop fs -count [-q] [-h] [-v] [-x] [-t [<storage type>]] [-u] [-e] <paths>

计算与指定文件模式匹配的路径下目录、文件和字节的大小。获得配额和使用情况。-count参数的输出列是:DIR_COUNT、FILE_COUNT、CONTENT_SIZE、PATHNAME。

-u和-q参数控制输出包括哪些列。-q代表的是显示配额,-u限定输出只显示配额和用法。

-count -q参数的输出列是:QUOTA ,REMAINING_QUOTA,SPACE_QUOTA,REMAINING_SPACE_QUOTA,DIR)COUNT,FILE_COUNT,CONTENT_SIZE,PATHNAME

-count -u参数的输出列是:QUOTA,REAMINING_QUOTA,REMAINING_SPACE_QUOTA,PATHNAME

-t选项显示每一个存储类型的配额和使用情况。如果-u或-q没有指定的话-t选项将被忽略。在-t选项中可以使用的参数(除参数之外忽略大小写):“”,“all”,"ram_disk","ssd","disk"或者"archive".

-h选项以人类可阅读的格式显示大小。

-v选项显示一个头行。

-x选项从计算结果中排除快照。如果没有-X选项,结果总是计算所有的节点,包括给定路径下的快照。如果给定了-u或者-q选项,那么-x选项将被忽略。

-e选项显示每一个文件的擦除编码策略。

-count -e选项的输出列:DIR_COUNT,FILE_COUNT,CONTENT_SIZE,ERASURECODING_POLICY,PATHNAME.

ERASURECODING_POLICY是文件的策略名称。如果该文件设置了擦除编码策略,它将会返回政策名称、如果没有擦除编码策略设置,它将返回“Replicated”,这以为这它使用的是复制存储策略。

示例:

  • hadoop fs -count hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
  • hadoop fs -count -q hdfs://nn1.example.com/file1
  • hadoop fs -count -q -h hdfs://nn1.example.com/file1
  • hadoop fs -count -q -h -v hdfs://nn1.example.com/file1
  • hadoop fs -count -u hdfs://nn1.example.com/file1
  • hadoop fs -count -u -h hdfs://nn1.example.com/file1
  • hadoop fs -count -u -h -v hdfs://nn1.example.com/file1
  • hadoop fs -count -e hdfs://nn1.example.com/file1

返回码:

返回0代表成功,-1代表失败。

cp

Usage: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>

Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory.

‘raw.*’ namespace extended attributes are preserved if (1) the source and destination filesystems support them (HDFS only), and (2) all source and destination pathnames are in the /.reserved/raw hierarchy. Determination of whether raw.* namespace xattrs are preserved is independent of the -p (preserve) flag.

Options:

  • The -f option will overwrite the destination if it already exists.
  • The -p option will preserve file attributes [topx] (timestamps, ownership, permission, ACL, XAttr). If -p is specified with no arg, then preserves timestamps, ownership, permission. If -pa is specified, then preserves permission also because ACL is a super-set of permission. Determination of whether raw namespace extended attributes are preserved is independent of the -p flag.

Example:

  • hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error.

cp

用法: hadoop fs -cp [-f] [-p | -p[topax]] URI [URI ...] <dest>

从原始路径拷贝文件到目的路径。该命令允许多个原始路径,但是目标路径必须是一个文件夹。

“raw.*”命名空间扩展属性是可以保存的如果(1):源文件和目标文件系统支持他们(仅仅是HDFS),(2)所有的源文件和目标文件路径名称在/.reserved/raw是分级的。决定raw.*命名空间属性是否可保存依赖于-p(preserve)标识。

选项:

  • -f选项将会覆盖已经存在的目标文件或目录。.
  • -p选项将会保存文件属性(topx)(timastamp,ownership,permission,ACL,XAttr).如果-p选项没有指定参数,将保存timestamps,ownership,permission.如果-pa指定的话,保存permission,这也是因为ACL是permission的超级设置,决定行命名空间扩展选项是否保存的是-p标识。

示例:

  • hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir

返回码:

0代表成功-1代表失败。

createSnapshot

See HDFS Snapshots Guide.

createSnapshop  创建快照

查看 HDFS Snapshots Guide.

deleteSnapshot

See HDFS Snapshots Guide.

deleteSnapshot  删除快照

查看HDFS Snapshots Guide.

df

Usage: hadoop fs -df [-h] URI [URI ...]

Displays free space.

Options:

  • The -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of 67108864)

Example:

  • hadoop dfs -df /user/hadoop/dir1

df

用法:hadoop fs -df [-h] URI  [URI]

显示剩余空间

选项:

-h选项格式化文件大小到人类可阅读的方式(比如64m而不是67108864)

du

Usage: hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]

Displays sizes of files and directories contained in the given directory or the length of a file in case its just a file.

Options:

  • The -s option will result in an aggregate summary of file lengths being displayed, rather than the individual files. Without the -s option, calculation is done by going 1-level deep from the given path.
  • The -h option will format file sizes in a “human-readable” fashion (e.g 64.0m instead of 67108864)
  • The -v option will display the names of columns as a header line.
  • The -x option will exclude snapshots from the result calculation. Without the -x option (default), the result is always calculated from all INodes, including all snapshots under the given path.

The du returns three columns with the following format:

size disk_space_consumed_with_all_replicas full_path_name

Example:

  • hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1

Exit Code: Returns 0 on success and -1 on error.

du

用法: hadoop fs -du [-s] [-h] [-v] [-x] URI [URI ...]

显示给定目录中文件或目录的大小,如果只是一个文件的话显示文件的长度。

选项:

  • -s选项将显示文件长度的汇总而不是单独的文件。如果没有-s选项,计算的深度是用给定路径开始1级。
  • -h选项将会格式化大小到人类可阅读的方式(比如64m而不是67108864)
  • -v选项将会显示列的名称作为头行
  • -x选项将会从计算结果中排出快照。如果没有-x选项(默认是没有-x选项的),结果将会计算所有的节点,包括给定路径下的所有的快照,

du以如下的格式返回3列:

size disk_space_consumed_with_all_replicas full_path_name

示例:

  • hadoop fs -du /user/hadoop/dir1 /user/hadoop/file1 hdfs://nn.example.com/user/hadoop/dir1

返回码:

返回0代表成功-1代表失败。

dus

Usage: hadoop fs -dus <args>

Displays a summary of file lengths.

Note: This command is deprecated. Instead use hadoop fs -du -s.

dus

用法:hadoop -fs -dus <args>

显示文件长度的汇总

主机:这个命令时不推荐的,使用hadoop fs   -du -s代替。

expunge

Usage: hadoop fs -expunge

Permanently delete files in checkpoints older than the retention threshold from trash directory, and create new checkpoint.

When checkpoint is created, recently deleted files in trash are moved under the checkpoint. Files in checkpoints older than fs.trash.interval will be permanently deleted on the next invocation of -expunge command.

If the file system supports the feature, users can configure to create and delete checkpoints periodically by the parameter stored as fs.trash.checkpoint.interval (in core-site.xml). This value should be smaller or equal to fs.trash.interval.

Refer to the HDFS Architecture guide for more information about trash feature of HDFS.

expurge

用法:hadoop fs  -expunge

永久删除比垃圾目录中保留阈值大的检查点中的文件,并且创建新的检查点。当检查点创建之后,最近删除的在垃圾中的文件将会移动到检查点中。检查点中比fs.trash,interval老的文件在下次调用-expurge的时候将会永久删除。

如果文件系统支持如下特性:用户可以在fs.trash.checkpoint.interval中配置周期性的创建或删除检查点,这个值应该与fs.trash.interval相同。

查看 HDFS Architecture guide 获取更多关于HDFS的垃圾特性。

find

Usage: hadoop fs -find <path> ... <expression> ...

Finds all files that match the specified expression and applies selected actions to them. If no path is specified then defaults to the current working directory. If no expression is specified then defaults to -print.

The following primary expressions are recognised:

  • -name pattern
    -iname pattern

    Evaluates as true if the basename of the file matches the pattern using standard file system globbing. If -iname is used then the match is case insensitive.

  • -print
    -print0

    Always evaluates to true. Causes the current pathname to be written to standard output. If the -print0 expression is used then an ASCII NULL character is appended.

The following operators are recognised:

  • expression -a expression
    expression -and expression
    expression expression

    Logical AND operator for joining two expressions. Returns true if both child expressions return true. Implied by the juxtaposition of two expressions and so does not need to be explicitly specified. The second expression will not be applied if the first fails.

Example:

hadoop fs -find / -name test -print

Exit Code:

Returns 0 on success and -1 on error.

find

用法: hadoop fs -find <path> ... <expression> ...

查找匹配指定表达式的文件并且应用选定的操作。如果path参数未指定默认使用当前工作空间。如果没有没有expression指定,则使用默认的-print.

如下主要的表达式是可以辨认的:

  • -name pattern
    -iname pattern

    如果文件的基本名匹配使用标准文件系统其值为true.如果使用-iname则匹配大小写敏感。

  • -print
    -print0

    值总是true,导致当前路径被写到标准输出。如果使用-print0表达那么将会拼接一个ASCII 空字符。

如下的操作是可以被识别的:

  • expression -a expression
    expression -and expression
    expression expression

    逻辑上的AND操作符连接两个表达式。如果两个子表达式返回true则返回true.两个表达式并列是隐含的不需要明确指定。如果第一个表达失败的话第二个表达式不会应用。

示例:

hadoop fs -find / -name test -print

返回码:

0代表成功-1代表失败。

get

Usage: hadoop fs -get [-ignorecrc] [-crc] [-p] [-f] <src> <localdst>

Copy files to the local file system. Files that fail the CRC check may be copied with the -ignorecrc option. Files and CRCs may be copied using the -crc option.

Example:

  • hadoop fs -get /user/hadoop/file localfile
  • hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile

Exit Code:

Returns 0 on success and -1 on error.

Options:

  • -p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
  • -f : Overwrites the destination if it already exists.
  • -ignorecrc : Skip CRC checks on the file(s) downloaded.
  • -crc: write CRC checksums for the files downloaded.

get

用法: hadoop fs -get [-ignorecrc] [-crc] [-p] [-f] <src> <localdst>

从本地文件系统中拷贝文件。CRC检查失败的文件使用-ignorecrc参数进行拷贝。使用-crc选项文件和CRC文件可以被拷贝。

示例:

  • hadoop fs -get /user/hadoop/file localfile
  • hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile

返回码:

返回0代表成功,-1代表失败。.

选项:

  • -p : 保留访问和修改时间、所有权和权限(假设权限可以通过文件系统传播)
  • -f :如果目的已存在的话进行覆盖
  • -ignorecrc :下载文件忽略CRC检查
  • -crc:为下载文件写CRC checksum

getfacl

Usage: hadoop fs -getfacl [-R] <path>

Displays the Access Control Lists (ACLs) of files and directories. If a directory has a default ACL, then getfacl also displays the default ACL.

Options:

  • -R: List the ACLs of all files and directories recursively.
  • path: File or directory to list.

Examples:

  • hadoop fs -getfacl /file
  • hadoop fs -getfacl -R /dir

Exit Code:

Returns 0 on success and non-zero on error.

getfacl

用法:hadoop fs -getfacl [-R] <path>

显示文件或目录的ACLS。如果一个文件夹有一个默认的ACL那么显示默认的ACL

选项:

-R:递归的显示文件和目录的ACL列表

path:文件或目录的列表

示例:

hadoop fs -getfacl /file

hadoop fs -getfacl -R /dir

返回码:

返回0代表成功非零代表失败

getfattr

Usage: hadoop fs -getfattr [-R] -n name | -d [-e en] <path>

Displays the extended attribute names and values (if any) for a file or directory.

Options:

  • -R: Recursively list the attributes for all files and directories.
  • -n name: Dump the named extended attribute value.
  • -d: Dump all extended attribute values associated with pathname.
  • -e encoding: Encode values after retrieving them. Valid encodings are “text”, “hex”, and “base64”. Values encoded as text strings are enclosed in double quotes ("), and values encoded as hexadecimal and base64 are prefixed with 0x and 0s, respectively.
  • path: The file or directory.

Examples:

  • hadoop fs -getfattr -d /file
  • hadoop fs -getfattr -R -n user.myAttr /dir

Exit Code:

Returns 0 on success and non-zero on error.

getfattr

用法: hadoop fs -getfattr [-R] -n name | -d [-e en] <path>

显示文件或目录的扩展属性名称和目录。

选项:

  • -R: 递归的显示文件或目录的属性
  • -n name: 转储指定的扩展属性值.
  • -d:.转储所有域路径相关的扩展属性值
  • -e encoding: 检索之后进行编码 ,有效的编码包括“text” "hex" "base64".编码为文本字符串的值用双引号括起来。使用十六进制或base64的分别以0x和0s开始。
  • path: 文件或目录

示例:

  • hadoop fs -getfattr -d /file
  • hadoop fs -getfattr -R -n user.myAttr /dir

返回码:

返回0代表成功非0代表失败.

getmerge

Usage: hadoop fs -getmerge [-nl] <src> <localdst>

Takes a source directory and a destination file as input and concatenates files in src into the destination local file. Optionally -nl can be set to enable adding a newline character (LF) at the end of each file. -skip-empty-file can be used to avoid unwanted newline characters in case of empty files.

Examples:

  • hadoop fs -getmerge -nl /src /opt/output.txt
  • hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /output.txt

Exit Code:

Returns 0 on success and non-zero on error.

getmerge

用法:hadoop fs -getmerge [nl] <src> <localdst>

将源文件和目标文件作为输入,并且将src中的文件连接到目的文件。选项 -nl可以设置在每一个文件末尾添加一个新行。-skip-empty-file可以用来避免在空文件的情况下多余的换行符。

示例

返回码:

  • hadoop fs -getmerge -nl /src /opt/output.txt
  • hadoop fs -getmerge -nl /src/file1.txt /src/file2.txt /output.txt

返回0代表成功非零代表错误

help

Usage: hadoop fs -help

Return usage output.

help

用法:hadoop  fs -help

返回用法输出

ls

Usage: hadoop fs -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] <args>

Options:

  • -C: Display the paths of files and directories only.
  • -d: Directories are listed as plain files.
  • -h: Format file sizes in a human-readable fashion (eg 64.0m instead of 67108864).
  • -q: Print ? instead of non-printable characters.
  • -R: Recursively list subdirectories encountered.
  • -t: Sort output by modification time (most recent first).
  • -S: Sort output by file size.
  • -r: Reverse the sort order.
  • -u: Use access time rather than modification time for display and sorting.
  • -e: Display the erasure coding policy of files and directories only.

For a file ls returns stat on the file with the following format:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename

For a directory it returns list of its direct children as in Unix. A directory is listed as:

permissions userid groupid modification_date modification_time dirname

Files within a directory are order by filename by default.

Example:

  • hadoop fs -ls /user/hadoop/file1
  • hadoop fs -ls -e /ecdir

Exit Code:

Returns 0 on success and -1 on error.

ls

用法: hadoop fs -ls [-C] [-d] [-h] [-q] [-R] [-t] [-S] [-r] [-u] [-e] <args>

选项:

  • -C: 仅显示文件或目录的路径.
  • -d: 目录像普通文件一样显示.
  • -h: 格式化文件大小为人类可阅读的形式(例如:64m而不是67108864)
  • -q: 不可打印的字符以?显示.
  • -R: 遇到子目录的话递归显示
  • -t: 根据修改时间排序输出.
  • -S:根据文件大小排序输出.
  • -r: 排序列表翻转
  • -u: 使用访问时间而不是修改时间去显示和排序.
  • -e: 显示文件或目录的擦除编码政策.

对于一个文件ls返回如下格式:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename

对于一个目录它将像unix一个返回其直接子目录列表,目录显示如下:

permissions userid groupid modification_date modification_time dirname

目录下的文件模式默认使用文件名称排序:

示例:

  • hadoop fs -ls /user/hadoop/file1
  • hadoop fs -ls -e /ecdir

返回码:

返回0代表成功-1代表错误

lsr

Usage: hadoop fs -lsr <args>

Recursive version of ls.

Note: This command is deprecated. Instead use hadoop fs -ls -R

lsr

用法:hadoop fs -lsr <args>

是ls的递归版本

注意:这个命令时不推荐的,代替命令时hadoop fs -ls -R

mkdir

Usage: hadoop fs -mkdir [-p] <paths>

Takes path uri’s as argument and creates directories.

Options:

  • The -p option behavior is much like Unix mkdir -p, creating parent directories along the path.

Example:

  • hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
  • hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

Exit Code:

Returns 0 on success and -1 on error.

mkdir

用法:hadoop fs -mkdir [-p]  <paths>

使用uri作为参数并且创建目录

选项:

-p选项的行为更像Unix系统中的mkdir -p,创建路径的父目录

示例

  • hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
  • hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir

返回码:

返回0代表成功,-1代表错误

moveFromLocal

Usage: hadoop fs -moveFromLocal <localsrc> <dst>

Similar to put command, except that the source localsrc is deleted after it’s copied.

moveFromLocal

用法:hadoop fs -moveFromLocal <localsrc> <dst>

除了源文件复制之后将被删除之外与put命令一致

moveToLocal

Usage: hadoop fs -moveToLocal [-crc] <src> <dst>

Displays a “Not implemented yet” message.

moveToLocal

用法:hadoop fs  -moveToLocal  > [-crc] <src> <dst>

显示一个未还未实行的信息

mv

Usage: hadoop fs -mv URI [URI ...] <dest>

Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across file systems is not permitted.

Example:

  • hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1

Exit Code:

Returns 0 on success and -1 on error.

mv:

用法:hadoop fs -mv URI [URI] <dest>

移动文件从源路径到目标路径。如果目标文件时文件夹的话该命令允许多源文件,跨文件系统移动文件时不允许的

示例:

  • hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1

返回码:

返回0代表成功,-1代表失败。

put

Usage: hadoop fs -put [-f] [-p] [-l] [-d] [ - | <localsrc1> .. ]. <dst>

Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”

Copying fails if the file already exists, unless the -f flag is given.

Options:

  • -p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
  • -f : Overwrites the destination if it already exists.
  • -l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.
  • -d : Skip creation of temporary file with the suffix ._COPYING_.

Examples:

  • hadoop fs -put localfile /user/hadoop/hadoopfile
  • hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
  • hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
  • hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

Exit Code:

Returns 0 on success and -1 on error.

put

用法: hadoop fs -put [-f] [-p] [-l] [-d] [ - | <localsrc1> .. ]. <dst>

从本地文件系统拷贝单个或多个文件到目标文件系统,如果原路径设置为“-”同样从标准输入读取并且写到目标文件系统。

如果目标文件已经存在将拷贝失败,除非给予参数-f

选项:

  • -p :保存读取和修改时间、所有者和权限(假设权限可以通过文件系统传播)
  • -f :如果目标文件已经存在覆盖
  • -l :允许datanode延迟持久化到磁盘。强制复制因子为1, 这个标志将导致减少持久性,小心使用.
  • -d :略过创建以._COPYING开始的临时文件.

示例:

  • hadoop fs -put localfile /user/hadoop/hadoopfile
  • hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
  • hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
  • hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.

Exit Code:

返回0代表成功-1代表错误。

renameSnapshot

See HDFS Snapshots Guide.

renameSnapshot

查看HDFS Snapshots Guide.

rm

Usage: hadoop fs -rm [-f] [-r |-R] [-skipTrash] [-safely] URI [URI ...]

Delete files specified as args.

If trash is enabled, file system instead moves the deleted file to a trash directory (given by FileSystem#getTrashRoot).

Currently, the trash feature is disabled by default. User can enable trash by setting a value greater than zero for parameter fs.trash.interval (in core-site.xml).

See expunge about deletion of files in trash.

Options:

  • The -f option will not display a diagnostic message or modify the exit status to reflect an error if the file does not exist.
  • The -R option deletes the directory and any content under it recursively.
  • The -r option is equivalent to -R.
  • The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.
  • The -safely option will require safety confirmation before deleting directory with total number of files greater than hadoop.shell.delete.limit.num.files (in core-site.xml, default: 100). It can be used with -skipTrash to prevent accidental deletion of large directories. Delay is expected when walking over large directory recursively to count the number of files to be deleted before the confirmation.

Example:

  • hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

Exit Code:

Returns 0 on success and -1 on error.

rm

用法: hadoop fs -rm [-f] [-r |-R] [-skipTrash] [-safely] URI [URI ...]

指定参数删除文件.

如果垃圾桶可以使用的话,文件系统将移动文件到垃圾文件夹(查看FileSystem#getTrashRoot

当前,垃圾通道额性质默认是不可用的。用户可以开启它通过设置一个大于fs.trash.interval的值的参数(core-site.xml文件中)

查看expunge 关于从垃圾桶中删除文件

选项:

  • 如果文件不存在,F选项将不会显示诊断消息或修改退出状态来反映错误。
  • -R选项代表递归的删除目录
  • -r与-R选项相同
  • The -skipTrash option will bypass trash, if enabled, and delete the specified file(s) immediately. This can be useful when it is necessary to delete files from an over-quota directory.
  • -skipTrash选项将绕过垃圾,如果启动的话这可能从一个超额,目录下删除问
  • The -safely option will require safety confirmation before deleting directory with total number of files greater than hadoop.shell.delete.limit.num.files (in core-site.xml, default: 100). It can be used with -skipTrash to prevent
  • -安全选项将需要安全确认之前删除目录总文件数大于hadoop.shell.delete.limit.num.files(在core-site.xml,默认值:100)。它可以用来与skiptrash防止意外删除大目录。在遍历大型目录时,希望在确认之前删除要删除的文件数。

示例:

  • hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

返回码:

0代表成功-1代表错误

rmdir

Usage: hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]

Delete a directory.

Options:

  • --ignore-fail-on-non-empty: When using wildcards, do not fail if a directory still contains files.

Example:

  • hadoop fs -rmdir /user/hadoop/emptydir

rmdir

用法: hadoop fs -rmdir [--ignore-fail-on-non-empty] URI [URI ...]

删除一个目录

选项:

  • --ignore-fail-on-non-empty: 当使用通配符时,如果目录包含文件不失败。

示例:

  • hadoop fs -rmdir /user/hadoop/emptydir

rmr

Usage: hadoop fs -rmr [-skipTrash] URI [URI ...]

Recursive version of delete.

Note: This command is deprecated. Instead use hadoop fs -rm -r

rmr

用法:hadoop fs -rmr [-skipTrash] URI [URI]

删除的递归版本

注意:这个命令是不推荐的,代替命令hadoop fs -rm -r

setfacl

Usage: hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_spec> <path>] |[--set <acl_spec> <path>]

Sets Access Control Lists (ACLs) of files and directories.

Options:

  • -b: Remove all but the base ACL entries. The entries for user, group and others are retained for compatibility with permission bits.
  • -k: Remove the default ACL.
  • -R: Apply operations to all files and directories recursively.
  • -m: Modify ACL. New entries are added to the ACL, and existing entries are retained.
  • -x: Remove specified ACL entries. Other ACL entries are retained.
  • --set: Fully replace the ACL, discarding all existing entries. The acl_spec must include entries for user, group, and others for compatibility with permission bits.
  • acl_spec: Comma separated list of ACL entries.
  • path: File or directory to modify.

Examples:

  • hadoop fs -setfacl -m user:hadoop:rw- /file
  • hadoop fs -setfacl -x user:hadoop /file
  • hadoop fs -setfacl -b /file
  • hadoop fs -setfacl -k /dir
  • hadoop fs -setfacl --set user::rw-,user:hadoop:rw-,group::r--,other::r-- /file
  • hadoop fs -setfacl -R -m user:hadoop:r-x /dir
  • hadoop fs -setfacl -m default:user:hadoop:r-x /dir

Exit Code:

Returns 0 on success and non-zero on error.

setfacl

用法: hadoop fs -setfacl [-R] [-b |-k -m |-x <acl_spec> <path>] |[--set <acl_spec> <path>]

设置文件和目录的访问控制列表(ACLs)

选项:

  • -b: 删除出ACL条目之外的所有. 用户、组合其他的权限位兼容。
  • -k:移除默认的ACL
  • -R: 递归的应用操作到所有的文件和目录
  • -m: 修改ACL,添加新的条目到ACL,原有的ACL依旧保存。=
  • -x:移除特定的ACL条目,别的ACL条目依旧存在
  • --set: 全部奇幻ACL,废除所有已经存在的条目。acl_aprc必须包括用户、组合别的用的条目的兼容性。
  • acl_spec: 逗号分隔的ACL条目列表
  • path: 要修改的文件或目录

示例:

  • hadoop fs -setfacl -m user:hadoop:rw- /file
  • hadoop fs -setfacl -x user:hadoop /file
  • hadoop fs -setfacl -b /file
  • hadoop fs -setfacl -k /dir
  • hadoop fs -setfacl --set user::rw-,user:hadoop:rw-,group::r--,other::r-- /file
  • hadoop fs -setfacl -R -m user:hadoop:r-x /dir
  • hadoop fs -setfacl -m default:user:hadoop:r-x /dir

返回码:

返回0代表成功-1代表错误

setfattr

Usage: hadoop fs -setfattr -n name [-v value] | -x name <path>

Sets an extended attribute name and value for a file or directory.

Options:

  • -n name: The extended attribute name.
  • -v value: The extended attribute value. There are three different encoding methods for the value. If the argument is enclosed in double quotes, then the value is the string inside the quotes. If the argument is prefixed with 0x or 0X, then it is taken as a hexadecimal number. If the argument begins with 0s or 0S, then it is taken as a base64 encoding.
  • -x name: Remove the extended attribute.
  • path: The file or directory.

Examples:

  • hadoop fs -setfattr -n user.myAttr -v myValue /file
  • hadoop fs -setfattr -n user.noValue /file
  • hadoop fs -setfattr -x user.myAttr /file

Exit Code:

Returns 0 on success and non-zero on error.

setfattr

用法: hadoop fs -setfattr -n name [-v value] | -x name <path>

为文件或目录设置一个扩展属性名称和值

选项:

  • -n name: 扩展属性名称.
  • -v value: 扩展属性值. 对于属性值有3种不同的编码方法。如果参数在双引号中括起来那么值也在双引号中。如果参数以0x或0X开始,那么它被认为是16进制。如果参数以0s或0S开始,则被当做base64编码。
  • -x name: 删除扩展属性值.
  • path: 文件或目录.

示例:

  • hadoop fs -setfattr -n user.myAttr -v myValue /file
  • hadoop fs -setfattr -n user.noValue /file
  • hadoop fs -setfattr -x user.myAttr /file

返回码:

返回0代表成功,非0代表错误.

setrep

Usage: hadoop fs -setrep [-R] [-w] <numReplicas> <path>

Changes the replication factor of a file. If path is a directory then the command recursively changes the replication factor of all files under the directory tree rooted at path. The EC files will be ignored when executing this command.

Options:

  • The -w flag requests that the command wait for the replication to complete. This can potentially take a very long time.
  • The -R flag is accepted for backwards compatibility. It has no effect.

Example:

  • hadoop fs -setrep -w 3 /user/hadoop/dir1

Exit Code:

Returns 0 on success and -1 on error.

setrep

用法: hadoop fs -setrep [-R] [-w] <numReplicas> <path>

改变一个文件的复制因子。如果path是一个文件夹,那么这个命令递归的改变目录下文件的复制因子。当执行这个命令的时候EC文件将会被忽略。

选项:

  • -w标识要求这个命令等待复制的完成,默认情况下将花费很长的时间。
  • -R标识接受向后兼容。它没有影响。

示例:

  • hadoop fs -setrep -w 3 /user/hadoop/dir1

返回码:

返回0代表成功,-1代表错误。

stat

Usage: hadoop fs -stat [format] <path> ...

Print statistics about the file/directory at <path> in the specified format. Format accepts permissions in octal (%a) and symbolic (%A), filesize in bytes (%b), type (%F), group name of owner (%g), name (%n), block size (%o), replication (%r), user name of owner(%u), and modification date (%y, %Y). %y shows UTC date as “yyyy-MM-dd HH:mm:ss” and %Y shows milliseconds since January 1, 1970 UTC. If the format is not specified, %y is used by default.

Example:

  • hadoop fs -stat "%F %a %u:%g %b %y %n" /file

Exit Code: Returns 0 on success and -1 on error.

stat

用法: hadoop fs -stat [format] <path> ...

以指定格式在<path>中打印文件/目录的统计信息。 格式接受权限八进制(%)和符号(%),文件大小的字节数(b)、类型(%f),业主团体名称(% G)、名称(%N),块大小(‰),复制(R),所有者的用户名(% U),和修改日期(%,% y)。% Y显示UTC日期为“yyyy-mm-dd HH:毫米:SS“% Y显示毫秒自1970年1月1日UTC。如果未指定格式,则默认使用% y。

示例:

  • hadoop fs -stat "%F %a %u:%g %b %y %n" /file

返回码: 返回0代表成功,-1代表失败。

tail

Usage: hadoop fs -tail [-f] URI

Displays last kilobyte of the file to stdout.

Options:

  • The -f option will output appended data as the file grows, as in Unix.

Example:

  • hadoop fs -tail pathname

Exit Code: Returns 0 on success and -1 on error.

tail

用法: hadoop fs -tail [-f] URI

显示文件最新的上千字节到标准输出

选项:

  • 像Unix一样使用-f选项当文件增长的时候将输出附加的数据

示例:

  • hadoop fs -tail pathname

返回码:返回0代表成功-1代表错误

test

Usage: hadoop fs -test -[defsz] URI

Options:

  • -d: f the path is a directory, return 0.
  • -e: if the path exists, return 0.
  • -f: if the path is a file, return 0.
  • -s: if the path is not empty, return 0.
  • -r: if the path exists and read permission is granted, return 0.
  • -w: if the path exists and write permission is granted, return 0.
  • -z: if the file is zero length, return 0.

Example:

  • hadoop fs -test -e filename

test

用法: hadoop fs -test -[defsz] URI

选项:

  • -d:路径是一个目录,返回0
  • -e: 路径存在返回0
  • -f: 路径是一个文件返回0
  • -s:如果路径非空返回0
  • -r:如果路口存在并且授予读的权限返回0
  • -w: 如果路径存在并且授予了写的权限返回0
  • -z: 如果文件长度为0则返回0

示例:

  • hadoop fs -test -e filename

text

Usage: hadoop fs -text <src>

Takes a source file and outputs the file in text format. The allowed formats are zip and TextRecordInputStream.

text

用法:hadoop fs -text <src>

获取源文件并以文本格式输出文件。允许的格式是zip和textrecordinputstream。

touchz

Usage: hadoop fs -touchz URI [URI ...]

Create a file of zero length. An error is returned if the file exists with non-zero length.

Example:

  • hadoop fs -touchz pathname

Exit Code: Returns 0 on success and -1 on error.

touchz

用法: hadoop fs -touchz URI [URI ...]

Create a file of zero length. An error is returned if the file exists with non-zero length.

创建一个0长度的文件,如果已经存在的文件长度非0返回错误。

示例:

  • hadoop fs -touchz pathname

返回码: 返回0代表成功,非0代表错误.

truncate

Usage: hadoop fs -truncate [-w] <length> <paths>

Truncate all files that match the specified file pattern to the specified length.

Options:

  • The -w flag requests that the command waits for block recovery to complete, if necessary.
    Without -w flag the file may remain unclosed for some time while the recovery is in progress.
    During this time file cannot be reopened for append.

Example:

  • hadoop fs -truncate 55 /user/hadoop/file1 /user/hadoop/file2
  • hadoop fs -truncate -w 127 hdfs://nn1.example.com/user/hadoop/file1

truncate

用法: hadoop fs -truncate [-w] <length> <paths>

将与指定文件模式匹配的所有文件截断为指定长度。

选项:

  • 如果有必要的话,-w标识要求命令等待块回复完成,如果没有-w标识在恢复进程运行中一些文件仍然可以打开一些时间。在此期间,无法重新添加文件。

示例:

    • hadoop fs -truncate 55 /user/hadoop/file1 /user/hadoop/file2
    • hadoop fs -truncate -w 127 hdfs://nn1.example.com/user/hadoop/file1

usage

Usage: hadoop fs -usage command

Return the help for an individual command.

usage

用法:hadoop fs -usage command

返回单独一个命令的帮助文档

Working with Object Storage

The Hadoop FileSystem shell works with Object Stores such as Amazon S3, Azure WASB and OpenStack Swift.

hadoop文件系统脚本采用对象存储工作,比如Amazon S3、Azure WASB和OpenStack Swift

# Create a directory
hadoop fs -mkdir s3a://bucket/datasets/ # Upload a file from the cluster filesystem
hadoop fs -put /datasets/example.orc s3a://bucket/datasets/ # touch a file
hadoop fs -touchz wasb://yourcontainer@youraccount.blob.core.windows.net/touched

Unlike a normal filesystem, renaming files and directories in an object store usually takes time proportional to the size of the objects being manipulated. As many of the filesystem shell operations use renaming as the final stage in operations, skipping that stage can avoid long delays.

不同于普通的文件系统,在对象存储中重命名文件和目录所花费的时间与对象的大小是成比例的。由于许多文件系统脚本操作使用重命名作为操作的最后阶段,跳过该阶段可以避免长时间延迟。

In particular, the put and copyFromLocal commands should both have the -d options set for a direct upload.

特别的,put和copyFromLocal命令应该使用-d选项去设置直接上传。

# Upload a file from the cluster filesystem
hadoop fs -put -d /datasets/example.orc s3a://bucket/datasets/ # Upload a file from under the user's home directory in the local filesystem.
# Note it is the shell expanding the "~", not the hadoop fs command
hadoop fs -copyFromLocal -d -f ~/datasets/devices.orc s3a://bucket/datasets/ # create a file from stdin
# the special "-" source means "use stdin"
echo "hello" | hadoop fs -put -d -f - wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt

Objects can be downloaded and viewed:

对象可以下载并预览

# copy a directory to the local filesystem
hadoop fs -copyToLocal s3a://bucket/datasets/ # copy a file from the object store to the cluster filesystem.
hadoop fs -get wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt /examples # print the object
hadoop fs -cat wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt # print the object, unzipping it if necessary
hadoop fs -text wasb://yourcontainer@youraccount.blob.core.windows.net/hello.txt ## download log files into a local file
hadoop fs -getmerge wasb://yourcontainer@youraccount.blob.core.windows.net/logs\* log.txt

Commands which list many files tend to be significantly slower than when working with HDFS or other filesystems

hadoop fs -count s3a://bucket/
hadoop fs -du s3a://bucket/

Other slow commands include findmvcp and rm.

其他的慢的命令包括find、mv 、cp  和rm

Find

This can be very slow on a large store with many directories under the path supplied.

在给定的路径下有许多目录的大型存储中,这将非常慢。

# enumerate all files in the object store's container.
hadoop fs -find s3a://bucket/ -print # remember to escape the wildcards to stop the shell trying to expand them first
hadoop fs -find s3a://bucket/datasets/ -name \*.txt -print

Rename

The time to rename a file depends on its size.

The time to rename a directory depends on the number and size of all files beneath that directory.

重命名一个文件的时间依赖于它的大小

重命名一个目录的时间依赖于目录下文件大数量和大小。

hadoop fs -mv s3a://bucket/datasets s3a://bucket/historical

If the operation is interrupted, the object store will be in an undefined state.

Rename

重命名一个文件的时间决定于它的大小

重命名一个目录决定于目录下的文件的数量和大小

hadoop fs -mv s3a:..bucket/datasets s3a://bucket/historical

如果操作被打断,对象存储将会是一个未定义的状态。

Copy

hadoop fs -cp s3a://bucket/datasets s3a://bucket/historical

The copy operation reads each file and then writes it back to the object store; the time to complete depends on the amount of data to copy, and the bandwidth in both directions between the local computer and the object store.

拷贝操作读取每一个文件并且写入到存储中;完成时间依赖于拷贝数据的数量,以及本地计算机和目标存储之间的带宽。

The further the computer is from the object store, the longer the copy takes

copy hadoop fs -cp s3a://bucket/datasets s3a://bucket/historical

Deleting objects

The rm command will delete objects and directories full of objects. If the object store is eventually consistentfs ls commands and other accessors may briefly return the details of the now-deleted objects; this is an artifact of object stores which cannot be avoided.

If the filesystem client is configured to copy files to a trash directory, this will be in the bucket; the rm operation will then take time proportional to the size of the data. Furthermore, the deleted files will continue to incur storage costs.

To avoid this, use the the -skipTrash option.

hadoop fs -rm -skipTrash s3a://bucket/dataset

Data moved to the .Trash directory can be purged using the expunge command. As this command only works with the default filesystem, it must be configured to make the default filesystem the target object store.

hadoop fs -expunge -D fs.defaultFS=s3a://bucket/

Overwriting Objects

If an object store is eventually consistent, then any operation which overwrites existing objects may not be immediately visible to all clients/queries. That is: later operations which query the same object’s status or contents may get the previous object. This can sometimes surface within the same client, while reading a single object.

Avoid having a sequence of commands which overwrite objects and then immediately work on the updated data; there is a risk that the previous data will be used instead.

Timestamps

Timestamps of objects and directories in Object Stores may not follow the behavior of files and directories in HDFS.

  1. The creation and initial modification times of an object will be the time it was created on the object store; this will be at the end of the write process, not the beginning.
  2. The timestamp will be taken from the object store infrastructure’s clock, not that of the client.
  3. If an object is overwritten, the modification time will be updated.
  4. Directories may or may not have valid timestamps. They are unlikely to have their modification times updated when an object underneath is updated.
  5. The atime access time feature is not supported by any of the object stores found in the Apache Hadoop codebase.

Consult the DistCp documentation for details on how this may affect the distcp -update operation.

Security model and operations

The security and permissions models of object stores are usually very different from those of a Unix-style filesystem; operations which query or manipulate permissions are generally unsupported.

Operations to which this applies include: chgrpchmodchowngetfacl, and setfacl. The related attribute commands getfattr andsetfattr are also usually unavailable.

  • Filesystem commands which list permission and user/group details, usually simulate these details.

  • Operations which try to preserve permissions (example fs -put -p) do not preserve permissions for this reason. (Special case: wasb://, which preserves permissions but does not enforce them).

When interacting with read-only object stores, the permissions found in “list” and “stat” commands may indicate that the user has write access, when in fact they do not.

Object stores usually have permissions models of their own, models can be manipulated through store-specific tooling. Be aware that some of the permissions which an object store may provide (such as write-only paths, or different permissions on the root path) may be incompatible with the Hadoop filesystem clients. These tend to require full read and write access to the entire object store bucket/container into which they write data.

As an example of how permissions are mocked, here is a listing of Amazon’s public, read-only bucket of Landsat images:

$ hadoop fs -ls s3a://landsat-pds/
Found 10 items
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/L8
-rw-rw-rw- 1 mapred 23764 2015-01-28 18:13 s3a://landsat-pds/index.html
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/landsat-pds_stats
-rw-rw-rw- 1 mapred 105 2016-08-19 18:12 s3a://landsat-pds/robots.txt
-rw-rw-rw- 1 mapred 38 2016-09-26 12:16 s3a://landsat-pds/run_info.json
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/runs
-rw-rw-rw- 1 mapred 27458808 2016-09-26 12:16 s3a://landsat-pds/scene_list.gz
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/tarq
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/tarq_corrupt
drwxrwxrwx - mapred 0 2016-09-26 12:16 s3a://landsat-pds/test
  1. All files are listed as having full read/write permissions.
  2. All directories appear to have full rwx permissions.
  3. The replication count of all files is “1”.
  4. The owner of all files and directories is declared to be the current user (mapred).
  5. The timestamp of all directories is actually that of the time the -ls operation was executed. This is because these directories are not actual objects in the store; they are simulated directories based on the existence of objects under their paths.

When an attempt is made to delete one of the files, the operation fails —despite the permissions shown by the ls command:

$ hadoop fs -rm s3a://landsat-pds/scene_list.gz
rm: s3a://landsat-pds/scene_list.gz: delete on s3a://landsat-pds/scene_list.gz:
com.amazonaws.services.s3.model.AmazonS3Exception: Access Denied (Service: Amazon S3;
Status Code: 403; Error Code: AccessDenied; Request ID: 1EF98D5957BCAB3D),
S3 Extended Request ID: wi3veOXFuFqWBUCJgV3Z+NQVj9gWgZVdXlPU4KBbYMsw/gA+hyhRXcaQ+PogOsDgHh31HlTCebQ=

This demonstrates that the listed permissions cannot be taken as evidence of write access; only object manipulation can determine this.

Note that the Microsoft Azure WASB filesystem does allow permissions to be set and checked, however the permissions are not actually enforced. This feature offers the ability for a HDFS directory tree to be backed up with DistCp, with its permissions preserved, permissions which may be restored when copying the directory back into HDFS. For securing access to the data in the object store, however, Azure’s own model and tools must be used.

Commands of limited value

Here is the list of shell commands which generally have no effect —and may actually fail.

command limitations
appendToFile generally unsupported
checksum the usual checksum is “NONE”
chgrp generally unsupported permissions model; no-op
chmod generally unsupported permissions model; no-op
chown generally unsupported permissions model; no-op
createSnapshot generally unsupported
deleteSnapshot generally unsupported
df default values are normally displayed
getfacl may or may not be supported
getfattr generally supported
renameSnapshot generally unsupported
setfacl generally unsupported permissions model
setfattr generally unsupported permissions model
setrep has no effect
truncate generally unsupported

Different object store clients may support these commands: do consult the documentation and test against the target store.

【大数据系列】FileSystem Shell官方文档翻译的更多相关文章

  1. 大数据系列-java用官方JDBC连接greenplum数据库

    这个其实非常简单,之所以要写此文是因为当前网上搜索到的文章都是使用PostgreSQL的驱动,没有找到使用greenplum官方驱动的案例,两者有什么区别呢? 一开始我也使用的是PostgreSQL的 ...

  2. 大数据系列(3)——Hadoop集群完全分布式坏境搭建

    前言 上一篇我们讲解了Hadoop单节点的安装,并且已经通过VMware安装了一台CentOS 6.8的Linux系统,咱们本篇的目标就是要配置一个真正的完全分布式的Hadoop集群,闲言少叙,进入本 ...

  3. 大数据系列之分布式计算批处理引擎MapReduce实践-排序

    清明刚过,该来学习点新的知识点了. 上次说到关于MapReduce对于文本中词频的统计使用WordCount.如果还有同学不熟悉的可以参考博文大数据系列之分布式计算批处理引擎MapReduce实践. ...

  4. iOS数据存取---iOS-Apple苹果官方文档翻译

    CHENYILONG Blog iOS数据存取---iOS-Apple苹果官方文档翻译 数据存取/*技术博客http://www.cnblogs.com/ChenYilong/ 新浪微博http:// ...

  5. 大数据系列之分布式数据库HBase-1.2.4+Zookeeper 安装及增删改查实践

    之前介绍过关于HBase 0.9.8版本的部署及使用,本篇介绍下最新版本HBase1.2.4的部署及使用,有部分区别,详见如下: 1. 环境准备: 1.需要在Hadoop[hadoop-2.7.3]  ...

  6. 大数据系列之分布式数据库HBase-0.9.8安装及增删改查实践

    若查看HBase-1.2.4版本内容及demo代码详见 大数据系列之分布式数据库HBase-1.2.4+Zookeeper 安装及增删改查实践 1. 环境准备: 1.需要在Hadoop启动正常情况下安 ...

  7. 玩转大数据系列之Apache Pig高级技能之函数编程(六)

    原创不易,转载请务必注明,原创地址,谢谢配合! http://qindongliang.iteye.com/ Pig系列的学习文档,希望对大家有用,感谢关注散仙! Apache Pig的前世今生 Ap ...

  8. 大数据系列2:Hdfs的读写操作

    在前文大数据系列1:一文初识Hdfs中,我们对Hdfs有了简单的认识. 在本文中,我们将会简单的介绍一下Hdfs文件的读写流程,为后续追踪读写流程的源码做准备. Hdfs 架构 首先来个Hdfs的架构 ...

  9. 12.Linux软件安装 (一步一步学习大数据系列之 Linux)

    1.如何上传安装包到服务器 有三种方式: 1.1使用图形化工具,如: filezilla 如何使用FileZilla上传和下载文件 1.2使用 sftp 工具: 在 windows下使用CRT 软件 ...

随机推荐

  1. [原创]Allegro 导入DXF文件,保留布好的线路信息

    最近智能钥匙产品开发过程中,由于结构装配尺寸的偏差,需要对电路PCB外框OUTLINE进行缩小调整,并且USB插座定位孔改变. Allegro软件在线性绘制方面是有严重缺陷的,想绘制一个异形的板框比较 ...

  2. Remote SSH: Using JSCH with Expect4j

    Now-a-days, we can see that whole world is moving around Clouds and virtualization. More and more ap ...

  3. SharePoint 使用ECMAscript对象模型来读取帖子列表

    本随笔讲述如何用JavaScript来读取SharePoint 2013 中blog相关的帖子列表. ASCX File Content: <div id="divGetItemsFr ...

  4. Oracle性能调整ASH,AWR,ADDM

    ASH (Active Session History)ASH以V$SESSION为基础,每秒采样一次,记录活动会话等待的事件.不活动的会话不会采样,采样工作由新引入的后台进程MMNL来完成.ASH ...

  5. VS2008编译错误:error C2065: 'PMIB_TCPSTATS' : undeclared identifier c:\program files (x86)\microsoft sdks\windows\v7.0a\include\iphlpapi.h 411

    安装了VS2008编译之前的程序,结果出现了编译错误,以为是VS2008的Sp1补丁没装好,重装补丁后还是不行,编译错误如下: 双击错误会定位在iphlpapi.h中, 一个可行的解决办法是:把iph ...

  6. iis重启的几种方法

    1. 通过“IIS管理器”重启在IIS服务器管理控制树中展开IIS节点,选择需要重新启动IIS服务的计算机,接着单击鼠标右键,选择“所有任务”->“重新启动IIS”. 2.通过“控制面板”-&g ...

  7. Symbol.iterator的理解

    https://blog.csdn.net/margin_0px/article/details/82971545

  8. phonegap入门–2 Android phonegap工程建立

    一.环境要求: 需要安装Android ADT 二.支持Android相关设备列表: a)Android 2.1 (Deprecated May 2013) b)Android 2.2 c)Andro ...

  9. Android ViewDragHelper全然解析 自己定义ViewGroup神器

    转载请标明出处: http://blog.csdn.net/lmj623565791/article/details/46858663. 本文出自:[张鸿洋的博客] 一.概述 在自己定义ViewGro ...

  10. mac开机启动apache、memcached与mysql

    一.开机自动启动apache方法 #sudo launchctl load -w /System/Library/LaunchDaemons/org.apache.httpd.plist //开机启动 ...