Python的高级文件操作（shutil模块）

　　　　　　　　　　　　　　　　　　　　Python的高级文件操作（shutil模块）

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　作者：尹正杰

　　如果让我们用python的文件处理来进行文件拷贝，想必很多小伙伴的思路是：使用打开2个文件对象，源文件读取内容，写入目标文件中来完成拷贝过程。但是这样丢失stat数据信息（权限）等，因为根本没有复制这些信息过去。那目录复制又这咋办呢？

　　Python提供了一个方便的库shutil（高级文件操作）。它可以解决上面提到的问题，接下来我们来一起学习。

一.复制

1>.copyfilobj

 #！/usr/bin/env python

 #_*_conding:utf-8_*_

 #@author :yinzhengjie

 #blog:http://www.cnblogs.com/yinzhengjie

 import shutil

 src_file = r"E:\temp\a.txt"

 dest_file = r"E:\temp\a.txt-1"

 #写入测试数据

 with open(src_file,"w",encoding="utf8") as f:

     f.write("尹正杰到此一游!")

 with open(src_file,"r",encoding="utf8") as f1:

     with open(dest_file,"w") as f2:

         """

             文件对象的复制，f1和f2是open函数打开的文件对象，仅复制内容。后面的length指定了表示buffer的大小。这个长度咱们

         可以不指定，因为该函数有默认值，即"16*1024",建议使用IDE查看源码

         """

         shutil.copyfileobj(f1,f2,length=4096)

         print("文件拷贝成功")

 #以上代码输出结果如下:

 文件拷贝成功

2>.copyfile

 #！/usr/bin/env python

 #_*_conding:utf-8_*_

 #@author :yinzhengjie

 #blog:http://www.cnblogs.com/yinzhengjie

 import shutil

 src_file = r"E:\temp\a.txt"

 dest_file = r"E:\temp\a.txt-2"

 #写入测试数据

 with open(src_file,"w",encoding="utf8") as f:

     f.write("尹正杰到此一游!")

 with open(src_file,"r") as f1:

     with open(dest_file,"w") as f2:

         shutil.copyfile(src_file,dest_file)             #复制文件内容，不含元数据。本质上调用的就是copyfileobj，所以不带元数据内容复制。

         print("文件拷贝成功")

 #以上代码输出结果如下:

 文件拷贝成功

3>.copymode

 #！/usr/bin/env python

 #_*_conding:utf-8_*_

 #@author :yinzhengjie

 #blog:http://www.cnblogs.com/yinzhengjie

 import shutil,os

 src_file = r"E:\temp\a.txt"

 dest_file = r"E:\temp\a.txt-1"

 print(os.stat(src_file))

 print(os.stat(dest_file))

 shutil.copymode(src_file,dest_file)         #仅仅复制权限，该方法在Linux系统看起来比较直观

 print(os.stat(src_file))

 print(os.stat(dest_file))

4>.copystat

 #！/usr/bin/env python

 #_*_conding:utf-8_*_

 #@author :yinzhengjie

 #blog:http://www.cnblogs.com/yinzhengjie

 import shutil,os

 src_file = r"E:\temp\a.txt"

 dest_file = r"E:\temp\a.txt-1"

 print(os.stat(src_file))

 print(os.stat(dest_file))

 shutil.copystat(src_file,dest_file)         #复制元数据，stat包含权限

 print(os.stat(src_file))

 print(os.stat(dest_file))

5>.copy

 #！/usr/bin/env python

 #_*_conding:utf-8_*_

 #@author :yinzhengjie

 #blog:http://www.cnblogs.com/yinzhengjie

 import shutil,os

 src_file = r"E:\temp\a.txt"

 dest_file = r"E:\temp\a.txt-1"

 print(os.stat(src_file))

 print(os.stat(dest_file))

 """

     复制文件内容，权限和部分元数据，不包括创建时间和修改时间。本质上调用的是copyfile和copymode方法，可以使用IDE查看源码

 """

 shutil.copy(src_file,dest_file)          

 print(os.stat(src_file))

 print(os.stat(dest_file))

6>.copy2

 #！/usr/bin/env python

 #_*_conding:utf-8_*_

 #@author :yinzhengjie

 #blog:http://www.cnblogs.com/yinzhengjie

 import shutil,os

 src_file = r"E:\temp\a.txt"

 dest_file = r"E:\temp\a.txt-3"

 print(os.stat(src_file))

 print(os.stat(dest_file))

 """

    copy2比copy多了复制全部元数据，但需要平台支持。本质上调用的是copyfile和copystat函数。

 """

 shutil.copy2(src_file,dest_file)

 print(os.stat(src_file))

 print(os.stat(dest_file))

7>.copytree

 #！/usr/bin/env python

 #_*_conding:utf-8_*_

 #@author :yinzhengjie

 #blog:http://www.cnblogs.com/yinzhengjie

 import shutil,os

 src_file = r"E:\temp\test"

 dest_file = r"E:\temp\bak"

 def my_ignore(src,names):

     ig = filter(lambda x:x.startswith("a"),names)

     return set(ig)

 """

    递归复制目录。默认使用copy2，也就是带更多的元数据复制。

    函数签名为:

         def copytree(src, dst, symlinks=False, ignore=None, copy_function=copy2,ignore_dangling_symlinks=False)

         其中src，dst必须都是目录，src必须存在，dst必须不存在。

         ignore = func，提供一个"callable(src, names) -> ignored_names"。提供一个函数，提供一个函数，它会被调用。src是源目录，

     names是os.listdir(src)的结果，就是列出src中的文件名，返回值要被过滤的文件名的set类型数据。

 """

 shutil.copytree(src_file,dest_file,ignore=my_ignore)

二.删除

 #！/usr/bin/env python

 #_*_conding:utf-8_*_

 #@author :yinzhengjie

 #blog:http://www.cnblogs.com/yinzhengjie

 import shutil

 dest_file = r"E:\temp\bak"

 """

     递归删除，如果Linux的"rm -rf"一样危险，慎用哈。

     它不是原子性操作，有可能删除错误，就会中断，已经删除的就删除啦。

     其函数签名为：

         def rmtree(path, ignore_errors=False, onerror=None)

         其中ignore_errors为True就表示忽略vuow，当为Flase或者omitted时onerror生效。

         onerror为callable接受函数function，path和execinofie。

 """

 shutil.rmtree(dest_file)            #删除的目标目录必须存在哟！否则会抛出异常的~

三.移动

 #！/usr/bin/env python

 #_*_conding:utf-8_*_

 #@author :yinzhengjie

 #blog:http://www.cnblogs.com/yinzhengjie

 import shutil

 src_file = r"E:\temp\test\a.txt"

 dest_file = r"E:\temp\test\b.txt"

 """

     递归移动文件，目录到目标，返回目标。

     本身使用的时os.rename方法，这个在源码中可以看到。

     如果不支持rename，如果时目录则copytree再删除源目录。

     默认使用copy2方法。

 """

 shutil.move(src_file,dest_file)

四.小试牛刀

　　shutil还有打包功能，生成tar并压缩。支持压缩格式有zip，gz，bz，xz。如果有需要的小伙伴可以自行查阅一下源码的实现方式，我这里就不罗嗦了。

　　下面有几个可以使用shutil模块的案例，仅供参考。

1>.指定一个源文件，实现copy到目标目录。

例如:
　　把/tmp/test.txt 拷贝到 /tmp/test.txt-bak

 #！/usr/bin/env python

 #_*_coding:utf-8_*_

 #@author :yinzhengjie

 #blog:https://www.cnblogs.com/yinzhengjie

 #EMAIL:y1053419035@qq.com

 from os import path

 basedir = "/tmp/"

 src = "test.txt"

 dst = "test.txt-bak"

 src = path.join(basedir,src)

 dst = path.join(basedir,dst)

 #写入测试内容

 with open(src,"w",encoding="utf-8") as f:

     f.writelines("\n".join(("yinzhengjie","jason","https://www.cnblogs.com/yinzhengjie")))

 #自定函数实现文件复制

 def copy(src,dst):

     with open(src,"rb") as f1:

         with open(dst,"wb") as f2:

             length = 16 * 1024

             while True:

                 buffer = f1.read(length)       #需要注意的是，这里需要一个缓冲区。

                 if not buffer:

                     break

                 f2.write(buffer)

 #调用咱们自定义的函数

 copy(src,dst)

自定函数实现文件复制

 #！/usr/bin/env python

 #_*_coding:utf-8_*_

 #@author :yinzhengjie

 #blog:https://www.cnblogs.com/yinzhengjie

 #EMAIL:y1053419035@qq.com

 from os import path

 basedir = "/tmp/"

 src = "test.txt"

 dst = "test.txt-bak"

 src = path.join(basedir,src)

 dst = path.join(basedir,dst)

 #写入测试内容

 with open(src,"w",encoding="utf-8") as f:

     f.writelines("\n".join(("yinzhengjie","jason","https://www.cnblogs.com/yinzhengjie")))

 #使用shutil工具也可以轻松实现文件拷贝

 import shutil

 #以下三种方法都可以实现文件拷贝，需要注意区别哟～

 # shutil.copyfile(src,dst)

 # shutil.copy(src,dst)

 shutil.copy2(src,dst)

使用shutil工具也可以轻松实现文件拷贝

2>.复制目录

选择一个已存在的目录作为当前工作目录，在其下创建a/b/c/d这样的子目录结构并在这些子目录的不同层级生成 50个普通文件，要求文件名由随机1-4个小写字母构成。 将a目录下所有内容复制到当前工作目录dst目录下去，要求复制的普通文件的文件名必须是x、y、z开头。


例如：
　　假设工作目录是/tmp，构建的目录结构是/tmp/a/b/c/d。在a、b、c、d目录中放入随机生成的文件，这些 文件的名称也是随机生成的。最终把a目录下所有的目录也就是b、c、d目录，和文件名开头是x、y、z开头的文件。

 #！/usr/bin/env python

 #_*_coding:utf-8_*_

 #@author :yinzhengjie

 #blog:https://www.cnblogs.com/yinzhengjie

 #EMAIL:y1053419035@qq.com

 import shutil

 from pathlib import Path

 from string import ascii_lowercase

 import random

 # 当前工作目录

 basedir = Path('/tmp')

 sub = Path('a/b/c/d')

 dirs = [sub] + list(sub.parents)[:-1]

 print(dirs)

 # 创建所有目录

 (basedir / sub).mkdir(parents=True, exist_ok=True)

 # 生成随机文件名

 filenames = ("".join(random.choices(ascii_lowercase, k=random.randint(1,4))) for i in range(50))

 # 拼接路径生成文件

 for name in filenames:

     (basedir / random.choice(dirs) / name).touch()

 headers = set('xyz') # 名称开头 #print(headers)

 def ignore_files(src, names):

     #return {name for name in names if name[0] not in headers and Path(src, name).is_file()}

     return set(filter(lambda name: name[0] not in headers and Path(src, name).is_file(), names))

 shutil.copytree(str(basedir / 'a'), str(basedir / 'dst'), ignore=ignore_files)

 #遍历所有文件

 print('-' * 30)

 for f in basedir.rglob('*'):

     print(f)

参考案例

3>.单词统计

Navigation

index modules | next | previous |  Python » 3.5.3 Documentation » The Python Standard Library » 11. File and Directory Access »

11.2. os.path — Common pathname manipulations

Source code: Lib/posixpath.py (for POSIX), Lib/ntpath.py (for Windows NT), and Lib/macpath.py (for Macintosh)

--------------------------------------------------------------------------------

This module implements some useful functions on pathnames. To read or write files see open(), and for accessing the filesystem see the os module. The path parameters can be passed as either strings, or bytes. Applications are encouraged to represent file names as (Unicode) character strings. Unfortunately, some file names may not be representable as strings on Unix, so applications that need to support arbitrary file names on Unix should use bytes objects to represent path names. Vice versa, using bytes objects cannot represent all file names on Windows (in the standard mbcs encoding), hence Windows applications should use string objects to access all files.

Unlike a unix shell, Python does not do any automatic path expansions. Functions such as expanduser() and expandvars() can be invoked explicitly when an application desires shell-like path expansion. (See also the glob module.)

See also

The pathlib module offers high-level path objects.

Note

All of these functions accept either only bytes or only string objects as their parameters. The result is an object of the same type, if a path or file name is returned.

Note

Since different operating systems have different path name conventions, there are several versions of this module in the standard library. The os.path module is always the path module suitable for the operating system Python is running on, and therefore usable for local paths. However, you can also import and use the individual modules if you want to manipulate a path that is always in one of the different formats. They all have the same interface:

posixpath for UNIX-style paths

ntpath for Windows paths

macpath for old-style MacOS paths

os.path.abspath(path)

Return a normalized absolutized version of the pathname path. On most platforms, this is equivalent to calling the function normpath() as follows: normpath(join(os.getcwd(), path)).

os.path.basename(path)

Return the base name of pathname path. This is the second element of the pair returned by passing path to the function split(). Note that the result of this function is different from the Unix basename program; where basename for '/foo/bar/' returns 'bar', the basename() function returns an empty string ('').

os.path.commonpath(paths)

Return the longest common sub-path of each pathname in the sequence paths. Raise ValueError if paths contains both absolute and relative pathnames, or if paths is empty. Unlike commonprefix(), this returns a valid path.

Availability: Unix, Windows

New in version 3.5.

os.path.commonprefix(list)

Return the longest path prefix (taken character-by-character) that is a prefix of all paths in list. If list is empty, return the empty string ('').

Note

This function may return invalid paths because it works a character at a time. To obtain a valid path, see commonpath().

>>> os.path.commonprefix(['/usr/lib', '/usr/local/lib'])

'/usr/l'

>>> os.path.commonpath(['/usr/lib', '/usr/local/lib'])

'/usr'

os.path.dirname(path)

Return the directory name of pathname path. This is the first element of the pair returned by passing path to the function split().

os.path.exists(path)

Return True if path refers to an existing path or an open file descriptor. Returns False for broken symbolic links. On some platforms, this function may return False if permission is not granted to execute os.stat() on the requested file, even if the path physically exists.

Changed in version 3.3: path can now be an integer: True is returned if it is an open file descriptor, False otherwise.

os.path.lexists(path)

Return True if path refers to an existing path. Returns True for broken symbolic links. Equivalent to exists() on platforms lacking os.lstat().

os.path.expanduser(path)

On Unix and Windows, return the argument with an initial component of ~ or ~user replaced by that user‘s home directory.

On Unix, an initial ~ is replaced by the environment variable HOME if it is set; otherwise the current user’s home directory is looked up in the password directory through the built-in module pwd. An initial ~user is looked up directly in the password directory.

On Windows, HOME and USERPROFILE will be used if set, otherwise a combination of HOMEPATH and HOMEDRIVE will be used. An initial ~user is handled by stripping the last directory component from the created user path derived above.

If the expansion fails or if the path does not begin with a tilde, the path is returned unchanged.

os.path.expandvars(path)

Return the argument with environment variables expanded. Substrings of the form $name or ${name} are replaced by the value of environment variable name. Malformed variable names and references to non-existing variables are left unchanged.

On Windows, %name% expansions are supported in addition to $name and ${name}.

os.path.getatime(path)

Return the time of last access of path. The return value is a number giving the number of seconds since the epoch (see the time module). Raise OSError if the file does not exist or is inaccessible.

If os.stat_float_times() returns True, the result is a floating point number.

os.path.getmtime(path)

Return the time of last modification of path. The return value is a number giving the number of seconds since the epoch (see the time module). Raise OSError if the file does not exist or is inaccessible.

If os.stat_float_times() returns True, the result is a floating point number.

os.path.getctime(path)

Return the system’s ctime which, on some systems (like Unix) is the time of the last metadata change, and, on others (like Windows), is the creation time for path. The return value is a number giving the number of seconds since the epoch (see the time module). Raise OSError if the file does not exist or is inaccessible.

os.path.getsize(path)

Return the size, in bytes, of path. Raise OSError if the file does not exist or is inaccessible.

os.path.isabs(path)

Return True if path is an absolute pathname. On Unix, that means it begins with a slash, on Windows that it begins with a (back)slash after chopping off a potential drive letter.

os.path.isfile(path)

Return True if path is an existing regular file. This follows symbolic links, so both islink() and isfile() can be true for the same path.

os.path.isdir(path)

Return True if path is an existing directory. This follows symbolic links, so both islink() and isdir() can be true for the same path.

os.path.islink(path)

Return True if path refers to a directory entry that is a symbolic link. Always False if symbolic links are not supported by the Python runtime.

os.path.ismount(path)

Return True if pathname path is a mount point: a point in a file system where a different file system has been mounted. On POSIX, the function checks whether path‘s parent, path/.., is on a different device than path, or whether path/.. and path point to the same i-node on the same device — this should detect mount points for all Unix and POSIX variants. On Windows, a drive letter root and a share UNC are always mount points, and for any other path GetVolumePathName is called to see if it is different from the input path.

New in version 3.4: Support for detecting non-root mount points on Windows.

os.path.join(path, *paths)

Join one or more path components intelligently. The return value is the concatenation of path and any members of *paths with exactly one directory separator (os.sep) following each non-empty part except the last, meaning that the result will only end in a separator if the last part is empty. If a component is an absolute path, all previous components are thrown away and joining continues from the absolute path component.

On Windows, the drive letter is not reset when an absolute path component (e.g., r'\foo') is encountered. If a component contains a drive letter, all previous components are thrown away and the drive letter is reset. Note that since there is a current directory for each drive, os.path.join("c:", "foo") represents a path relative to the current directory on drive C: (c:foo), not c:\foo.

os.path.normcase(path)

Normalize the case of a pathname. On Unix and Mac OS X, this returns the path unchanged; on case-insensitive filesystems, it converts the path to lowercase. On Windows, it also converts forward slashes to backward slashes. Raise a TypeError if the type of path is not str or bytes.

os.path.normpath(path)

Normalize a pathname by collapsing redundant separators and up-level references so that A//B, A/B/, A/./B and A/foo/../B all become A/B. This string manipulation may change the meaning of a path that contains symbolic links. On Windows, it converts forward slashes to backward slashes. To normalize case, use normcase().

os.path.realpath(path)

Return the canonical path of the specified filename, eliminating any symbolic links encountered in the path (if they are supported by the operating system).

os.path.relpath(path, start=os.curdir)

Return a relative filepath to path either from the current directory or from an optional start directory. This is a path computation: the filesystem is not accessed to confirm the existence or nature of path or start.

start defaults to os.curdir.

Availability: Unix, Windows.

os.path.samefile(path1, path2)

Return True if both pathname arguments refer to the same file or directory. This is determined by the device number and i-node number and raises an exception if an os.stat() call on either pathname fails.

Availability: Unix, Windows.

Changed in version 3.2: Added Windows support.

Changed in version 3.4: Windows now uses the same implementation as all other platforms.

os.path.sameopenfile(fp1, fp2)

Return True if the file descriptors fp1 and fp2 refer to the same file.

Availability: Unix, Windows.

Changed in version 3.2: Added Windows support.

os.path.samestat(stat1, stat2)

Return True if the stat tuples stat1 and stat2 refer to the same file. These structures may have been returned by os.fstat(), os.lstat(), or os.stat(). This function implements the underlying comparison used by samefile() and sameopenfile().

Availability: Unix, Windows.

Changed in version 3.4: Added Windows support.

os.path.split(path)

Split the pathname path into a pair, (head, tail) where tail is the last pathname component and head is everything leading up to that. The tail part will never contain a slash; if path ends in a slash, tail will be empty. If there is no slash in path, head will be empty. If path is empty, both head and tail are empty. Trailing slashes are stripped from head unless it is the root (one or more slashes only). In all cases, join(head, tail) returns a path to the same location as path (but the strings may differ). Also see the functions dirname() and basename().

os.path.splitdrive(path)

Split the pathname path into a pair (drive, tail) where drive is either a mount point or the empty string. On systems which do not use drive specifications, drive will always be the empty string. In all cases, drive + tail will be the same as path.

On Windows, splits a pathname into drive/UNC sharepoint and relative path.

If the path contains a drive letter, drive will contain everything up to and including the colon. e.g. splitdrive("c:/dir") returns ("c:", "/dir")

If the path contains a UNC path, drive will contain the host name and share, up to but not including the fourth separator. e.g. splitdrive("//host/computer/dir") returns ("//host/computer", "/dir")

os.path.splitext(path)

Split the pathname path into a pair (root, ext) such that root + ext == path, and ext is empty or begins with a period and contains at most one period. Leading periods on the basename are ignored; splitext('.cshrc') returns ('.cshrc', '').

os.path.splitunc(path)

Deprecated since version 3.1: Use splitdrive instead.

Split the pathname path into a pair (unc, rest) so that unc is the UNC mount point (such as r'\\host\mount'), if present, and rest the rest of the path (such as r'\path\file.ext'). For paths containing drive letters, unc will always be the empty string.

Availability: Windows.

os.path.supports_unicode_filenames

True if arbitrary Unicode strings can be used as file names (within limitations imposed by the file system).

Navigation

index modules | next | previous |  Python » 3.5.3 Documentation » The Python Standard Library » 11. File and Directory Access »

© Copyright 2001-2017, Python Software Foundation.

The Python Software Foundation is a non-profit corporation. Please donate.

Last updated on Jan 16, 2017. Found a bug?

Created using Sphinx 1.3.1.

sample.txt

　　上面有一个sample.txt文件，对其进行单词统计，不区分大小写，并显示单词重复最多的10个单词。[理论上“path”关键词应该是最多的]

 #！/usr/bin/env python

 #_*_coding:utf-8_*_

 #@author :yinzhengjie

 #blog:https://www.cnblogs.com/yinzhengjie

 #EMAIL:y1053419035@qq.com

 filename = "sample.txt"

 d = {}

 with open(filename,encoding="utf-8") as f:

     for line in f:

         words = line.split()

         for word in map(str.lower,words):   #不区分大小写

             d[word] = d.get(word,0) + 1

 print(sorted(d.items(),key=lambda item:item[1],reverse=True))

 #或使用缺省字典

 from collections import defaultdict

 d = defaultdict(lambda :0)

 with open(filename,encoding="utf8") as f:

     for line in f:

         words = line.split()

         for word in map(str.lower,words):

             d[word] += 1

 print(sorted(d.items(),key=lambda item:item[1],reverse=True))

简单处理版本，直接使用split方法（版本一，简单实现）

 #！/usr/bin/env python

 #_*_coding:utf-8_*_

 #@author :yinzhengjie

 #blog:https://www.cnblogs.com/yinzhengjie

 #EMAIL:y1053419035@qq.com

 from collections import defaultdict

 filename = "sample.txt"

 def makekey(src:str)->list:

     """

         该方法虽然没有使用replace一次次遍历字符串来替换，但是毕竟产生一个等长的字符串，最后又使用split 遍历这个字符串切割成段。

     可以思考能否遍历原字符串的同时，不替换，直接将单词提取了?

     """

     chars = set(r"""!()-+*'"/\#.[]{},""")

     key = src.lower()

     ret = []

     for c in key:

         ret.append(" ") if c in chars else ret.append(c)

     return "".join(ret).split()

 #或使用缺省字典

 d = defaultdict(lambda :0)

 with open(filename,encoding="utf8") as f:

     for line in f:

         for word in makekey(line):  #把整行中的所有特殊字符都替换

             d[word] += 1

 print(sorted(d.items(), key=lambda x:x[1], reverse=True)[:10])

 print(*list(filter(lambda x: x.find('path') > -1, d.keys())), sep='\n')

遇到特殊字符，就用空格替代（版本二，完整实现，但有待优化）

 #！/usr/bin/env python

 #_*_coding:utf-8_*_

 #@author :yinzhengjie

 #blog:https://www.cnblogs.com/yinzhengjie

 #EMAIL:y1053419035@qq.com

 from collections import defaultdict

 filename = "sample.txt"

 def makekey(src:str)->list:

     chars = set("""!()-+*'"/\#.[]{}, \n""")

     key = src.lower()

     ret = []

     start = 0

     length = len(key)

     for i,c in enumerate(key):

         if c in chars:

             if start == i: # 如果紧挨着还是特殊字符，start一定等于i

                 start += 1 # 加1并continue

                 continue

             ret.append(key[start:i])

             start = i + 1 # 加1是跳过这个不需要的特殊字符c

     else:

         if start < length: # 小于，说明还有有效的字符，而且一直到末尾

             ret.append(key[start:])

     return ret

 #或使用缺省字典

 d = defaultdict(lambda :0)

 with open(filename,encoding="utf8") as f:

     for line in f:

         for word in makekey(line):  #把整行中的所有特殊字符都替换

             d[word] += 1

 print(sorted(d.items(), key=lambda x:x[1], reverse=True)[:10])

 print(*list(filter(lambda x: x.find('path') > -1, d.keys())), sep='\n')

遍历原字符串的同时，不替换，直接将单词提取(完整版本)

4>.单词统计进阶

　　在上一题基础之上，要求用户可以排除一些单词的统计，例如a、the、of等不应该出现在具有实际意义的统计中， 应当忽略。

　　要求全部代码使用函数封装，并调用完成

 #！/usr/bin/env python

 #_*_coding:utf-8_*_

 #@author :yinzhengjie

 #blog:https://www.cnblogs.com/yinzhengjie

 #EMAIL:y1053419035@qq.com

 filename = 'sample.txt'

 def makekey2(line: str, chars=set("""!()-+*'"/\#.[]{}, &:\r\n""")):

     """处理行，分离出单词

     :param line: 行字符串

     :param chars: 分隔符，包括空白字符 :return: 生成器，分离出的单词

     """

     # 大小写不是生成单词的事情，所以移除

     start = 0

     length = len(line)

     for i, c in enumerate(line):

         if c in chars:

             if start == i:  # 如果紧挨着还是特殊字符，start一定等于i

                 start += 1 # 加1并continue

                 continue

             yield line[start:i]

             start = i + 1  # 加1是跳过这个不需要的特殊字符c

     else:

         if start < length:  # 小于，说明还有有效的字符，而且一直到末尾

             yield line[start:]

 def wordcount(filename, encoding='utf-8', ignore=set()):

     d = {}

     with open(filename, encoding=encoding) as f:

         for line in f:

             for word in map(str.lower, makekey2(line)):

                 if word not in ignore:

                     d[word] = d.get(word, 0) + 1

     return d

 def top(d: dict, n=10):

     yield from sorted(d.items(), key=lambda x: x[1], reverse=True)[:n]

 # 单词统计前几名

 t = top(wordcount(filename, ignore={'the', 'is', 'a'}))

 print(list(t))

参考案例

Python的高级文件操作（shutil模块）的更多相关文章

【Python】高级文件操作 shutil
shutil 很多时候,我想要对文件进行重命名,删除,创建等操作的时候的想法就是用subprocess开一个子进程来处理,但是实际上shutil可以更加方便地提供os的文件操作接口,从而可以一条语句搞 ...
python模块之shutil高级文件操作
简介 shutil模块提供了大量的文件的高级操作.特别针对文件拷贝和删除,主要功能为目录和文件操作以及压缩操作.对单个文件的操作也可参见os模块. 注意即便是更高级别的文件复制函数(shutil.co ...
python3之shutil高级文件操作
1.shutil高级文件操作模块 shutil模块提供了大量的文件的高级操作.特别针对文件拷贝和删除,主要功能为目录和文件操作以及压缩操作.对单个文件的操作也可参见os模块. 2.shutil模块的拷 ...
python 历险记（三）— python 的常用文件操作
目录前言文件什么是文件? 如何在 python 中打开文件? python 文件对象有哪些属性? 如何读文件? read() readline() 如何写文件? 如何操作文件和目录? 强大的 o ...
17 - 路径操作-shutil模块
目录 1 路径操作 1.1 os.path模块 1.2 pathlib模块 1.2.1 目录操作 1.2.2 文件操作 1.3 os 模块 2 shutil模块 2.1 copy复制 2.2 rm删除 ...
Python与CSV文件（CSV模块）
Python与CSV文件(CSV模块) 1.CSV文件 CSV(逗号分隔值)格式是电子表格和数据库最常用的导入和导出格式.没有“CSV标准”,因此格式由许多读写的应用程序在操作上定义.缺乏标准意味 ...
第3章文件I/O（7）_高级文件操作：存储映射
8. 高级文件操作:存储映射 (1)概念: 存储映射是一个磁盘文件与存储空间的一个缓存相映射,对缓存数据的读写就相应的完成了文件的读写. (2)mmap和munmap函数头文件 #include&l ...
第3章文件I/O（6）_高级文件操作：文件锁
7. 高级文件操作:文件锁 (1)文件锁分类分类依据类型说明按功能分共享读锁文件描述符必须读打开一个进程上了读锁,共它进程也可以上读锁进行读取独占写锁文件描述符必须写打开一个进程上 ...
Python入门篇-文件操作
Python入门篇-文件操作作者:尹正杰版权声明:原创作品,谢绝转载!否则将追究法律责任. 一.文件IO常用操作 open:打开 read:读取 write:写入 close:关闭 readlin ...

随机推荐

Pandas的DataFrame
1. 手工创建DataFrame a = [[1, 2, 2],[3,None,6],[3, 7, None],[5,None,7]] data = DataFrame(a) 2. Excel数据数据 ...
Python的log
关键代码调用方: from Logger import MyLogger import logging import sys, os def getLogger(): # get the file ...
基于Spring Boot架构的前后端完全分离项目API路径问题
最近的一个项目采用前后端完全分离的架构,前端组件:vue + vue-router + vuex + element-ui + axios,后端组件:Spring Boot + MyBatis.之所以 ...
shell中 >/dev/null 2>&1是什么意思
原文地址:http://juke.outofmemory.cn/entry/295292 我们经常能在 shell 脚本中发现 >/dev/null 2>&1 这样的语句.以前的我 ...
Arch Linux 启用 MTU 探测
最近在家里经常遇到 ssh 超时的问题,一开始也没太当回事,感觉是网络不稳定导致的,但是后来慢慢的发现这种超时问题只会出现在跟 ssh 相关的程序中,例如 git.ssh.这成功的引起了我的注意,于是 ...
【Spring Cloud学习之三】负载均衡
环境 eclipse 4.7 jdk 1.8 Spring Boot 1.5.2 Spring Cloud 1.2 主流的负载均衡技术有nginx.LVS.HAproxy.F5,Spring Clou ...
python从写定时器学习Thread
目录 python从写定时器学习Thread Timer 对象粗陋的循环定时器更 pythonic 循环定时器 FAQ python从写定时器学习Thread python 如何写一个定时器,循环 ...
DS 红黑树详解
通过上篇博客知道,二叉搜索树的局限在于不能完成自平衡,从而导致不能一直保持高性能. AVL树则定义了平衡因子绝对值不能大于1,使二叉搜索树达到了严格的高度平衡. 还有一种能自我调整的二叉搜索树, 红黑 ...
me.chanjar.weixin.common.error.WxErrorException: {"errcode":40013,"errmsg":"invalid appid hint: [xxxxxxxxxx]"}
错误解决思路: 1.看看appid和appsecret的配置信息是否正确 2.查看前后端通信的http或者https协议是否正确( http://xxxxxxx 写成https://xxxxxxx)
矩阵优化DP类问题应用向小结
前言本篇强调应用,矩阵的基本知识有所省略(也许会写篇基础向...). 思想及原理为什么Oier们能够想到用矩阵来加速DP呢?做了一些DP题之后,我们会发现,有时候DP两两状态之间的转移是定向的,也 ...

Python的高级文件操作（shutil模块）

Python的高级文件操作（shutil模块）的更多相关文章

随机推荐

热门专题