python实现查看目录下重复的文件

该python 脚本有以下三个功能：

1. 实现查看目录下重复的文件，输出文件按修改时间升序排列

2. 将按修改时间排列比较旧的、可删除的文件列出来

3. 按目录对重复文件进行统计，比如，目录/tmp 重复个数5，是指/tmp目录下有5个文件在其他地方也存在

python脚本

#!/usr/bin/env python

#coding=utf-8

'''

Created on Nov 30, 2016

@author: fangcheng

'''

from __future__ import print_function

from operator import itemgetter

import os

import time

'tt为浮点型日期，换化为年月日时分秒格式时间'

def timeYS(tt):

    t1 = time.localtime(tt)

    t2 = time.strftime("%Y-%m-%d %H:%M:%S",t1)

    return t2;  

class File():

    '''

    copy move remove

    '''

    allfilecount = 0

    rddfilecount = 0

    singlefiles={}

    rddfiles={}

    rdddirs={}

    def __init__(self):

        '''

        Constructor

        '''

    def getFileMsg(self,filepath):

        '''

        以元组(filepath,ftime,size)形式输出文件信息

        '''

        if os.path.isfile(filepath):

            size = os.path.getsize(filepath)  #bytes B

            if size <= 1024:

                size ='{0}B'.format(size);

            elif size <= 1024*1024:

                size = size/1024

                size ='{0}K'.format(size);

            else:

                size = size/1024/1024

                size ='{0}M'.format(size);

            #filename = os.path.basename(filepath)

            ftime = timeYS(os.path.getmtime(filepath))

            return (filepath,ftime,size)

        return ()

    def setRedundanceFile(self,filepath):

        '''

        根据文件名称和大小判断文件是否重复,文件信息:元组(filepath,mtime,size) ,getFileMsg返回值

        1. 遍历某一目录下所有文件

        2. 将文件的名称及大小组成一个字符串，做为 key 放入字典 dict1 ,其 value 为 文件信息

        3. 每次放入时时判断 key 是否存在，若存在，就将 文件信息 放入字典 dict2

        4. dict2 的 key 为 文件名称，value为 文件信息 列表 list1

        '''

        try:

            if os.path.isdir(filepath):

                for fil in os.listdir(filepath):

                    fil = os.path.join(filepath,fil)

                    self.setRedundanceFile(fil)

            elif os.path.isfile(filepath):

                self.allfilecount = self.allfilecount + 1

                size = os.path.getsize(filepath)

                filename = os.path.basename(filepath)

                f = self.getFileMsg(filepath)

                filekey = '{0}_{1}'.format(filename, size)

                if self.singlefiles.has_key(filekey):

                    self.rddfilecount = self.rddfilecount + 1

                    #增加规则：发现一个重复文件时，在父目录下文件数加1，若是首次发现则取该文件在总文件列表的父目录，其数目也加1

                    pardir = os.path.dirname(filepath)

                    if self.rdddirs.has_key(pardir):

                        self.rdddirs[pardir] = self.rdddirs.get(pardir)+1

                    else:

                        self.rdddirs[pardir] = 1

                    if self.rddfiles.has_key(filekey) :

                        self.rddfiles[filekey].append(f)

                    else:

                        self.rddfiles[filekey] = [f]

                        f = self.singlefiles.get(filekey)

                        self.rddfiles[filekey].append(f)

                        #若是首次发现则取该文件在总文件列表的父目录，其数目也加1

                        pardir = os.path.dirname(f[0])

                        if self.rdddirs.has_key(pardir):

                            self.rdddirs[pardir] = self.rdddirs.get(pardir)+1

                        else:

                            self.rdddirs[pardir] = 1

                else:

                    self.singlefiles[filekey]=f

            else:

                return

        except Exception as e:

            print(e)

    def showFileCount(self):

        print(self.allfilecount)

    def showRedundanceFile(self,filepath):

        '''

        根据文件名称和大小判断文件是否重复

        '''

        self.allfilecount = 0

        self.rddfilecount = 0

        self.singlefiles={}

        self.rddfiles={}

        self.setRedundanceFile(filepath)

        print('the total file num:{0},the redundance file num(not including the first file):{1}'.format(self.allfilecount,self.rddfilecount))

        print('-----------------------------------------')

        for k in self.rddfiles.keys():

            for l in sorted(self.rddfiles.get(k), key=itemgetter(1)): #按修改日期升序排列

                print(l);

            print('');

        print('------------------------------------------')

    def showCanRemoveFile(self,filepath):

        '''

        根据文件名称和大小判断文件是否重复

        输出按修改时间较旧的文件

        '''

        self.allfilecount = 0

        self.rddfilecount = 0

        self.singlefiles={}

        self.rddfiles={}

        rmlist = []

        self.setRedundanceFile(filepath)

        for k in self.rddfiles.keys():

            tmplist = sorted(self.rddfiles.get(k), key=itemgetter(1))

            tmplist.pop()

            rmlist.extend(tmplist)

        for rl in rmlist:

            print(rl[0])

    def rdddirstat(self):

        '''

        按目录统计文件重复个数

        输出：目录/tmp  重复个数5，是指/tmp目录下有5个文件在其他地方也存在

        '''

        if len(self.rdddirs)> 0 :

            print('The redundance file statistics by dirs:')

            for rd in self.rdddirs.keys():

                print('{0} {1}'.format(rd, self.rdddirs.get(rd)))

        else:

            print('There are no redundance files')

if __name__ == '__main__':

    f = File()

    filepath = os.getcwd()

    #filepath = '/scripts'

    f.showRedundanceFile(filepath) #查看多余的文件

    #f.showCanRemoveFile(filepath)  #按修改时间给出比较旧的多余文件

    f.rdddirstat()                 #按目录统计重复文件个数

脚本添加执行权限后，可直接在服务器上执行
chmod +x findrdd.py

linux上执行示例

[root@bak scripts]# ./findrdd.py

the total file num:33,the redundance file num(not including the first file):5

-----------------------------------------

('/scripts/bkapp.sh', '2016-03-09 16:31:03', '3K')

('/scripts/esgcc/bkapp.sh', '2016-03-10 11:06:06', '3K')

('/scripts/show_rollbak.txt', '2016-03-09 10:50:02', '2K')

('/scripts/esgcc/show_rollbak.txt', '2016-03-10 11:06:06', '2K')

('/scripts/esgcc/deploy.sh', '2016-03-10 11:36:19', '8K')

('/scripts/deploy.sh', '2016-03-11 11:42:04', '8K')

('/scripts/rollback.sh', '2016-03-10 10:22:33', '10K')

('/scripts/esgcc/rollback.sh', '2016-03-10 11:06:06', '10K')

('/scripts/show_deploy.txt', '2016-03-09 10:50:02', '2K')

('/scripts/esgcc/show_deploy.txt', '2016-03-10 11:06:06', '2K')

------------------------------------------

The redundance file statistics by dirs:

/scripts 5

/scripts/esgcc 5

windows上执行示例（需要安装python）:

C:\Users\fei\Desktop\tmp>python findrdd.py

the total file num:42,the redundance file num(not including the first file):10

-----------------------------------------

('C:\\Users\\fei\\Desktop\\tmp\\build\\build\\src\\application\\application.css', '2016-11-22 13:11:51', '101B')

('C:\\Users\\fei\\Desktop\\tmp\\build\\project\\src\\application\\application.css', '2016-11-22 13:11:51', '101B')

('C:\\Users\\fei\\Desktop\\tmp\\build\\build\\classes\\application\\application.css', '2016-11-22 13:11:53', '101B')

('C:\\Users\\fei\\Desktop\\tmp\\build\\project\\src\\login\\Login.java', '2016-11-22 13:11:51', '3K')

('C:\\Users\\fei\\Desktop\\tmp\\build\\build\\src\\login\\Login.java', '2016-11-22 13:11:52', '3K')

('C:\\Users\\fei\\Desktop\\tmp\\build\\dist\\LoginCSS.jar', '2016-11-22 13:11:53', '55K')

('C:\\Users\\fei\\Desktop\\tmp\\build\\deploy\\LoginCSS.jar', '2016-11-22 13:11:54', '55K')

('C:\\Users\\fei\\Desktop\\tmp\\build\\project\\src\\login\\background.jpg', '2016-11-22 13:11:51', '51K')

('C:\\Users\\fei\\Desktop\\tmp\\build\\build\\src\\login\\background.jpg', '2016-11-22 13:11:52', '51K')

('C:\\Users\\fei\\Desktop\\tmp\\build\\build\\classes\\login\\background.jpg', '2016-11-22 13:11:53', '51K')

('C:\\Users\\fei\\Desktop\\tmp\\build\\project\\src\\application\\Main.java', '2016-11-22 13:11:50', '633B')

('C:\\Users\\fei\\Desktop\\tmp\\build\\build\\src\\application\\Main.java', '2016-11-22 13:11:51', '633B')

('C:\\Users\\fei\\Desktop\\tmp\\build\\project\\src\\login\\Test.java', '2016-11-22 13:11:51', '443B')

('C:\\Users\\fei\\Desktop\\tmp\\build\\build\\src\\login\\Test.java', '2016-11-22 13:11:52', '443B')

('C:\\Users\\fei\\Desktop\\tmp\\build\\project\\src\\login\\Login.css', '2016-11-22 13:11:51', '2K')

('C:\\Users\\fei\\Desktop\\tmp\\build\\build\\src\\login\\Login.css', '2016-11-22 13:11:52', '2K')

('C:\\Users\\fei\\Desktop\\tmp\\build\\build\\classes\\login\\Login.css', '2016-11-22 13:11:53', '2K')

------------------------------------------

The redundance file statistics by dirs:

C:\Users\fei\Desktop\tmp\build\build\src\application 2

C:\Users\fei\Desktop\tmp\build\deploy 1

C:\Users\fei\Desktop\tmp\build\build\classes\application 1

C:\Users\fei\Desktop\tmp\build\project\src\login 4

C:\Users\fei\Desktop\tmp\build\dist 1

C:\Users\fei\Desktop\tmp\build\build\classes\login 2

C:\Users\fei\Desktop\tmp\build\project\src\application 2

输出结果中第二个方法-输出可删除文件列表注释掉了，该删除方式仅供参考，是否按这种“最新修改的文件就是有效文件、其他文件皆可不要”方式筛选尚需自我决定。

python实现查看目录下重复的文件的更多相关文章

python引入同一目录下的py文件
python引入同一目录下的py文件注意:python2和python3的包内import语法有区别,下面介绍一下python3的包内import语法例如在admin.py文件中要引入dealco ...
Python读取一个目录下的所有文件
#!/usr/bin/python # -*- coding:utf8 -*- import os allFileNum = 0 def printPath(level, path): global ...
python 删除一个目录下的所有文件
一个目录下有文件,文件夹,文件夹里又有文件.文件夹....用python脚本,实现,递归删除一个目录下的所有文件: 目录结构如下: 其中我们要删除所有文件代码实现如下: import os CUR_ ...
Python读取指定目录下指定后缀文件并保存为docx
最近有个奇葩要求要项目中的N行代码申请专利啥的然后作为程序员当然不能复制粘贴用代码解决.. 使用python-docx读写docx文件环境使用python3.6.0 首先pip安装pytho ...
linux怎么实时查看目录下是否有文件生成
inotify-tools 是为linux下inotify文件监控工具提供的一套c的开发接口库函数,同时还提供了一系列的命令行工具,这些工具可以用来监控文件系统的事件. inotify-tools是用 ...
Python列出指定目录下的子目录/文件或者递归列出
1.python只列出当前目录(或者指定目录)下的文件或者目录条目 import os files,dirs=[],[] for item in os.listdir(): if os.path.is ...
【python】获取目录下的最新文件夹/文件
直接上代码 def new_report(test_report): lists = os.listdir(test_report) #列出目录的下所有文件和文件夹保存到lists print(lis ...
python将指定目录下的所有文件夹用随机数重命名
我的目的在于打乱数据顺序,便于GAN训练: import random import os path = 'hunhe_7' #目标文件夹 listname = os.listdir(path) #遍 ...
PHP查看目录下的所有文件
[1].[代码] [PHP]代码跳至 [1] ? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ...

随机推荐

HTML5--拖动02-dragstart、drag、dragenter、dragover、dragleave、drop、dragend属性
<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8&quo ...
Hybrid技术的设计与实现（转）
浅谈Hybrid技术的设计与实现前言随着移动浪潮的兴起,各种APP层出不穷,极速的业务扩展提升了团队对开发效率的要求,这个时候使用IOS&Andriod开发一个APP似乎成本有点过高了,而 ...
mysqldump导出
mysqldump -u user -p dbname table1 table2 > db.sql mysql执行sql mysql –uroot –p -Dtest < 1.sql
此操作失败的原因是对 IID 为“{000208DA-0000-0000-C000-000000000046}”的接口的 COM 组件调用 QueryInterface
有些电脑报错,有些电脑正常. 环境:VS2010 WinForm程序, Office2007 C#操作Excel时报错.错误: 无法将类型为“System.__ComObject”的 COM 对象强制 ...
Python实现冒泡排序
array = [1,2,3,6,5,4] for i in range(len(array)): for j in range(i): if array[j] > array[j + 1]: ...
BabelMap 9.0.0.3 汉化版（2016年12月27日更新）
软件简介 BabelMap 是一个免费的字体映射表工具,可辅助使用<汉字速查>程序. 该软件可使用系统上安装的所有字体浏览 Unicode 中的十万个字符,还带有拼音及部首检字法,适合文献 ...
用正则表达式限定XML simpleType 定义
<xsd:simpleType name="ResTrictions"> <xsd:restriction base="xsd:string" ...
MyBatis传入参数为集合、数组SQL写法
参考:http://blog.csdn.net/small____fish/article/details/8029030 foreach的主要用在构建in条件中,它可以在SQL语句中进行迭代一个集合 ...
Organize Your Train part II-POJ3007模拟
Organize Your Train part II Time Limit: 1000MS Memory Limit: 65536K Description RJ Freight, a Japane ...
Uva 10891 经典博弈区间DP
经典博弈区间DP 题目链接:https://uva.onlinejudge.org/external/108/p10891.pdf 题意: 给定n个数字,A和B可以从这串数字的两端任意选数字,一次只能 ...

python实现查看目录下重复的文件

python实现查看目录下重复的文件的更多相关文章

随机推荐

热门专题