Find and delete duplicate files
作用:查找指定目录(一个或多个)及子目录下的所有重复文件,分组列出,并可手动选择或自动随机删除多余重复文件,每组重复文件仅保留一份。(支持文件名有空格,例如:"file name" 等)
实现:find遍历指定目录查找所有文件,并对找到的所有文件进行MD5校验,通过比对MD5值分类处理重复文件。
不足: find 遍历文件耗时;
MD5校验大文件耗时;
对所有文件校验比对耗时(可考虑通过比对文件大小进行第一轮的重复性筛选,此方式针对存放大量大文件的目录效果明显,本脚本未采用);
演示:
注释:
脚本执行过程中显示MD5校验过程,完毕后,统计信息如下:
Files: 校验的文件总数
Groups: 重复文件组的数量
Size:此处统计的大小为,多余文件的总大小,即将要删除的多余的重复文件的大小,换句话说就是,删除重复文件后,磁盘空间会节省的空间。
可在“Show detailed information ?”提示后,按键“y”,进行重复文件组的查看,以便确认,也可直接跳过,进入删除文件方式的选择菜单:
删除文件方式有两种,一种是手动选择方式(默认的方式),每次列出一组重复文件,手动选择欲留下的文件,其他文件将会被删除,若没有选择 则默认保留列表的第一个文件,演示如下:
另一种方式是自动选择方式,默认保留每组文件的第一个文件,其他重复文件自动删除。(为防止删除重要文件,建议使用第一种方式),演示如下:
支持文件名空格的情况,演示如下:
代码专区:
#!/bin/bash #Author: LingYi #Date: #Func: Delete duplicate files #EG : $ [ DIR1 DIR2 ... DIRn ] #Define the mnt file, confirming the write authority by yourself. md5sum_result_log="/tmp/$(date +%Y%m%d%H%M%S)" echo -e "\033[1;31mMd5suming ...\033[0m" -I {} md5sum {} | tee -a $md5sum_result_log files_sum=$(cat $md5sum_result_log | wc -l) # Define array, using the value of md5 as index, filename as element. # Firstly, you must do advance declaration to make sure the it's supported by bash. declare -A md5sum_value_arry while read md5sum_value md5sum_filename do #Space in a file name, in order to support this case ,using the ‘+’ as the segmentation charater. #So, if '+' appears in a file name, there will be problems. The use should choose the manual mode to delete redundant files. md5sum_value_arry[$md5sum_value]="${md5sum_value_arry[$md5sum_value]}+$md5sum_filename" (( _${md5sum_value}+= )) done <$md5sum_result_log # counting the duplicate file groups and the size of redundant files in this loop. groups_sum= repfiles_size= for md5sum_value_index in ${!md5sum_value_arry[@]} do ]]; then let groups_sum++ need_print_indexes="$need_print_indexes $md5sum_value_index" eval repfile_sum=\$\(\( \$_$md5sum_value_index - \)\) repfile_size=$( ls -lS "`echo ${md5sum_value_arry[$md5sum_value_index]}|awk -F'+' '{print $2}'`" | awk '{print $5}') repfiles_size=$(( repfiles_size + repfile_sum*repfile_size )) fi done #Outputing the statistical information. echo -e "\033[1;31mFiles: $files_sum Groups: $groups_sum \ Size: ${repfiles_size}B $((repfiles_size/))K $((repfiles_size//))M\[0m" [[ $groups_sum -eq ]] && exit #The use chooses whether to check the file grouping or not. read -n -s -t -p 'Show detailed information ?' user_ch [[ $user_ch == 'n' ]] && echo || { [[ $user_ch == 'q' ]] && exit for print_value_index in $need_print_indexes do echo -ne "\n\033[1;35m$((++i)) \033[0m" eval echo -ne "\\\033[1\;34m$print_value_index [ \$_${print_value_index} ]:\\\033[0m" echo ${md5sum_value_arry[$print_value_index]} | tr '+' '\n' done | more } #The user can choose the way of deleting file here. echo -e "\n\nManual Selection by default !" echo -e " 1 Manual selection\n 2 Random selection" echo -ne "\033[1;31m" read -t USER_CH echo -ne "\033[0m" [[ $USER_CH == 'q' ]] && exit [[ $USER_CH -ne ]] && USER_CH= || { echo -ne "\033[31mWARNING: you have choiced the Random Selection mode, files will be deleted at random !\nAre you sure ?\033[0m" read -t yn [[ $yn == 'q' ]] && exit [[ $yn != } #Handle files according to the user's selection echo -e "\033[31m\nWarn: keep the first file by default.\033[0m" for exec_value_index in $need_print_indexes do #This loop contains an array of files that are about to be deleted. ,j=;i<$(echo ${md5sum_value_arry[$exec_value_index]} | grep -o '+' | wc -l); i++,j++)) do file_choices_arry[i]="$(echo ${md5sum_value_arry[$exec_value_index]}|awk -F'+' '{print $J}' J=$j)" done eval file_sum=\$_$exec_value_index ]]; then #If the user selects a manual mode, handle the duplicate file group one by one in a loop. echo -e "\033[1;34m$exec_value_index\033[0m" ; j<${#file_choices_arry[@]}; j++)) do echo "[ $j ] ${file_choices_arry[j]}" done read -p "Number of the file you want to keep: " num_ch [[ $num_ch == 'q' ]] && exit $((${#file_choices_arry[@]}-)) | else num_ch= fi #If the user selects the automatic deletion mode, then delete the redundant files ; n<${#file_choices_arry[@]}; n++)) do [[ $n -ne $num_ch ]] && { echo -ne "\033[1mDeleting file \" ${file_choices_arry[n]} \" ... \033[0m" rm -f "${file_choices_arry[n]}" [[ $? -eq ]] && echo -e "\033[1;32mOK" || echo -e "\033[1;31mFAIL" echo -ne "\033[0m" } done done
Find and delete duplicate files的更多相关文章
- Compare, sort, and delete duplicate lines in Notepad ++
Compare, sort, and delete duplicate lines in Notepad ++ Organize Lines: Since version 6.5.2 the app ...
- Android Duplicate files copied in APK
今天调试 Android 应用遇到这么个问题: Duplicate files copied in APK META-INF/DEPENDENCIES File 1: httpmime-4.3.2.j ...
- com.android.build.api.transform.TransformException: com.android.builder.packaging.DuplicateFileException: Duplicate files copied in APK assets/com.xx.xx
完整的Error 信息(关键部分) Error:Execution failed for task ':fanwe_o2o_47_mgxz_dingzhi:transformResourcesWith ...
- AndroidStudio使用第三方jar包报错(Error: duplicate files during packaging of APK)
http://www.kwstu.com/ArticleView/android_201410252131196692 错误描述: Error: duplicate files during pack ...
- Android Studio 错误 Duplicate files copied in APK META-INF/LICENSE.txt
1 .Duplicate files copied in APK META-INF/LICENSE.txt android { packagingOptions { exclude 'META-I ...
- Duplicate files copied in APK META-INF/LICENSE.txt
Error:Execution failed for task ':app:packageDebug'. > Duplicate files copied in APK META-INF/LIC ...
- Android Studio 错误 Duplicate files copied in APK META-INF/LICENSE.txt解决方案
My logcat: log Execution failed for task ':Prog:packageDebug'. Duplicate files copied in APK META-IN ...
- List or delete hidden files from command prompt(CMD)
In Windows, files/folders have a special attribute called hidden attribute. By setting this attribut ...
- 解决DuplicateFileException: Duplicate files copied in APK META-INF/LICENSE(或META-INF/DEPENDENCIES)
导入eclipse项目时报 Error:Execution failed for task ':app:transformResourcesWithMergeJavaResForDebug'.> ...
随机推荐
- Linux操作系统主机名(hostname)简介
http://www.jb51.net/LINUXjishu/10938.html 摘要:本文是关于Linux操作系统主机名(hostname)的文档,对主要配置文件/etc/hosts进行简要的说明 ...
- JAVA UUID 生成
UUID是指在一台机器上生成的数字,它保证对在同一时空中的所有机器都是唯一的.通常平台会提供生成UUID的API.UUID按照开放软件基金会(OSF)制定的标准计算,用到了以太网卡地址.纳秒级时间.芯 ...
- 基于TQ2440的SPI驱动学习(OLED)
平台简介 开发板:TQ2440 (NandFlash:256M 内存:64M) u-boot版本:u-boot-2015.04 内核版本:Linux-3.14 作者:彭东林 邮箱:pengdongl ...
- 项目自动化建构工具gradle 入门2——log4j输出helloWorld
上一章节呢,有一个能跑的程序了.但是对做工程的人来说,用日志输出感觉比用System.out要有档次一点.比如使用log4j.直接上例子: 1进入D:\work\gradle\log目录 ,您电脑没 ...
- 图像处理中任意核卷积(matlab中conv2函数)的快速实现。
卷积其实是图像处理中最基本的操作,我们常见的一些算法比如:均值模糊.高斯模糊.锐化.Sobel.拉普拉斯.prewitt边缘检测等等一些和领域相关的算法,都可以通过卷积算法实现.只不过由于这些算法的卷 ...
- [No000092]SVN学习笔记3-Import/Checkout(迁入/迁出),GetLock(加锁)
一.TortoiseSVN Client 获取服务器端的文件到新的本地文件夹 1.在本地新文件夹上右键菜单: 2.打开Repo-browser(可能需要输入你的用户名&密码) 3.输入服务器端 ...
- PIC12F508/505/509/510/506/519/526/527单片机破解芯片解密方法!
IC芯片解密PIC12F508/505/509/510/506/519/526/527单片机破解 单片机芯片解密型号: PIC12F508解密 | PIC12F505解密 | PIC12F506解密 ...
- node基础12:动态网页
1.显示动态网页 又到了激动人心的时刻,马上就可以使用node创建动态网站了,其原理为: 在HTML模板中使用占位符 根据请求路径,确定需要返回的页面 根据请求参数来确定静态模板中占位符的值 使用正则 ...
- [LeetCode] Different Ways to Add Parentheses 添加括号的不同方式
Given a string of numbers and operators, return all possible results from computing all the differen ...
- 深入理解numpy
一.为啥需要numpy python虽然说注重优雅简洁,但它终究是需要考虑效率的.别说运行速度不是瓶颈,在科学计算中运行速度就是瓶颈. python的列表,跟java一样,其实只是一维列表.一维列表相 ...