Python:查找硬盘上重复文件
在下载了很多资料之后,由于分类不好,很多文件夹下都放了重复的文件,就想用python写个查找重复文件的小工具。
主要思路如下:
1. 查找同命文件
2. 利用了crc32,先检查出同样尺寸的文件,再计算crc32,得出相同的文件名列表。
下面是转载的一个代码,虽然可以满足要求,但是在查找大量文件时候,速度很慢,我抽空把它调优。
代码
1 #!/usr/bin/env python
2 #coding=utf-8
3 import binascii, os
4
5 filesizes = {}
6 samefiles = []
7
8 def filesize(path):
9 if os.path.isdir(path):
10 files = os.listdir(path)
11 for file in files:
12 filesize(path + "/" + file)
13 else:
14 size = os.path.getsize(path)
15 if not filesizes.has_key(size):
16 filesizes[size] = []
17 filesizes[size].append(path)
18
19 def filecrc(files):
20 filecrcs = {}
21 for file in files:
22 f = open(file, "r")
23 crc = binascii.crc32(f.read())
24 f.close()
25 if not filecrcs.has_key(crc):
26 filecrcs[crc] = []
27 filecrcs[crc].append(file)
28 for filecrclist in filecrcs.values():
29 if len(filecrclist) > 1:
30 samefiles.append(filecrclist)
31
32 if __name__ == '__main__':
33 path = r"J:\My Work"
34 filesize(path)
35 for sizesamefilelist in filesizes.values():
36 if len(sizesamefilelist) > 1:
37 filecrc(sizesamefilelist)
38 for samfile in samefiles:
39 print "****** same file group ******"
40 for file in samefile:
41 print file
2 #coding=utf-8
3 import binascii, os
4
5 filesizes = {}
6 samefiles = []
7
8 def filesize(path):
9 if os.path.isdir(path):
10 files = os.listdir(path)
11 for file in files:
12 filesize(path + "/" + file)
13 else:
14 size = os.path.getsize(path)
15 if not filesizes.has_key(size):
16 filesizes[size] = []
17 filesizes[size].append(path)
18
19 def filecrc(files):
20 filecrcs = {}
21 for file in files:
22 f = open(file, "r")
23 crc = binascii.crc32(f.read())
24 f.close()
25 if not filecrcs.has_key(crc):
26 filecrcs[crc] = []
27 filecrcs[crc].append(file)
28 for filecrclist in filecrcs.values():
29 if len(filecrclist) > 1:
30 samefiles.append(filecrclist)
31
32 if __name__ == '__main__':
33 path = r"J:\My Work"
34 filesize(path)
35 for sizesamefilelist in filesizes.values():
36 if len(sizesamefilelist) > 1:
37 filecrc(sizesamefilelist)
38 for samfile in samefiles:
39 print "****** same file group ******"
40 for file in samefile:
41 print file
作者:Shane
出处:http://bluescorpio.cnblogs.com
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
出处:http://bluescorpio.cnblogs.com
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。