python文件操作

Python的open函数文件读写线程不安全，logging模型文件读写线程安全！

工作中遇到的问题：如何在多线程的程序中同时记录日志？

最初图省事，使用了最原始的open函数来写日志，因为开始使用的写文件模式的是追加('a'），发现并没有线程不安全的现象，各个线程的的日志信息都写入到了日志文件中。

后来将写文件模式改成了只写默认('w')，这时候线程不安全的问题就显露出来了，只有一个线程的日志信息被记录。

这时候觉得不能再图省事了，有必要把Python标准库中专用日志模块logging好好学习一下，果然不让人失望，logging是线程安全的。无论是只写模式，还是追加模式，多线程的日志信息都正确的被记录下来了。

以下是测试代码：

def file_io(message,mode):
    with open('log_test.log',mode) as f:
        f.write(message)
        f.write('\n')


def logging_io(message,mode):
    logging.basicConfig(level='DEBUG',
                        filename='log_test1.log',
                        filemode=mode)
    logging.info(message)


if __name__ == '__main__':
    messages= ['---hello--', '----nihaojlj', '----world%%%%%%%%%%%%%%%%%%']
    for m in messages:
        th = threading.Thread(target=logging_io, args=(m,'a'))
        th.start()

总结：

多线程同时写文件的时候，追加模式('a')貌似并没有线程不安全的现象
多线程记录日志信息，还是使用标准库的logging模块吧，它线程安全！专业的事用专业的模块！

获取文件大小：os.path.getsize()

'''获取文件的大小,结果保留两位小数，单位为MB'''
　　　　def get_FileSize(filePath):
　　　　　　filePath = unicode(filePath,'utf8')
　　　　　　fsize = os.path.getsize(filePath)
　　　　　　fsize = fsize/float(1024*1024)
　　　　　　return round(fsize,2)

python读取大文件

最近在学习python的过程中接触到了python对文件的读取。python读取文件一般情况是利用open()函数以及read()函数来完成：

f = open(filename,'r')
f.read()

这种方法读取小文件，即读取远远大小小于内存的文件显然没有什么问题。但是如果是将一个10G大小的日志文件读取，即文件大于内存的大小，这么处理就有问题了，会造成MemoryError ... 也就是发生内存溢出。

发生这种错误的原因在于，read()方法执行操作是一次性的都读入内存中，显然文件大于内存就会报错。

解决方法：

这里发现跟read()类似的还有其他的方法：read(参数)、readline()、readlines()

(1)read(参数)：通过参数指定每次读取的大小长度,这样就避免了因为文件太大读取出问题。

while True:
    block = f.read(1024)
    if not block:
        break

(2)readline()：每次读取一行

while True:
    line = f.readline()
    if not line:
        break

(3)readlines()：读取全部的行，构成一个list，通过list来对文件进行处理，但是这种方式依然会造成MemoyError

for line in f.readlines():
    ....

以上基本分析了python中读取文件的方法，但是总感觉不能达到python中所强调的优雅，后来发现了还有下面的解决方法：

pythonic（我理解的是很python的python代码）的解决办法：

with open(filename, 'r') as flie:
    for line in file:
        ....

对可迭代对象file进行迭代，这样会自动的使用buffered IO以及内存管理，这样就不必担心大文件问题了。

后来，又发现了一个模块：linecache，这个模块也可以解决大文件读取的问题，并且可以指定读取哪一行，

# 输出第2行
text = linecache.getline(filename, 2)

pandas分块读取大数据，避免内存不足

#coding=utf-8
import pandas as pd
def read_data(file_name):
    '''
    file_name:文件地址
    '''
    inputfile = open(file_name, 'rb')   #可打开含有中文的地址
    data = pd.read_csv(inputfile, iterator=True,header=None)
    loop = True
    chunkSize = 1000    #一千行一块
    chunks = []
    while loop:
        try:
            chunk = data.get_chunk(chunkSize)
            chunks.append(chunk)
        except StopIteration:
            loop = False
            print("Iteration is stopped.")
    data = pd.concat(chunks, ignore_index=True)
    #print(train.head())
    return data.values

print read_data("./log_test1.log")

默认读入时，第一行默认是列名，你的第一行数据被当成列名处理了，可以通过设置read_csv方法，传参的时候加个参数header=None 来声明文件中没有列名，没有这个参数那么第一行将不会被读取。

Python3读取大文件的方法

1. 方法一：利用yield生成器

def readPart(filePath, size=1024, encoding="utf-8"):
    with open(filePath,"r",encoding=encoding) as f:
        while True:
            part = f.read(size)  
            if part:
                yield part
            else:
                return None
filePath = r"filePath"
size = 2048 # 每次读取指定大小的内容到内存
encoding = 'utf-8'
for part in readPart(filePath,size,encoding):
    print(part)
    # Processing data

Python写入到csv文件存在空行的解决方法

我在使用Python将数据写入到csv文件中，发现采用下面的方法，写入到csv中会存在一行间一行的问题

with open(os.path.join(outpath,'result.csv'),'w') as cf:
        writer = csv.writer(cf)
        writer.writerow(['shader','file'])
        for key , value in result.items():
            writer.writerow([key,value])

为了解决这个问题，查了下资料，发现这是和打开方式有关，将打开的方法改为wb，就不存在这个问题了，也就是

在read/write csv 文件是要以binary的方式进行。

with open(os.path.join(outpath,'result.csv'),'wb') as cf:
        writer = csv.writer(cf)
        writer.writerow(['shader','file'])
        for key , value in result.items():
            writer.writerow([key,value])

python写入csv文件的几种方法总结

最常用的一种方法，利用pandas包

#coding=utf-8
import pandas as pd

#任意的多组列表
a = [1,2,3]
b = [4,5,6]    

#字典中的key值即为csv中列名
dataframe = pd.DataFrame({'a_name':a,'b_name':b})

#将DataFrame存储为csv,index表示是否显示行名，default=True
dataframe.to_csv("test.csv",index=False,sep=',')

同样pandas也提供简单的读csv方法

import pandas as pd
data = pd.read_csv('test.csv')

会得到一个DataFrame类型的data，不熟悉处理方法可以参考pandas十分钟入门

另一种方法用csv包，一行一行写入

import csv

#python2可以用file替代open
with open("test.csv","w") as csvfile: 
    writer = csv.writer(csvfile)

    #先写入columns_name
    writer.writerow(["index","a_name","b_name"])
    #写入多行用writerows
    writer.writerows([[0,1,3],[1,2,3],[2,3,4]])

读取csv文件用reader

import csv
with open("test.csv","r") as csvfile:
    reader = csv.reader(csvfile)
    #这里不需要readlines
    for line in reader:
        print line

python 读写csv文件（创建，追加，覆盖）

创建：

利用csv包中的writer函数，如果文件不存在，会自动创建，需要注意的是，文件后缀一定要是.csv，这样才会创建csv文件

这里创建好文件，将csv文件的头信息写进了文件。

import csv
def create_csv():
    path = "aa.csv"
    with open(path,'wb') as f:
        csv_write = csv.writer(f)
        csv_head = ["good","bad"]
        csv_write.writerow(csv_head)

追加：

在python中，以a+的方式打开，是追加

def write_csv():
    path  = "aa.csv"
    with open(path,'a+') as f:
        csv_write = csv.writer(f)
        data_row = ["1","2"]
        csv_write.writerow(data_row)

读：

利用csv.reader可以读csv文件，然后返回一个可迭代的对象csv_read，我们可以直接从csv_read中取数据

def read_csv():
    path = "aa.csv"
    with open(path,"rb") as f:
        csv_read = csv.reader(f)
        for line in csv_read:
            print line

附加：
python利用open打开文件的方式：

python实现将excel文件转化成CSV格式

import pandas as pd
data = pd.read_excel('123.xls','Sheet1',index_col=0)
data.to_csv('data.csv',encoding='utf-8')

python 获取当前文件夹下所有文件名

os 模块下有两个函数：

os.walk()

os.listdir()

# -*- coding: utf-8 -*-   
      
    import os  
      
    def file_name(file_dir):   
        for root, dirs, files in os.walk(file_dir):  
            print(root) #当前目录路径  
            print(dirs) #当前路径下所有子目录  
            print(files) #当前路径下所有非目录子文件

#coding=utf-8
#递归获取路径下所有文件名
import os
allfile = []
def file_name(file_dir):
    for root, dirs, files in os.walk(file_dir):
        print('root_dir:', root)  # 当前目录路径
        print('sub_dirs:', dirs)  # 当前路径下所有子目录
        print('files:', files)  # 当前路径下所有非目录子文件
        allfile.extend(files)

file_name('./aa')
print allfile

# -*- coding: utf-8 -*-   
      
    import os  
      
    def file_name(file_dir):   
        L=[]   
        for root, dirs, files in os.walk(file_dir):  
            for file in files:  
                if os.path.splitext(file)[1] == '.jpeg':  
                    L.append(os.path.join(root, file))  
        return L  


#其中os.path.splitext()函数将路径拆分为文件名+扩展名

python获取目录下文件夹名称

path = '/opt'
dirs = os.listdir(path)
for dir in dirs:
    print dir

Python根据路径名称获取文件的名称以及所在的路径

大神一看题目就知道用python中的string.split('\')，记得之前处理大量的文件的时候，有时候有几十万的文本文件，经常会读取获取名称，并且保存为名字一样的另外一种格式的文件

其实python中有一句话可以解决这个问题的方法，如下

根据全路径获取文件名称的方法os.path.basename(path)

获取文件所在路径的方法os.path.dirname(path)

Python解压缩ZIP格式

转自：http://blog.csdn.net/linux__kernel/article/details/8271326

很多人在Google上不停的找合适自己的压缩，殊不知Py的压缩很不错。可以试试。当然C#，Java的压缩也有第三方的类。Py有很多美名：数学理论强大，数据结构高级等等，关于压缩算法当然用Py更加简单易用，达到目的才是最重要的。

Python压缩ZIP文件：

import zipfile
f = zipfile.ZipFile(target,'w',zipfile.ZIP_DEFLATED)
f.write(filename,file_url)
f.close()

其中target:是压缩后要保存的路径，可以是: 'C:\\temp\\'
ZIP_DEFLATED:表示压缩，还有一个参数：ZIP_STORE：表示只打包，不压缩。这个Linux中的gz跟tar格式有点类似.
write方法如果只有一个参数filename的话，表示把你filename所带的路径全部压缩到zip文件中。如果带两个参数，表示把filename路径中的那个file压缩一下并且存放到file_url中，中间没有增加任何的文件夹。
如果要压缩很多的文件，循环的write就ok了
最后close掉。
Python解压ZIP文件:

f = zipfile.ZipFile("zipfilePath",'r')
for file in f.namelist():
f.extract(file,"temp/")

zipfilePath是压缩文件的路径
循环访问该压缩文件中的文件，并且一个一个file的解压到对应的"temp\"文件夹中

解压当前目录下的zip文件到当前目录，并删除原有的zip文件

import zipfile
import os

file_list = os.listdir(r'.')

for file_name in file_list:
    if os.path.splitext(file_name)[1] == '.zip':
        print file_name

        file_zip = zipfile.ZipFile(file_name, 'r')
        for file in file_zip.namelist():
            file_zip.extract(file, r'.')
        file_zip.close()
        os.remove(file_name)

View Code

python解压压缩包的几种方法

这里讨论使用Python解压例如以下五种压缩文件：

.gz .tar .tgz .zip .rar

简单介绍

gz：即gzip。通常仅仅能压缩一个文件。与tar结合起来就能够实现先打包，再压缩。

tar： linux系统下的打包工具。仅仅打包。不压缩

tgz：即tar.gz。先用tar打包，然后再用gz压缩得到的文件

zip：不同于gzip。尽管使用相似的算法，能够打包压缩多个文件。只是分别压缩文件。压缩率低于tar。

rar：打包压缩文件。最初用于DOS，基于window操作系统。

压缩率比zip高，但速度慢。随机訪问的速度也慢。

关于zip于rar之间的各种比較。可见：

http://www.comicer.com/stronghorse/water/software/ziprar.htm

gz

因为gz一般仅仅压缩一个文件，全部常与其它打包工具一起工作。比方能够先用tar打包为XXX.tar,然后在压缩为XXX.tar.gz

解压gz，事实上就是读出当中的单一文件，Python方法例如以下：

import gzip
import os
def un_gz(file_name):
    """ungz zip file"""
    f_name = file_name.replace(".gz", "")
    #获取文件的名称，去掉
    g_file = gzip.GzipFile(file_name)
    #创建gzip对象
    open(f_name, "w+").write(g_file.read())
    #gzip对象用read()打开后，写入open()建立的文件里。
    g_file.close()
    #关闭gzip对象

tar

XXX.tar.gz解压后得到XXX.tar，还要进一步解压出来。

*注：tgz与tar.gz是同样的格式，老版本号DOS扩展名最多三个字符，故用tgz表示。

因为这里有多个文件，我们先读取全部文件名称。然后解压。例如以下：

import tarfile
def un_tar(file_name):
       untar zip file"""
    tar = tarfile.open(file_name)
    names = tar.getnames()
    if os.path.isdir(file_name + "_files"):
        pass
    else:
        os.mkdir(file_name + "_files")
    #因为解压后是很多文件，预先建立同名目录
    for name in names:
        tar.extract(name, file_name + "_files/")
    tar.close()

*注：tgz文件与tar文件同样的解压方法。

zip

与tar类似，先读取多个文件名称，然后解压。例如以下：

import zipfile
def un_zip(file_name):
    """unzip zip file"""
    zip_file = zipfile.ZipFile(file_name)
    if os.path.isdir(file_name + "_files"):
        pass
    else:
        os.mkdir(file_name + "_files")
    for names in zip_file.namelist():
        zip_file.extract(names,file_name + "_files/")
    zip_file.close()

rar

由于rar通常为window下使用，须要额外的Python包rarfile。

可用地址： http://sourceforge.net/projects/rarfile.berlios/files/rarfile-2.4.tar.gz/download

解压到Python安装文件夹的/Scripts/文件夹下，在当前窗体打开命令行,

输入Python setup.py install

安装完毕。

import rarfile
import os
def un_rar(file_name):
    """unrar zip file"""
    rar = rarfile.RarFile(file_name)
    if os.path.isdir(file_name + "_files"):
        pass
    else:
        os.mkdir(file_name + "_files")
    os.chdir(file_name + "_files"):
    rar.extractall()
    rar.close()

tar打包

在写打包代码的过程中，使用tar.add()添加文件时，会把文件本身的路径也加进去，加上arcname就能依据自己的命名规则将文件添加tar包

打包代码：

#!/usr/bin/env /usr/local/bin/python
# encoding: utf-8
import tarfile
import os
import time
start = time.time()
tar=tarfile.open('/path/to/your.tar,'w')
for root,dir,files in os.walk('/path/to/dir/'):
for file in files:
fullpath=os.path.join(root,file)
tar.add(fullpath,arcname=file)
tar.close()
print time.time()-start

在打包的过程中能够设置压缩规则,如想要以gz压缩的格式打包

tar=tarfile.open('/path/to/your.tar.gz','w:gz')

其它格式例如以下表：

tarfile.open的mode有非常多种：

mode action

'r' or 'r:*'	Open for reading with transparent compression (recommended).
'r:'	Open for reading exclusively without compression.
'r:gz'	Open for reading with gzip compression.
'r:bz2'	Open for reading with bzip2 compression.
'a' or 'a:'	Open for appending with no compression. The file is created if it does not exist.
'w' or 'w:'	Open for uncompressed writing.
'w:gz'	Open for gzip compressed writing.
'w:bz2'	Open for bzip2 compressed writing.

tar解包

tar解包也能够依据不同压缩格式来解压。

#!/usr/bin/env /usr/local/bin/python
# encoding: utf-8
import tarfile
import time
start = time.time()
t = tarfile.open("/path/to/your.tar", "r:")
t.extractall(path = '/path/to/extractdir/')
t.close()
print time.time()-start

上面的代码是解压全部的，也能够挨个起做不同的处理，但要假设tar包内文件过多，小心内存哦~

tar = tarfile.open(filename, 'r:gz')
for tar_info in tar:
file = tar.extractfile(tar_info)
do_something_with(file)

python逐行读取txt文件时出现多余空行的问题

这几天做程序作业的时候需要用python的读取文件功能，在我用readlines()函数做逐行读取的时候遇到了一个小问题，在这里和大家分享一下。

txt文件里的内容是这样的：

代码也没什么问题：

1 with open('001.txt','r') as f:
2     lines = f.readlines()
3     for line in lines:
4         print(line)

但运行出来就。。。：

每两行之间都出现了奇怪的空行，这是怎么回事呢？

其实是因为文件中每行末尾会有一个隐藏的换行符“\n”，读取之后“\n”会被解析出来形成换行，而print()语句本身就自带换行的效果，两个换行叠加之后就会出现空行。

那么怎样消除这个bug呢？

其实很简单，python有两个自带的函数：.strip()和.rstrip()

.strip()的意思是消除字符串整体的指定字符
.rstrip()的意思是消除字符串末尾的指定字符

括号里什么都不写，默认消除空格和换行符

ok，我们再来试试：

1 with open('001.txt','r') as f:
2     lines = f.readlines()
3     for line in lines:
4         print(line.strip())

运行结果：

问题解决！

删除文件或者文件夹

import os
import shutil
name = "test"
if os.path.exists(name):    #判断文件或者文件夹是否存在
    if not os.listdir(name):    #判断文件夹是否为空
        os.rmdir(name)  #只能删除空文件夹

    else:
        shutil.rmtree(str(name))    #删除非空文件夹

if os.path.exists(path):  # 如果文件存在
    # 删除文件
    os.remove(path)

记一次读取csv报“'gb2312' codec can't decode byte 0x9b”解决办法

问题：UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 2-3: illegal multibyte sequence

原因：python在做将普通字符串转换为unicode对象时，

例如:u_string = unicode(string , "gb2312")，如果你的字符串string中有诸如某些繁体字，例如"河滘小学"

中的滘，那么gb2312作为简体中文编码是不能进行解析的，必须使用国标扩展码gbk，gbk支持繁体中文和日文假文

解决方法：使用gbk，代替gb2312，例如:u_string = unicode(string , "gbk")

posted @ 2019-10-24 08:54 南哥的天下阅读(845) 评论(0) 编辑收藏举报

刷新页面返回顶部

1. 方法一：利用yield生成器

创建：

追加：

读：

简单介绍

gz

tar

zip

rar

删除文件或者文件夹

公告