11、数据读取（三）——文件读取流程、CSV文件读取

1、文件读取流程，文件处理中有CSV文件，二进制文件，图片文件等格式。

假设有A，B，C，D文件，每个文件有100个样本

　 ① 构建一个文件队列

　 ② 将文件的路径+名字放入队列中

　 ③ 读取文件内容，默认读取一个文件

　　　　 CSV文件：读取一行（每个文件是一行）

　　　　二进制文件：指定一个样本的Byes读取

　　　　图片文件：按一张张图片方式读取

　 ④ 解码操作，文件类型不一样，读取和解码文件的API不一样

　⑤ 批处理，因为每次只读取一个样本，所以需要批处理读取每个文件中的多个样本

　　　　主线程：取样本数据进行训练，如每次50个样本

2、CSV文件读取

（1）文件读取API—文件队列构造
　　tf.train.string_input_producer(string_tensor, num_ epochs, shuffle=True)
　　　将输出字符串(例如文件名)输入到管道队列
　　　 ● string_tensor：含有文件名的1阶张量
　　　 ● num_ epochs: 过几遍数据，默认无限过数据
　　　 ● return：具有输出字符串的队列

（2）文件读取API-文件阅读器　　

　　① 根据文件格式，选择对应的文件阅读器

　　② class tf.TextLineReader

　　　● 阅读文本文件逗号分隔值(CSV)格式, 默认按行读取

　　　 ● return: 读取器实例

　　③ tf.FixedLengthRecordReader(record_ bytes)

　　　 ● 要读取每个记录是固定数量字节的二进制文件

　　　 ● record_ bytes:整型，指定每次读取的字节数

　　　● return:读取器实例

　　④ tf.TFRecordReader

　　　● 读取TfRecords文件

有一个共同的读取方法: .

● read(ile_queue)：从队列中指定数量内容

返回一个Tensors元组(key文件名字，value默认的内容(行，字节))

（3）文件读取AP1-文件内容解码器

　　①由于从文件中读取的是字符串，需要函数去解析这些字符串到张量

　　②tf.decode_csv(records,record_defaults = None, field_delim = None, name = None)

　　　将CSV转换为张量，与tf.TextLineReader搭配使用

　　　● records：tensor型字符串，每个字符串是csv中的记录行

　　　● field_delim：默认分割符”，”

　　　● record_defaults：参数决定了所得张量的类型，并设置一个值。

　　在输入字符串中缺少使用默认值,如

　　③tf.decode_ raw(bytes, out_type, lttle_endian = None，name = None)

　　　将字节转换为一个数字向量表示，字节为一字符串类型的张量,与函数 tf.FixedL engthRecordReader搭配使用，二进制读取为uint8格式

（4）管道读端批处理

　① tf.train.batch(tensors, batch_size, num_threads = 1,capacity = 32,name = None)

　　● 读取指定大小(个数)的张量

　　● tensors: 可以是包含张量的列表

　　● batch_size: 从队列中读取的批处理大小

　　● num_threads: 进入队列的线程数

　　● capacity: 整数，队列中元素的最大数量

　　● return: tensors

　② tf.train.shuffle_batch(tensors, batch_size,capacity, min_after_dequeue, num_threads=1,)

　　● 乱序读取指定大小(个数)的张量

　　● min_after_dequeue:留下队列里的张量个数，能够保持随机打乱

（5）代码分步实现

　① 先找到文件，构造一个列表，路径+名字—>列表当中

1 file_name = os.listdir(r"E:\pythonprogram\deepLearning\base\csvdata")
2 print(file_name)

输出：

['A.csv', 'B.csv', 'C.csv']

　② 构造文件队列

file_queue = tf.train.string_input_producer(fileList)  #第一个参数string_tensor,在main中构建的文件列表,返回一个队列

　③ 构造阅读器，读取队列内容(按一行)

1 reader = tf.TextLineReader()  #没有参数
2 key, value = reader.read(file_queue) #返回一个tensor元组,(key文件名字,value默认的内容(行,字节))
3 print("value: ",value)  #Tensor("ReaderReadV2:1", shape=(), dtype=string)
4 print("key: \n",key)

输出：

value:  Tensor("ReaderReadV2:1", shape=(), dtype=string)
key: 
Tensor("ReaderReadV2:0", shape=(), dtype=string)
Tensor("DecodeCSV:0", shape=(), dtype=string) Tensor("DecodeCSV:1", shape=(), dtype=string)
Tensor("batch:0", shape=(9,), dtype=string) Tensor("batch:1", shape=(9,), dtype=string)

　④ 解码内容，record_defaults ①指定每一个样本的每一列的类型(即指定以什么类型解码,如int型),②指定默认值

1 records = [["None"],["None"]]  #表示相应的列用字符串类型解码
2 example,labels = tf.decode_csv(value,record_defaults=records)
3 print(example, labels)

输出：

example:
 Tensor("DecodeCSV:0", shape=(), dtype=string)
labels:
 Tensor("DecodeCSV:1", shape=(), dtype=string)

　⑤ 批处理(多个样本)，想要读取多个数据就需要批处理 #batch 批处理

1 example_batch, labels_batch = tf.train.batch([example, labels], batch_size=9, num_threads=1,capacity=9)
2 # batch_size 最终决定取多少数据，而capacity是每次取多少数据，一般设置为相同的值
3 print(example_batch, labels_batch)

输出：

example_batch: Tensor("batch:0", shape=(9,), dtype=string)
labels_batch: Tensor("batch:1", shape=(9,), dtype=string)

　⑥会话处理

 1 #开启会话运行结果
 2 with tf.Session() as sess:
 3     #定义一个线程协调器
 4     coord = tf.train.Coordinator()
 5 
 6     # 开启读文件的线程
 7     threads = tf.train.start_queue_runners(sess, coord=coord)
 8 
 9     #打印读取的内容
10     print(sess.run(([example_batch,labels_batch])))
11 
12     #回收子线程
13     coord.request_stop()
14     coord.join(threads)

输出：

[array([b'Bee1', b'Bee2', b'Bee3', b'Alpha1', b'Alpha2', b'Alpha3',
       b'Sea1', b'Sea2', b'Sea3'], dtype=object), array([b'B1', b'B2', b'B3', b'A1', b'A2', b'A3', b'C1', b'C2', b'C3'],
      dtype=object)]

（5）完整代码

 1 import tensorflow as tf
 2 import os
 3 os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2' #去掉警告，将警告级别提升
 4 
 5 
 6 def csvRead(fileList):
 7     """
 8     读取csv文件
 9     :param fileList: 文件路径+名字的列表
10     :return:读取的内容
11     """
12     # 1 .构建文件的队列
13     file_queue = tf.train.string_input_producer(fileList)  #第一个参数string_tensor,在main中构建的文件列表,返回一个队列
14 
15     # 2.构造阅读器读取队列数据,默认以行读取
16     reader = tf.TextLineReader()  #没有参数
17     key, value = reader.read(file_queue) #返回一个tensor元组,(key文件名字,value默认的内容(行,字节))
18     print("value: ",value)  #Tensor("ReaderReadV2:1", shape=(), dtype=string)
19     print("key: \n",key)
20 
21     #3. 对每行的内容进行解码
22     # record_defaults ①指定每一个样本的每一列的类型(即指定以什么类型解码,如int型),②指定默认值
23     records = [["None"],["None"]]  #表示相应的列用字符串类型解码
24     example,labels = tf.decode_csv(value,record_defaults=records)
25     print("example:\n", example)
26     print("labels:\n",labels)
27 
28     #4.想要读取多个数据就需要批处理 #batch 批处理
29     example_batch, labels_batch = tf.train.batch([example, labels], batch_size=9, num_threads=1,capacity=9)
30     # batch_size 最终决定取多少数据，而capacity是每次取多少数据，一般设置为相同的值
31     print("example_batch:",example_batch)
32     print("labels_batch:",labels_batch)
33 
34     return example_batch,labels_batch
35 # 批处理的大小，跟队列，数据的数量没有影响，只决定这批次取多少数据
36 
37 if __name__ == '__main__':
38     # 1.找到文件,放入列表   路径+名字 ->列表当中
39     file_name = os.listdir(r"E:\pythonprogram\deepLearning\base\csvdata")
40     print(file_name)
41     fileList = [os.path.join(r"E:\pythonprogram\deepLearning\base\csvdata",file) for file in file_name]
42     # print(fileList)
43     example_batch,labels_batch = csvRead(fileList) # batch 批处理
44 
45     #开启会话运行结果
46     with tf.Session() as sess:
47         #定义一个线程协调器
48         coord = tf.train.Coordinator()
49 
50         # 开启读文件的线程
51         threads = tf.train.start_queue_runners(sess, coord=coord)
52 
53         #打印读取的内容
54         print(sess.run(([example_batch,labels_batch])))
55 
56         #回收子线程
57         coord.request_stop()
58         coord.join(threads)

posted on 2019-11-18 14:31 Luaser 阅读(914) 评论(0) 编辑收藏举报