Fine-tuning CaffeNet for Style Recognition on “Flickr Style” Data 数据下载遇到的问题
(下载的时候没有提示 不知道是正在下 还是出现错误 卡着了)。。一直没有反应
下载前要以管理员身份运行 sudo su 再 python examples/finetune_flickr_style/assemble_data.py --workers=1 --images=2000 --seed 831486
或者在命令前加sudo
参考了 http://blog.csdn.net/lujiandong1/article/details/50495454
在使用这个教程时,主要遇到了两个问题:
1、数据下不下来。
- python examples/finetune_flickr_style/assemble_data.py --workers=1 --images=2000 --seed 831486
运行上述指令时,程序莫名其妙就不动了,也不下载文件,程序也没有挂掉,好像进入了死锁状态。
查看源程序:assemble_data.py,可以看出assemble_data.py用了大量多线程,多进程。我的解决方案就是改源程序,不使用进程来下载了。并且,对下载进行了超时限定,超过6s就认为超时,进而不下载。
====================================================================================================
assemble_data.py中使用多线程,多进程的源代码如下:
- pool = multiprocessing.Pool(processes=num_workers)
- map_args = zip(df['image_url'], df['image_filename'])
- results = pool.map(download_image, map_args)
===================================================================================================
我修改后的源码如下:
- #!/usr/bin/env python3
- """
- Form a subset of the Flickr Style data, download images to dirname, and write
- Caffe ImagesDataLayer training file.
- """
- import os
- import urllib
- import hashlib
- import argparse
- import numpy as np
- import pandas as pd
- from skimage import io
- import multiprocessing
- import socket
- # Flickr returns a special image if the request is unavailable.
- MISSING_IMAGE_SHA1 = '6a92790b1c2a301c6e7ddef645dca1f53ea97ac2'
- example_dirname = os.path.abspath(os.path.dirname(__file__))
- caffe_dirname = os.path.abspath(os.path.join(example_dirname, '../..'))
- training_dirname = os.path.join(caffe_dirname, 'data/flickr_style')
- def download_image(args_tuple):
- "For use with multiprocessing map. Returns filename on fail."
- try:
- url, filename = args_tuple
- if not os.path.exists(filename):
- urllib.urlretrieve(url, filename)
- with open(filename) as f:
- assert hashlib.sha1(f.read()).hexdigest() != MISSING_IMAGE_SHA1
- test_read_image = io.imread(filename)
- return True
- except KeyboardInterrupt:
- raise Exception() # multiprocessing doesn't catch keyboard exceptions
- except:
- return False
- def mydownload_image(args_tuple):
- "For use with multiprocessing map. Returns filename on fail."
- try:
- url, filename = args_tuple
- if not os.path.exists(filename):
- urllib.urlretrieve(url, filename)
- with open(filename) as f:
- assert hashlib.sha1(f.read()).hexdigest() != MISSING_IMAGE_SHA1
- test_read_image = io.imread(filename)
- return True
- except KeyboardInterrupt:
- raise Exception() # multiprocessing doesn't catch keyboard exceptions
- except:
- return False
- if __name__ == '__main__':
- parser = argparse.ArgumentParser(
- description='Download a subset of Flickr Style to a directory')
- parser.add_argument(
- '-s', '--seed', type=int, default=0,
- help="random seed")
- parser.add_argument(
- '-i', '--images', type=int, default=-1,
- help="number of images to use (-1 for all [default])",
- )
- parser.add_argument(
- '-w', '--workers', type=int, default=-1,
- help="num workers used to download images. -x uses (all - x) cores [-1 default]."
- )
- parser.add_argument(
- '-l', '--labels', type=int, default=0,
- help="if set to a positive value, only sample images from the first number of labels."
- )
- args = parser.parse_args()
- np.random.seed(args.seed)
- # Read data, shuffle order, and subsample.
- csv_filename = os.path.join(example_dirname, 'flickr_style.csv.gz')
- df = pd.read_csv(csv_filename, index_col=0, compression='gzip')
- df = df.iloc[np.random.permutation(df.shape[0])]
- if args.labels > 0:
- df = df.loc[df['label'] < args.labels]
- if args.images > 0 and args.images < df.shape[0]:
- df = df.iloc[:args.images]
- # Make directory for images and get local filenames.
- if training_dirname is None:
- training_dirname = os.path.join(caffe_dirname, 'data/flickr_style')
- images_dirname = os.path.join(training_dirname, 'images')
- if not os.path.exists(images_dirname):
- os.makedirs(images_dirname)
- df['image_filename'] = [
- os.path.join(images_dirname, _.split('/')[-1]) for _ in df['image_url']
- ]
- # Download images.
- num_workers = args.workers
- if num_workers <= 0:
- num_workers = multiprocessing.cpu_count() + num_workers
- print('Downloading {} images with {} workers...'.format(
- df.shape[0], num_workers))
- #pool = multiprocessing.Pool(processes=num_workers)
- map_args = zip(df['image_url'], df['image_filename'])
- #results = pool.map(download_image, map_args)
- socket.setdefaulttimeout(6)
- results = []
- for item in map_args:
- value = mydownload_image(item)
- results.append(value)
- if value == False:
- print 'Flase'
- else:
- print '1'
- # Only keep rows with valid images, and write out training file lists.
- print len(results)
- df = df[results]
- for split in ['train', 'test']:
- split_df = df[df['_split'] == split]
- filename = os.path.join(training_dirname, '{}.txt'.format(split))
- split_df[['image_filename', 'label']].to_csv(
- filename, sep=' ', header=None, index=None)
- print('Writing train/val for {} successfully downloaded images.'.format(
- df.shape[0]))
修改主要有以下几点:
1、#!/usr/bin/env python3 使用python3
2、
- #pool = multiprocessing.Pool(processes=num_workers)
- map_args = zip(df['image_url'], df['image_filename'])
- #results = pool.map(download_image, map_args)
- socket.setdefaulttimeout(6)
- results = []
- for item in map_args:
- value = mydownload_image(item)
- results.append(value)
- if value == False:
- print 'Flase'
- else:
- print '1'
- # Only keep rows with valid images, and write out training file lists.
- print len(results)
只使用单线程下载,不使用多线程,多进程下载。并且,设定连接的超时时间为6s,socket.setdefaulttimeout(6)。
经过上述改进,就可以把数据下载下来。
===================================================================================================
2、
在运行命令:
- ./build/tools/caffe train -solver models/finetune_flickr_style/solver.prototxt -weights models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
时遇到错误:
Failed to parse NetParameter file: models/bvlc_reference_caffenet/bvlc_reference_caffenet.caffemodel
出错的原因是我们传入的数据bvlc_reference_caffenet.caffemodel 并不是二进制的。
原因:因为我是在win7下,把bvlc_reference_caffenet.caffemodel下载下来,再使用winSCP传输到服务器上,直接在服务器上使用wget下载,速度太慢了,但是在传输的过程中winSCP就把bvlc_reference_caffenet.caffemodel的格式给篡改了,导致bvlc_reference_caffenet.caffemodel不是二进制的。
解决方案,把winSCP的传输格式设置成二进制,那么就可以解决这个问题。
详情见博客:http://blog.chinaunix.net/uid-20332519-id-5585964.html