【SRA】public数据的处理
public数据下载通过ascp可以达到令人满意的速度,但有时下载的SRA大文件进行拆包和压缩甚至要比下载的时间更长,这很令人恼火。
最简单的办法,在有fq.gz文件的情况下直接下载fq.gz文件,我一般用ENA的ascp链接下载,存在SRA的情况下一般都会有fq.gz的链接。
如果已经下载了SRA文件那么有下面三种方式来拆包:
- fastq-dump
fastq-dump ${line} --split-3 --skip-technical --gzip
最基础的SRA处理的软件,包含在SRA-tools工具包内,缺点很明显,效率非常低,几乎没有什么优点,即使几百个sra文件一起拆包也会因为压缩太慢令人抓狂,拆一些小文件尚可,大文件如果你还想毕业就算了吧。
2. fasterq-dump
Usage: fasterq-dump <path> [options] fasterq-dump <accession> [options] Options: -F|--format format (special, fastq, default=fastq) -o|--outfile output-file -O|--outdir output-dir -b|--bufsize size of file-buffer dflt=1MB -c|--curcache size of cursor-cache dflt=10MB -m|--mem memory limit for sorting dflt=100MB -t|--temp where to put temp. files dflt=curr dir -e|--threads how many thread dflt=6 -p|--progress show progress -x|--details print details -s|--split-spot split spots into reads -S|--split-files write reads into different files -3|--split-3 writes single reads in special file --concatenate-reads writes whole spots into one file -Z|--stdout print output to stdout -f|--force force to overwrite existing file(s) --skip-technical skip technical reads --include-technical include technical reads -M|--min-read-len filter by sequence-len --table which seq-table to use in case of pacbio -B|--bases filter by bases -A|--append append to output-file --fasta produce FASTA output --fasta-unsorted produce FASTA output, unsorted --fasta-ref-tbl produce FASTA output from REFERENCE tbl --fasta-concat-all concatenate all rows and produce FASTA --internal-ref extract only internal REFERENCEs --external-ref extract only external REFERENCEs --ref-name extract only these REFERENCEs --ref-report enumerate references --use-name print name instead of seq-id --seq-defline custom defline for sequence: $ac=accession, $sn=spot-name, $sg=spot-group, $si=spot-id, $ri=read-id, $rl=read-length --qual-defline custom defline for qualities: same as seq-defline -U|--only-unaligned process only unaligned reads -a|--only-aligned process only aligned reads --disk-limit explicitly set disk-limit --disk-limit-tmp explicitly set disk-limit for temp. files --size-check switch to control: on=perform size-check (default), off=do not perform size-check, only=perform size-check only --ngc <PATH> PATH to ngc file -h|--help Output brief explanation for the program. -V|--version Display the version of the program then quit. -L|--log-level <level> Logging level as number or enum string. One of (fatal|sys|int|err|warn|info|debug) or (0-6) Current/default is warn. -v|--verbose Increase the verbosity of the program status messages. Use multiple times for more verbosity. Negates quiet. -q|--quiet Turn off all status messages for the program. Negated by verbose. --option-file <file> Read more options and parameters from the file. for more information visit: https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump fasterq-dump : 3.0.10 fasterq-dump DRR271585 --skip-technical -3 -m 100GB -p -e 30
参数和fastq-dump相似,有一些内存控制和并行处理的参数,拆包确实很快,但是没想到他不带压缩,后续用gzip压缩同样很慢,最后效果和原版几乎一致,如果你空间有限需要使用fq.gz文件就大可不必使用这个。
3. parallel-fastq-dump
强烈推荐这款软件,能把拆很多大文件的时间成本控制在可接受的范围内。
usage: parallel-fastq-dump [-h] [-s SRA_ID] [-t THREADS] [-O OUTDIR] [-T TMPDIR] [-N MINSPOTID] [-X MAXSPOTID] [-V]
parallel fastq-dump wrapper, extra args will be passed through
options:
-h, --help show this help message and exit
-s SRA_ID, --sra-id SRA_ID
SRA id (default: None)
-t THREADS, --threads THREADS
number of threads (default: 1)
-O OUTDIR, --outdir OUTDIR
output directory (default: .)
-T TMPDIR, --tmpdir TMPDIR
temporary directory (default: None)
-N MINSPOTID, --minSpotId MINSPOTID
Minimum spot id (default: 1)
-X MAXSPOTID, --maxSpotId MAXSPOTID
Maximum spot id (default: None)
-V, --version shows version (default: False)
DESCRIPTION:
Example: parallel-fastq-dump --sra-id SRR2244401 --threads 4 --outdir out/ --split-files --gzip
这里只有一点需要注意,不要用它自带的下载工具下载,“--sra-id”这个参数可以传入sra文件的路径。github主页的描述可知一般8个核并行就可以。