【SRA】public数据的处理

public数据下载通过ascp可以达到令人满意的速度,但有时下载的SRA大文件进行拆包和压缩甚至要比下载的时间更长,这很令人恼火。

最简单的办法,在有fq.gz文件的情况下直接下载fq.gz文件,我一般用ENA的ascp链接下载,存在SRA的情况下一般都会有fq.gz的链接。

如果已经下载了SRA文件那么有下面三种方式来拆包:

  1. fastq-dump
fastq-dump ${line} --split-3 --skip-technical --gzip

最基础的SRA处理的软件,包含在SRA-tools工具包内,缺点很明显,效率非常低,几乎没有什么优点,即使几百个sra文件一起拆包也会因为压缩太慢令人抓狂,拆一些小文件尚可,大文件如果你还想毕业就算了吧。

2. fasterq-dump

Usage:
  fasterq-dump <path> [options]
  fasterq-dump <accession> [options]

Options:
  -F|--format                      format (special, fastq, default=fastq) 
  -o|--outfile                     output-file 
  -O|--outdir                      output-dir 
  -b|--bufsize                     size of file-buffer dflt=1MB 
  -c|--curcache                    size of cursor-cache dflt=10MB 
  -m|--mem                         memory limit for sorting dflt=100MB 
  -t|--temp                        where to put temp. files dflt=curr dir 
  -e|--threads                     how many thread dflt=6 
  -p|--progress                    show progress 
  -x|--details                     print details 
  -s|--split-spot                  split spots into reads 
  -S|--split-files                 write reads into different files 
  -3|--split-3                     writes single reads in special file 
  --concatenate-reads              writes whole spots into one file 
  -Z|--stdout                      print output to stdout 
  -f|--force                       force to overwrite existing file(s) 
  --skip-technical                 skip technical reads 
  --include-technical              include technical reads 
  -M|--min-read-len                filter by sequence-len 
  --table                          which seq-table to use in case of pacbio 
  -B|--bases                       filter by bases 
  -A|--append                      append to output-file 
  --fasta                          produce FASTA output 
  --fasta-unsorted                 produce FASTA output, unsorted 
  --fasta-ref-tbl                  produce FASTA output from REFERENCE tbl 
  --fasta-concat-all               concatenate all rows and produce FASTA 
  --internal-ref                   extract only internal REFERENCEs 
  --external-ref                   extract only external REFERENCEs 
  --ref-name                       extract only these REFERENCEs 
  --ref-report                     enumerate references 
  --use-name                       print name instead of seq-id 
  --seq-defline                    custom defline for sequence:  $ac=accession, 
                                   $sn=spot-name,  $sg=spot-group, $si=spot-id,  
                                   $ri=read-id, $rl=read-length 
  --qual-defline                   custom defline for qualities:  same as 
                                   seq-defline 
  -U|--only-unaligned              process only unaligned reads 
  -a|--only-aligned                process only aligned reads 
  --disk-limit                     explicitly set disk-limit 
  --disk-limit-tmp                 explicitly set disk-limit for temp. files 
  --size-check                     switch to control: on=perform size-check 
                                   (default),  off=do not perform size-check,  
                                   only=perform size-check only 
  --ngc <PATH>                     PATH to ngc file 

  -h|--help                        Output brief explanation for the program. 
  -V|--version                     Display the version of the program then 
                                   quit. 
  -L|--log-level <level>           Logging level as number or enum string. One 
                                   of (fatal|sys|int|err|warn|info|debug) or 
                                   (0-6) Current/default is warn. 
  -v|--verbose                     Increase the verbosity of the program 
                                   status messages. Use multiple times for more 
                                   verbosity. Negates quiet. 
  -q|--quiet                       Turn off all status messages for the 
                                   program. Negated by verbose. 
  --option-file <file>             Read more options and parameters from the 
                                   file. 
for more information visit:
   https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump
   https://github.com/ncbi/sra-tools/wiki/08.-prefetch-and-fasterq-dump

fasterq-dump : 3.0.10

fasterq-dump DRR271585 --skip-technical -3 -m 100GB -p -e 30

参数和fastq-dump相似,有一些内存控制和并行处理的参数,拆包确实很快,但是没想到他不带压缩,后续用gzip压缩同样很慢,最后效果和原版几乎一致,如果你空间有限需要使用fq.gz文件就大可不必使用这个。

3. parallel-fastq-dump

强烈推荐这款软件,能把拆很多大文件的时间成本控制在可接受的范围内。

 

usage: parallel-fastq-dump [-h] [-s SRA_ID] [-t THREADS] [-O OUTDIR] [-T TMPDIR] [-N MINSPOTID] [-X MAXSPOTID] [-V]

parallel fastq-dump wrapper, extra args will be passed through

options:
  -h, --help            show this help message and exit
  -s SRA_ID, --sra-id SRA_ID
                        SRA id (default: None)
  -t THREADS, --threads THREADS
                        number of threads (default: 1)
  -O OUTDIR, --outdir OUTDIR
                        output directory (default: .)
  -T TMPDIR, --tmpdir TMPDIR
                        temporary directory (default: None)
  -N MINSPOTID, --minSpotId MINSPOTID
                        Minimum spot id (default: 1)
  -X MAXSPOTID, --maxSpotId MAXSPOTID
                        Maximum spot id (default: None)
  -V, --version         shows version (default: False)

DESCRIPTION:
Example: parallel-fastq-dump --sra-id SRR2244401 --threads 4 --outdir out/ --split-files --gzip

 

  这里只有一点需要注意,不要用它自带的下载工具下载,“--sra-id”这个参数可以传入sra文件的路径。github主页的描述可知一般8个核并行就可以。

 

posted @ 2024-02-22 17:22  xjce  阅读(33)  评论(0编辑  收藏  举报