sratoolkit | 单细胞公共数据挖掘 | GEO | SRA
2022年12月15日
需要快速检测某文章cut&run的antibody的特异性,要下载SRA的raw data,看比对率。
如何下载SRA数据,快速,高效?
找到数据的SRA链接:https://www.ncbi.nlm.nih.gov/bioproject/PRJNA589292,选择进入Run Selector
下载accession list,传送到server。
下载https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.2/sratoolkit.3.0.2-ubuntu64.tar.gz,【如下所述】
prefetch --option-file SRR_Acc_List.txt
SRA to fastq
for i in SRR*/*.sra do echo $i time fastq-dump --gzip --split-3 -A $i && echo "** ${i} to fastq done **" done
参考:
heart single-cell dataset
Single-cell transcriptomic landscape of cardiac neural crest cell derivatives during development
https://www.ncbi.nlm.nih.gov/bioproject/PRJNA562135/
https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA562135&o=acc_s%3Aa
可视化网站:http://scrnaseqcncc.fwgenetics.org/
问题:没有处理后的表达矩阵,只能自己下载数据,自己处理了。
SRA-Toolkit最好去下载二进制的版本,不然SRR的解析有问题,下载后需要配置,然后下载的时候需要设定下载文件的上限。
# conda install sratoolkit # go to https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit prefetch -h ~/project/Data_center/softwares/sratoolkit.3.0.0-centos_linux64/bin/prefetch SRR10065158 SRR10065151 SRR10065152 SRR10065153 SRR10065154 SRR10065155 SRR10065156 SRR10065157 -O SRA --max-size 1000G
for i in SRA/*/*.sra do echo $i time fastq-dump --gzip --split-3 -A $i && echo "** ${i} to fastq done **" done
# 比如,将原来的SRR7692286_1.fastq.gz改成SRR7692286_S1_L001_I1_001.fastq.gz # 依次类推,将原来_2的改成R1,将_3改成R2 # cat SRR_Acc_List-9245-3.txt | while read i ;do (mv ${i}_1*.gz ${i}_S1_L001_I1_001.fastq.gz;mv ${i}_2*.gz ${i}_S1_L001_R1_001.fastq.gz;mv ${i}_3*.gz ${i}_S1_L001_R2_001.fastq.gz);done SampleName_S1_L001_R1_001.fastq.gz
比较懒,也懒得去写循环了。
mv SRA_SRR10065151_SRR10065151.sra_1.fastq.gz SRR10065151_S1_L001_R1_001.fastq.gz mv SRA_SRR10065151_SRR10065151.sra_2.fastq.gz SRR10065151_S1_L001_R2_001.fastq.gz
mv SRA_SRR10065153_SRR10065153.sra_1.fastq.gz SRR10065153_S1_L001_R1_001.fastq.gz mv SRA_SRR10065153_SRR10065153.sra_2.fastq.gz SRR10065153_S1_L001_R2_001.fastq.gz mv SRA_SRR10065158_SRR10065158.sra_1.fastq.gz SRR10065158_S1_L001_R1_001.fastq.gz mv SRA_SRR10065158_SRR10065158.sra_2.fastq.gz SRR10065158_S1_L001_R2_001.fastq.gz mv SRA_SRR10065154_SRR10065154.sra_1.fastq.gz SRR10065154_S1_L001_R1_001.fastq.gz mv SRA_SRR10065154_SRR10065154.sra_2.fastq.gz SRR10065154_S1_L001_R2_001.fastq.gz mv SRA_SRR10065156_SRR10065156.sra_1.fastq.gz SRR10065156_S1_L001_R1_001.fastq.gz mv SRA_SRR10065156_SRR10065156.sra_2.fastq.gz SRR10065156_S1_L001_R2_001.fastq.gz mv SRA_SRR10065155_SRR10065155.sra_1.fastq.gz SRR10065155_S1_L001_R1_001.fastq.gz mv SRA_SRR10065155_SRR10065155.sra_2.fastq.gz SRR10065155_S1_L001_R2_001.fastq.gz mv SRA_SRR10065152_SRR10065152.sra_1.fastq.gz SRR10065152_S1_L001_R1_001.fastq.gz mv SRA_SRR10065152_SRR10065152.sra_2.fastq.gz SRR10065152_S1_L001_R2_001.fastq.gz mv SRA_SRR10065157_SRR10065157.sra_1.fastq.gz SRR10065157_S1_L001_R1_001.fastq.gz mv SRA_SRR10065157_SRR10065157.sra_2.fastq.gz SRR10065157_S1_L001_R2_001.fastq.gz
cellranger处理fastq
# export PATH=~/softwares/cellranger-2.1.1:$PATH export PATH=~/softwares/cellranger-3.1.0:$PATH sampleName=SRR10065151 workdir=~/project/Data_center/public/2021_Chen # appdir=~/softwares/cellranger-2.1.1/ appdir=~/softwares/cellranger-3.1.0/ # refdir=~/databases/cellranger_ref/2019_Aug/refdata-cellranger-GRCh38-3.0.0 refdir=~/databases/cellranger_ref/2019_Aug/refdata-cellranger-mm10-3.0.0 $appdir/cellranger count --id=${sampleName}_report \ --transcriptome=${refdir} \ --jobmode=local \ --localcores=12 \ --localmem=100 \ --sample=${sampleName} \ --fastqs=$workdir
最终文件很小,也就50G左右,但中间文件可达几百G,所以8个并行的,2-3T的空间一下就吃光了,为了防止爆盘,一次最多跑3个吧。
整合一下,https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/aggregate
cellranger aggr --id=aggr --csv=aggr.csv
我更倾向于去R里面整合
注意不同cellranger版本的一些处理细节的差异,比如低版本里就得用library_id,高版本里就是sample
10X的数据完全可以顺手把velocity给跑一下,完全不费力,就一行代码。
文章中关于fastq的处理
The official software Cell Ranger v3.0.2 (https://support.10xge nomics.com) was applied for sample demultiplexing, barcode processing, and unique molecular identifier (UMI) counting. Briefly, the raw base call files generated by the sequencers were demultiplexed into reads in FASTQ format using the “cellranger mkfastq” pipeline. Then, the reads were processed using the “cellranger count” pipeline to generate a gene-barcode matrix for each library. During this step, the reads were aligned to the mouse (Mus musculus) reference genome (version: mm10) and the tdTomato sequence. The resulting gene-cell UMI count matrices of all samples were ultimately concatenated into one matrix using the “cellranger aggr” pipeline.
待续~
参考:
- 单细胞数据分析 - 合集
- 单细胞实战(一)数据下载
- 单细胞实战(二) cell ranger使用前注意事项
- 单细胞实战(三) Cell Ranger使用初探
- scRNA-单细胞实战(四) Cell Ranger流程概览
- 单细胞实战(五) 理解cellranger count的结果