单细胞转录组上游fasta文件处理

单细胞分析上游fasta文件处理

——基于cellranger与dropseqRunner

###如果测序文件由10X genomics平台产生，则采用cellranger count的基本流程进行fasta文件的上游处理；如果测序文件由dropseq平台产生，则采用dropseqRunner软件进行处理

一、cellranger配置

1、软件安装并查看帮助文档

#安装包下载

 wget -O cellranger-7.1.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-7.1.0.tar.gz?Expires=1694703729&Policy=eyJTdGF0ZW1lbnQiOlt7IlJlc291cmNlIjoiaHR0cHM6Ly9jZi4xMHhnZW5vbWljcy5jb20vcmVsZWFzZXMvY2VsbC1leHAvY2VsbHJhbmdlci03LjEuMC50YXIuZ3oiLCJDb25kaXRpb24iOnsiRGF0ZUxlc3NUaGFuIjp7IkFXUzpFcG9jaFRpbWUiOjE2OTQ3MDM3Mjl9fX1dfQ__&Signature=YmIZ3TsEI7VxGNIY7SdL~8oH0jr7ktjMZ48HRiLDQfcYLN4YWcs5nk0CZeKkeemvygGK3VryeHnvZpA21r2jN2YKfSeAHC03t-aDKzjctzbPvnv9UbckvrOghyxW7mH14W7uzMJJ1C9PbBo869EDRH04vxfsYGFQONCxvb~iBamTU1ZJ-6etWVioLjzb7o4-Y3v4v46nw67qf2NaPTwNXr4PIA-vFdWe9v9YhQQM6VlHR8a5crTmaM39hGC~2PatW0qlEd-DsMHeeNb34~Gr5N8XNIHv6K1VcuMq8VobqLQKxeoz3obmA23~kWkPNOSZNCVXosd0p6Ok7fUHiVUt-Q__&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA" &

#解压文件

tar -zxvf cellranger-7.0.1.tar.gz

#把cellranger的路径加到$PATH中方便调用

vi ~/.bashrc

export PATH=”/data5/tan/zengchuanj/Software/cellranger-7.1.0/bin:$PATH”

echo 'export PATH=/data5/tan/zengchuanj/Software/cellranger-7.1.0/:$PATH' >> ~/.bashrc

#更新系统配置文件

source ~/.bashrc

#查看cellranger使用说明

cellranger count --help

2、参考基因组下载

#人类参考基因组数据集

wget -o human.log https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2020-A.tar.gz &

tar -xvf refdata-gex-GRCh38-2020-A.tar.gz

#mouse参考基因组数据集下载

wget -o mouse.log https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-mm10-2020-A.tar.gz &

tar -xvf refdata-gex-mm10-2020-A.tar.gz

#测试数据集下载

wget -o sample.log 'http://cf.10xgenomics.com/samples/cell-exp/2.1.0/neurons_900/neurons_900_fastqs.tar' &

tar -xvf neurons_900_fastqs.tar #解压

cellranger count --id=result --transcriptome=../refdata-gex-mm10-2020-A/ --fastqs=/neurons_900_fastqs --sample=neurons_900 --expect-cells=1000 --nosecondary

Attention：#count函数参数解释

cellranger count --id=sample \

--transcriptome=/opt/refdata-cellranger-GRCh38-1.2.0 \

--fastqs=/home/scRNA/runs/HAWT7ADXX/outs/fastq_path \

--sample=mysample \

--expect-cells=1000 \

--nosecondary

# id指定输出文件存放目录名

# transcriptome指定与CellRanger兼容的参考基因组

# fastqs指定mkfastq或者自定义的测序文件

# sample要和fastq文件的前缀中的sample保持一致，作为软件识别的标志

# expect-cells指定复现的细胞数量，这个要和实验设计结合起来

# nosecondary 只获得表达矩阵，不进行后续的降维、聚类和可视化分析(反正后续要走Seurat，为了节省计算资源，建议加上)

3、结果解读

Ref：https:/zhuanlan.zhihu.com/p/390516422

Outputs:

- Run summary HTML: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/web_summary.html

- Run summary CSV: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/metrics_summary.csv

- BAM: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/possorted_genome_bam.bam

- BAM index: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/possorted_genome_bam.bam.bai

- Filtered feature-barcode matrices MEX: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/filtered_feature_bc_matrix

- Filtered feature-barcode matrices HDF5: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/filtered_feature_bc_matrix.h5

- Unfiltered feature-barcode matrices MEX: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/raw_feature_bc_matrix

- Unfiltered feature-barcode matrices HDF5: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/raw_feature_bc_matrix_h5.h5

- Secondary analysis output CSV: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/analysis

- Per-molecule read information: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/molecule_info.h5

- Loupe Browser file: /data5/tan/zengchuanj/pipeline/cellranger/result/outs/cloupe.cloupe

outs/raw_feature_bc_matrix: 这个文件夹包含原始的基因表达矩阵，每一行代表一个基因，每一列代表一个细胞。这个矩阵中的值表示每个细胞中每个基因的表达水平。这个矩阵没有经过任何的标准化或过滤。

outs/filtered_feature_bc_matrix: 这个文件夹包含经过过滤后的基因表达矩阵。在这个矩阵中，已经去除了低质量的细胞和低表达的基因。这是进行后续分析的主要输入。此文件夹包含三个文件：barcodes.tsv.gz、features.tsv.gz和matrix.mtx.gz。这些文件包含了每个细胞的条形码、每个特征的名称和每个细胞中每个特征的计数。

outs/metrics_summary.csv: 这个CSV文件包含了关于每个细胞和每个样本的一些质量控制指标，例如细胞计数、平均基因表达水平等。

outs/web_summary.html: 这个HTML文件提供了一个交互式的可视化界面，用于查看分析的总结结果，包括细胞计数、质量控制指标、细胞类型聚类等。

outs/cloupe.cloupe: 这是一个文件，可以用于在10x Genomics的Loupe浏览器中查看和分析单细胞数据。Loupe浏览器提供了丰富的数据可视化和分析功能。

二、dropseqRunner的配置

1、conda的安装

dropseqRunner是个依赖conda和python的环境，在安装前确保自己的服务器中有与之兼容的conda与python

wget https://mirrors.tuna.tsinghua.edu.cn/anaconda/archive/Anaconda3-5.3.1-Linux-x86_64.sh

bash Anaconda3-5.3.1-Linux-x86_64.sh

2、Dropseq的安装

wget https://codeload.github.com/aselewa/dropseqRunner/zip/master

mv master master.zip

unzip master.zip

#创建dropseq运行的conda环境

conda env create -f environment.yaml

#每次运行dropseq前需要进行激活，不激活环境则无法调用snakemake

conda activate dropRunner

#编译，不编译无法出现主脚本

make

3、下载参考数据并构建比对索引

#这里以小鼠的为例

wget -o mm.log https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/GCF_000001635.27_GRCm39_genomic.fna.gz &

#安装处理gff文件软件

conda install gffread

#将gff文件转换为gtf文件

gffread GCF_000001635.27_GRCm39_genomic.gff -T -o mice.gtf

#建参考数据库

STAR --runThreadN 4 --runMode genomeGenerate --genomeDir reference/ --genomeFastaFiles GCF_000001635.27_GRCm39_genomic.fna --sjdbGTFfile mice.gtf

4、Dropseq使用方法

python /dropseqRunner-master/dropRunner.py --R1 SRR11799731_R1.fastq.gz --R2 SRR11799731_R2.fastq.gz --indices /dropseqRunner-master/db/reference --sample SRR11799731 --protocol drop

#主程序使用方法

#各个参数：

#R1 R2，分别是你的两个fastq文件

#--indices是刚才构建好的参考数据集

#--sample是样本前缀名

#运行完毕后用于Seurat的数据存在/sample/output/SRR11799731_0_Solo.out/Gene