Single Cell Multiome | 单细胞多组学 | 单细胞开放染色质数据分析 | scATAC-seq | multi-omics
2023年11月22日
居然到今天才开始了解10x多组学建库的原理,开始尝到这个技术的甜头了,有很多问题只有这个技术能够回答,比如表观的plasticity。
这里的多组学其实是multi-modal,即一个细胞同时测scRNA-seq和scATAC-seq,而不是单独测然后integarte,它们之间有本质区别。【价格差别也很大】
原理也非常简单:
- hyperactive transposase Tn5,NGS adapters are loaded onto the transposase, which allows simultaneous fragmentation of chromatin and integration of those adapters into open chromatin regions. Tn5酶不仅切割了开放DNA,而且加上了自己的adapter,然后就可以被Gel Bead上的带有Spacer的序列捕获;
- cDNA就更简单了,poly(dT)即可捕获带有polyA的cDNA;
- 统一扩增加上P5、P7(for flow cell binding)、i5、i7(for sample index)等NGS adapter;
Spacer: An 8 bp sequence on the Gel Bead ATAC Barcode oligo that enables barcode attachment to transposed DNA fragments.
The P5 adapter binds to the 5'end of flow cell oligos, and the P7 adapters bind to the 3'end.
NGS Adapters can also have tags — like sample barcodes. This tagged region of the adapter is called an index or barcode region.
只需要好好理解Chromium Next GEM Single Cell Multiome ATAC + Gene Expression Reagent Kits User Guide,17-20页的内容即可。
2023年10月25日
Change location of 'fragments' in multiome Seurat file 更新fragment文件位置
Reset the current fragments object in the Seurat
Make a new path
Add to your object
Fragments(seuset.flt.allen@assays$ATAC) <- NULL fragments <- CreateFragmentObject(path = "/home/zz950/projects/ApcKO_multiomics/fragments/atac_fragments.tsv.gz", cells = colnames(seuset.flt.allen), validate.fragments = TRUE) Fragments(seuset.flt.allen@assays$ATAC) <- fragments gene.activities <- GeneActivity(seuset.flt.allen)
http://localhost:17435/notebooks/projects/ApcKO_multiomics/7.placiticity_estimation.ipynb
终于要开始分析scATAC-seq数据了,联合scRNA-seq就可以做multi-omics,可以深入挖掘TF的调控机制。
先从seurat和signac开始上手
- cellRanger - Understanding Output
- 10x - ATAC Data Concepts
- 10x - Cell Ranger ATAC Algorithms Overview
- Integrating scRNA-seq and scATAC-seq data
- Weighted Nearest Neighbor Analysis
- Analyzing PBMC scATAC-seq - signac 【最专业的包】
必须知道ATAC-seq的原理
- library是如何构建的?
- 下机的fastq测序的reads到底是genome的哪些区域?
- fragment和peak是如何从fastq里产生的?
- 一些核心的算法(peak calling)的直觉原理是什么?
看2022-Nature Protocols好文:Chromatin accessibility profiling by ATAC-seq
建库原理:改造的Tn5自带测序接头
ATAC-seq uses the activity of an engineered, hyperactive Tn5 transposase33 preloaded with sequencing adapters to determine the sites of accessible chromatin.
(i) a transposase had previously been used to generate ‘tagmentation’ libraries, in which a Tn5 transposase was preloaded with sequencing adapters and used to simultaneously fragment and tag genomic DNA for high-throughput sequencing library preparation
(ii) the observation that in vivo Tn5 could efficiently insert into nucleosome-free regions
转座事件
Each unique transposition event, termed an ‘insertion’, marks a location in the genome where a Tn5 transposase dimer is able to access DNA and perform a cut-and-paste reaction.
The transposase simultaneously fragments the DNA and inserts sequence handles that are then used for amplification during library preparation.
A sequenceable ATAC-seq DNA fragment is created by two separate transposase insertion events【这就是测序的根源,fastq的本质理解】
Tn5只是负责插入adapter,任何两个配对的adapter fragment都可以被测序仪捕捉到,从而测序,但测序是short read,所以只能捕捉到两端。
The precise biochemical interactions that govern Tn5 transposition at these sites are not yet fully understood.【机制却还不理解】
fastq的基本处理
- processing and alignment of ATAC-seq fragments
- enrichment of Tn5 transposition events
- identify peaks of Tn5-accessible chromatin
- Chromatin accessibility signal within these peak regions can be compared between different sample types
downstream analyses
- peaks can be linked to putative gene targets by using orthogonal chromatin conformation capture datasets or by naively assigning each peak to the nearest gene
- These predicted gene regulatory interactions can provide a hint as to the functional importance of a given peak
- genes with several ATAC-seq peaks in their promoter and gene body are inferred to be actively expressed in that cell type
- A common application of ATAC-seq is to identify novel enhancers or gene regulatory regions for a given cell type or cell context of interest.
- ATAC-seq peaks can also be annotated for the presence of various TF motif sequences
- enrichment tests can be used to predict the drivers of differential chromatin accessibility
- ATAC-seq data can also be used to infer the positions of nucleosomes61, providing insights into chromatin regulation beyond TF binding
While gene expression is more accurately measured by RNA sequencing (RNA-seq), ATAC-seq can explain the mechanism behind how gene expression is regulated or why it might be different between two cell types or conditions.
看完了原理,再来看处理细节:
10x - Cell Ranger ATAC Algorithms Overview
peak calling
ChIP-seq的peak calling和ATAC-seq的还不太一样,ChIP-seq是直接抓出目标序列了,然后正反链测序,以antibody抓的蛋白为中心。
ATAC-seq,sequenceable fragments of DNA where each end identifies a transposase cut site,最终就是测了cut site的片段,paired end则有insert size。
Because each sample may have cells with different patterns of chromatin accessibility, peaks must be called directly from ATAC data with each run of the pipeline.必须得分开call
去除背景噪音the desired signal (open chromatin causing localized enrichment in cut sites) must be distinguished from background noise (random transposase activity throughout the genome).
The fragment ends, corrected for the estimated binding position of the transposase enzyme and de-duplicated, are identified in the position-sorted fragments.tsv.gz file produced by Cell Ranger ATAC.
Taking these data, the number of transposition events at each position in the genome are counted.
Because of local variability in transposase binding affinity, this raw signal is smoothed with a 401bp moving window sum to generate a smoothed signal profile, so that the signal at each genomic position represents the total number of transposase cut sites within the window around that position across all barcodes.
完全解释了raw peak-barcode matrix是什么
Cell Ranger ATAC produces a count matrix consisting of the counts of fragment ends (or cut sites) within each peak region for each barcode. This is the raw peak-barcode matrix and it captures the enrichment of open chromatin per barcode. The matrix is then filtered to consist of only cell barcodes, which is then used in subsequent analysis such as dimensionality reduction, clustering and visualization.
tf-barcode matrix是什么,比较复杂
理解处理流程和结果文件:
cellranger-atac mkfastq
cellranger-atac count
流程类似scRNA-seq
outs文件夹里的输出文件有些差异:
Outputs: - Per-barcode fragment counts & metrics: /home/jdoe/runs/sample345/outs/singlecell.csv - Position sorted BAM file: /home/jdoe/runs/sample345/outs/possorted_bam.bam - Position sorted BAM index: /home/jdoe/runs/sample345/outs/possorted_bam.bam.bai - Summary of all data metrics: /home/jdoe/runs/sample345/outs/summary.json - HTML file summarizing data & analysis: /home/jdoe/runs/sample345/outs/web_summary.html - Bed file of all called peak locations: /home/jdoe/runs/sample345/outs/peaks.bed - Smoothed transposition site track: /home/jdoe/runs/sample345/outs/cut_sites.bigwig - Raw peak barcode matrix in hdf5 format: /home/jdoe/runs/sample345/outs/raw_peak_bc_matrix.h5 - Raw peak barcode matrix in mex format: /home/jdoe/runs/sample345/outs/raw_peak_bc_matrix - Directory of analysis files: /home/jdoe/runs/sample345/outs/analysis - Filtered peak barcode matrix in hdf5 format: /home/jdoe/runs/sample345/outs/filtered_peak_bc_matrix.h5 - Filtered peak barcode matrix in mex format: /home/jdoe/runs/sample345/outs/filtered_peak_bc_matrix - Barcoded and aligned fragment file: /home/jdoe/runs/sample345/outs/fragments.tsv.gz - Fragment file index: /home/jdoe/runs/sample345/outs/fragments.tsv.gz.tbi - Filtered tf barcode matrix in hdf5 format: /home/jdoe/runs/sample345/outs/filtered_tf_bc_matrix.h5 - Filtered tf barcode matrix in mex format: /home/jdoe/runs/sample345/outs/filtered_tf_bc_matrix - Loupe Browser input file: /home/jdoe/runs/sample345/outs/cloupe.cloupe - csv summarizing important metrics and values: /home/jdoe/runs/sample345/outs/summary.csv - Annotation of peaks with genes: /home/jdoe/runs/sample345/outs/peak_annotation.tsv - Peak-motif associations: /home/jdoe/runs/sample345/outs/peak_motif_mapping.bed
summary HTML file:QC质控
- 这里的feature就是fragment
- 这里的QC指标要复杂很多
理解几个文件:
- Filtered peak-barcode matrix,Each peak is an interval on the genome that has a local enrichment of transposase cut-sites. Cell Ranger ATAC produces peaks that are numerically sorted and non-overlapping. Each peak has a corresponding row in the feature-barcode.
- Filtered tf-barcode matrix
- Fragments File - 通常用于作图,如CoveragePlot。
问题:
fragment和peak的区别?概念,长度分布
- UMI count per cell is the unit of gene expression. Cut sites per cell is the unit of accessibility.
- Cut-site: a genome location where transposase cuts the DNA and inserts adapters.
- Genes are the rows of a Gene Expression matrix. Peaks are the rows of a Chromatin Accessibility matrix.
- Peaks are genomic regions where there were significant upticks in fragment cut sites, which indicate regions of open chromatin. They are named by their location (e.g., "chr1:10244-10510")
- Unlike genes, peaks are likely to be different between different datasets.
- There are typically more distinct peaks in an ATAC dataset than there are genes in a reference.
- Promoter sums: the sums of cut sites per cell (within peaks) which are close to one of the transcription start sites for that gene. These features are named "(Gene) Sum". Not all peaks are associated with a gene.
- Transcription factor motifs: the sums of cut sites per cell which fall within peaks associated with a motif by the Cell Ranger ATAC pipeline. Motif features are named after the motifs themselves (e.g., "SPI1"). A peak is usually associated with multiple motifs.
- An ATAC dataset takes up several times as much disk space (per cell) than a Gene Expression dataset.
- To see fragment locations per cluster in high resolution, you need access to the fragments.tsv.gz file for that run, generated by the Cell Ranger ATAC pipeline. These files are bundled because they are typically several times larger than the .cloupe file. You can either specify the location of this file on a locally mounted file system, or on the web via a URL.
基本教程
分析工具
signac - https://github.com/timoast/signac
- QC
- PCA and UMAP
- peak to gene
- integration with scRNA-seq
- DE Peaks
- CoveragePlot
- TF footprinting
单细胞ATAC-seq分析工具Scasat在细胞分类中的应用
scJoint:结合迁移学习整合scRNA-seq和scATAC-seq数据
2020年2月,Qing Nie团队在Genome Biology杂志发表名为 “scAI: an unsupervised approach for the integrative analysis of parallel single-cell transcriptomic and epigenomic profiles”
NC |SCALE准确鉴定单细胞ATAC-seq数据中染色质开放特征
在文章中,作者从开发者的角度列出了目前的scATAC-seq分析软件,chromVAR, scABC, cisTopic, scVI,发现每个软件都有一定的不足之处,而从我们软件使用者的角度,其实可以考虑都试试这些工具。
scATAC-seq分析工具当中,比较为人熟知的是ArchR、SnapATAC以及Signac三个R包
Nat Comm | 陈捷凯团队等开发单细胞测序分析转座元件表达的工具scTE
应用案例
- scRNA-seq联合scATAC-seq解析人大脑皮层发育的基因调控
- 斯坦福大学发表snRNA-seq联合scATAC-seq揭示结直肠息肉恶性转化过程中细胞状态变化
- Cell子刊 | 整合scRNA-seq和ATAC-seq数据分析人类发育过程中造血功能的调控
- 看scRNA-Seq+scATAC-Seq联手,拿下《Blood》高分文章
- scRNA-seq&scATAC-seq揭示如何刺激新生毛囊促进创面愈合
- 文献集锦│联合scATAC-seq&scRNA-seq的研究一览
TF转录因子高级分析
Transcription factors
General and specific transcription factors. Transcription initiation complex & looping. Combinatorial regulation.
待续~