STARR-seq

STARR-seq peak calling

STARRPeaker 输出文件的格式

Final Peak Call Format (up to v1.0; BED6+4)

  • Column 1: Chromosome
  • Column 2: Start position
  • Column 3: End position
  • Column 4: Name (peak rank based on score, 1 being the highest rank)
  • Column 5: Score (integer value of "100 * fold change", maxed at 1000 per BED format specification)
  • Column 6: Strand
  • Column 7: Fold change (output/normalized-input)
  • Column 8: Output fragment coverage
  • Column 9: -log10 of P-value
  • Column 10: -log10 of Q-value (Benjamini-Hochberg False Discovery Rate, FDR)

STARRPeaker uniform processing and accurate identification of STARR-seq active regions.docx

 STARRPeaker 示例

chr1	737031	737531	peak_28774	292	.	2.917	113	2.999	1.862
chr1	779519	780068	peak_30539	286	.	2.859	151	3.421	2.181
chr1	851507	852190	peak_14391	369	.	3.691	136	4.882	3.371
chr1	882860	883360	peak_15546	359	.	3.589	203	5.654	4.034
chr1	943023	943523	peak_8220	453	.	4.534	129	3.701	2.399
chr1	983423	983986	peak_16365	352	.	3.522	225	4.083	2.707
chr1	1006435	1007050	peak_2367	733	.	7.331	392	10.014	7.957
chr1	1013140	1013640	peak_8504	448	.	4.481	221	6.173	4.488
chr1	1053795	1054295	peak_11463	400	.	4.004	289	4.051	2.681
chr1	1098488	1098988	peak_52307	237	.	2.372	135	2.783	1.707

 

 

https://hbctraining.github.io/Intro-to-ChIPseq/lessons/05_peak_calling_macs.html

  • _peaks.narrowPeak: BED6+4 format file which contains the peak locations together with peak summit, pvalue and qvalue

macs示例:

chr1	137988	139077	A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_1	107	.	2.07609	12.79223	10.79328	800
chr1	183813	184198	A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_2	62	.	1.80156	7.98671	6.25028	362
chr1 778328 779235 A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_3 1000 . 4.02790 127.46170 123.78461 396
chr1	818721	819243	A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_4	84	.	1.87128	10.32806	8.45139	341
chr1	826970	827885	A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_5	268	.	2.54440	29.41094	26.89010	583
chr1	831880	832489	A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_6	58	.	1.67430	7.58716	5.87863	200
chr1	842755	844744	A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_7	122	.	1.98321	14.35260	12.28650	1394
chr1	851489	852108	A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_8	27	.	1.52704	4.20449	2.78904	90
chr1	856307	856911	A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_9	21	.	1.42146	3.44552	2.11941	202
chr1	860455	861134	A001-K562.f3q10.sorted.dups_marked.macs_keep_dups_norm_peak_10	50	.	1.67918	6.73655	5.08994	220

https://gander.wustl.edu/cgi-bin/hgTables?db=dm6&hgta_group=regulation&hgta_track=STARRseq_macs&hgta_table=S2_STARRseq_macs&hgta_doSchema=describe+table+schema
fieldexampledescription
chrom chr2L Reference sequence chromosome or scaffold
chromStart 15790602 Pseudogene alignment start position
chromEnd 15791396 Pseudogene alignment end position
name dm6_S2_macs_peak_500 Name of pseudogene
score 7 Score of pseudogene with gene (0-1000)
strand . + or - or . for unknown
thickStart 15790602 Start of where display should be thick (start codon)
thickEnd 15791396 End of where display should be thick (stop codon)
reserved 0 Always zero for now
blockCount 3 Number of blocks
blockSizes 1,726,1 Comma separated list of block sizes
chromStarts 0,48,793 Start positions relative to chromStart
signalValue 1.26549 Measurement of average enrichment for the region
pValue 1.57219 Statistical significance of signal value (-log10). Set to -1 if not used.
qValue 0.78136 Statistical significance with multiple-test correction applied (FDR). Set to -1 if not used.
 

Sample Rows
 
 
chromchromStartchromEndnamescorestrandthickStartthickEndreservedblockCountblockSizeschromStartssignalValuepValueqValue
chr2L 15790602 15791396 dm6_S2_macs_peak_500 7 . 15790602 15791396 0 3 1,726,1 0,48,793 1.26549 1.57219 0.78136
chr2L 15793470 15793982 dm6_S2_macs_peak_501 12 . 15793470 15793982 0 2 1,1 0,511 1.45984 2.08082 1.23422
chr2L 15877966 15879024 dm6_S2_macs_peak_502 12 . 15877966 15879024 0 2 1,1 0,1057 1.42725 2.06278 1.24344
chr2L 15892565 15893078 dm6_S2_macs_peak_503 7 . 15892565 15893078 0 2 1,1 0,512 1.32721 1.51711 0.73666
chr2L 15913711 15914639 dm6_S2_macs_peak_504 6 . 15913711 15914639 0 3 1,570,1 0,357,927 1.24213 1.39006 0.67120
chr2L 15921259 15923111 dm6_S2_macs_peak_505 20 . 15921259 15923111 0 3 1,688,1 0,27,1851 1.80132 2.93227 2.00504
chr2L 15934282 15935000 dm6_S2_macs_peak_506 9 . 15934282 15935000 0 3 1,664,1 0,49,717 1.39124 1.81636 0.95197
chr2L 15985186 15986307 dm6_S2_macs_peak_507 35 . 15985186 15986307 0 2 1,1 0,1120 2.22505 4.50524 3.53903
chr2L 16063228 16065183 dm6_S2_macs_peak_508 88 . 16063228 16065183 0 2 1,1 0,1954 3.77957 10.09229 8.80252
chr2L 16191197 16191985 dm6_S2_macs_peak_509 54 . 16191197 16191985 0 2 1,1 0,787 2.44281 6.54044 5.47299
 




 

STARR-seq:该方法是用来评估增强子(启动子)活性。

Epromoters: 指的是既可以作为promoters 同时可以作为远端基因的enhancers

 

enhancers和epromoters的鉴定流程如下图:

主要是从GEO数据库搜集目前已发表的STARR-seqMPRA数据,然后分别进行Enhancer peak calling ,鉴定具有enhancers和Epromoters活性的序列。

 

 


 
链接:https://www.jianshu.com/p/98ce9a3f3fe5

 

STARR-seq目前广泛应用于增强子活性检测。但传统的STARR-seq的准确性严重依赖于从报告基因reporter gene启动子开始的自转录mRNA的完全恢复。

在质粒构建过程中,polyadenylation site(PAS)被添加到报告基因的后端,由于这个是设计好的PAS用来给自转录self-transcripts (STs) 做聚腺苷酸化polyadenylation 的,称之为“DPAS”。但是,可能存在alternative另外的 polyadenylation site(PAS)在检测DNA序列中,也是受到了enhancer的潜在影响,称之为“APAS”。APAS在STARR-seq中是不会被检测到的。



 
链接:https://www.jianshu.com/p/57d18d57d34f

 

 一种高通量的验证方法例如self-transcribing active regulatory region-sequencing(STARR-seq)能够对基因组范围内的CRMs进行大规模的评估。在STARR-seq中,基因组片段被打断并添加barcode,打断后的片段被克隆到报告基因的3’UTR区域,从而创建了一个报告基因库中衍生的RNA片段表明了CRMs的活性(图5B)。

 

 https://zhuanlan.zhihu.com/p/440319953

 

https://news.sciencenet.cn/htmlpaper/2020/8/20208123422227357873.shtm

 

 

 https://wenku.baidu.com/view/4ad853d4920ef12d2af90242a8956bec0975a5d4.html?_wkts_=1674140985734&bdQuery=starr-seq

 

 

 https://www.science.org/doi/10.1126/science.1232542

Genome-Wide Quantitative Enhancer Activity Maps Identified by STARR-seq

SCIENCE
17 Jan 2013
Vol 339, Issue 6123
pp. 1074-1077

DOI: 10.1126/science.1232542

Abstract

Genomic enhancers are important regulators of gene expression, but their identification is a challenge, and methods depend on indirect measures of activity. We developed a method termed STARR-seq to directly and quantitatively assess enhancer activity for millions of candidates from arbitrary sources of DNA, which enables screens across entire genomes. When applied to the Drosophila genome, STARR-seq identifies thousands of cell type–specific enhancers across a broad continuum of strengths, links differential gene expression to differences in enhancer activity, and creates a genome-wide quantitative enhancer map. This map reveals the highly complex regulation of transcription, with several independent enhancers for both developmental regulators and ubiquitously expressed genes. STARR-seq can be used to identify and quantify enhancer activity in other eukaryotes, including humans.

 

 

 

 

 

 

 

Methodology[edit]

Genomic DNA is randomly sheared and broken down to small fragments. Adaptors are ligated to size-selected DNA fragments. Next, adaptor linked fragments are amplified and the PCR products are purified followed by placing candidate sequences downstream of a minimal promoter of screening vectors, giving them an opportunity to transcribe themselves. Candidate cells are then transfected with reporter library and cultured. Thereafter, total RNAs are extracted and poly-A RNAs isolated. Using reverse transcription method, cDNAs are produced, amplified and then candidate fragments are used for high-throughput paired end sequencing. Sequence reads are mapped to the reference genome and computational processing of data is carried out.[1]

 

 

 https://www.sciencedirect.com/science/article/pii/S0888754315300100

 https://en.wikipedia.org/wiki/STARR-seq

https://www.jianshu.com/p/98ce9a3f3fe5

 

 

 

posted @ 2023-05-03 17:43  emanlee  阅读(667)  评论(0编辑  收藏  举报