hg19 | GRCh38 | mm10 | 基因组 | 功能区域 | 位置提取
2022年12月01日
准备bedtools需要用的-g参数文件,-g $chromSize,genome.txt,信息就在比对的bam文件里,提取一下即可。
1 | samtools view -H N3_H3K27Ac_1_CKDL220013314-1a_HY2YCDSX3_L1_1_val_1.bowtie2.bam | grep @SQ| sed 's/@SQ\tSN:\|LN://g' > genome.txt |
2021年08月09日
提取转录本id,查询对应的基因id
1 2 3 | ~ /databases/cellranger_ref/refdata-cellranger-GRCh38-1 .2.0 /genes/ cat genes.gtf | awk '{if($3=="transcript")print $0}' | cut -f1,3,4,5,7,9 | cut -d\; -f1,3,5 | sed 's/\t/;/g' | sed 's/ /;/g' | sed 's/;;/;/g' | sed 's/\"//g' > .. /transcript .gene.anno.hg38.3.0.0.csv |
如何获取hg19的CDS、UTR、intergenic、intron等的位置信息?
如何从gtf中提取基因的注释信息 ?
2020年09月06日更新。
不积跬步无以至千里,从最基本的格式转换开始处理。
看一下gtf的格式
1 2 3 4 5 6 7 8 9 | seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below. source - name of the program that generated this feature, or the data source (database or project name) feature - feature type name, e.g. Gene, Variation, Similarity start - Start position* of the feature, with sequence numbering starting at 1. end - End position* of the feature, with sequence numbering starting at 1. score - A floating point value. strand - defined as + (forward) or - (reverse). frame - One of '0' , '1' or '2' . '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on.. attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature. |
了解基本的awk和sed命令即可。
基因的注释文件版本太多太乱了,做转录组就以10x提供的为准吧。
1 | cat genes.gtf | awk '{if($3=="gene")print $0}' | cut -f1,3,4,5,7,9 | cut -d\; -f1,3,5 | sed 's/\t/;/g' | sed 's/ /;/g' | sed 's/;;/;/g' | sed 's/\"//g' > ../gene.anno.mm10.3.0.0.csv |
1 | cat genes.gtf | awk '{if($3=="gene")print $0}' | cut -f1,3,4,5,7,9 | cut -d\; -f1,3,5 | sed 's/\t/;/g' | sed 's/ /;/g' | sed 's/;;/;/g' | sed 's/\"//g' > ../gene.anno.GRCh38.3.0.0.csv |
1 | cat genes.gtf | awk '{if($3=="gene")print $0}' | cut -f1,3,4,5,7,9 | cut -d\; -f1,3,5 | sed 's/\t/;/g' | sed 's/ /;/g' | sed 's/;;/;/g' | sed 's/\"//g' > ../gene.anno.hg19.3.0.0.csv |
提取出来的基因注释,可以直接用csv读取:
1 2 3 4 5 6 7 8 | 1;gene;3205901;3671498;-;gene_id;ENSMUSG00000051951;gene_name;Xkr4;gene_biotype;protein_coding 1;gene;3466587;3513553;+;gene_id;ENSMUSG00000089699;gene_name;Gm1992;gene_biotype;antisense 1;gene;29554;31109;+;gene_id;ENSG00000243485;gene_name;MIR1302-2HG;gene_biotype;lincRNA 1;gene;34554;36081;-;gene_id;ENSG00000237613;gene_name;FAM138A;gene_biotype;lincRNA 1;gene;29554;31109;+;gene_id;ENSG00000243485;gene_name;MIR1302-10;gene_biotype;lincRNA 1;gene;34554;36081;-;gene_id;ENSG00000237613;gene_name;FAM138A;gene_biotype;lincRNA |
上传到github,以后数据分析时会很常用,写好说明文档。
参考手册:
Hg19 regions for Intergenic, Promoters, Enhancer, Exon, Intron, 5-UTR, 3-UTR
The coding region of a gene, also known as the CDS (from coding sequence), is that portion of a gene's DNA or RNA that codes for protein. The region usually begins at the 5' end by a start codon and ends at the 3' end with a stop codon.
CDS就是所有exon的组合
有细微的差别:
Exons = gene - introns
CDS = gene - introns - UTRs
therefore also:
CDS = Exons - UTRs
就是UTR也算作了exon了。
也就是exon算是一个比较大的概念,所以文章里用的CDS区域来统计,而不是exon。
获取数据的方法:
1. UCSC - 最全,最个性化
How to download the most similar annotation file as the author required from UCSC browser directly.
Go to UCSC brower and Tool, Table caterogy. and pick you reference genome and select right version under clade/genome/assembly
Make sure the group is "Genes and Gene Predictions"
Choose your preferred track (RefSeq/RefGene or UCSC gene/KnownGene)
Choose the table that gives gene information (RefSeq or KnownGene)
Select your region or the entire genome to get coordinates for
Select BED format as your output format
Name your output file
Click "get output"
Be careful, the ouput files don't have exon, intron, integenic, 5-UTR, 3-UTR informatics if you save it as a single file. You can save them as separated files so that you know the information for each subset.
需要自己再sort一下:
1 2 3 4 5 6 7 8 | cat raw.txt /Gene .txt | egrep "^chr([0-9]{1,2})" | grep - v random | bedtools sort -g .. /genome .txt > Gene.bed cat raw.txt /UTR3 .txt | egrep "^chr([0-9]{1,2})" | grep - v random | bedtools sort -g .. /genome .txt > UTR3.bed cat raw.txt /UTR5 .txt | egrep "^chr([0-9]{1,2})" | grep - v random | bedtools sort -g .. /genome .txt > UTR5.bed cat raw.txt /down2K .txt | egrep "^chr([0-9]{1,2})" | grep - v random | bedtools sort -g .. /genome .txt > Down2K.bed cat raw.txt /CDS .txt | egrep "^chr([0-9]{1,2})" | grep - v random | bedtools sort -g .. /genome .txt > CDS.bed cat raw.txt /exon .txt | egrep "^chr([0-9]{1,2})" | grep - v random | bedtools sort -g .. /genome .txt > Exon.bed cat raw.txt /up2K .txt | egrep "^chr([0-9]{1,2})" | grep - v random | bedtools sort -g .. /genome .txt > Up2K.bed cat raw.txt /intron .txt | egrep "^chr([0-9]{1,2})" | grep - v random | bedtools sort -g .. /genome .txt > Intron.bed |
这里得到的CDS是有overlap的,因为是按转录本算的,同一个基因有多个转录本。
2. genecode - 最自动化,全部用代码搞定
3. 其他
RSeQC有比较好的bed注释文件,但是不是完全的明文,不好利用。
附录:
intergenic的需要自己计算
1 | wget http: //ftp .ensembl.org /pub/release-75/gtf/homo_sapiens/Homo_sapiens .GRCh37.75.gtf.gz |
1 2 3 4 5 | grep - v "_" hg19.chrlen.bed | egrep "^chr([0-9]{1,2})" | cut -f1,3 > hg19.chrlen.1_22.bed # sort -k1,1 hg19.chrlen.1_22.bed > sorted.genome bedtools complement -i UCSC.anno /Gene .bed -g hg19.chrlen.1_22.bed > intergenic.bed sort -k1,1 -k2,2n intergenic.bed > sorted.intergenic.bed bedtools merge -i sorted.intergenic.bed > merged.sorted.intergenic.bed |
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)