hg19 | GRCh38 | mm10 | 基因组 | 功能区域 | 位置提取

2022年12月01日

准备bedtools需要用的-g参数文件，-g $chromSize，genome.txt，信息就在比对的bam文件里，提取一下即可。

samtools view -H N3_H3K27Ac_1_CKDL220013314-1a_HY2YCDSX3_L1_1_val_1.bowtie2.bam | grep @SQ|sed 's/@SQ\tSN:\|LN://g' > genome.txt

2021年08月09日

提取转录本id，查询对应的基因id

~/databases/cellranger_ref/refdata-cellranger-GRCh38-1.2.0/genes/

cat genes.gtf | awk '{if($3=="transcript")print $0}' | cut -f1,3,4,5,7,9 | cut -d\; -f1,3,5 | sed 's/\t/;/g' | sed 's/ /;/g' | sed 's/;;/;/g' | sed 's/\"//g' > ../transcript.gene.anno.hg38.3.0.0.csv

如何获取hg19的CDS、UTR、intergenic、intron等的位置信息？

如何从gtf中提取基因的注释信息？

2020年09月06日更新。

不积跬步无以至千里，从最基本的格式转换开始处理。

看一下gtf的格式

seqname - name of the chromosome or scaffold; chromosome names can be given with or without the 'chr' prefix. Important note: the seqname must be one used within Ensembl, i.e. a standard chromosome name or an Ensembl identifier such as a scaffold ID, without any additional content such as species or assembly. See the example GFF output below.
source - name of the program that generated this feature, or the data source (database or project name)
feature - feature type name, e.g. Gene, Variation, Similarity
start - Start position* of the feature, with sequence numbering starting at 1.
end - End position* of the feature, with sequence numbering starting at 1.
score - A floating point value.
strand - defined as + (forward) or - (reverse).
frame - One of '0', '1' or '2'. '0' indicates that the first base of the feature is the first base of a codon, '1' that the second base is the first base of a codon, and so on..
attribute - A semicolon-separated list of tag-value pairs, providing additional information about each feature.

了解基本的awk和sed命令即可。

基因的注释文件版本太多太乱了，做转录组就以10x提供的为准吧。

cat genes.gtf | awk '{if($3=="gene")print $0}' | cut -f1,3,4,5,7,9 | cut -d\; -f1,3,5 | sed 's/\t/;/g' | sed 's/ /;/g' | sed 's/;;/;/g' | sed 's/\"//g' > ../gene.anno.mm10.3.0.0.csv

cat genes.gtf | awk '{if($3=="gene")print $0}' | cut -f1,3,4,5,7,9 | cut -d\; -f1,3,5 | sed 's/\t/;/g' | sed 's/ /;/g' | sed 's/;;/;/g' | sed 's/\"//g' > ../gene.anno.GRCh38.3.0.0.csv

cat genes.gtf | awk '{if($3=="gene")print $0}' | cut -f1,3,4,5,7,9 | cut -d\; -f1,3,5 | sed 's/\t/;/g' | sed 's/ /;/g' | sed 's/;;/;/g' | sed 's/\"//g' > ../gene.anno.hg19.3.0.0.csv

提取出来的基因注释，可以直接用csv读取：

1;gene;3205901;3671498;-;gene_id;ENSMUSG00000051951;gene_name;Xkr4;gene_biotype;protein_coding
1;gene;3466587;3513553;+;gene_id;ENSMUSG00000089699;gene_name;Gm1992;gene_biotype;antisense

1;gene;29554;31109;+;gene_id;ENSG00000243485;gene_name;MIR1302-2HG;gene_biotype;lincRNA
1;gene;34554;36081;-;gene_id;ENSG00000237613;gene_name;FAM138A;gene_biotype;lincRNA

1;gene;29554;31109;+;gene_id;ENSG00000243485;gene_name;MIR1302-10;gene_biotype;lincRNA
1;gene;34554;36081;-;gene_id;ENSG00000237613;gene_name;FAM138A;gene_biotype;lincRNA

上传到github，以后数据分析时会很常用，写好说明文档。　　

参考手册：

Hg19 regions for Intergenic, Promoters, Enhancer, Exon, Intron, 5-UTR, 3-UTR

怎么从gtf文件获取genome feature的区间

The coding region of a gene, also known as the CDS (from coding sequence), is that portion of a gene's DNA or RNA that codes for protein. The region usually begins at the 5' end by a start codon and ends at the 3' end with a stop codon.

CDS就是所有exon的组合

有细微的差别：

Exons = gene - introns

CDS = gene - introns - UTRs

therefore also:

CDS = Exons - UTRs

就是UTR也算作了exon了。

也就是exon算是一个比较大的概念，所以文章里用的CDS区域来统计，而不是exon。

获取数据的方法：

1. UCSC - 最全，最个性化

How to download the most similar annotation file as the author required from UCSC browser directly.

Go to UCSC brower and Tool, Table caterogy. and pick you reference genome and select right version under clade/genome/assembly
Make sure the group is "Genes and Gene Predictions"
Choose your preferred track (RefSeq/RefGene or UCSC gene/KnownGene)
Choose the table that gives gene information (RefSeq or KnownGene)
Select your region or the entire genome to get coordinates for
Select BED format as your output format
Name your output file
Click "get output"
Be careful, the ouput files don't have exon, intron, integenic, 5-UTR, 3-UTR informatics if you save it as a single file. You can save them as separated files so that you know the information for each subset.

需要自己再sort一下：

cat raw.txt/Gene.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Gene.bed
cat raw.txt/UTR3.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > UTR3.bed
cat raw.txt/UTR5.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > UTR5.bed
cat raw.txt/down2K.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Down2K.bed
cat raw.txt/CDS.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > CDS.bed
cat raw.txt/exon.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Exon.bed
cat raw.txt/up2K.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Up2K.bed
cat raw.txt/intron.txt | egrep  "^chr([0-9]{1,2})" | grep -v random | bedtools sort -g ../genome.txt > Intron.bed

这里得到的CDS是有overlap的，因为是按转录本算的，同一个基因有多个转录本。

2. genecode - 最自动化，全部用代码搞定

3. 其他

RSeQC有比较好的bed注释文件，但是不是完全的明文，不好利用。

附录：

intergenic的需要自己计算

wget http://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz

grep -v "_" hg19.chrlen.bed | egrep  "^chr([0-9]{1,2})" | cut -f1,3 > hg19.chrlen.1_22.bed
# sort -k1,1 hg19.chrlen.1_22.bed > sorted.genome
bedtools complement -i UCSC.anno/Gene.bed -g hg19.chrlen.1_22.bed > intergenic.bed
sort -k1,1 -k2,2n intergenic.bed > sorted.intergenic.bed
bedtools merge -i sorted.intergenic.bed > merged.sorted.intergenic.bed

posted @ 2019-11-19 15:39 Life·Intelligence 阅读(7041) 评论(0) 编辑收藏举报

刷新页面返回顶部

Digital-LI

hg19 | GRCh38 | mm10 | 基因组 | 功能区域 | 位置提取