用哪个版本的基因组和注释文件好?| 亲测
What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)
这是一个很细节也很实际的问题,到底用哪个版本?
参考:
What Ensembl genome version should I use for alignments? (e.g. toplevel.fa vs. primary_assembly.fa)
Results differ when using different ensembl versions
First part options:
- dna_sm - Repeats soft-masked (converts repeat nucleotides to lowercase)
- dna_rm - Repeats masked (converts repeats to to N's)
- dna - No masking
Second part options:
-
.toplevel - Includes haplotype information (not sure how aligners deal with this)
-
.primary_assembly - Single reference base per position
大部分都推荐使用soft-mask版本的,也就是没有把repeat替换为N。
下载hg19基因组:http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/
参考:基因组各种版本对应关系
从genecode下载hg19注释文件:ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_27/
UCSC也可以下载,不过只能从网页导出。http://genome.ucsc.edu/cgi-bin/hgTables
注:genecode貌似出了问题,https://www.gencodegenes.org/releases/26lift37.html,里面ebi的链接无法下载了。
参考:http://www.biotrainee.com/thread-2035-1-1.html
基因组不是越新越好的,看看最新的CNS,里面很少有用最新版本的基因组,为什么?因为注释没跟上,你做出来的东西可能和别人对不上。
亲测
用不同版本的基因组效果会怎么样?
我做了转录组的测试,用的hg19和GRCh38
结论如下:
1. reads比对到基因组上的情况大致相同,基本没有差别;
2. 用不同的注释文件,基因表达的结果差距非常大。同样都是用featureCounts
GRCh38的结果:
1 2 3 4 5 6 7 8 9 10 11 12 | Assigned 306852 Unassigned_Unmapped 0 Unassigned_MappingQuality 0 Unassigned_Chimera 0 Unassigned_FragmentLength 0 Unassigned_Duplicate 0 Unassigned_MultiMapping 36280 Unassigned_Secondary 0 Unassigned_Nonjunction 0 Unassigned_NoFeatures 56950 Unassigned_Overlapping_Length 0 Unassigned_Ambiguity 19771 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | // ================================= Running ==================================\\ || || || Load annotation file /home/lizhixin/databases/ensembl/release91/Homo_s ... || || Features : 1199851 || || Meta-features : 58302 || || Chromosomes /contigs : 47 || || || || Process BAM file /home/lizhixin/project/scRNA-seq/reanalyze/first_five ... || || Paired-end reads are included. || || Assign fragments ( read pairs) to features... || || || || WARNING: reads from the same pair were found not adjacent to each || || other in the input (due to read sorting by location or || || reporting of multi-mapping read pairs). || || || || Read re-ordering is performed. || || || || Total fragments : 419853 || || Successfully assigned fragments : 306852 (73.1%) || || Running time : 0.05 minutes || |
hg19的结果:
1 2 3 4 5 6 7 8 9 10 11 12 | Assigned 586467 Unassigned_Unmapped 0 Unassigned_MappingQuality 0 Unassigned_Chimera 0 Unassigned_FragmentLength 0 Unassigned_Duplicate 0 Unassigned_MultiMapping 66997 Unassigned_Secondary 0 Unassigned_Nonjunction 0 Unassigned_NoFeatures 133437 Unassigned_Overlapping_Length 0 Unassigned_Ambiguity 47278 |
1 2 3 4 5 6 7 8 9 10 11 12 13 | // ================================= Running ==================================\\ || || || Load annotation file /home/lizhixin/databases/cellranger_ref/refdata-c ... || || Features : 1130716 || || Meta-features : 32738 || || Chromosomes /contigs : 45 || || || || Process BAM file /home/lizhixin/project/scRNA-seq/reanalyze/first_five ... || || Paired-end reads are included. || || Assign fragments ( read pairs) to features... || || Total fragments : 834179 || || Successfully assigned fragments : 586467 (70.3%) || || Running time : 0.05 minutes || |
不同的注释文件千万不要乱用!!!
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列(二):开始使用ML.NET
· 记一次.NET内存居高不下排查解决与启示
· 探究高空视频全景AR技术的实现原理
· 理解Rust引用及其生命周期标识(上)
· 浏览器原生「磁吸」效果!Anchor Positioning 锚点定位神器解析
· 全程不用写代码,我用AI程序员写了一个飞机大战
· DeepSeek 开源周回顾「GitHub 热点速览」
· 记一次.NET内存居高不下排查解决与启示
· MongoDB 8.0这个新功能碉堡了,比商业数据库还牛
· .NET10 - 预览版1新功能体验(一)