生信算法实践

最近在搞16S,发现了一个实践算法的最佳机会。

见文章:

A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy.

文章利用了贝叶斯模型,调用了blast和muscle来对OTU进行taxonomy assignment。

可以看一下源代码,非常简单。

Bayesian-based LCA taxonomic classification method

如果你能彻底理解本文的算法,并能看懂其源码,那你基本就打到了生信算法入门的水平。

说不定以后你也能随手发一个算法的文章哦!


BLAST

Query id
Subject id
% identity
alignment length
mismatches
gap openings
q. start
q. end
s. start
s. end
e-value
bit score

这几个名词必须理解深刻!

Sequence identity is the amount of characters which match exactly between two different sequences. Hereby, gaps are not counted and the measurement is relational to the shorter of the two sequences. This has the effect that sequence identity is not transitive, i.e. if sequence A=B and B=C then A is not necessarily equal C (in terms of the identity distance measure) :

A: AAGGCTT
B: AAGGC
C:AAGGCAT

Here identity(A,B)=100% (5 identical nucleotides / min(length(A),length(B))).
Identity(B,C)=100%, but identity(A,C)=85% ((6 identical nucleotides / 7)).

Similarity is the degree of resemblance between two sequences when they are compared. This is dependent on their identity. It shows the extent to which residues in aligned. Similar sequences have similar properties.
Sequence similarity is first of all a general description of a relationship but nevertheless its more or less common practice to define similarity as an optimal matching problem (for sequence alignments or unless defined otherwise). Hereby, the optimal matching algorithm finds the minimal number of edit operations (inserts, deletes, and substitutions) in order to align one sequence to the other sequence . Using this, the percentage sequence similarity of the examples above are sim(A,B)=60%, sim(B,C)=60%, sim(A,C)=86% .

BLAST Glossary

The bit score gives an indication of how good the alignment is; the higher the score, the better the alignment. In general terms, this score is calculated from a formula that takes into account the alignment of similar or identical residues, as well as any gaps introduced to align the sequences.

posted @   Life·Intelligence  阅读(1669)  评论(0编辑  收藏  举报
(评论功能已被禁用)
编辑推荐:
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
阅读排行:
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
TOP
点击右上角即可分享
微信分享提示