biopython之成对序列比对

成对序列比对

Bio.Align.PairwiseAligner

请注意，Bio.pairwise2在1.80版中被弃用。作为替代，请考虑使用Bio.Align.PairwiseAligner。

成对序列比对是通过优化两个序列之间的相似性得分来将它们彼此比对的过程。Bio.Align模块包含PairwiseAligner类，用于使用Needleman-Wunsch、Smith-Waterman、Gotoh(three-state)和Waterman-Smith-Beyer全局和局部成对比对算法进行全局和局部比对，并具有许多更改比对参数的选项。

说明：
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec127

def count_mismatches(s1, s2):
    # return sum(1 for a, b in zip(s1, s2) if a != b and (a != '-' and b != '-'))
    return sum(1 for a, b in zip(s1, s2) if a != b)

# 导入模块
from Bio import Align

# To generate pairwise alignments, first create a PairwiseAligner object:
aligner = Align.PairwiseAligner(match_score=1.0, mismatch_score=-1.0, gap_score=-1.0, extend_gap_score=-1.0)
# 设置为全局比对，局部比对则是 aligner.mode = "local"
aligner.mode = "global"
seqA = 'TCAGTCG'
seqB = 'TCAGTCG'
alignments = aligner.align(seqA, seqB)
for alignment in alignments:
    print(alignment)
    print(alignment.score) #得分
    print(alignment.query)
    print(alignment.target)
    # print(alignment.aligned)
    # print(format_alignment(alignment))
    mismatches = count_mismatches(alignment.query, alignment.target) #计算mismatch的个数
    print(mismatches)

"""输出：
TCAGTCG
|||||||
TCAGTCG

7.0
TCAGTCG
TCAGTCG
0
"""

Bio.pairwise2

请注意，Bio.pairwise2在1.80版中被弃用。作为替代，请考虑使用Bio.Align.PairwiseAligner。

说明：
https://biopython.org/docs/1.79/api/Bio.pairwise2.html
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec176

此模块中的对齐函数的名称遵循约定

global + XX
local + XX

XX是2个字符的代码，表示它所采用的参数。

第一个字符表示匹配（和不匹配）的参数，第二个字符表示间隙罚分的参数。
匹配参数是：

x 没有参数。相同的字符得分为1，否则为0。
m 匹配分数是相同字符的分数，否则不匹配得分。
d 字典返回任意一对字符的分数。
c 回调函数返回分数。

gap 惩罚参数是：

x 无gap罚款。
s 同样的开放和延长两个序列的gap罚分。
d 序列具有不同的开放和延长间隙罚分。
c 回调函数返回间隙罚分。

示例1：

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

seqA='TCAGTCG'
seqB='TCAGTCG'
alignments = pairwise2.align.globalxx(seqA, seqB)
for alignment in alignments:
    print(alignment)
    print(format_alignment(*alignment))
    print(alignment[2]) #比对得分
    # 第一个是query序列，第二个是target序列，第三列是得分，后面两个是比对的起始和终止位置

示例2：
做全局比对。相同的字符给出1个点，每个不相同的字符扣除1个点，打开间隙时扣除0.5分，并且在延长时扣除0.1分。

from Bio import pairwise2
from Bio.pairwise2 import format_alignment

seqA='TCAGTCG'
seqB='TCAGTCG'
alignments = pairwise2.align.globalms(seqA, seqB, match=1, mismatch=-1, open=-.5, extend=-.1)
for alignment in alignments:
    print(alignment)
    print(format_alignment(*alignment))
    print(alignment[2]) #比对得分
    # 第一个是query序列，第二个是target序列，第三列是得分，后面两个是比对的起始和终止位置

示例3：
对齐函数还可以使用已包含在Biopython中的已知矩阵。

from Bio import pairwise2
from Bio.Align import substitution_matrices
matrix = substitution_matrices.load("BLOSUM62")
for a in pairwise2.align.globaldx("KEVLA", "EVL", matrix):
    print(format_alignment(*a))

示例4：
计算错配数。

from Bio import pairwise2

def count_mismatches(s1, s2):
    # return sum(1 for a, b in zip(s1, s2) if a != b and (a != '-' and b != '-'))
    return sum(1 for a, b in zip(s1, s2) if a != b)

bc = "TCAGTCG"
seq1 = "TCAGTCG"
alignments = pairwise2.align.globalxx(bc, seq1)
for alignment in alignments:
    # 计算错配数
    mismatches = count_mismatches(
        alignment[0], alignment[1]
    )
    print(alignment)
    print(format_alignment(*alignment))
    print(alignment[2])
    print(mismatches)

示例5：
获得两两序列的相似比（这里用的是全局比对）

def get_similarity_pct(query, target):
    # 全局比对，相同的残基就给1分，不同和gap不扣分
    alignments = pairwise2.align.globalxx(query, target)
    seq_length = min(len(query), len(target))
    matches = alignments[0][2]
    percent_match = (matches / seq_length) * 100
    return percent_match