reference:https://davetang.org/wiki/tiki-index.php?page=SAM
@SQ SN:contig1 LN:9401 (序列ID及长度)
参考序列名,这些参考序列决定了比对结果sort的顺序,SN是参考序列名;LN是参考序列长度;每个参考序列为一行。
@PG ID:bowtie2 PN:bowtie2 VN:2.0.0-beta7 (比对所使用的软件及版本)
这里的ID是bwa,PN是bwa,VN是0.7.12-r1039版本。CL可以认为是运行程序
K00133:143:H5LL5BBXX:8:1101:1194:1103 77 * 0 0 * * 0 0A NCCATCCATAATTTATAGGGAAGTGTTAAAATGCGAATAATATCCTTTTCTTTTCCTGGGATTCGTTCAACATTTCCATTTCCATTTCCCTCCGAGTT GATTTATGGGAATTCTCAGCAGCCTTGTCCATAATTCATGAACAATCCCTG A#AFFAFFJJJJJJJJJJJJJJ7FJJJJAJJJJJJJJJJFJJ JJJJJJJJJJJJJAJJJJJJJJJFJAJJJJJFJJFJJJJJJJJJJJJFFFJJJFA7AA7FFJAJ<AJ<AFF-FFFJFJFJJJF<AFFAFAFJJFAJJJ FA<<<AFJJA AS:i:0 XS:i:0 K00133:143:H5LL5BBXX:8:1101:1194:1103 141 * 0 0 * * 0 0N AAGGAGAAGTTCCCATAGCAACAGCACCATTTTTTCCAGGGAAAAGGAGGAAAGAATCAAGGATGCGAAAGAGGGAATCAAAGCAGCTTAAGGGATAT AAAAAAAACAGGGATTGTTCATGAATTATGGACAAGGCAGCTGAGAATTCC #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJA-<7-7 7A<7AJJF<AJJJJJF7JFAJF7AJJFJJJJJJJFF-AFJFJJJJJ--FFJJJJJJAAJFF<AJJJJJF777-----<-7F7-<--<<<<<7F<--7) --A7FF--<7 AS:i:0 XS:i:0 K00133:143:H5LL5BBXX:8:1101:1255:1103 77 * 0 0 * * 0 0T NTTTGAGTGCCTTCTAATTTTTTTTCTCCTGTATCTAAATTGCATGAGTATACTCTGCTACATTTAATCTTCAGGTAAAACAAAACTTTCTTTTATCA TATTTAGGAAGTATCACTAATCATGAATTAAAATTAAATGTGTATTGTCAG A#AFAJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJFJJJJJJ FJJJFJJJJJJJJFJFJJJJJJJJJJJJJAJJFJAAJ77-AJFJFJJJJJFJJJJJJJFJJJFJF7AFFJAFFJFFA<FFAFJJAFAJFJJJJJJF-A FFJF77-A-- AS:i:0 XS:i:0 K00133:143:H5LL5BBXX:8:1101:1255:1103 141 * 0 0 * * 0 0N CTGGCAGGCACAGATGCCCACAGTACACACACTGATATTTATAAAGCACTTCACAAAAAGTCACAAAAAAACCCCCAAAAACGAACCAAACCCAAAAA CCCCACCAAGCAAGCTGACAATACACATTTAATTTTAATTCATGATTAGTG #AAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJ JFJJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJJFFJJJJJ<JFJFJFJJFJ<JJJJJ<FJJJJJFJJJJF7F<FJFJ7AFAJJ7FFFAF7F-AAF<7 JFA<-77<-< AS:i:0 XS:i:0
$1:QNAME( 比对的序列名称,即一条测序reads的名称 ):
K00133:143:H5LL5BBXX:8:1101:1194:1103
$2:FLAG(表明比对类型:paring,strand,mate strand等):
可以使用samtools查到flag信息
$ samtools flags 77 0x4d 77 PAIRED,UNMAP,MUNMAP,READ1 $ samtools flags 141 0x8d 141 PAIRED,UNMAP,MUNMAP,READ2
$3:RNAME(表示read比对的那条序列的序列名称):
情况一:这条read有比对上的序列,则名称与头部的@SQ相对应,eg:
$ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |les|awk '{if($3!="*"){print $0}}' @SQ SN:scaffold332 LN:2588 @SQ SN:scaffold322 LN:8291 @SQ SN:scaffold342 LN:24194 @SQ SN:scaffold191 LN:43246 @SQ SN:scaffold1157 LN:21100 @PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz K00133:143:H5LL5BBXX:8:1101:14245:1138 69 scaffold191 6389 0 * = 6389 0 CNTCTTTTCAACTTTCCTACCTGGGACACCAAAAGAAGTGAATTTGAAATGCTGGAATACCCATTTCTTTGTAATATACTACCAGGCAATAATTTCTTCACCAGTGTGTGTAAGCCTGTAGCACTGCCTTGTGACCCACGGGCAGGACCC A#AFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJ MC:Z:47S40M63S AS:i:0 XS:i:0 K00133:143:H5LL5BBXX:8:1101:14245:1138 137 scaffold191 6389 60 47S40M63S = 6389 0 NCAAGCTGGGTGGGAGTGTTGATCTGCTGGAGGGCAGGAGGGCTCTGCAGAGGGATCTGGACAGGCTGGATCGATGGGCTGAGGCCAGTTGTATGAGGTTTAATAAGGTGAAATGCTGGGTCCTGCCCGTGGGTCACAAGGCAGTGCTAC #AAFFJJJJJAJJJJJ<JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJFJJJJJJJJJJJJAJJJJFJJFFFJFJJJJJJ-<FJJJJJJJJJJJJFFFJAFJJJJJJJ7FJ<JJJFJJJJJJJJJJJJJ<FF NM:i:1 MD:Z:26C13 AS:i:35 XS:i:0
情况二:这条read没有比对上的序列,则名称是“*”,且这一行的第四,五,八,九 列是“0”,第六,七列与该列是相同的表示方法
$4:POS(表示read比对到RNAME这条序列的最左边的位置)
noted:
(1)无论该read是正向比对或是反向比对到该序列,比对结果均是正向序列或反向序列最左端的比对位置
(2)如果该read能够完全比对到这条序列,则CIGAR string为M
$5:MAPQ(表示为mapping的质量值)
(1)255:mapping值不可用,
(2)0:通常对于unmapped read
(3)一般结果是这一列的数值是从0到60,且0和60这两个数字出现次数最多
(4)第五列为60表示mapping率最高,
$6:CIGARCIGAR,简要比对信息表达式(Compact Idiosyncratic Gapped Alignment Report),其以参考序列为基础
M:表示read可mapping到第三列的序列上,则read的碱基序列与第三列的序列碱基相同,表示正常的mapping结果,M表示完全匹配,但是无论reads与序列的正确匹配或是错误匹配该位置都显示
I:表示read的碱基序列相对于第三列的RNAME序列,有碱基的插入
D:表示read的碱基序列相对于第三列的RNAME序列,有碱基的删除
N:read 中有gap
P:read 和reference sequence都有gap
S:read 与reference没有mapping上的区域,但是保留了该mismatchs
H:read 与reference没有mapping上的区域,但是没有保留该mismatchs
=:正确匹配到序列上
X:错误匹配到序列上
set ‘*’ if unavailable
而H只出现在一条read的前端或末端,但不会出现在中间,S一般会和H成对出现,当有H出现时,一定会有一个与之对应的S出现
例如:
162M89S
162H89M
149M102S
149H102M
40S211M
20M1D20M211H
S可以单独出现,而H必须有与之对应的S出现时才可能出现,不可在相同第一列的情况下单独出现
N:如果是mRNA-to-genome,N出现的位置代表内含子,其它比对形式出现N时则没有具体解释
M/I/S/=/X:这些数值的加和等于第10列SEQ的长度
如图:
$7:MRNM
情况一:这条reads第二次比对的位置,
在利用bwa mem产生sam文件时,如果该列是“=”而第3列RNAME不是“*”,则表示该reads比对到第3列显示序列名的序列上,而没有比对到其他位置,
在利用bwa aln及bwa sampe比对生成的sam文件,如果和上述情况相同,则第7列为“=”,上述情况均表示该reads只比对到这一个位置
情况二:如果这条reads没有匹配上的序列,第3列RNAME和第7列MRNM都为“*”
情况三:如果这条reads匹配两个序列,则第一个序列的名称出现在第3列,而第二个序列的名称出现在第7列
$8:MPOS:该列表示与该reads对应的mate pair reads的比对位置,
如果这对pair-end reads比对到同一条reference序列上,在sam文件中reads的id出现2次,Read1比对的第4列等于Read2比对的第8列。同样Read1比对的第8列等于Read2比对的第4列。例如:
第1列(Read id)····第4列(Read1比对位置)····第8列(mate-pair reads比对位置)
22699:1759····124057649····124057667
22699:1759····124057667····124057649
相同的reads id一个来自Read1文件,一个来自Read2文件,第4列和第8列是对应的
$ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |awk '{if($3!="*"){print $0}}'|awk '{print $1"\t"$4"\t"$8}'|tail -2 K00133:143:H5LL5BBXX:8:1101:14245:1138 6389 6389 K00133:143:H5LL5BBXX:8:1101:14245:1138 6389 6389
$9:TLEN:insert length+mapping base
如果R1端的read和R2端的read能够mapping到同一条Reference序列上(即第三列RNAME相同),则该列的值表示第8列减去第4列加上第6列的M值,即为R1端和R2端相同id的reads其第九列值相同,但该值为一正一负,R1文件的reads和R2文件的reads,相同id的reads要相对来看。在进行该第列值的计算时,如果取第6列的数值,一定要取出现M的值,S或H的值不能取。It is set as 0 for single-segment template or when the information is unavailable.
$ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |awk '{if($3!="*"){print $0}}'|awk '{print $1"\t"$9}'|tail -2 K00133:143:H5LL5BBXX:8:1101:14245:1138 0 K00133:143:H5LL5BBXX:8:1101:14245:1138 0
$10:SEQ,(序列片段的序列信息,如果不存储此类信息,此处为’*‘,注意CIGAR中M/I/S/=/X对应数字的和要等于序列长度);
$ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |awk '{if($3!="*"){print $0}}'|awk '{print $1"\t"$10}'|tail -2 K00133:143:H5LL5BBXX:8:1101:14245:1138 CNTCTTTTCAACTTTCCTACCTGGGACACCAAAAGAAGTGAATTTGAAATGCTGGAATACCCATTTCTTTGTAATATACTACCAGGCAATAATTTCTTCACCAGTGTGTGTAAGCCTGTAGCACTGCCTTGTGACCCACGGGCAGGACCC K00133:143:H5LL5BBXX:8:1101:14245:1138 NCAAGCTGGGTGGGAGTGTTGATCTGCTGGAGGGCAGGAGGGCTCTGCAGAGGGATCTGGACAGGCTGGATCGATGGGCTGAGGCCAGTTGTATGAGGTTTAATAAGGTGAAATGCTGGGTCCTGCCCGTGGGTCACAAGGCAGTGCTAC
$11:QUAL,(ASCII,read质量的ASCII编码。)
$ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log "$13"\t"$14"\t"$15"\t"$16}'|tail -2 K00133:143:H5LL5BBXX:8:1101:14245:1138 MC:Z:47S40M63S AS:i:0 XS:i:0 K00133:143:H5LL5BBXX:8:1101:14245:1138 NM:i:1 MD:Z:26C13 AS:i:35 XS:i:0
$12之后是自选option
$ bwa mem -t 4 -M ../02.index/t/t1.fa ../01.grep/t/o3.gz ../01.grep/t/o4.gz 2>bwa.log |awk '{if($3!="*"){print $0}}'|awk '{print $1"\t"$12"\t"$13"\t"$14"\t"$15"\t"$16}'|tail -2 K00133:143:H5LL5BBXX:8:1101:14245:1138 MC:Z:47S40M63S AS:i:0 XS:i:0 K00133:143:H5LL5BBXX:8:1101:14245:1138 NM:i:1 MD:Z:26C13 AS:i:35 XS:i:0