[笔记] cluster是否中标
完整代码:
use strict; use warnings; my ($tss, $cluster, $out) = @ARGV; die $! unless (@ARGV ==3); open OUT,">$out"; open TSS, $tss; my (%ftss, %rtss); while (<TSS>){ chomp; my ($s, $e, $str) = (split /\s+/, $_)[1,2,3]; if ($str eq "+"){ $ftss{$e} = $s; } elsif ($str eq "-"){ $rtss{$s} = $e; } } my (@ftss, @rtss); @ftss = sort {$b <=> $a} keys %ftss; @rtss = sort {$b <=> $a} keys %rtss; my ($sum, $fsum, $rsum, $ftrue1,$ftrue2, $fother, $rtrue1, $rtrue2, $rother) = (0,0,0,0,0,0,0,0,0); open CLST, $cluster; while (<CLST>){ $sum ++; chomp; my ($fmk1,$fmk2,$rmk1,$rmk2) = (0,0,0,0); my ($cs, $ce, $v, $strand) = (split /\t/, $_)[1,2,3,4]; if ($strand eq "+"){ $fsum ++; foreach my $tssend (@ftss){ my $fdis = 0; $fdis = $cs - $tssend; next if ($fdis < 0 or $fdis >100); if ($fdis <= 5 ){ #$ftrue1 ++; $fmk1 = 1; print OUT "chr\t$cs\t$ce\t$v\t$strand\t$ftss{$tssend}\t$tssend\tknown++\n"; last; } elsif ($fdis <= 10){ #$ftrue2 ++; $fmk2 = 1; print OUT "chr\t$cs\t$ce\t$v\t$strand\t$ftss{$tssend}\t$tssend\tknown+\n"; last; } #else { #$fother ++; #print "chr\t$cs\t$ce\t$v\t$strand\t-\t-\tnovel\n"; #last; #} } } elsif ($strand eq "-"){ $rsum ++; foreach my $tsstart (@rtss){ my $rdis = 0; $rdis = $tsstart - $ce; next if ($rdis <0 or $rdis >100); # next if ($rdis <0 or $rdis >100); if ($rdis <= 5){ #$rtrue1 ++; $rmk1 = -1; print OUT"chr\t$cs\t$ce\t$v\t$strand\t$tsstart\t$rtss{$tsstart}\tknown++\n"; last; } elsif ($rdis <= 10){ #$rtrue2 ++; $rmk2 = -1; print OUT"chr\t$cs\t$ce\t$v\t$strand\t$tsstart\t$rtss{$tsstart}\tknown+\n"; last; } #else{ #$rother ++; #print "chr\t$cs\t$ce\t$v\t$strand\t-\t-\tnovel\n"; #} } } $ftrue1 ++ if ($fmk1 == 1); $ftrue2 ++ if ($fmk2 == 1); if ($fmk1 == 0 and $fmk2 == 0 and $strand eq "+"){ $fother ++; print OUT "chr\t$cs\t$ce\t$v\t$strand\t-\t-\tnovel\n"; } $rtrue1 ++ if ($rmk1 == -1); $rtrue2 ++ if ($rmk2 == -1); if ($rmk1 == 0 and$rmk2 == 0 and $strand eq "-"){ $rother ++ ; print OUT "chr\t$cs\t$ce\t$v\t$strand\t-\t-\tnovel\n"; } } print "$sum, $fsum, $rsum, $ftrue1,$ftrue2, $fother, $rtrue1, $rtrue2, $rother";
【目的】一个文件记录着cluster,一个文件记录着如Transcription Start Site信息。希望知道每一个cluster上游100nt以内是否潜在地对应一个(当然实际搜索时会有多个)TSS。
【方法】因为正负链5‘->3‘方向顺序考究不同:
对于正链cluster,0< cluster_start - TSS_end < 100nt;
对于负链cluster,0< TSS_start - cluster_end < 100nt;
【具体】
首先各制备一个正负TSS哈希,记录着起始终止信息。注意正负的不同:
if ($str eq "+"){ $ftss{$e} = $s; } elsif ($str eq "-"){ $rtss{$s} = $e; }
然后,逐个读取cluster,针对每个cluster,用TSS列表去审视扫描分类。我们尽可能希望能更准确快速找到TSS,所以考虑如下:
1. 一旦找到符合要求的TSS,就跳出本轮审视。开启下一轮,对下一个cluster审视。因而在每一个if里,都有last,如:
if ($rdis <= 5){ #$rtrue1 ++; $rmk1 = -1; print OUT"chr\t$cs\t$ce\t$v\t$strand\t$tsstart\t$rtss{$tsstart}\tknown++\n"; last; }
有无last的效果的决然不同的。可想而知,如果本cluster已经找到一个TSS,它们的距离<=5,确实可以输出;但是如果没有last,还有可能继续查找,得到一个新的TSS,使得他们的距离<=100。很明显,<=5,才是要查找的。
2. 即使有last,因为提前终止扫描,那么会忽略掉更可靠的TSS。因为第一条道理,一样可以继续找,得到一个新的TSS,同样距离也<=5。
所以TSS起始/终止数组顺序必须是有序的!而且必须是逆序来查找。就像两头开凿山洞一样。
cluster读取是小->大,而TSS表格翻阅则是大->小,这样就能自动找到最靠近cluster的TSS,即最靠谱的。
所以这里才会有这么一步:
@ftss = sort {$b <=> $a} keys %ftss; @rtss = sort {$b <=> $a} keys %rtss;
3. 至于这些
$fmk1,$fmk2,$rmk1,$rmk2
则是为了控制输出。
--end