多条件结合分类统计

统计这样一个文本:

1   @FL   ID=rs2151655; status=known; support=C30; mut_type=Hom; region=Intergenic; function=5-UTR; mut_name=c.*2978C->T/T;flank5=CTGCC.CCCAA;flank5=CAGCC.GTGGT;
2   @DL  ID=snp1;  support=A249; mut_type=Hom; region=Gene; NM_ID=NM_005957; flank5=cacca.ttgtt;function=5-UTR; mut_name=c.*2978C->T/T; flank5=CTGCC.CCCAA;
3   @EL  ID=rs3737966; status=known; support=T184; mut_type=Hom; region=Gene; NM_ID=NM_005957; strand_of_gene="-"; sub_region=3-UTR; function=3-UTR; mut_name=c.*2978C->T/T; flank5=CTGCC.CCCAA;
4   @AL  ID=rs2077360; status=known; support=G164; mut_type=Hom; region=Intergenic; strand_of_gene="-"; sub_region=3-UTR; function=3-UTR; mut_name=c.*1858A->G/G; flank5=GTGCG.GGCTG;
5   @BL  ID=rs868014; status=known; support=G130; mut_type=Hom; region=Intergenic; strand_of_gene="-"; sub_region=3-UTR; function=3-UTR; mut_name=c.*1290A->G/G; flank5=TGCTC.GGCGG;
6   @CL  ID=snp1; status=novel; support=C95/71T; mut_type=Het; region=Gene;  NM_ID=NM_005957; strand_of_gene="-"; sub_region=3-UTR; function=3-UTR; mut_name=c.*1096T->C/T; flank5=CAGCT.GGACT;

我现在以第三列的ID\mut_type\function同时进行分类,ID有“^rs[0-9]*"和非“^rs[0-9]*"两种,但是除了ID其它的我不知道各自有多少种情况,那么我现在想统计每种情况下ID两种情况,比如mut_type为Het且function是3-UTR,ID为“^rs[0-9]*"有N种,不是“^rs[0-9]*"的M种,输出这个值来;mut_type为Het且function是5-UTR,ID为“^rs[0-9]*"有N1种,不是“^rs[0-9]*"的M1种,输出这个值来;mut_type为Hom且function是3-UTR,ID为“^rs[0-9]*"有N2种,不是“^rs[0-9]*"的M2种,输出这个值来;
如此类推。。。

 

----------------------------------

 

shell:
awk '{x=gensub(/.*\tID=([^;]*).*/,"\\1",1);y=gensub(/.*mut_type=([^;]*).*/,"\\1",1);z=gensub(/.*function=([^;]*).*/,"\\1",1);if(x~/^rs[0-9]*$/){s++;a[z]++;b[z,y]++}else{c[z]++;d[z,y]++}}END{e[1]="Het";e[2]="Hom";printf "%-10s%-10s%-10s%-10s\n%-10s%-10d%-10d%-10d\n","","Sum","in_dpsno","novel","",NR,s,NR-s;for(i in a){printf "%-10s%-10d%-10d%-10d\n",i,a[i]+c[i],a[i],c[i];for(j=1;j<=2;j++){if(!b[i,e[j]])b[i,e[j]]=0;if(!d[i,e[j]])d[i,e[j]]=0;printf "%-10s%-10d%-10d%-10d\n",e[j],b[i,e[j]]+d[i,e[j]],b[i,e[j]],d[i,e[j]]}}}' test
----------------------------------------------------------------
 1 perl:
 2 #!/usr/bin/perl
 3 use 5.010;
 4 open(FILE,"test") or die;
 5 while () {
 6         $x = $1 if /\bID=([^;]*)/;
 7         $y = $1 if /\bmut_type=([^;]*)/;
 8         $z = $1 if /\bfunction=([^;]*)/;
 9         if ($x =~ /^rs[^;]+$/) {
10                 $s1++;
11                 $h1{$z}++;
12                 $h2{$z}{$y}++;
13         } else {
14                 $s2++;
15                 $h3{$z}++;
16                 $h4{$z}{$y}++;
17         }
18         $h5{$y}++;
19 }
20 close FILE;
21 END {
22         say "\tSum\tin_dpsnp\tnovel";
23         printf "\t%d\t%d\t%d\n",$s1+$s2,$s1,$s2;
24         foreach $k1 (sort keys %h1) {
25                 printf "%s\t%d\t%d\t%d\n",$k1,$h1{$k1}+$h3{$k1},$h1{$k1},$h3{$k1};
26                 foreach $k2 (sort keys %h5) {
27                         printf "%s\t%d\t%d\t%d\n",$k2,$h2{$k1}{$k2}+$h4{$k1}{$k2},$h2{$k1}{$k2},$h4{$k1}{$k2};
28                 }
29         }
30 }
31 ------------------------------------------------------------------------------

 

posted on 2013-12-13 10:47  三川  阅读(381)  评论(0编辑  收藏  举报