多条件结合分类统计
统计这样一个文本:
1 @FL ID=rs2151655; status=known; support=C30; mut_type=Hom; region=Intergenic; function=5-UTR; mut_name=c.*2978C->T/T;flank5=CTGCC.CCCAA;flank5=CAGCC.GTGGT;
2 @DL ID=snp1; support=A249; mut_type=Hom; region=Gene; NM_ID=NM_005957; flank5=cacca.ttgtt;function=5-UTR; mut_name=c.*2978C->T/T; flank5=CTGCC.CCCAA;
3 @EL ID=rs3737966; status=known; support=T184; mut_type=Hom; region=Gene; NM_ID=NM_005957; strand_of_gene="-"; sub_region=3-UTR; function=3-UTR; mut_name=c.*2978C->T/T; flank5=CTGCC.CCCAA;
4 @AL ID=rs2077360; status=known; support=G164; mut_type=Hom; region=Intergenic; strand_of_gene="-"; sub_region=3-UTR; function=3-UTR; mut_name=c.*1858A->G/G; flank5=GTGCG.GGCTG;
5 @BL ID=rs868014; status=known; support=G130; mut_type=Hom; region=Intergenic; strand_of_gene="-"; sub_region=3-UTR; function=3-UTR; mut_name=c.*1290A->G/G; flank5=TGCTC.GGCGG;
6 @CL ID=snp1; status=novel; support=C95/71T; mut_type=Het; region=Gene; NM_ID=NM_005957; strand_of_gene="-"; sub_region=3-UTR; function=3-UTR; mut_name=c.*1096T->C/T; flank5=CAGCT.GGACT;
我现在以第三列的ID\mut_type\function同时进行分类,ID有“^rs[0-9]*"和非“^rs[0-9]*"两种,但是除了ID其它的我不知道各自有多少种情况,那么我现在想统计每种情况下ID两种情况,比如mut_type为Het且function是3-UTR,ID为“^rs[0-9]*"有N种,不是“^rs[0-9]*"的M种,输出这个值来;mut_type为Het且function是5-UTR,ID为“^rs[0-9]*"有N1种,不是“^rs[0-9]*"的M1种,输出这个值来;mut_type为Hom且function是3-UTR,ID为“^rs[0-9]*"有N2种,不是“^rs[0-9]*"的M2种,输出这个值来;
如此类推。。。
----------------------------------
shell: awk '{x=gensub(/.*\tID=([^;]*).*/,"\\1",1);y=gensub(/.*mut_type=([^;]*).*/,"\\1",1);z=gensub(/.*function=([^;]*).*/,"\\1",1);if(x~/^rs[0-9]*$/){s++;a[z]++;b[z,y]++}else{c[z]++;d[z,y]++}}END{e[1]="Het";e[2]="Hom";printf "%-10s%-10s%-10s%-10s\n%-10s%-10d%-10d%-10d\n","","Sum","in_dpsno","novel","",NR,s,NR-s;for(i in a){printf "%-10s%-10d%-10d%-10d\n",i,a[i]+c[i],a[i],c[i];for(j=1;j<=2;j++){if(!b[i,e[j]])b[i,e[j]]=0;if(!d[i,e[j]])d[i,e[j]]=0;printf "%-10s%-10d%-10d%-10d\n",e[j],b[i,e[j]]+d[i,e[j]],b[i,e[j]],d[i,e[j]]}}}' test ----------------------------------------------------------------
1 perl: 2 #!/usr/bin/perl 3 use 5.010; 4 open(FILE,"test") or die; 5 while () { 6 $x = $1 if /\bID=([^;]*)/; 7 $y = $1 if /\bmut_type=([^;]*)/; 8 $z = $1 if /\bfunction=([^;]*)/; 9 if ($x =~ /^rs[^;]+$/) { 10 $s1++; 11 $h1{$z}++; 12 $h2{$z}{$y}++; 13 } else { 14 $s2++; 15 $h3{$z}++; 16 $h4{$z}{$y}++; 17 } 18 $h5{$y}++; 19 } 20 close FILE; 21 END { 22 say "\tSum\tin_dpsnp\tnovel"; 23 printf "\t%d\t%d\t%d\n",$s1+$s2,$s1,$s2; 24 foreach $k1 (sort keys %h1) { 25 printf "%s\t%d\t%d\t%d\n",$k1,$h1{$k1}+$h3{$k1},$h1{$k1},$h3{$k1}; 26 foreach $k2 (sort keys %h5) { 27 printf "%s\t%d\t%d\t%d\n",$k2,$h2{$k1}{$k2}+$h4{$k1}{$k2},$h2{$k1}{$k2},$h4{$k1}{$k2}; 28 } 29 } 30 } 31 ------------------------------------------------------------------------------