reads去污染接头

对于reads:一段一段的文件,比如说下面这个文件,它是四行一段的,我只想匹配每一段第一行末尾是1还是2,匹配是1的话,匹配第二行是否以ATCG开头或者结尾,如果是输出到一个文件,不是输出到另一个文件。匹配是2的话,匹配第二行是否以GCTA开头或者结尾,如果是输出到一个文件,不是输出到另一个文件。

@FCC36J1ACXX:1:1101:1218:2181#GATTAGTG/1  
ATCGNATATTGTATAGTGATAGAGACTAATATGTACAGAAAACACATGGCTCAATGTAATCATAAAGGTAGGACATGCTTGTAAGAACTC
+
^\\\BQ\ceceggddfgggegdgbfgdgdgggfefdecfhhh`cegfhdfhhhehhhgfgffhdaffh\bdebgcdghegggggdbcddd
@FCC36J1ACXX:1:1101:1236:2181#GATTAGTG/1
GTACNGAAGACAGAGCTGGCAAACAGAAGTTTAAAAGCCCCATGAATACTTCCTGGTGTTCCTTTAACACAGCTGACTGGGTCCTATCG
+
b__eBQ`cgegggghhhiecgifgffhhidghihiihghiihihifdf[egffhihgghhfhiiiehhgggggdcbdedbc_bcccccbc
@FCC36J1ACXX:1:1101:1416:2128#GATTAGTG/2
GGCATGGCTAATGTAAAAGCAGCCCCAAAGATAATTGACACAGGAGGAGGCTTCATTTTAGAAGAGGAAGAAGAAGAAGAACAGAAAATT
+
bbbeeeeeggggggiiiiifhiiiiiiiiihiiiiiiiihiiiighidghbghiiiiiiiighiiiggggeeeeedddddcccccccccc
@FCC36J1ACXX:1:1101:1278:2132#GATTAGTG/2
ATGTCATCCCCATCAAGTTCAGCTTTTCTAAACACTTCCTTCACTCATTCATTTATTCATTCTTACGTGTATTCATTGTGAGAACAGCTA
+
bbbeceeeggfgggiiihifghdghhhgghfifhhghdfhhhihhiiii_ggfhgiihhihfhdfegfgaeffS

shell:
1 awk -vRS="@" -vfq="131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_1" 'NR>1{if(($1~/2$/&&$2~/^ATCG|ATCG$/)||($1~/1$/&&$2~/^GCTA|GCTA$/))printf RS""$0 > fq"@in.fq_2";else printf RS""$0 > fq"@un.fq_2"}' 131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_1.fq.gz@un.fq.out

PERL:

1 #!/usr/bI/perl
2 my @f = map { open my $f, '>', $_; $f } qw/1.txt 2.txt/;
3 my %h = qw/1 ATCG 2 GCTA/;
4 while (<>) {
5     my @l   = ( ~~<>,<>.<> );
6     my ($n) = /(.)$/;
7     my $i   = grep $_ eq $h{$n}, $l[0] =~ /^(.{4}).*(.{4})$/;
8     print { $f[ $i == 0 ] } $_, @l;
9 }

 

pe测序要求两条reads是位置对应的,我们需要删除一部分被污染,另一部分没污染保留下来的:

shell:

1 => cat ref.sh
2 grep @ 131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_1@in.fq_2 >52.ref
3 grep @ 131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_2@in.fq_2 >>52.ref
4 perl ref.pl 52.ref 131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_2@un.fq_2 >131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_2.fq.result
5 perl ref.pl 52.ref 131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_1@un.fq_2 >131206_I175_FCC36J1ACXX_L1_RHUMjdaXACDAAAPEI-52_1.fq.result

perl部分:

 1 => cat ref.pl
 2 #!/usr/bin/perl
 3 $ref=$ARGV[0];
 4 $fq=$ARGV[1];
 5 #$outdir=$ARGV[2];
 6 open IN1,"< $ref" or die"$!";
 7 open DATA,"< $fq" or die"$!";
 8 
 9 my %R = map { split /\// } <IN1>;
10 while (<DATA>) {
11     my ($k) = split /\//;
12     $R{$k}
13       ? ( <DATA>, <DATA>, <DATA> )
14       : print $_, <DATA>, <DATA>, <DATA>;
15 }

单独shell:

awk -F/ -vRS='@' 'NR==FNR{a[$1]=1;next}NF&&a[$1]{printf RS$0}' file1 file2

 

posted on 2013-12-24 12:07  三川  阅读(1043)  评论(0编辑  收藏  举报