FASTQ format
每个FASTQ文件中每个序列通常有四行信息:
1: 以 '@' 字符开头,后面紧接着的是序列标识符和可选字段的描述(类似FASTA title line).
2: 序列
3: 以 '+' 字符开头, 后面紧接着的是可选字段的描述性信息
4: 第二行序列的质量信息
Illumina sequence identifiers
@HWUSI-EAS100R:6:73:941:1973#0/1
sequence identifiers | description |
---|---|
HWUSI-EAS100R | the unique instrument name |
6 | flowcell lane |
73 | tile number within the flowcell lane |
941 | 'x'-coordinate of the cluster within the tile |
1973 | 'y'-coordinate of the cluster within the tile |
#0 | index number for a multiplexed sample (0 for no indexing) |
/1 | the member of a pair, /1 or /2 (paired-end or mate-pair reads only) |
Versions of the Illumina pipeline since 1.4 appear to use #NNNNNN instead of #0 for the multiplex ID, where NNNNNN is the sequence of the multiplex tag.
With Casava 1.8 the format of the '@' line has changed:
@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
sequence identifiers | description |
---|---|
EAS139 | the unique instrument name |
136 | the run id |
FC706VJ | the flowcell id |
2 | flowcell lane |
2104 | tile number within the flowcell lane |
15343 | 'x'-coordinate of the cluster within the tile |
197393 | 'y'-coordinate of the cluster within the tile |
1 | the member of a pair, 1 or 2 (paired-end or mate-pair reads only) |
Y | Y if the read is filtered, N otherwise |
18 | 0 when none of the control bits are on, otherwise it is an even number(偶数) |
ATCACG | index sequence |
将FASTQ 转换为 FASTA 格式:
zcat input_file.fastq.gz | awk 'NR%4==1{printf ">%s\n", substr($0,2)}NR%4==2{print}' > output_file.fa
#printf 命令的语法:format-string 为格式控制字符串,arguments 为参数列表。
printf format-string [arguments...]
#substr(s,p) 返回字符串s中从p开始的后缀部分
#substr(s,p,n) 返回字符串s中从p开始长度为n的后缀部分。