fastq demultiplexing
最后merge一下
cat *_1_D7_1uM_1* index2/*_1_D7_1uM_1* > D7_1uM_1_1.fq cat *_1_D7_1uM_2* index2/*_1_D7_1uM_2* > D7_1uM_2_1.fq cat *_1_D7_DMSO_1* index2/*_1_D7_DMSO_1* > D7_DMSO_1_1.fq cat *_1_D7_DMSO_2* index2/*_1_D7_DMSO_2* > D7_DMSO_2_1.fq cat *_1_D2_DMSO_1* index2/*_1_D2_DMSO_1* > D2_DMSO_1_1.fq cat *_1_D2_DMSO_2* index2/*_1_D2_DMSO_2* > D2_DMSO_2_1.fq cat *_1_D2_0.1uM_1* index2/*_1_D2_0.1uM_1* > D2_0.1uM_1_1.fq cat *_1_D2_0.1uM_2* index2/*_1_D2_0.1uM_2* > D2_0.1uM_2_1.fq cat *_1_Plasmid* index2/*_1_Plasmid* > Plasmid_1.fq cat *_2_D7_1uM_1* index2/*_2_D7_1uM_1* > D7_1uM_1_2.fq cat *_2_D7_1uM_2* index2/*_2_D7_1uM_2* > D7_1uM_2_2.fq cat *_2_D7_DMSO_1* index2/*_2_D7_DMSO_1* > D7_DMSO_1_2.fq cat *_2_D7_DMSO_2* index2/*_2_D7_DMSO_2* > D7_DMSO_2_2.fq cat *_2_D2_DMSO_1* index2/*_2_D2_DMSO_1* > D2_DMSO_1_2.fq cat *_2_D2_DMSO_2* index2/*_2_D2_DMSO_2* > D2_DMSO_2_2.fq cat *_2_D2_0.1uM_1* index2/*_2_D2_0.1uM_1* > D2_0.1uM_1_2.fq cat *_2_D2_0.1uM_2* index2/*_2_D2_0.1uM_2* > D2_0.1uM_2_2.fq cat *_2_Plasmid* index2/*_2_Plasmid* > Plasmid_2.fq
2024年02月01日
Demultiplex真的能同时处理三个fastq,所以只需要分别对index1/2提取就行。
demultiplex demux -r index.csv index1.fq Undetermined_Undetermined_22GFCYLT3_L4_1.fq Undetermined_Undetermined_22GFCYLT3_L4_2.fq demultiplex demux -r ../index.csv ../index2.fq ../Undetermined_Undetermined_22GFCYLT3_L4_1.fq ../Undetermined_Undetermined_22GFCYLT3_L4_2.fq
发邮件给novogene,bcl文件已经删了,不得不自己解决。
万万没想到,最终我还是东拼西凑的解决了这个问题。
方法就是,用下面那个biostars的代码,把dual index分别提取成单独的index1/2 fastq序列,用awk可以很快实现,边读边写。
然后就用之前的工具demultiplex来处理,主要允许mismatch。
Demultiplex any number of files given a list of barcodes.【不确定能不能同时输入三个文件,因为是根据第一个文件来寻找匹配】
cat Undetermined_Undetermined_22GFCYLT3_L4_1.fq | awk '{if( (NR%4)==1){ print $0; print substr($2,length($2)-16,8); print "+"; print "FFFFFFFF"} }' > index1.fq cat Undetermined_Undetermined_22GFCYLT3_L4_1.fq | awk '{if( (NR%4)==1){ print $0; print substr($2,length($2)-7,8); print "+"; print "FFFFFFFF"} }' > index2.fq
demultiplex demux -r index.csv -m 2 -d index1.fq Undetermined_Undetermined_22GFCYLT3_L4_1.fq demultiplex demux -r index.csv -m 2 -d index1.fq Undetermined_Undetermined_22GFCYLT3_L4_2.fq demultiplex demux -r ../index.csv -m 2 -d ../index2.fq ../Undetermined_Undetermined_22GFCYLT3_L4_1.fq demultiplex demux -r ../index.csv -m 2 -d ../index2.fq ../Undetermined_Undetermined_22GFCYLT3_L4_2.fq # demultiplex match index.csv -m 2 -d index1.fq Undetermined_Undetermined_22GFCYLT3_L4_1.fq
一些无用的尝试
# trim意义不大 fastp -i Undetermined_Undetermined_22GFCYLT3_L4_1.fq -I Undetermined_Undetermined_22GFCYLT3_L4_2.fq -o filter_1.fq -O filter_2.fq conda install bioconda::pear pear -f filter_1.fq -r filter_2.fq -o merged.fq # 用不到 pear -v 0 -f merged.fq.unassembled.forward.fastq -r merged.fq.unassembled.reverse.fastq -o unassembled # 完全无用 # reformat.sh in1=merged.fq.unassembled.forward.fastq in2=merged.fq.unassembled.reverse.fastq out=unassembled.fq # 效果更差 demultiplex match new_index.csv pear.assembled.fq # 不work,因为是dual index demultiplex demux -m 2 -d Undetermined_Undetermined_22GFCYLT3_L4_1.fq Undetermined_Undetermined_22GFCYLT3_L4_2.fq # 用grep手动检查了,assemble之后反而grep不到那么多序列了,最直接的RC后拼接就这么难做到吗?不需要你assemble什么。 # 能看到它识别的是什么index demultiplex guess -o barcodes.tsv -f -t 5 -n 10000 Undetermined_Undetermined_22GFCYLT3_L4_1.fq demultiplex guess -o barcodes.tsv -t 10 -n 100000 Undetermined_Undetermined_22GFCYLT3_L4_1.fq
2024年01月31日
没事找事,本来测序商可以解决的问题,你在这折腾了好久,还没有简单的解法。
才发现,index都在header里,其中只有一个index是有用的
@LH00328:146:22GFCYLT3:4:1101:25854:2410 2:N:0:GACGAACT+TCGCCTTA @LH00328:146:22GFCYLT3:4:1101:37846:6143 2:N:0:TCGCCTTA+TCAAGGAC @LH00328:146:22GFCYLT3:4:1101:25308:13642 2:N:0:GACTAACT+TCGCCTTA @LH00328:146:22GFCYLT3:4:1102:8610:8995 2:N:0:GACGAACT+TCGCCTTA @LH00328:146:22GFCYLT3:4:1102:48534:15885 2:N:0:AACAAACC+TCGCCTTA @LH00328:146:22GFCYLT3:4:1102:47369:17615 2:N:0:TACGACCT+TCGCCTTA @LH00328:146:22GFCYLT3:4:1102:10588:19986 2:N:0:AACAACCT+TCGCCTTA @LH00328:146:22GFCYLT3:4:1102:9303:21092 2:N:0:TCCCAACT+TCGCCTTA
但新问题来了,这是dual index,只提供一个index是无法提取出read group的。
demultiplex必须接受AACAAACC+TCGCCTTA这样的index,那就很难处理了,这样的index就太多了。
grep '@' Undetermined_Undetermined_22GFCYLT3_L4_2.fq | grep TCGCCTTA | cut -d':' -f 10 > all.TCGCCTTA
FASTQ header (line 1) contains various information separated either by ':' or a space:
An example from a HiSeq FASTQ header:
@EAS139:136:FC706VJ:2:5:1000:12850 1:N:18:ATCACG
@ - Each sequence identifier line starts with @.
InstrumentID - unique identifier of the sequencer (EAS139)
RunNumber - Run number on instrument (136).
Flowcell_ID - ID of flowcell (FC706VJ).
LaneNumber - positive integer, currently 1-8 (2)
TileNumber - positive integer (5)
X - x coordinate of the spot. Integer which can be negative (1000)
Y - y coordinate of the spot. Integer which can be negative (12850)
ReadNumber - 1 for single reads; 1 or 2 for paired ends (1)
whether it is filtered - NB: Y if the read is filtered out, not in the delivered fastq file, N otherwise (N)
ControlNumber - 0 when none of the control bits are on, otherwise it is an even number (18)
IndexSequence - Index sequence for this read (ATCACG)
参考:
- Demultiplexing based on dual indices in headers while allowing 1 mismatch to each index
- Demultiplexing a Sequencing Run
- https://github.com/grenaud/deML
2024年01月29日
最终work的是这个命令,bug的原因有点搞笑,就是应该用RC,反向互补序列。
demultiplex match new_index.csv Undetermined_Undetermined_22GFCYLT3_L4_2.fq Undetermined_Undetermined_22GFCYLT3_L4_1.fq
有时候需要手动拆分demultiplexing fastq
- 你递交给测序商的index是不对的,得到的只有Undetermined fastq;
- 类似10x的库,有非常多的index需要拆分;
一个现成的工具:demultiplex
需要先构建一个tsv index文件:ATACSeq_oligos.txt
默认index在read name里
这里我们的index在read本身,所以用了参数-r,然后index所在的fastq要放在第一位,而不是第二位。
demultiplex demux -r -s 1 ATACSeq_oligos.txt Undetermined_Undetermined_22GFCYLT3_L4_2.fq.gz Undetermined_Undetermined_22GFCYLT3_L4_1.fq.gz
$ demultiplex demux -h usage: demultiplex demux [-h] [-r] [--format {normal,x,umi,unknown}] [-s START] [-e END] [-m MISMATCH] [-d] [-p PATH] BARCODES INPUT [INPUT ...] Demultiplex any number of files given a list of barcodes. positional arguments: BARCODES barcodes file INPUT input files optional arguments: -h, --help show this help message and exit -r extract the barcodes from the read (default: False) --format {normal,x,umi,unknown} provdide the header format (default: None) -s START start of the selection (default: None) -e END end of the selection (default: None) -m MISMATCH number of mismatches (default: 1) -d use Levenshtein distance (default: False) -p PATH output directory (default: .)
如果你发现自己的indices文件不work,那就让其自己去猜,有哪些index
-s和-e是指start和end,-r是在read里去找而不是head,-f是指频率,-t是指出现的最少次数,-n是取样多少。
demultiplex guess -s 1 -e 53 -r -o barcodes.tsv -f -t 500 -n 10000000 Undetermined_Undetermined_22GFCYLT3_L4_2.fq
注意,gzip压缩非常耗时,所以非常不建议处理gz的文件,最好先解压缩为fastq文件。
另一个可以参考的工具:https://github.com/Molmed/fastq_demux
用最原始的grep来count indices也是可以的。【注意最后设置一个empty,否则index后面会自带一个\n】
cat ATACSeq_oligos.txt | while IFS=" " read name index empty; do # echo $index echo $name # echo "fq 1" grep $index Undetermined_Undetermined_22GFCYLT3_L4_1.fq | wc -l # echo "fq 2" grep $index Undetermined_Undetermined_22GFCYLT3_L4_2.fq | wc -l done