fastq demultiplexing

 

最后merge一下

cat *_1_D7_1uM_1* index2/*_1_D7_1uM_1* > D7_1uM_1_1.fq
cat *_1_D7_1uM_2* index2/*_1_D7_1uM_2* > D7_1uM_2_1.fq
cat *_1_D7_DMSO_1* index2/*_1_D7_DMSO_1* > D7_DMSO_1_1.fq
cat *_1_D7_DMSO_2* index2/*_1_D7_DMSO_2* > D7_DMSO_2_1.fq
cat *_1_D2_DMSO_1* index2/*_1_D2_DMSO_1* > D2_DMSO_1_1.fq
cat *_1_D2_DMSO_2* index2/*_1_D2_DMSO_2* > D2_DMSO_2_1.fq
cat *_1_D2_0.1uM_1* index2/*_1_D2_0.1uM_1* > D2_0.1uM_1_1.fq
cat *_1_D2_0.1uM_2* index2/*_1_D2_0.1uM_2* > D2_0.1uM_2_1.fq
cat *_1_Plasmid* index2/*_1_Plasmid* > Plasmid_1.fq

cat *_2_D7_1uM_1* index2/*_2_D7_1uM_1* > D7_1uM_1_2.fq
cat *_2_D7_1uM_2* index2/*_2_D7_1uM_2* > D7_1uM_2_2.fq
cat *_2_D7_DMSO_1* index2/*_2_D7_DMSO_1* > D7_DMSO_1_2.fq
cat *_2_D7_DMSO_2* index2/*_2_D7_DMSO_2* > D7_DMSO_2_2.fq
cat *_2_D2_DMSO_1* index2/*_2_D2_DMSO_1* > D2_DMSO_1_2.fq
cat *_2_D2_DMSO_2* index2/*_2_D2_DMSO_2* > D2_DMSO_2_2.fq
cat *_2_D2_0.1uM_1* index2/*_2_D2_0.1uM_1* > D2_0.1uM_1_2.fq
cat *_2_D2_0.1uM_2* index2/*_2_D2_0.1uM_2* > D2_0.1uM_2_2.fq
cat *_2_Plasmid* index2/*_2_Plasmid* > Plasmid_2.fq

  

 

2024年02月01日

Demultiplex真的能同时处理三个fastq,所以只需要分别对index1/2提取就行。

demultiplex demux -r index.csv index1.fq Undetermined_Undetermined_22GFCYLT3_L4_1.fq Undetermined_Undetermined_22GFCYLT3_L4_2.fq
demultiplex demux -r ../index.csv ../index2.fq ../Undetermined_Undetermined_22GFCYLT3_L4_1.fq ../Undetermined_Undetermined_22GFCYLT3_L4_2.fq

 

发邮件给novogene,bcl文件已经删了,不得不自己解决。

万万没想到,最终我还是东拼西凑的解决了这个问题。

方法就是,用下面那个biostars的代码,把dual index分别提取成单独的index1/2 fastq序列,用awk可以很快实现,边读边写。

然后就用之前的工具demultiplex来处理,主要允许mismatch。

Demultiplex any number of files given a list of barcodes.【不确定能不能同时输入三个文件,因为是根据第一个文件来寻找匹配】

cat Undetermined_Undetermined_22GFCYLT3_L4_1.fq | awk '{if( (NR%4)==1){ print $0; print substr($2,length($2)-16,8); print "+"; print "FFFFFFFF"} }' > index1.fq

cat Undetermined_Undetermined_22GFCYLT3_L4_1.fq | awk '{if( (NR%4)==1){ print $0; print substr($2,length($2)-7,8); print "+"; print "FFFFFFFF"} }' > index2.fq

  

demultiplex demux -r index.csv -m 2 -d index1.fq Undetermined_Undetermined_22GFCYLT3_L4_1.fq
demultiplex demux -r index.csv -m 2 -d index1.fq Undetermined_Undetermined_22GFCYLT3_L4_2.fq

demultiplex demux -r ../index.csv -m 2 -d ../index2.fq ../Undetermined_Undetermined_22GFCYLT3_L4_1.fq
demultiplex demux -r ../index.csv -m 2 -d ../index2.fq ../Undetermined_Undetermined_22GFCYLT3_L4_2.fq

# demultiplex match index.csv -m 2 -d index1.fq Undetermined_Undetermined_22GFCYLT3_L4_1.fq

  

一些无用的尝试

# trim意义不大
fastp -i Undetermined_Undetermined_22GFCYLT3_L4_1.fq -I Undetermined_Undetermined_22GFCYLT3_L4_2.fq -o filter_1.fq -O filter_2.fq

conda install bioconda::pear
pear -f filter_1.fq -r filter_2.fq -o merged.fq

# 用不到
pear -v 0 -f merged.fq.unassembled.forward.fastq -r merged.fq.unassembled.reverse.fastq -o unassembled

# 完全无用
# reformat.sh in1=merged.fq.unassembled.forward.fastq in2=merged.fq.unassembled.reverse.fastq out=unassembled.fq

# 效果更差
demultiplex match new_index.csv pear.assembled.fq

# 不work,因为是dual index
demultiplex demux -m 2 -d Undetermined_Undetermined_22GFCYLT3_L4_1.fq Undetermined_Undetermined_22GFCYLT3_L4_2.fq

# 用grep手动检查了,assemble之后反而grep不到那么多序列了,最直接的RC后拼接就这么难做到吗?不需要你assemble什么。

# 能看到它识别的是什么index
demultiplex guess -o barcodes.tsv -f -t 5 -n 10000 Undetermined_Undetermined_22GFCYLT3_L4_1.fq
demultiplex guess -o barcodes.tsv -t 10 -n 100000 Undetermined_Undetermined_22GFCYLT3_L4_1.fq

  

 

2024年01月31日

没事找事,本来测序商可以解决的问题,你在这折腾了好久,还没有简单的解法。

才发现,index都在header里,其中只有一个index是有用的

@LH00328:146:22GFCYLT3:4:1101:25854:2410 2:N:0:GACGAACT+TCGCCTTA
@LH00328:146:22GFCYLT3:4:1101:37846:6143 2:N:0:TCGCCTTA+TCAAGGAC
@LH00328:146:22GFCYLT3:4:1101:25308:13642 2:N:0:GACTAACT+TCGCCTTA
@LH00328:146:22GFCYLT3:4:1102:8610:8995 2:N:0:GACGAACT+TCGCCTTA
@LH00328:146:22GFCYLT3:4:1102:48534:15885 2:N:0:AACAAACC+TCGCCTTA
@LH00328:146:22GFCYLT3:4:1102:47369:17615 2:N:0:TACGACCT+TCGCCTTA
@LH00328:146:22GFCYLT3:4:1102:10588:19986 2:N:0:AACAACCT+TCGCCTTA
@LH00328:146:22GFCYLT3:4:1102:9303:21092 2:N:0:TCCCAACT+TCGCCTTA

 

  

但新问题来了,这是dual index,只提供一个index是无法提取出read group的。

demultiplex必须接受AACAAACC+TCGCCTTA这样的index,那就很难处理了,这样的index就太多了。

grep '@' Undetermined_Undetermined_22GFCYLT3_L4_2.fq | grep TCGCCTTA | cut -d':' -f 10 > all.TCGCCTTA

  

FASTQ header (line 1) contains various information separated either by ':' or a space:

An example from a HiSeq FASTQ header:

@EAS139:136:FC706VJ:2:5:1000:12850 1:N:18:ATCACG

@ - Each sequence identifier line starts with @.
InstrumentID - unique identifier of the sequencer (EAS139)
RunNumber - Run number on instrument (136).
Flowcell_ID - ID of flowcell (FC706VJ).
LaneNumber - positive integer, currently 1-8 (2)
TileNumber - positive integer (5)
X - x coordinate of the spot. Integer which can be negative (1000)
Y - y coordinate of the spot. Integer which can be negative (12850)
ReadNumber - 1 for single reads; 1 or 2 for paired ends (1)
whether it is filtered - NB: Y if the read is filtered out, not in the delivered fastq file, N otherwise (N)
ControlNumber - 0 when none of the control bits are on, otherwise it is an even number (18)
IndexSequence - Index sequence for this read (ATCACG)

 

参考:

 

2024年01月29日

最终work的是这个命令,bug的原因有点搞笑,就是应该用RC,反向互补序列。

demultiplex match new_index.csv Undetermined_Undetermined_22GFCYLT3_L4_2.fq Undetermined_Undetermined_22GFCYLT3_L4_1.fq

 

 

有时候需要手动拆分demultiplexing fastq

  • 你递交给测序商的index是不对的,得到的只有Undetermined fastq;
  • 类似10x的库,有非常多的index需要拆分;

 

一个现成的工具:demultiplex 

需要先构建一个tsv index文件:ATACSeq_oligos.txt

默认index在read name里

这里我们的index在read本身,所以用了参数-r,然后index所在的fastq要放在第一位,而不是第二位。

demultiplex demux -r -s 1 ATACSeq_oligos.txt Undetermined_Undetermined_22GFCYLT3_L4_2.fq.gz Undetermined_Undetermined_22GFCYLT3_L4_1.fq.gz

 

$ demultiplex demux -h
usage: demultiplex demux [-h] [-r] [--format {normal,x,umi,unknown}] [-s START] [-e END] [-m MISMATCH] [-d] [-p PATH] BARCODES INPUT [INPUT ...]

Demultiplex any number of files given a list of barcodes.

positional arguments:
  BARCODES              barcodes file
  INPUT                 input files

optional arguments:
  -h, --help            show this help message and exit
  -r                    extract the barcodes from the read (default: False)
  --format {normal,x,umi,unknown}
                        provdide the header format (default: None)
  -s START              start of the selection (default: None)
  -e END                end of the selection (default: None)
  -m MISMATCH           number of mismatches (default: 1)
  -d                    use Levenshtein distance (default: False)
  -p PATH               output directory (default: .)

 

如果你发现自己的indices文件不work,那就让其自己去猜,有哪些index

-s和-e是指start和end,-r是在read里去找而不是head,-f是指频率,-t是指出现的最少次数,-n是取样多少。

demultiplex guess -s 1 -e 53 -r -o barcodes.tsv -f -t 500 -n 10000000 Undetermined_Undetermined_22GFCYLT3_L4_2.fq

  

 

注意,gzip压缩非常耗时,所以非常不建议处理gz的文件,最好先解压缩为fastq文件。

 

另一个可以参考的工具:https://github.com/Molmed/fastq_demux

 

用最原始的grep来count indices也是可以的。【注意最后设置一个empty,否则index后面会自带一个\n】

cat ATACSeq_oligos.txt | while IFS=" " read name index empty; do
# echo $index
echo $name
# echo "fq 1"
grep $index Undetermined_Undetermined_22GFCYLT3_L4_1.fq | wc -l
# echo "fq 2"
grep $index Undetermined_Undetermined_22GFCYLT3_L4_2.fq | wc -l
done

  

 

posted @ 2024-01-10 01:08  Life·Intelligence  阅读(118)  评论(0编辑  收藏  举报
TOP