根据荧光蛋白来做单细胞的lineage tracing | GFP | YFP

 

2023年08月15日 【成功检测出tdTomato,yes】

1
2
3
4
5
6
7
8
9
10
11
12
13
14
cp best.td.cdx2.fasta tdTomato.fa
vi tdTomato.fa # change fasta name to tdTomato
cat tdTomato.fa | grep -v "^>" | tr -d "\n" | wc -c
echo -e 'tdTomato\tunknown\texon\t1\t7149\t.\t+\t.\tgene_id "tdTomato"; transcript_id "tdTomato"; gene_name "tdTomato"; gene_biotype "protein_coding";' > tdTomato.gtf
mkdir refdata-gex-mm10-2020-A-tdTomato
cp find_best_egfp/tdTomato.* ./refdata-gex-mm10-2020-A-tdTomato/
cat tdTomato.fa >> genome.fa
grep ">" genome.fa
cat tdTomato.gtf >> genes.gtf
tail genes.gtf
 
/home/zz950/softwares/cellranger-7.1.0/cellranger mkref --genome=mm10_genome_tdTomato \
  --fasta=genome.fa \
  --genes=genes.gtf

 

2023年06月22日

基本步骤

1. ask what is the fluorescent protein, in this case, it's https://www.jax.org/strain/008875
2. search the protein sequence, enhanced green fluorescent protein sequence (EGFP) mouse, https://www.uniprot.org/uniprotkb/C5MKY7/entry#sequences
3. NCBI, tblastn, https://blast.ncbi.nlm.nih.gov/Blast.cgi#
4. download all fasta, get full DNA sequences, remove duplicate
5. build star index: STAR --runThreadN 20 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles seqdump_noDup.fa --genomeSAindexNbases 7
6. align to find best sequence

 

Cdx2-td and Cdx2-ApcKO scRNA-seq data that we recently generated (unsorted).
tdTomato fluorescent protein sequence
https://www.fpbase.org/protein/tdtomato/

 

找到的EGFP 蛋白序列

1
2
3
4
5
>tr|C5MKY7|C5MKY7_HCMV Enhanced green fluorescent protein OS=Human cytomegalovirus OX=10359 GN=egfp PE=1 SV=1
MVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPT
LVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTL
VNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLA
DHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK

  

tdTomato蛋白序列

1
2
3
4
5
6
7
8
>AAV52169.1 tandem-dimer red fluorescent protein [synthetic construct]
MVSKGEEVIKEFMRFKVRMEGSMNGHEFEIEGEGEGRPYEGTQTAKLKVTKGGPLPFAWDILSPQFMYGS
KAYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGLVTVTQDSSLQDGTLIYKVKMRGTNFPPDGPVMQKK
TMGWEASTERLYPRDGVLKGEIHQALKLKDGGHYLVEFKTIYMAKKPVQLPGYYYVDTKLDITSHNEDYT
IVEQYERSEGRHHLFLGHGTGSTGSGSSGTASSEDNNMAVIKEFMRFKVRMEGSMNGHEFEIEGEGEGRP
YEGTQTAKLKVTKGGPLPFAWDILSPQFMYGSKAYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGLVTV
TQDSSLQDGTLIYKVKMRGTNFPPDGPVMQKKTMGWEASTERLYPRDGVLKGEIHQALKLKDGGHYLVEF
KTIYMAKKPVQLPGYYYVDTKLDITSHNEDYTIVEQYERSEGRHHLFLYGMDELYK

 

工具

1
2
source activate splicing
module load samtools-1.9-gcc-5.4.0-jjq5nua

  

生成Reference

1
STAR --runThreadN 20 --runMode genomeGenerate --genomeDir td_star_index --genomeFastaFiles seqdump-td_noDup.fa --genomeSAindexNbases 7

 

比对

1
2
3
4
5
6
7
8
9
10
11
12
13
14
STAR --runThreadN 20 \
--genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/egfp_star_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix egfp_slideD \
--readFilesCommand gunzip -c \
--readFilesIn LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R1_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R1_001.fastq.gz LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R2_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R2_001.fastq.gz
 
 
STAR --runThreadN 20 \
--genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/td_star_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix td_slideD \
--readFilesCommand gunzip -c \
--readFilesIn LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R1_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R1_001.fastq.gz LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R2_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R2_001.fastq.gz

  

1
2
3
4
5
6
7
# CDX2
STAR --runThreadN 20 \
--genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/td_star_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix td_slideD \
--readFilesCommand gunzip -c \
--readFilesIn NS3_CKDL230008140-1A_H3TGMDSX7_S3_L002_R1_001.fastq.gz NS3_CKDL230008140-1A_H3TGMDSX7_S3_L002_R2_001.fastq.gz

  

找到最佳比对序列

1
samtools view td_slideDAligned.sortedByCoord.out.bam | cut -f3 | sort | uniq -c

  

1
2
>MZ708019.1 Cloning vector pAAV-SynaptoTAG2, complete sequence
CCTGCAGGCAGCTGCGCGCTCGCTCGCTCACTGAGGCCGCCCGGGCAAAGCCCGGGCGTCGGGCGACCTTTGGTCGCCCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAACTCCATCACTAGGGGTTCCTGCGGCCGCACGCGTCTGCAGAGGGCCCTGCGTATGAGTGCAAGTGGGTTTTAGGACCAGGATGAGGCGGGGTGGGGGTGCCTACCTGACGACCGACCCCGACCCACTGGACAAGCACCCAACCCCCATTCCCCAAATTGCGCATCCCCTATCAGAGAGGGGGAGGGGAAACAGGATGCGGCGAGGCGCGTGCGCACTGCCAGCTTCAGCACCGCGGACAGTGCCTTCGCCCCCGCCTGGCGGCGCGCGCCACCGCCGCCTCAGCACTGAAGGCGCGCTGACGTCACTCGCCGGTCCCCCGCAAACTCCCCTTCCCGGCCACCTTGGTCGCGTCCGCGCCGCCGCCGGCCCAGCCGGACCGCACCACGCGAGGCGCGAGATAGGGGGGCACGGGCGCGACCATCTGCGCTGCGGCGCCGGCGACTCAGCGCTGCCTCAGTCTGCGGTGGGCAGCGGAGGAGTCGTGTCGTGCCTGAGAGCGCAGTCGCCACCATGGGATCCACCGGTGCCACCATGGTCGAGATGGTGAGCAAGGGCGAGGAGGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGCGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCCCAGTTCATGTACGGCTCCAAGGCGTACGTGAAGCACCCCGCCGACATCCCCGATTACAAGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGTCTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCACGCTGATCTACAAGGTGAAGATGCGCGGCACCAACTTCCCCCCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCACCGAGCGCCTGTACCCCCGCGACGGCGTGCTGAAGGGCGAGATCCACCAGGCCCTGAAGCTGAAGGACGGCGGCCACTACCTGGTGGAGTTCAAGACCATCTACATGGCCAAGAAGCCCGTGCAACTGCCCGGCTACTACTACGTGGACACCAAGCTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAGCGCTCCGAGGGCCGCCACCACCTGTTCCTGGGGCATGGCACCGGCAGCACCGGCAGCGGCAGCTCCGGCACCGCCTCCTCCGAGGACAACAACATGGCCGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGCGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCCCAGTTCATGTACGGCTCCAAGGCGTACGTGAAGCACCCCGCCGACATCCCCGATTACAAGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGTCTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCACGCTGATCTACAAGGTGAAGATGCGCGGCACCAACTTCCCCCCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCACCGAGCGCCTGTACCCCCGCGACGGCGTGCTGAAGGGCGAGATCCACCAGGCCCTGAAGCTGAAGGACGGCGGCCACTACCTGGTGGAGTTCAAGACCATCTACATGGCCAAGAAGCCCGTGCAACTGCCCGGCTACTACTACGTGGACACCAAGCTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAGCGCTCCGAGGGCCGCCACCACCTGTTCCTGTACGGCATGGACGAGCTGTACAAGGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCTCAATTCATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGGGTGGAGGTGGATCGGCTACCGCTGCCACCGTCCCGCCTGCCGCCCCGGCCGGCGAGGGTGGCCCCCCTGCACCTCCTCCAAATCTTACCAGTAACAGGAGACTGCAGCAGACCCAGGCCCAGGTGGATGAGGTGGTGGACATCATGAGGGTGAATGTGGACAAGGTCCTGGAGCGAGACCAGAAGCTATCGGAACTGGATGATCGCGCAGATGCCCTCCAGGCAGGGGCCTCCCAGTTTGAAACAAGTGCAGCCAAGCTCAAGCGCAAATACTGGTGGAAAAACCTCAAGATGATGATCATCTTGGGAGTGATTTGCGCCATCATCCTCATCATCATCATCGTTTACTTCAGCACTTAAGAATTGAGATCTGAATTCGATATCAAGCTTATCGATAATCAACCTCTGGATTACAAAATTTGTGAAAGATTGACTGGTATTCTTAACTATGTTGCTCCTTTTACGCTATGTGGATACGCTGCTTTAATGCCTTTGTATCATGCTATTGCTTCCCGTATGGCTTTCATTTTCTCCTCCTTGTATAAATCCTGGTTGCTGTCTCTTTATGAGGAGTTGTGGCCCGTTGTCAGGCAACGTGGCGTGGTGTGCACTGTGTTTGCTGACGCAACCCCCACTGGTTGGGGCATTGCCACCACCTGTCAGCTCCTTTCCGGGACTTTCGCTTTCCCCCTCCCTATTGCCACGGCGGAACTCATCGCCGCCTGCCTTGCCCGCTGCTGGACAGGGGCTCGGCTGTTGGGCACTGACAATTCCGTGGTGTTGTCGGGGAAATCATCGTCCTTTCCTTGGCTGCTCGCCTGTGTTGCCACCTGGATTCTGCGCGGGACGTCCTTCTGCTACGTCCCTTCGGCCCTCAATCCAGCGGACCTTCCTTCCCGCGGCCTGCTGCCGGCTCTGCGGCCTCTTCCGCGTCTTCGCCTTCGCCCTCAGACGAGTCGGATCTCCCTTTGGGCCGCCTCCCCGCATCGATACCGAGCGCTGCTCGAGAGATCTACGGGTGGCATCCCTGTGACCCCTCCCCAGTGCCTCTCCTGGCCCTGGAAGTTGCCACTCCAGTGCCCACCAGCCTTGTCCTAATAAAATTAAGTTGCATCATTTTGTCTGACTAGGTGTCCTTCTATAATATTATGGGGTGGAGGGGGGTGGTATGGAGCAAGGGGCAAGTTGGGAAGACAACCTGTAGGGCCTGCGGGGTCTATTGGGAACCAAGCTGGAGTGCAGTGGCACAATCTTGGCTCACTGCAATCTCCGCCTCCTGGGTTCAAGCGATTCTCCTGCCTCAGCCTCCCGAGTTGTTGGGATTCCAGGCATGCATGACCAGGCTCAGCTAATTTTTGTTTTTTTGGTAGAGACGGGGTTTCACCATATTGGCCAGGCTGGTCTCCAACTCCTAATCTCAGGTGATCTACCCACCTTGGCCTCCCAAATTGCTGGGATTACAGGCGTGAACCACTGCTCCCTTCCCTGTCCTTCTGATTTTGTAGGTAACCACGTGCGGACCGAGCGGCCGCAGGAACCCCTAGTGATGGAGTTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGCTGCCTGCAGGGGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATACGTCAAAGCAACCATAGTACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAAACTTGATTTGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAACCCTATCTCGGGCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAACGTTTACAATTTTATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGACGTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGGAACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATATATACTTTAGATTGATTTAAAACTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGT

  

 

 

 

 

 

 

 


 

如何找到可能的YFP?

  1. NCBI
  2. 搜protein sequence,然后tblastn

去除merge之后的冗余:Remove Duplicates from a Fasta File and manipulate names

上次的最佳匹配是这个序列:

1
2
>AB971579.1 Synthetic construct sfgfp gene for superfolder green fluorescent protein, complete cds
ATGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGCGTGGCGAGGGCGAGGGCGATGCCACCAACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCGTCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTCGTTCAAGGACGACGGCACATACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTTTAACAGCCACAACGTCTATATCACAGCCGACAAGCAGAAGAACGGCATCAAGGCAAACTTCAAGATCCGCCACAACGTTGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGTTCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCACGGCATGGACGAGCTGTACAAGTAA

  

构建STAR references【为了保险,还是再筛一次】

1
2
3
4
source activate /home/lizhixin/softwares/anaconda3/envs/splicing
 
mkdir GFP_YFP_index
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir GFP_YFP_index --genomeFastaFiles GFP_YFP.fasta --genomeSAindexNbases 7

  

以后换个GFP和YFP荧光蛋白还是可以继续用。

找出最优的DNA序列

1
2
3
4
5
6
7
8
source activate /home/lizhixin/softwares/anaconda3/envs/splicing
 
/home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \
--genomeDir /home/lizhixin/references/GFP_YFP_tracing/GFP_YFP_index/GFP_YFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix test.best \
--readFilesCommand gunzip -c \
--readFilesIn NganE_10XSpatialRNASO_CPOS-220608-CWL-12975a/primary_seq/EVSE220525-VclCN-A1-1_S1_L002_R1_001.fastq.gz,NganE_10XSpatialRNASO_CPOS-220608-CWL-12975a/primary_seq/EVSE220525-VclCN-A1-1_S1_L002_R2_001.fastq.gz

  

 

 

 


 

一定要善于搜索,这种常见的需求其实早就有了官方解决方案。

今天突然就搜索到了:Google:cellranger add genes to reference

 

提取fasta

参见下文

 

构建gtf【gtf格式每列必须以tab分割,去sublime里手动调节】

1
2
3
YFP unknown exon    1   717 .   +   .   gene_id "YFP"; gene_version "1"; transcript_id "YFP"; transcript_version "1"; exon_number "1"; gene_name "YFP"; gene_source "unknown"; gene_biotype "protein_coding"; transcript_name "YFP"; transcript_source "unknown"; transcript_biotype "protein_coding"; protein_id "YFP"; protein_version "1"; tag "basic"; transcript_support_level "5"
 
YFP unknown exon    1   717 .   +   .   gene_id "YFP"; transcript_id "YFP"; gene_name "YFP"; gene_biotype "protein_coding"; 

最新的GFP

1
GFP unknown exon    1   1116    .   +   .   gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";

 

mkref【最好用最新版的cellranger,目前是7.0了】

1
2
3
~/softwares/cellranger-3.1.0/cellranger mkref --genome=mm10_GFP \
  --fasta=genome.fa \
  --genes=genes.gtf
1
2
3
~/softwares/cellranger-7.0.1/cellranger mkref --genome=mm10_GFP \
  --fasta=mm.genome.GFP.fa \
  --genes=mm.genes.GFP.gtf

  

 


 

大部分时候我们都可以用marker的基因表达来区分细胞的lineage,荧光蛋白大部分时候就是用于staining验证。

有时候我们也需要检测荧光蛋白的mRNA来分析这些cell,比如这篇paper:Rainbow-Seq: Combining Cell Lineage Tracing with Single-Cell RNA Sequencing in Preimplantation Embryos

 

其实是可行的,首先找到荧光蛋白DNA序列,通过蛋白序列去blast,然后把我们的fastq比对上去找出真正的荧光蛋白DNA reference。

然后把荧光蛋白DNA reference加入到genome reference里,以及gtf里,就可以正常分析出它的表达量了。

 

比如PHOX2B项目,但这是个伪命题,检测出来不需要这么做,这么做是因为PHOX2B检测不到,那荧光蛋白自然也检测不到,实际上我们也没法通过fastq找到荧光蛋白DNA reference。【除非你非常确定你的荧光蛋白DNA reference,否则后面没法做】

 

通过fastq找到荧光蛋白DNA reference

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/bin/bash
#PBS -l nodes=1:ppn=12
#PBS -l mem=100G
#PBS -l walltime=84:00:00
#PBS -q large
#PBS -N STAR
 
# q: medium_ext, legacy
 
cd $PBS_O_WORKDIR
 
# source activate /home/lizhixin/softwares/anaconda3/envs/splicing
 
/home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \
--genomeDir /home/lizhixin/project/scRNA-seq/lineageTracing/reference/GFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix all/7Ala-D60-BO-2 \
--readFilesCommand gunzip -c \
--readFilesIn 7Ala-D60-BO-2-1_S25_L003_R1_001.fastq.gz,7Ala-D60-BO-2-2_S26_L003_R1_001.fastq.gz,7Ala-D60-BO-2-3_S27_L003_R1_001.fastq.gz,7Ala-D60-BO-2-4_S28_L003_R1_001.fastq.gz 7Ala-D60-BO-2-1_S25_L003_R2_001.fastq.gz,7Ala-D60-BO-2-2_S26_L003_R2_001.fastq.gz,7Ala-D60-BO-2-3_S27_L003_R2_001.fastq.gz,7Ala-D60-BO-2-4_S28_L003_R2_001.fastq.gz &&\
 
/home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \
--genomeDir /home/lizhixin/project/scRNA-seq/lineageTracing/reference/GFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix all/UE-D60-BO-2 \
--readFilesCommand gunzip -c \
--readFilesIn UE-D60-BO-2-1_S21_L003_R1_001.fastq.gz,UE-D60-BO-2-2_S22_L003_R1_001.fastq.gz,UE-D60-BO-2-3_S23_L00
3_R1_001.fastq.gz,UE-D60-BO-2-4_S24_L003_R1_001.fastq.gz UE-D60-BO-2-1_S21_L003_R2_001.fastq.gz,UE-D60-BO-2-2_S22_L003_R2_001.fastq.gz,UE-D60-BO-2-3_S23_L003_R2_001.fastq.gz,UE-D60-BO-2-4_S24_L003_R2_001.fastq.gz &&\

  

 

 


 

有时候需要个性化处理原始序列,自己写python脚本太慢,且速度太慢,可以用seqkit这个工具,开发得不错。

 

2021年12月02日

另一个需求:需要从Cell Ranger处理后的bam文件里提取出未比对的reads,然后再去比对到YFP序列上,确定每个细胞是否被YFP标记了。

参考1:Cell Ranger的bam文件的解读 Barcoded BAM

The cellranger pipeline outputs an indexed BAM file containing position-sorted reads aligned to the genome and transcriptome, as well as unaligned reads. Each read in this BAM file has Chromium cellular and molecular barcode information attached.

参考2:如何提取出未比对上的reads How do I identify the unmapped reads in my Cell Ranger or Long Ranger output?

1
samtools view -f 4 phased_possorted_bam.bam | cut -f1 > unmapped_reads.txt

  

总结:

提取未比对序列的命令

1
2
3
4
samtools view -b -f 4 7Ala-D60-BO-2.bam > 7Ala-D60-BO-2.unmapped.bam
samtools view -b -f 4 UE-D60-BO-2.bam > UE-D60-BO-2.unmapped.bam
samtools view -b -f 4 UE-D60-BO-3.bam > UE-D60-BO-3.unmapped.bam
samtools view -b -f 4 7Ala-D60-BO-3.bam > 7Ala-D60-BO-3.unmapped.bam
1
samtools merge unmapped.bam 7Ala-D60-BO-2.unmapped.bam UE-D60-BO-2.unmapped.bam UE-D60-BO-3.unmapped.bam 7Ala-D60-BO-3.unmapped.bam

其中CB是校正后的细胞barcode,没有的话也可以用CR代替。

 

之后的环节

  1. bam转fastq,然后比对到YFP基因组
  2. 提取出比对reads对应的cell barcode

 

bam转fastq bedtools bamtofastq

1
bedtools bamtofastq -i unmapped.bam -fq unmapped.fq

  

通过blast搜索所有的YFP相关序列,然后用STAR构建索引。

1
2
3
4
source activate /home/lizhixin/softwares/anaconda3/envs/splicing
 
mkdir GFP_index
STAR --runThreadN 6 --runMode genomeGenerate --genomeDir GFP_index --genomeFastaFiles GFP.fasta

  

比对

1
2
3
4
5
STAR --runThreadN 10 \
--genomeDir YFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix unmap/PHOX2B \
--readFilesIn /home/lizhixin/project/scRNA-seq/rawData/10x/7Ala/unmapped.fq

  

提取出比对上的序列

 

提取对应的cell barcode

 

 

 

 

 

 

 

 

 


 

比如提取10x genomics的barcode,fastq里的前16个碱基【搞错了,没这么简单】。

1
seqkit subseq Vcl-YFP-CNCC_3_S35_L004_R2_001.fastq.gz -r 1:16 > tmp.fastq

 

所有需要的信息都在这个bam文件里面,可以进行二次分析

possorted_genome_bam.bam BAM file containing both unaligned reads and reads aligned to the genome and transcriptome annotated with barcode information  

 


 

统计fasta的每条序列的长度

目的:通过统计人基因组里转录本的长度分布,来评估不同的基因治疗的载体的局限。

 

1
2
# ~/databases/ensembl.release-90/homo_sapiens/RSEM/bowtie2/
cat GRCh38.transcripts.fa | seqkit  fx2tab -l | cut -f1,4 > transcript.len.txt

  

 

 

 

 

参考:

fasta/fq文件处理万能工具——Seqkit学习记录

Single-Library Analysis with cellranger count

 

posted @   Life·Intelligence  阅读(927)  评论(0编辑  收藏  举报
(评论功能已被禁用)
编辑推荐:
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
阅读排行:
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)
TOP
点击右上角即可分享
微信分享提示