根据荧光蛋白来做单细胞的lineage tracing | GFP | YFP

 

2023年08月15日 【成功检测出tdTomato,yes】

cp best.td.cdx2.fasta tdTomato.fa
vi tdTomato.fa # change fasta name to tdTomato
cat tdTomato.fa | grep -v "^>" | tr -d "\n" | wc -c
echo -e 'tdTomato\tunknown\texon\t1\t7149\t.\t+\t.\tgene_id "tdTomato"; transcript_id "tdTomato"; gene_name "tdTomato"; gene_biotype "protein_coding";' > tdTomato.gtf
mkdir refdata-gex-mm10-2020-A-tdTomato
cp find_best_egfp/tdTomato.* ./refdata-gex-mm10-2020-A-tdTomato/
cat tdTomato.fa >> genome.fa
grep ">" genome.fa
cat tdTomato.gtf >> genes.gtf
tail genes.gtf

/home/zz950/softwares/cellranger-7.1.0/cellranger mkref --genome=mm10_genome_tdTomato \
  --fasta=genome.fa \
  --genes=genes.gtf

 

2023年06月22日

基本步骤

1. ask what is the fluorescent protein, in this case, it's https://www.jax.org/strain/008875
2. search the protein sequence, enhanced green fluorescent protein sequence (EGFP) mouse, https://www.uniprot.org/uniprotkb/C5MKY7/entry#sequences
3. NCBI, tblastn, https://blast.ncbi.nlm.nih.gov/Blast.cgi#
4. download all fasta, get full DNA sequences, remove duplicate
5. build star index: STAR --runThreadN 20 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles seqdump_noDup.fa --genomeSAindexNbases 7
6. align to find best sequence

 

Cdx2-td and Cdx2-ApcKO scRNA-seq data that we recently generated (unsorted).
tdTomato fluorescent protein sequence
https://www.fpbase.org/protein/tdtomato/

 

找到的EGFP 蛋白序列

>tr|C5MKY7|C5MKY7_HCMV Enhanced green fluorescent protein OS=Human cytomegalovirus OX=10359 GN=egfp PE=1 SV=1
MVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPT
LVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTL
VNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLA
DHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK

  

tdTomato蛋白序列

>AAV52169.1 tandem-dimer red fluorescent protein [synthetic construct]
MVSKGEEVIKEFMRFKVRMEGSMNGHEFEIEGEGEGRPYEGTQTAKLKVTKGGPLPFAWDILSPQFMYGS
KAYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGLVTVTQDSSLQDGTLIYKVKMRGTNFPPDGPVMQKK
TMGWEASTERLYPRDGVLKGEIHQALKLKDGGHYLVEFKTIYMAKKPVQLPGYYYVDTKLDITSHNEDYT
IVEQYERSEGRHHLFLGHGTGSTGSGSSGTASSEDNNMAVIKEFMRFKVRMEGSMNGHEFEIEGEGEGRP
YEGTQTAKLKVTKGGPLPFAWDILSPQFMYGSKAYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGLVTV
TQDSSLQDGTLIYKVKMRGTNFPPDGPVMQKKTMGWEASTERLYPRDGVLKGEIHQALKLKDGGHYLVEF
KTIYMAKKPVQLPGYYYVDTKLDITSHNEDYTIVEQYERSEGRHHLFLYGMDELYK

 

工具

source activate splicing
module load samtools-1.9-gcc-5.4.0-jjq5nua

  

生成Reference

STAR --runThreadN 20 --runMode genomeGenerate --genomeDir td_star_index --genomeFastaFiles seqdump-td_noDup.fa --genomeSAindexNbases 7

 

比对

STAR --runThreadN 20 \
--genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/egfp_star_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix egfp_slideD \
--readFilesCommand gunzip -c \
--readFilesIn LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R1_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R1_001.fastq.gz LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R2_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R2_001.fastq.gz


STAR --runThreadN 20 \
--genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/td_star_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix td_slideD \
--readFilesCommand gunzip -c \
--readFilesIn LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R1_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R1_001.fastq.gz LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R2_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R2_001.fastq.gz

  

# CDX2
STAR --runThreadN 20 \
--genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/td_star_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix td_slideD \
--readFilesCommand gunzip -c \
--readFilesIn NS3_CKDL230008140-1A_H3TGMDSX7_S3_L002_R1_001.fastq.gz NS3_CKDL230008140-1A_H3TGMDSX7_S3_L002_R2_001.fastq.gz

  

找到最佳比对序列

samtools view td_slideDAligned.sortedByCoord.out.bam | cut -f3 | sort | uniq -c

  

>MZ708019.1 Cloning vector pAAV-SynaptoTAG2, complete sequence
CCTGCAGGCAGCTGCGCGCTCGCTCGCTCACTGAGGCCGCCCGGGCAAAGCCCGGGCGTCGGGCGACCTTTGGTCGCCCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAACTCCATCACTAGGGGTTCCTGCGGCCGCACGCGTCTGCAGAGGGCCCTGCGTATGAGTGCAAGTGGGTTTTAGGACCAGGATGAGGCGGGGTGGGGGTGCCTACCTGACGACCGACCCCGACCCACTGGACAAGCACCCAACCCCCATTCCCCAAATTGCGCATCCCCTATCAGAGAGGGGGAGGGGAAACAGGATGCGGCGAGGCGCGTGCGCACTGCCAGCTTCAGCACCGCGGACAGTGCCTTCGCCCCCGCCTGGCGGCGCGCGCCACCGCCGCCTCAGCACTGAAGGCGCGCTGACGTCACTCGCCGGTCCCCCGCAAACTCCCCTTCCCGGCCACCTTGGTCGCGTCCGCGCCGCCGCCGGCCCAGCCGGACCGCACCACGCGAGGCGCGAGATAGGGGGGCACGGGCGCGACCATCTGCGCTGCGGCGCCGGCGACTCAGCGCTGCCTCAGTCTGCGGTGGGCAGCGGAGGAGTCGTGTCGTGCCTGAGAGCGCAGTCGCCACCATGGGATCCACCGGTGCCACCATGGTCGAGATGGTGAGCAAGGGCGAGGAGGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGCGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCCCAGTTCATGTACGGCTCCAAGGCGTACGTGAAGCACCCCGCCGACATCCCCGATTACAAGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGTCTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCACGCTGATCTACAAGGTGAAGATGCGCGGCACCAACTTCCCCCCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCACCGAGCGCCTGTACCCCCGCGACGGCGTGCTGAAGGGCGAGATCCACCAGGCCCTGAAGCTGAAGGACGGCGGCCACTACCTGGTGGAGTTCAAGACCATCTACATGGCCAAGAAGCCCGTGCAACTGCCCGGCTACTACTACGTGGACACCAAGCTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAGCGCTCCGAGGGCCGCCACCACCTGTTCCTGGGGCATGGCACCGGCAGCACCGGCAGCGGCAGCTCCGGCACCGCCTCCTCCGAGGACAACAACATGGCCGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGCGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCCCAGTTCATGTACGGCTCCAAGGCGTACGTGAAGCACCCCGCCGACATCCCCGATTACAAGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGTCTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCACGCTGATCTACAAGGTGAAGATGCGCGGCACCAACTTCCCCCCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCACCGAGCGCCTGTACCCCCGCGACGGCGTGCTGAAGGGCGAGATCCACCAGGCCCTGAAGCTGAAGGACGGCGGCCACTACCTGGTGGAGTTCAAGACCATCTACATGGCCAAGAAGCCCGTGCAACTGCCCGGCTACTACTACGTGGACACCAAGCTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAGCGCTCCGAGGGCCGCCACCACCTGTTCCTGTACGGCATGGACGAGCTGTACAAGGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCTCAATTCATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGGGTGGAGGTGGATCGGCTACCGCTGCCACCGTCCCGCCTGCCGCCCCGGCCGGCGAGGGTGGCCCCCCTGCACCTCCTCCAAATCTTACCAGTAACAGGAGACTGCAGCAGACCCAGGCCCAGGTGGATGAGGTGGTGGACATCATGAGGGTGAATGTGGACAAGGTCCTGGAGCGAGACCAGAAGCTATCGGAACTGGATGATCGCGCAGATGCCCTCCAGGCAGGGGCCTCCCAGTTTGAAACAAGTGCAGCCAAGCTCAAGCGCAAATACTGGTGGAAAAACCTCAAGATGATGATCATCTTGGGAGTGATTTGCGCCATCATCCTCATCATCATCATCGTTTACTTCAGCACTTAAGAATTGAGATCTGAATTCGATATCAAGCTTATCGATAATCAACCTCTGGATTACAAAATTTGTGAAAGATTGACTGGTATTCTTAACTATGTTGCTCCTTTTACGCTATGTGGATACGCTGCTTTAATGCCTTTGTATCATGCTATTGCTTCCCGTATGGCTTTCATTTTCTCCTCCTTGTATAAATCCTGGTTGCTGTCTCTTTATGAGGAGTTGTGGCCCGTTGTCAGGCAACGTGGCGTGGTGTGCACTGTGTTTGCTGACGCAACCCCCACTGGTTGGGGCATTGCCACCACCTGTCAGCTCCTTTCCGGGACTTTCGCTTTCCCCCTCCCTATTGCCACGGCGGAACTCATCGCCGCCTGCCTTGCCCGCTGCTGGACAGGGGCTCGGCTGTTGGGCACTGACAATTCCGTGGTGTTGTCGGGGAAATCATCGTCCTTTCCTTGGCTGCTCGCCTGTGTTGCCACCTGGATTCTGCGCGGGACGTCCTTCTGCTACGTCCCTTCGGCCCTCAATCCAGCGGACCTTCCTTCCCGCGGCCTGCTGCCGGCTCTGCGGCCTCTTCCGCGTCTTCGCCTTCGCCCTCAGACGAGTCGGATCTCCCTTTGGGCCGCCTCCCCGCATCGATACCGAGCGCTGCTCGAGAGATCTACGGGTGGCATCCCTGTGACCCCTCCCCAGTGCCTCTCCTGGCCCTGGAAGTTGCCACTCCAGTGCCCACCAGCCTTGTCCTAATAAAATTAAGTTGCATCATTTTGTCTGACTAGGTGTCCTTCTATAATATTATGGGGTGGAGGGGGGTGGTATGGAGCAAGGGGCAAGTTGGGAAGACAACCTGTAGGGCCTGCGGGGTCTATTGGGAACCAAGCTGGAGTGCAGTGGCACAATCTTGGCTCACTGCAATCTCCGCCTCCTGGGTTCAAGCGATTCTCCTGCCTCAGCCTCCCGAGTTGTTGGGATTCCAGGCATGCATGACCAGGCTCAGCTAATTTTTGTTTTTTTGGTAGAGACGGGGTTTCACCATATTGGCCAGGCTGGTCTCCAACTCCTAATCTCAGGTGATCTACCCACCTTGGCCTCCCAAATTGCTGGGATTACAGGCGTGAACCACTGCTCCCTTCCCTGTCCTTCTGATTTTGTAGGTAACCACGTGCGGACCGAGCGGCCGCAGGAACCCCTAGTGATGGAGTTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGCTGCCTGCAGGGGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATACGTCAAAGCAACCATAGTACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAAACTTGATTTGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAACCCTATCTCGGGCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAACGTTTACAATTTTATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGACGTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGGAACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATATATACTTTAGATTGATTTAAAACTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGT

  

 

 

 

 

 

 

 


 

如何找到可能的YFP?

  1. NCBI
  2. 搜protein sequence,然后tblastn

去除merge之后的冗余:Remove Duplicates from a Fasta File and manipulate names

上次的最佳匹配是这个序列:

>AB971579.1 Synthetic construct sfgfp gene for superfolder green fluorescent protein, complete cds
ATGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGCGTGGCGAGGGCGAGGGCGATGCCACCAACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCGTCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTCGTTCAAGGACGACGGCACATACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTTTAACAGCCACAACGTCTATATCACAGCCGACAAGCAGAAGAACGGCATCAAGGCAAACTTCAAGATCCGCCACAACGTTGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGTTCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCACGGCATGGACGAGCTGTACAAGTAA

  

构建STAR references【为了保险,还是再筛一次】

source activate /home/lizhixin/softwares/anaconda3/envs/splicing

mkdir GFP_YFP_index
STAR --runThreadN 8 --runMode genomeGenerate --genomeDir GFP_YFP_index --genomeFastaFiles GFP_YFP.fasta --genomeSAindexNbases 7

  

以后换个GFP和YFP荧光蛋白还是可以继续用。

找出最优的DNA序列

source activate /home/lizhixin/softwares/anaconda3/envs/splicing

/home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \
--genomeDir /home/lizhixin/references/GFP_YFP_tracing/GFP_YFP_index/GFP_YFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix test.best \
--readFilesCommand gunzip -c \
--readFilesIn NganE_10XSpatialRNASO_CPOS-220608-CWL-12975a/primary_seq/EVSE220525-VclCN-A1-1_S1_L002_R1_001.fastq.gz,NganE_10XSpatialRNASO_CPOS-220608-CWL-12975a/primary_seq/EVSE220525-VclCN-A1-1_S1_L002_R2_001.fastq.gz

  

 

 

 


 

一定要善于搜索,这种常见的需求其实早就有了官方解决方案。

今天突然就搜索到了:Google:cellranger add genes to reference

 

提取fasta

参见下文

 

构建gtf【gtf格式每列必须以tab分割,去sublime里手动调节】

YFP	unknown	exon	1	717	.	+	.	gene_id "YFP"; gene_version "1"; transcript_id "YFP"; transcript_version "1"; exon_number "1"; gene_name "YFP"; gene_source "unknown"; gene_biotype "protein_coding"; transcript_name "YFP"; transcript_source "unknown"; transcript_biotype "protein_coding"; protein_id "YFP"; protein_version "1"; tag "basic"; transcript_support_level "5"

YFP	unknown	exon	1	717	.	+	.	gene_id "YFP"; transcript_id "YFP"; gene_name "YFP"; gene_biotype "protein_coding"; 

最新的GFP

GFP	unknown	exon	1	1116	.	+	.	gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";

 

mkref【最好用最新版的cellranger,目前是7.0了】

~/softwares/cellranger-3.1.0/cellranger mkref --genome=mm10_GFP \
  --fasta=genome.fa \
  --genes=genes.gtf
~/softwares/cellranger-7.0.1/cellranger mkref --genome=mm10_GFP \
  --fasta=mm.genome.GFP.fa \
  --genes=mm.genes.GFP.gtf

  

 


 

大部分时候我们都可以用marker的基因表达来区分细胞的lineage,荧光蛋白大部分时候就是用于staining验证。

有时候我们也需要检测荧光蛋白的mRNA来分析这些cell,比如这篇paper:Rainbow-Seq: Combining Cell Lineage Tracing with Single-Cell RNA Sequencing in Preimplantation Embryos

 

其实是可行的,首先找到荧光蛋白DNA序列,通过蛋白序列去blast,然后把我们的fastq比对上去找出真正的荧光蛋白DNA reference。

然后把荧光蛋白DNA reference加入到genome reference里,以及gtf里,就可以正常分析出它的表达量了。

 

比如PHOX2B项目,但这是个伪命题,检测出来不需要这么做,这么做是因为PHOX2B检测不到,那荧光蛋白自然也检测不到,实际上我们也没法通过fastq找到荧光蛋白DNA reference。【除非你非常确定你的荧光蛋白DNA reference,否则后面没法做】

 

通过fastq找到荧光蛋白DNA reference

#!/bin/bash
#PBS -l nodes=1:ppn=12
#PBS -l mem=100G
#PBS -l walltime=84:00:00
#PBS -q large
#PBS -N STAR

# q: medium_ext, legacy

cd $PBS_O_WORKDIR

# source activate /home/lizhixin/softwares/anaconda3/envs/splicing

/home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \
--genomeDir /home/lizhixin/project/scRNA-seq/lineageTracing/reference/GFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix all/7Ala-D60-BO-2 \
--readFilesCommand gunzip -c \
--readFilesIn 7Ala-D60-BO-2-1_S25_L003_R1_001.fastq.gz,7Ala-D60-BO-2-2_S26_L003_R1_001.fastq.gz,7Ala-D60-BO-2-3_S27_L003_R1_001.fastq.gz,7Ala-D60-BO-2-4_S28_L003_R1_001.fastq.gz 7Ala-D60-BO-2-1_S25_L003_R2_001.fastq.gz,7Ala-D60-BO-2-2_S26_L003_R2_001.fastq.gz,7Ala-D60-BO-2-3_S27_L003_R2_001.fastq.gz,7Ala-D60-BO-2-4_S28_L003_R2_001.fastq.gz &&\

/home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \
--genomeDir /home/lizhixin/project/scRNA-seq/lineageTracing/reference/GFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix all/UE-D60-BO-2 \
--readFilesCommand gunzip -c \
--readFilesIn UE-D60-BO-2-1_S21_L003_R1_001.fastq.gz,UE-D60-BO-2-2_S22_L003_R1_001.fastq.gz,UE-D60-BO-2-3_S23_L00
3_R1_001.fastq.gz,UE-D60-BO-2-4_S24_L003_R1_001.fastq.gz UE-D60-BO-2-1_S21_L003_R2_001.fastq.gz,UE-D60-BO-2-2_S22_L003_R2_001.fastq.gz,UE-D60-BO-2-3_S23_L003_R2_001.fastq.gz,UE-D60-BO-2-4_S24_L003_R2_001.fastq.gz &&\

  

 

 


 

有时候需要个性化处理原始序列,自己写python脚本太慢,且速度太慢,可以用seqkit这个工具,开发得不错。

 

2021年12月02日

另一个需求:需要从Cell Ranger处理后的bam文件里提取出未比对的reads,然后再去比对到YFP序列上,确定每个细胞是否被YFP标记了。

参考1:Cell Ranger的bam文件的解读 Barcoded BAM

The cellranger pipeline outputs an indexed BAM file containing position-sorted reads aligned to the genome and transcriptome, as well as unaligned reads. Each read in this BAM file has Chromium cellular and molecular barcode information attached.

参考2:如何提取出未比对上的reads How do I identify the unmapped reads in my Cell Ranger or Long Ranger output?

samtools view -f 4 phased_possorted_bam.bam | cut -f1 > unmapped_reads.txt

  

总结:

提取未比对序列的命令

samtools view -b -f 4 7Ala-D60-BO-2.bam > 7Ala-D60-BO-2.unmapped.bam
samtools view -b -f 4 UE-D60-BO-2.bam > UE-D60-BO-2.unmapped.bam
samtools view -b -f 4 UE-D60-BO-3.bam > UE-D60-BO-3.unmapped.bam
samtools view -b -f 4 7Ala-D60-BO-3.bam > 7Ala-D60-BO-3.unmapped.bam
samtools merge unmapped.bam 7Ala-D60-BO-2.unmapped.bam UE-D60-BO-2.unmapped.bam UE-D60-BO-3.unmapped.bam 7Ala-D60-BO-3.unmapped.bam

其中CB是校正后的细胞barcode,没有的话也可以用CR代替。

 

之后的环节

  1. bam转fastq,然后比对到YFP基因组
  2. 提取出比对reads对应的cell barcode

 

bam转fastq bedtools bamtofastq

bedtools bamtofastq -i unmapped.bam -fq unmapped.fq

  

通过blast搜索所有的YFP相关序列,然后用STAR构建索引。

source activate /home/lizhixin/softwares/anaconda3/envs/splicing

mkdir GFP_index
STAR --runThreadN 6 --runMode genomeGenerate --genomeDir GFP_index --genomeFastaFiles GFP.fasta

  

比对

STAR --runThreadN 10 \
--genomeDir YFP_index \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix unmap/PHOX2B \
--readFilesIn /home/lizhixin/project/scRNA-seq/rawData/10x/7Ala/unmapped.fq

  

提取出比对上的序列

 

提取对应的cell barcode

 

 

 

 

 

 

 

 

 


 

比如提取10x genomics的barcode,fastq里的前16个碱基【搞错了,没这么简单】。

seqkit subseq Vcl-YFP-CNCC_3_S35_L004_R2_001.fastq.gz -r 1:16 > tmp.fastq

 

所有需要的信息都在这个bam文件里面,可以进行二次分析

possorted_genome_bam.bam BAM file containing both unaligned reads and reads aligned to the genome and transcriptome annotated with barcode information  

 


 

统计fasta的每条序列的长度

目的:通过统计人基因组里转录本的长度分布,来评估不同的基因治疗的载体的局限。

 

# ~/databases/ensembl.release-90/homo_sapiens/RSEM/bowtie2/
cat GRCh38.transcripts.fa | seqkit  fx2tab -l | cut -f1,4 > transcript.len.txt

  

 

 

 

 

参考:

fasta/fq文件处理万能工具——Seqkit学习记录

Single-Library Analysis with cellranger count

 

posted @ 2021-10-18 18:14  Life·Intelligence  阅读(804)  评论(0编辑  收藏  举报
TOP