根据荧光蛋白来做单细胞的lineage tracing | GFP | YFP
2023年08月15日 【成功检测出tdTomato,yes】
cp best.td.cdx2.fasta tdTomato.fa vi tdTomato.fa # change fasta name to tdTomato cat tdTomato.fa | grep -v "^>" | tr -d "\n" | wc -c echo -e 'tdTomato\tunknown\texon\t1\t7149\t.\t+\t.\tgene_id "tdTomato"; transcript_id "tdTomato"; gene_name "tdTomato"; gene_biotype "protein_coding";' > tdTomato.gtf mkdir refdata-gex-mm10-2020-A-tdTomato cp find_best_egfp/tdTomato.* ./refdata-gex-mm10-2020-A-tdTomato/ cat tdTomato.fa >> genome.fa grep ">" genome.fa cat tdTomato.gtf >> genes.gtf tail genes.gtf /home/zz950/softwares/cellranger-7.1.0/cellranger mkref --genome=mm10_genome_tdTomato \ --fasta=genome.fa \ --genes=genes.gtf
2023年06月22日
基本步骤
1. ask what is the fluorescent protein, in this case, it's https://www.jax.org/strain/008875
2. search the protein sequence, enhanced green fluorescent protein sequence (EGFP) mouse, https://www.uniprot.org/uniprotkb/C5MKY7/entry#sequences
3. NCBI, tblastn, https://blast.ncbi.nlm.nih.gov/Blast.cgi#
4. download all fasta, get full DNA sequences, remove duplicate
5. build star index: STAR --runThreadN 20 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles seqdump_noDup.fa --genomeSAindexNbases 7
6. align to find best sequence
Cdx2-td and Cdx2-ApcKO scRNA-seq data that we recently generated (unsorted).
tdTomato fluorescent protein sequence
https://www.fpbase.org/protein/tdtomato/
找到的EGFP 蛋白序列
>tr|C5MKY7|C5MKY7_HCMV Enhanced green fluorescent protein OS=Human cytomegalovirus OX=10359 GN=egfp PE=1 SV=1 MVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPT LVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTL VNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLA DHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK
tdTomato蛋白序列
>AAV52169.1 tandem-dimer red fluorescent protein [synthetic construct] MVSKGEEVIKEFMRFKVRMEGSMNGHEFEIEGEGEGRPYEGTQTAKLKVTKGGPLPFAWDILSPQFMYGS KAYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGLVTVTQDSSLQDGTLIYKVKMRGTNFPPDGPVMQKK TMGWEASTERLYPRDGVLKGEIHQALKLKDGGHYLVEFKTIYMAKKPVQLPGYYYVDTKLDITSHNEDYT IVEQYERSEGRHHLFLGHGTGSTGSGSSGTASSEDNNMAVIKEFMRFKVRMEGSMNGHEFEIEGEGEGRP YEGTQTAKLKVTKGGPLPFAWDILSPQFMYGSKAYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGLVTV TQDSSLQDGTLIYKVKMRGTNFPPDGPVMQKKTMGWEASTERLYPRDGVLKGEIHQALKLKDGGHYLVEF KTIYMAKKPVQLPGYYYVDTKLDITSHNEDYTIVEQYERSEGRHHLFLYGMDELYK
工具
source activate splicing module load samtools-1.9-gcc-5.4.0-jjq5nua
生成Reference
STAR --runThreadN 20 --runMode genomeGenerate --genomeDir td_star_index --genomeFastaFiles seqdump-td_noDup.fa --genomeSAindexNbases 7
比对
STAR --runThreadN 20 \ --genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/egfp_star_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix egfp_slideD \ --readFilesCommand gunzip -c \ --readFilesIn LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R1_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R1_001.fastq.gz LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R2_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R2_001.fastq.gz STAR --runThreadN 20 \ --genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/td_star_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix td_slideD \ --readFilesCommand gunzip -c \ --readFilesIn LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R1_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R1_001.fastq.gz LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R2_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R2_001.fastq.gz
# CDX2 STAR --runThreadN 20 \ --genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/td_star_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix td_slideD \ --readFilesCommand gunzip -c \ --readFilesIn NS3_CKDL230008140-1A_H3TGMDSX7_S3_L002_R1_001.fastq.gz NS3_CKDL230008140-1A_H3TGMDSX7_S3_L002_R2_001.fastq.gz
找到最佳比对序列
samtools view td_slideDAligned.sortedByCoord.out.bam | cut -f3 | sort | uniq -c
>MZ708019.1 Cloning vector pAAV-SynaptoTAG2, complete sequence CCTGCAGGCAGCTGCGCGCTCGCTCGCTCACTGAGGCCGCCCGGGCAAAGCCCGGGCGTCGGGCGACCTTTGGTCGCCCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAACTCCATCACTAGGGGTTCCTGCGGCCGCACGCGTCTGCAGAGGGCCCTGCGTATGAGTGCAAGTGGGTTTTAGGACCAGGATGAGGCGGGGTGGGGGTGCCTACCTGACGACCGACCCCGACCCACTGGACAAGCACCCAACCCCCATTCCCCAAATTGCGCATCCCCTATCAGAGAGGGGGAGGGGAAACAGGATGCGGCGAGGCGCGTGCGCACTGCCAGCTTCAGCACCGCGGACAGTGCCTTCGCCCCCGCCTGGCGGCGCGCGCCACCGCCGCCTCAGCACTGAAGGCGCGCTGACGTCACTCGCCGGTCCCCCGCAAACTCCCCTTCCCGGCCACCTTGGTCGCGTCCGCGCCGCCGCCGGCCCAGCCGGACCGCACCACGCGAGGCGCGAGATAGGGGGGCACGGGCGCGACCATCTGCGCTGCGGCGCCGGCGACTCAGCGCTGCCTCAGTCTGCGGTGGGCAGCGGAGGAGTCGTGTCGTGCCTGAGAGCGCAGTCGCCACCATGGGATCCACCGGTGCCACCATGGTCGAGATGGTGAGCAAGGGCGAGGAGGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGCGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCCCAGTTCATGTACGGCTCCAAGGCGTACGTGAAGCACCCCGCCGACATCCCCGATTACAAGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGTCTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCACGCTGATCTACAAGGTGAAGATGCGCGGCACCAACTTCCCCCCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCACCGAGCGCCTGTACCCCCGCGACGGCGTGCTGAAGGGCGAGATCCACCAGGCCCTGAAGCTGAAGGACGGCGGCCACTACCTGGTGGAGTTCAAGACCATCTACATGGCCAAGAAGCCCGTGCAACTGCCCGGCTACTACTACGTGGACACCAAGCTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAGCGCTCCGAGGGCCGCCACCACCTGTTCCTGGGGCATGGCACCGGCAGCACCGGCAGCGGCAGCTCCGGCACCGCCTCCTCCGAGGACAACAACATGGCCGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGCGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCCCAGTTCATGTACGGCTCCAAGGCGTACGTGAAGCACCCCGCCGACATCCCCGATTACAAGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGTCTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCACGCTGATCTACAAGGTGAAGATGCGCGGCACCAACTTCCCCCCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCACCGAGCGCCTGTACCCCCGCGACGGCGTGCTGAAGGGCGAGATCCACCAGGCCCTGAAGCTGAAGGACGGCGGCCACTACCTGGTGGAGTTCAAGACCATCTACATGGCCAAGAAGCCCGTGCAACTGCCCGGCTACTACTACGTGGACACCAAGCTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAGCGCTCCGAGGGCCGCCACCACCTGTTCCTGTACGGCATGGACGAGCTGTACAAGGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCTCAATTCATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGGGTGGAGGTGGATCGGCTACCGCTGCCACCGTCCCGCCTGCCGCCCCGGCCGGCGAGGGTGGCCCCCCTGCACCTCCTCCAAATCTTACCAGTAACAGGAGACTGCAGCAGACCCAGGCCCAGGTGGATGAGGTGGTGGACATCATGAGGGTGAATGTGGACAAGGTCCTGGAGCGAGACCAGAAGCTATCGGAACTGGATGATCGCGCAGATGCCCTCCAGGCAGGGGCCTCCCAGTTTGAAACAAGTGCAGCCAAGCTCAAGCGCAAATACTGGTGGAAAAACCTCAAGATGATGATCATCTTGGGAGTGATTTGCGCCATCATCCTCATCATCATCATCGTTTACTTCAGCACTTAAGAATTGAGATCTGAATTCGATATCAAGCTTATCGATAATCAACCTCTGGATTACAAAATTTGTGAAAGATTGACTGGTATTCTTAACTATGTTGCTCCTTTTACGCTATGTGGATACGCTGCTTTAATGCCTTTGTATCATGCTATTGCTTCCCGTATGGCTTTCATTTTCTCCTCCTTGTATAAATCCTGGTTGCTGTCTCTTTATGAGGAGTTGTGGCCCGTTGTCAGGCAACGTGGCGTGGTGTGCACTGTGTTTGCTGACGCAACCCCCACTGGTTGGGGCATTGCCACCACCTGTCAGCTCCTTTCCGGGACTTTCGCTTTCCCCCTCCCTATTGCCACGGCGGAACTCATCGCCGCCTGCCTTGCCCGCTGCTGGACAGGGGCTCGGCTGTTGGGCACTGACAATTCCGTGGTGTTGTCGGGGAAATCATCGTCCTTTCCTTGGCTGCTCGCCTGTGTTGCCACCTGGATTCTGCGCGGGACGTCCTTCTGCTACGTCCCTTCGGCCCTCAATCCAGCGGACCTTCCTTCCCGCGGCCTGCTGCCGGCTCTGCGGCCTCTTCCGCGTCTTCGCCTTCGCCCTCAGACGAGTCGGATCTCCCTTTGGGCCGCCTCCCCGCATCGATACCGAGCGCTGCTCGAGAGATCTACGGGTGGCATCCCTGTGACCCCTCCCCAGTGCCTCTCCTGGCCCTGGAAGTTGCCACTCCAGTGCCCACCAGCCTTGTCCTAATAAAATTAAGTTGCATCATTTTGTCTGACTAGGTGTCCTTCTATAATATTATGGGGTGGAGGGGGGTGGTATGGAGCAAGGGGCAAGTTGGGAAGACAACCTGTAGGGCCTGCGGGGTCTATTGGGAACCAAGCTGGAGTGCAGTGGCACAATCTTGGCTCACTGCAATCTCCGCCTCCTGGGTTCAAGCGATTCTCCTGCCTCAGCCTCCCGAGTTGTTGGGATTCCAGGCATGCATGACCAGGCTCAGCTAATTTTTGTTTTTTTGGTAGAGACGGGGTTTCACCATATTGGCCAGGCTGGTCTCCAACTCCTAATCTCAGGTGATCTACCCACCTTGGCCTCCCAAATTGCTGGGATTACAGGCGTGAACCACTGCTCCCTTCCCTGTCCTTCTGATTTTGTAGGTAACCACGTGCGGACCGAGCGGCCGCAGGAACCCCTAGTGATGGAGTTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGCTGCCTGCAGGGGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATACGTCAAAGCAACCATAGTACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAAACTTGATTTGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAACCCTATCTCGGGCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAACGTTTACAATTTTATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGACGTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGGAACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATATATACTTTAGATTGATTTAAAACTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGT
如何找到可能的YFP?
- NCBI
- 搜protein sequence,然后tblastn
去除merge之后的冗余:Remove Duplicates from a Fasta File and manipulate names
上次的最佳匹配是这个序列:
>AB971579.1 Synthetic construct sfgfp gene for superfolder green fluorescent protein, complete cds ATGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGCGTGGCGAGGGCGAGGGCGATGCCACCAACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCGTCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTCGTTCAAGGACGACGGCACATACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTTTAACAGCCACAACGTCTATATCACAGCCGACAAGCAGAAGAACGGCATCAAGGCAAACTTCAAGATCCGCCACAACGTTGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGTTCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCACGGCATGGACGAGCTGTACAAGTAA
构建STAR references【为了保险,还是再筛一次】
source activate /home/lizhixin/softwares/anaconda3/envs/splicing mkdir GFP_YFP_index STAR --runThreadN 8 --runMode genomeGenerate --genomeDir GFP_YFP_index --genomeFastaFiles GFP_YFP.fasta --genomeSAindexNbases 7
以后换个GFP和YFP荧光蛋白还是可以继续用。
找出最优的DNA序列
source activate /home/lizhixin/softwares/anaconda3/envs/splicing /home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \ --genomeDir /home/lizhixin/references/GFP_YFP_tracing/GFP_YFP_index/GFP_YFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix test.best \ --readFilesCommand gunzip -c \ --readFilesIn NganE_10XSpatialRNASO_CPOS-220608-CWL-12975a/primary_seq/EVSE220525-VclCN-A1-1_S1_L002_R1_001.fastq.gz,NganE_10XSpatialRNASO_CPOS-220608-CWL-12975a/primary_seq/EVSE220525-VclCN-A1-1_S1_L002_R2_001.fastq.gz
一定要善于搜索,这种常见的需求其实早就有了官方解决方案。
今天突然就搜索到了:Google:cellranger add genes to reference
- Build a Custom Reference (cellranger mkref)
- Creating a Reference Package with cellranger mkref
- Add a marker gene to the FASTA and GTF【重点参考这个部分】
提取fasta
参见下文
构建gtf【gtf格式每列必须以tab分割,去sublime里手动调节】
YFP unknown exon 1 717 . + . gene_id "YFP"; gene_version "1"; transcript_id "YFP"; transcript_version "1"; exon_number "1"; gene_name "YFP"; gene_source "unknown"; gene_biotype "protein_coding"; transcript_name "YFP"; transcript_source "unknown"; transcript_biotype "protein_coding"; protein_id "YFP"; protein_version "1"; tag "basic"; transcript_support_level "5" YFP unknown exon 1 717 . + . gene_id "YFP"; transcript_id "YFP"; gene_name "YFP"; gene_biotype "protein_coding";
最新的GFP
GFP unknown exon 1 1116 . + . gene_id "GFP"; transcript_id "GFP"; gene_name "GFP"; gene_biotype "protein_coding";
mkref【最好用最新版的cellranger,目前是7.0了】
~/softwares/cellranger-3.1.0/cellranger mkref --genome=mm10_GFP \ --fasta=genome.fa \ --genes=genes.gtf
~/softwares/cellranger-7.0.1/cellranger mkref --genome=mm10_GFP \ --fasta=mm.genome.GFP.fa \ --genes=mm.genes.GFP.gtf
大部分时候我们都可以用marker的基因表达来区分细胞的lineage,荧光蛋白大部分时候就是用于staining验证。
有时候我们也需要检测荧光蛋白的mRNA来分析这些cell,比如这篇paper:Rainbow-Seq: Combining Cell Lineage Tracing with Single-Cell RNA Sequencing in Preimplantation Embryos
其实是可行的,首先找到荧光蛋白DNA序列,通过蛋白序列去blast,然后把我们的fastq比对上去找出真正的荧光蛋白DNA reference。
然后把荧光蛋白DNA reference加入到genome reference里,以及gtf里,就可以正常分析出它的表达量了。
比如PHOX2B项目,但这是个伪命题,检测出来不需要这么做,这么做是因为PHOX2B检测不到,那荧光蛋白自然也检测不到,实际上我们也没法通过fastq找到荧光蛋白DNA reference。【除非你非常确定你的荧光蛋白DNA reference,否则后面没法做】
通过fastq找到荧光蛋白DNA reference
#!/bin/bash #PBS -l nodes=1:ppn=12 #PBS -l mem=100G #PBS -l walltime=84:00:00 #PBS -q large #PBS -N STAR # q: medium_ext, legacy cd $PBS_O_WORKDIR # source activate /home/lizhixin/softwares/anaconda3/envs/splicing /home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \ --genomeDir /home/lizhixin/project/scRNA-seq/lineageTracing/reference/GFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix all/7Ala-D60-BO-2 \ --readFilesCommand gunzip -c \ --readFilesIn 7Ala-D60-BO-2-1_S25_L003_R1_001.fastq.gz,7Ala-D60-BO-2-2_S26_L003_R1_001.fastq.gz,7Ala-D60-BO-2-3_S27_L003_R1_001.fastq.gz,7Ala-D60-BO-2-4_S28_L003_R1_001.fastq.gz 7Ala-D60-BO-2-1_S25_L003_R2_001.fastq.gz,7Ala-D60-BO-2-2_S26_L003_R2_001.fastq.gz,7Ala-D60-BO-2-3_S27_L003_R2_001.fastq.gz,7Ala-D60-BO-2-4_S28_L003_R2_001.fastq.gz &&\ /home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \ --genomeDir /home/lizhixin/project/scRNA-seq/lineageTracing/reference/GFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix all/UE-D60-BO-2 \ --readFilesCommand gunzip -c \ --readFilesIn UE-D60-BO-2-1_S21_L003_R1_001.fastq.gz,UE-D60-BO-2-2_S22_L003_R1_001.fastq.gz,UE-D60-BO-2-3_S23_L00 3_R1_001.fastq.gz,UE-D60-BO-2-4_S24_L003_R1_001.fastq.gz UE-D60-BO-2-1_S21_L003_R2_001.fastq.gz,UE-D60-BO-2-2_S22_L003_R2_001.fastq.gz,UE-D60-BO-2-3_S23_L003_R2_001.fastq.gz,UE-D60-BO-2-4_S24_L003_R2_001.fastq.gz &&\
有时候需要个性化处理原始序列,自己写python脚本太慢,且速度太慢,可以用seqkit这个工具,开发得不错。
2021年12月02日
另一个需求:需要从Cell Ranger处理后的bam文件里提取出未比对的reads,然后再去比对到YFP序列上,确定每个细胞是否被YFP标记了。
参考1:Cell Ranger的bam文件的解读 Barcoded BAM
The cellranger pipeline outputs an indexed BAM file containing position-sorted reads aligned to the genome and transcriptome, as well as unaligned reads. Each read in this BAM file has Chromium cellular and molecular barcode information attached.
参考2:如何提取出未比对上的reads How do I identify the unmapped reads in my Cell Ranger or Long Ranger output?
samtools view -f 4 phased_possorted_bam.bam | cut -f1 > unmapped_reads.txt
总结:
提取未比对序列的命令
samtools view -b -f 4 7Ala-D60-BO-2.bam > 7Ala-D60-BO-2.unmapped.bam samtools view -b -f 4 UE-D60-BO-2.bam > UE-D60-BO-2.unmapped.bam samtools view -b -f 4 UE-D60-BO-3.bam > UE-D60-BO-3.unmapped.bam samtools view -b -f 4 7Ala-D60-BO-3.bam > 7Ala-D60-BO-3.unmapped.bam
samtools merge unmapped.bam 7Ala-D60-BO-2.unmapped.bam UE-D60-BO-2.unmapped.bam UE-D60-BO-3.unmapped.bam 7Ala-D60-BO-3.unmapped.bam
其中CB是校正后的细胞barcode,没有的话也可以用CR代替。
之后的环节
- bam转fastq,然后比对到YFP基因组
- 提取出比对reads对应的cell barcode
bam转fastq bedtools bamtofastq
bedtools bamtofastq -i unmapped.bam -fq unmapped.fq
通过blast搜索所有的YFP相关序列,然后用STAR构建索引。
source activate /home/lizhixin/softwares/anaconda3/envs/splicing mkdir GFP_index STAR --runThreadN 6 --runMode genomeGenerate --genomeDir GFP_index --genomeFastaFiles GFP.fasta
比对
STAR --runThreadN 10 \ --genomeDir YFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix unmap/PHOX2B \ --readFilesIn /home/lizhixin/project/scRNA-seq/rawData/10x/7Ala/unmapped.fq
提取出比对上的序列
提取对应的cell barcode
比如提取10x genomics的barcode,fastq里的前16个碱基【搞错了,没这么简单】。
seqkit subseq Vcl-YFP-CNCC_3_S35_L004_R2_001.fastq.gz -r 1:16 > tmp.fastq
所有需要的信息都在这个bam文件里面,可以进行二次分析
possorted_genome_bam.bam BAM file containing both unaligned reads and reads aligned to the genome and transcriptome annotated with barcode information
统计fasta的每条序列的长度
目的:通过统计人基因组里转录本的长度分布,来评估不同的基因治疗的载体的局限。
# ~/databases/ensembl.release-90/homo_sapiens/RSEM/bowtie2/ cat GRCh38.transcripts.fa | seqkit fx2tab -l | cut -f1,4 > transcript.len.txt
参考:
Single-Library Analysis with cellranger count