根据荧光蛋白来做单细胞的lineage tracing | GFP | YFP
2023年08月15日 【成功检测出tdTomato,yes】
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | cp best.td.cdx2.fasta tdTomato.fa vi tdTomato.fa # change fasta name to tdTomato cat tdTomato.fa | grep - v "^>" | tr - d "\n" | wc - c echo - e 'tdTomato\tunknown\texon\t1\t7149\t.\t+\t.\tgene_id "tdTomato"; transcript_id "tdTomato"; gene_name "tdTomato"; gene_biotype "protein_coding";' > tdTomato.gtf mkdir refdata - gex - mm10 - 2020 - A - tdTomato cp find_best_egfp / tdTomato. * . / refdata - gex - mm10 - 2020 - A - tdTomato / cat tdTomato.fa >> genome.fa grep ">" genome.fa cat tdTomato.gtf >> genes.gtf tail genes.gtf / home / zz950 / softwares / cellranger - 7.1 . 0 / cellranger mkref - - genome = mm10_genome_tdTomato \ - - fasta = genome.fa \ - - genes = genes.gtf |
2023年06月22日
基本步骤
1. ask what is the fluorescent protein, in this case, it's https://www.jax.org/strain/008875
2. search the protein sequence, enhanced green fluorescent protein sequence (EGFP) mouse, https://www.uniprot.org/uniprotkb/C5MKY7/entry#sequences
3. NCBI, tblastn, https://blast.ncbi.nlm.nih.gov/Blast.cgi#
4. download all fasta, get full DNA sequences, remove duplicate
5. build star index: STAR --runThreadN 20 --runMode genomeGenerate --genomeDir star_index --genomeFastaFiles seqdump_noDup.fa --genomeSAindexNbases 7
6. align to find best sequence
Cdx2-td and Cdx2-ApcKO scRNA-seq data that we recently generated (unsorted).
tdTomato fluorescent protein sequence
https://www.fpbase.org/protein/tdtomato/
找到的EGFP 蛋白序列
1 2 3 4 5 | > tr |C5MKY7|C5MKY7_HCMV Enhanced green fluorescent protein OS=Human cytomegalovirus OX=10359 GN=egfp PE=1 SV=1 MVSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPT LVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTL VNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLA DHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITLGMDELYK |
tdTomato蛋白序列
1 2 3 4 5 6 7 8 | >AAV52169.1 tandem-dimer red fluorescent protein [synthetic construct] MVSKGEEVIKEFMRFKVRMEGSMNGHEFEIEGEGEGRPYEGTQTAKLKVTKGGPLPFAWDILSPQFMYGS KAYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGLVTVTQDSSLQDGTLIYKVKMRGTNFPPDGPVMQKK TMGWEASTERLYPRDGVLKGEIHQALKLKDGGHYLVEFKTIYMAKKPVQLPGYYYVDTKLDITSHNEDYT IVEQYERSEGRHHLFLGHGTGSTGSGSSGTASSEDNNMAVIKEFMRFKVRMEGSMNGHEFEIEGEGEGRP YEGTQTAKLKVTKGGPLPFAWDILSPQFMYGSKAYVKHPADIPDYKKLSFPEGFKWERVMNFEDGGLVTV TQDSSLQDGTLIYKVKMRGTNFPPDGPVMQKKTMGWEASTERLYPRDGVLKGEIHQALKLKDGGHYLVEF KTIYMAKKPVQLPGYYYVDTKLDITSHNEDYTIVEQYERSEGRHHLFLYGMDELYK |
工具
1 2 | source activate splicing module load samtools-1.9- gcc -5.4.0-jjq5nua |
生成Reference
1 | STAR --runThreadN 20 --runMode genomeGenerate --genomeDir td_star_index --genomeFastaFiles seqdump-td_noDup.fa --genomeSAindexNbases 7 |
比对
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | STAR --runThreadN 20 \ --genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/egfp_star_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix egfp_slideD \ --readFilesCommand gunzip -c \ --readFilesIn LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R1_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R1_001.fastq.gz LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R2_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R2_001.fastq.gz STAR --runThreadN 20 \ --genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/td_star_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix td_slideD \ --readFilesCommand gunzip -c \ --readFilesIn LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R1_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R1_001.fastq.gz LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L001_R2_001.fastq.gz,LIB058905_CRN00258603_VisiumFFPE4-D1-H1_S4_L002_R2_001.fastq.gz |
1 2 3 4 5 6 7 | # CDX2 STAR --runThreadN 20 \ --genomeDir /home/zz950/reference/10x_EGFP_index/find_best_egfp/td_star_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix td_slideD \ --readFilesCommand gunzip -c \ --readFilesIn NS3_CKDL230008140-1A_H3TGMDSX7_S3_L002_R1_001.fastq.gz NS3_CKDL230008140-1A_H3TGMDSX7_S3_L002_R2_001.fastq.gz |
找到最佳比对序列
1 | samtools view td_slideDAligned.sortedByCoord.out.bam | cut -f3 | sort | uniq -c |
1 2 | >MZ708019.1 Cloning vector pAAV-SynaptoTAG2, complete sequence CCTGCAGGCAGCTGCGCGCTCGCTCGCTCACTGAGGCCGCCCGGGCAAAGCCCGGGCGTCGGGCGACCTTTGGTCGCCCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTGGCCAACTCCATCACTAGGGGTTCCTGCGGCCGCACGCGTCTGCAGAGGGCCCTGCGTATGAGTGCAAGTGGGTTTTAGGACCAGGATGAGGCGGGGTGGGGGTGCCTACCTGACGACCGACCCCGACCCACTGGACAAGCACCCAACCCCCATTCCCCAAATTGCGCATCCCCTATCAGAGAGGGGGAGGGGAAACAGGATGCGGCGAGGCGCGTGCGCACTGCCAGCTTCAGCACCGCGGACAGTGCCTTCGCCCCCGCCTGGCGGCGCGCGCCACCGCCGCCTCAGCACTGAAGGCGCGCTGACGTCACTCGCCGGTCCCCCGCAAACTCCCCTTCCCGGCCACCTTGGTCGCGTCCGCGCCGCCGCCGGCCCAGCCGGACCGCACCACGCGAGGCGCGAGATAGGGGGGCACGGGCGCGACCATCTGCGCTGCGGCGCCGGCGACTCAGCGCTGCCTCAGTCTGCGGTGGGCAGCGGAGGAGTCGTGTCGTGCCTGAGAGCGCAGTCGCCACCATGGGATCCACCGGTGCCACCATGGTCGAGATGGTGAGCAAGGGCGAGGAGGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGCGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCCCAGTTCATGTACGGCTCCAAGGCGTACGTGAAGCACCCCGCCGACATCCCCGATTACAAGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGTCTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCACGCTGATCTACAAGGTGAAGATGCGCGGCACCAACTTCCCCCCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCACCGAGCGCCTGTACCCCCGCGACGGCGTGCTGAAGGGCGAGATCCACCAGGCCCTGAAGCTGAAGGACGGCGGCCACTACCTGGTGGAGTTCAAGACCATCTACATGGCCAAGAAGCCCGTGCAACTGCCCGGCTACTACTACGTGGACACCAAGCTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAGCGCTCCGAGGGCCGCCACCACCTGTTCCTGGGGCATGGCACCGGCAGCACCGGCAGCGGCAGCTCCGGCACCGCCTCCTCCGAGGACAACAACATGGCCGTCATCAAAGAGTTCATGCGCTTCAAGGTGCGCATGGAGGGCTCCATGAACGGCCACGAGTTCGAGATCGAGGGCGAGGGCGAGGGCCGCCCCTACGAGGGCACCCAGACCGCCAAGCTGAAGGTGACCAAGGGCGGCCCCCTGCCCTTCGCCTGGGACATCCTGTCCCCCCAGTTCATGTACGGCTCCAAGGCGTACGTGAAGCACCCCGCCGACATCCCCGATTACAAGAAGCTGTCCTTCCCCGAGGGCTTCAAGTGGGAGCGCGTGATGAACTTCGAGGACGGCGGTCTGGTGACCGTGACCCAGGACTCCTCCCTGCAGGACGGCACGCTGATCTACAAGGTGAAGATGCGCGGCACCAACTTCCCCCCCGACGGCCCCGTAATGCAGAAGAAGACCATGGGCTGGGAGGCCTCCACCGAGCGCCTGTACCCCCGCGACGGCGTGCTGAAGGGCGAGATCCACCAGGCCCTGAAGCTGAAGGACGGCGGCCACTACCTGGTGGAGTTCAAGACCATCTACATGGCCAAGAAGCCCGTGCAACTGCCCGGCTACTACTACGTGGACACCAAGCTGGACATCACCTCCCACAACGAGGACTACACCATCGTGGAACAGTACGAGCGCTCCGAGGGCCGCCACCACCTGTTCCTGTACGGCATGGACGAGCTGTACAAGGGAAGCGGAGCTACTAACTTCAGCCTGCTGAAGCAGGCTGGAGACGTGGAGGAGAACCCTGGACCTCAATTCATGGTGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGTCCGGCGAGGGCGAGGGCGATGCCACCTACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCAGCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTTCTTCAAGGACGACGGCAACTACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTACAACAGCCACAACGTCTATATCATGGCCGACAAGCAGAAGAACGGCATCAAGGTGAACTTCAAGATCCGCCACAACATCGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGCCCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCTCGGCATGGACGAGCTGTACAAGGGTGGAGGTGGATCGGCTACCGCTGCCACCGTCCCGCCTGCCGCCCCGGCCGGCGAGGGTGGCCCCCCTGCACCTCCTCCAAATCTTACCAGTAACAGGAGACTGCAGCAGACCCAGGCCCAGGTGGATGAGGTGGTGGACATCATGAGGGTGAATGTGGACAAGGTCCTGGAGCGAGACCAGAAGCTATCGGAACTGGATGATCGCGCAGATGCCCTCCAGGCAGGGGCCTCCCAGTTTGAAACAAGTGCAGCCAAGCTCAAGCGCAAATACTGGTGGAAAAACCTCAAGATGATGATCATCTTGGGAGTGATTTGCGCCATCATCCTCATCATCATCATCGTTTACTTCAGCACTTAAGAATTGAGATCTGAATTCGATATCAAGCTTATCGATAATCAACCTCTGGATTACAAAATTTGTGAAAGATTGACTGGTATTCTTAACTATGTTGCTCCTTTTACGCTATGTGGATACGCTGCTTTAATGCCTTTGTATCATGCTATTGCTTCCCGTATGGCTTTCATTTTCTCCTCCTTGTATAAATCCTGGTTGCTGTCTCTTTATGAGGAGTTGTGGCCCGTTGTCAGGCAACGTGGCGTGGTGTGCACTGTGTTTGCTGACGCAACCCCCACTGGTTGGGGCATTGCCACCACCTGTCAGCTCCTTTCCGGGACTTTCGCTTTCCCCCTCCCTATTGCCACGGCGGAACTCATCGCCGCCTGCCTTGCCCGCTGCTGGACAGGGGCTCGGCTGTTGGGCACTGACAATTCCGTGGTGTTGTCGGGGAAATCATCGTCCTTTCCTTGGCTGCTCGCCTGTGTTGCCACCTGGATTCTGCGCGGGACGTCCTTCTGCTACGTCCCTTCGGCCCTCAATCCAGCGGACCTTCCTTCCCGCGGCCTGCTGCCGGCTCTGCGGCCTCTTCCGCGTCTTCGCCTTCGCCCTCAGACGAGTCGGATCTCCCTTTGGGCCGCCTCCCCGCATCGATACCGAGCGCTGCTCGAGAGATCTACGGGTGGCATCCCTGTGACCCCTCCCCAGTGCCTCTCCTGGCCCTGGAAGTTGCCACTCCAGTGCCCACCAGCCTTGTCCTAATAAAATTAAGTTGCATCATTTTGTCTGACTAGGTGTCCTTCTATAATATTATGGGGTGGAGGGGGGTGGTATGGAGCAAGGGGCAAGTTGGGAAGACAACCTGTAGGGCCTGCGGGGTCTATTGGGAACCAAGCTGGAGTGCAGTGGCACAATCTTGGCTCACTGCAATCTCCGCCTCCTGGGTTCAAGCGATTCTCCTGCCTCAGCCTCCCGAGTTGTTGGGATTCCAGGCATGCATGACCAGGCTCAGCTAATTTTTGTTTTTTTGGTAGAGACGGGGTTTCACCATATTGGCCAGGCTGGTCTCCAACTCCTAATCTCAGGTGATCTACCCACCTTGGCCTCCCAAATTGCTGGGATTACAGGCGTGAACCACTGCTCCCTTCCCTGTCCTTCTGATTTTGTAGGTAACCACGTGCGGACCGAGCGGCCGCAGGAACCCCTAGTGATGGAGTTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCCCGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGCTGCCTGCAGGGGCGCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATACGTCAAAGCAACCATAGTACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAAACTTGATTTGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAACCCTATCTCGGGCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAACGTTTACAATTTTATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGACGTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGGAACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATATATACTTTAGATTGATTTAAAACTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGT |
如何找到可能的YFP?
- NCBI
- 搜protein sequence,然后tblastn
去除merge之后的冗余:Remove Duplicates from a Fasta File and manipulate names
上次的最佳匹配是这个序列:
1 2 | >AB971579.1 Synthetic construct sfgfp gene for superfolder green fluorescent protein, complete cds ATGAGCAAGGGCGAGGAGCTGTTCACCGGGGTGGTGCCCATCCTGGTCGAGCTGGACGGCGACGTAAACGGCCACAAGTTCAGCGTGCGTGGCGAGGGCGAGGGCGATGCCACCAACGGCAAGCTGACCCTGAAGTTCATCTGCACCACCGGCAAGCTGCCCGTGCCCTGGCCCACCCTCGTGACCACCCTGACCTACGGCGTGCAGTGCTTCAGCCGCTACCCCGACCACATGAAGCGTCACGACTTCTTCAAGTCCGCCATGCCCGAAGGCTACGTCCAGGAGCGCACCATCTCGTTCAAGGACGACGGCACATACAAGACCCGCGCCGAGGTGAAGTTCGAGGGCGACACCCTGGTGAACCGCATCGAGCTGAAGGGCATCGACTTCAAGGAGGACGGCAACATCCTGGGGCACAAGCTGGAGTACAACTTTAACAGCCACAACGTCTATATCACAGCCGACAAGCAGAAGAACGGCATCAAGGCAAACTTCAAGATCCGCCACAACGTTGAGGACGGCAGCGTGCAGCTCGCCGACCACTACCAGCAGAACACCCCCATCGGCGACGGCCCCGTGCTGCTGCCCGACAACCACTACCTGAGCACCCAGTCCGTTCTGAGCAAAGACCCCAACGAGAAGCGCGATCACATGGTCCTGCTGGAGTTCGTGACCGCCGCCGGGATCACTCACGGCATGGACGAGCTGTACAAGTAA |
构建STAR references【为了保险,还是再筛一次】
1 2 3 4 | source activate /home/lizhixin/softwares/anaconda3/envs/splicing mkdir GFP_YFP_index STAR --runThreadN 8 --runMode genomeGenerate --genomeDir GFP_YFP_index --genomeFastaFiles GFP_YFP.fasta --genomeSAindexNbases 7 |
以后换个GFP和YFP荧光蛋白还是可以继续用。
找出最优的DNA序列
1 2 3 4 5 6 7 8 | source activate /home/lizhixin/softwares/anaconda3/envs/splicing /home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \ --genomeDir /home/lizhixin/references/GFP_YFP_tracing/GFP_YFP_index/GFP_YFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix test .best \ --readFilesCommand gunzip -c \ --readFilesIn NganE_10XSpatialRNASO_CPOS-220608-CWL-12975a /primary_seq/EVSE220525-VclCN-A1-1_S1_L002_R1_001 .fastq.gz,NganE_10XSpatialRNASO_CPOS-220608-CWL-12975a /primary_seq/EVSE220525-VclCN-A1-1_S1_L002_R2_001 .fastq.gz |
一定要善于搜索,这种常见的需求其实早就有了官方解决方案。
今天突然就搜索到了:Google:cellranger add genes to reference
- Build a Custom Reference (cellranger mkref)
- Creating a Reference Package with cellranger mkref
- Add a marker gene to the FASTA and GTF【重点参考这个部分】
提取fasta
参见下文
构建gtf【gtf格式每列必须以tab分割,去sublime里手动调节】
1 2 3 | YFP unknown exon 1 717 . + . gene_id "YFP" ; gene_version "1" ; transcript_id "YFP" ; transcript_version "1" ; exon_number "1" ; gene_name "YFP" ; gene_source "unknown" ; gene_biotype "protein_coding" ; transcript_name "YFP" ; transcript_source "unknown" ; transcript_biotype "protein_coding" ; protein_id "YFP" ; protein_version "1" ; tag "basic" ; transcript_support_level "5" YFP unknown exon 1 717 . + . gene_id "YFP" ; transcript_id "YFP" ; gene_name "YFP" ; gene_biotype "protein_coding" ; |
最新的GFP
1 | GFP unknown exon 1 1116 . + . gene_id "GFP" ; transcript_id "GFP" ; gene_name "GFP" ; gene_biotype "protein_coding" ; |
mkref【最好用最新版的cellranger,目前是7.0了】
1 2 3 | ~ /softwares/cellranger-3 .1.0 /cellranger mkref --genome=mm10_GFP \ --fasta=genome.fa \ --genes=genes.gtf |
1 2 3 | ~ /softwares/cellranger-7 .0.1 /cellranger mkref --genome=mm10_GFP \ --fasta=mm.genome.GFP.fa \ --genes=mm.genes.GFP.gtf |
大部分时候我们都可以用marker的基因表达来区分细胞的lineage,荧光蛋白大部分时候就是用于staining验证。
有时候我们也需要检测荧光蛋白的mRNA来分析这些cell,比如这篇paper:Rainbow-Seq: Combining Cell Lineage Tracing with Single-Cell RNA Sequencing in Preimplantation Embryos
其实是可行的,首先找到荧光蛋白DNA序列,通过蛋白序列去blast,然后把我们的fastq比对上去找出真正的荧光蛋白DNA reference。
然后把荧光蛋白DNA reference加入到genome reference里,以及gtf里,就可以正常分析出它的表达量了。
比如PHOX2B项目,但这是个伪命题,检测出来不需要这么做,这么做是因为PHOX2B检测不到,那荧光蛋白自然也检测不到,实际上我们也没法通过fastq找到荧光蛋白DNA reference。【除非你非常确定你的荧光蛋白DNA reference,否则后面没法做】
通过fastq找到荧光蛋白DNA reference
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | #!/bin/bash #PBS -l nodes=1:ppn=12 #PBS -l mem=100G #PBS -l walltime=84:00:00 #PBS -q large #PBS -N STAR # q: medium_ext, legacy cd $PBS_O_WORKDIR # source activate /home/lizhixin/softwares/anaconda3/envs/splicing /home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \ --genomeDir /home/lizhixin/project/scRNA-seq/lineageTracing/reference/GFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix all /7Ala-D60-BO-2 \ --readFilesCommand gunzip -c \ --readFilesIn 7Ala-D60-BO-2-1_S25_L003_R1_001.fastq.gz,7Ala-D60-BO-2-2_S26_L003_R1_001.fastq.gz,7Ala-D60-BO-2-3_S27_L003_R1_001.fastq.gz,7Ala-D60-BO-2-4_S28_L003_R1_001.fastq.gz 7Ala-D60-BO-2-1_S25_L003_R2_001.fastq.gz,7Ala-D60-BO-2-2_S26_L003_R2_001.fastq.gz,7Ala-D60-BO-2-3_S27_L003_R2_001.fastq.gz,7Ala-D60-BO-2-4_S28_L003_R2_001.fastq.gz &&\ /home/lizhixin/softwares/anaconda3/envs/splicing/bin/STAR --runThreadN 10 \ --genomeDir /home/lizhixin/project/scRNA-seq/lineageTracing/reference/GFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix all /UE-D60-BO-2 \ --readFilesCommand gunzip -c \ --readFilesIn UE-D60-BO-2-1_S21_L003_R1_001.fastq.gz,UE-D60-BO-2-2_S22_L003_R1_001.fastq.gz,UE-D60-BO-2-3_S23_L00 3_R1_001.fastq.gz,UE-D60-BO-2-4_S24_L003_R1_001.fastq.gz UE-D60-BO-2-1_S21_L003_R2_001.fastq.gz,UE-D60-BO-2-2_S22_L003_R2_001.fastq.gz,UE-D60-BO-2-3_S23_L003_R2_001.fastq.gz,UE-D60-BO-2-4_S24_L003_R2_001.fastq.gz &&\ |
有时候需要个性化处理原始序列,自己写python脚本太慢,且速度太慢,可以用seqkit这个工具,开发得不错。
2021年12月02日
另一个需求:需要从Cell Ranger处理后的bam文件里提取出未比对的reads,然后再去比对到YFP序列上,确定每个细胞是否被YFP标记了。
参考1:Cell Ranger的bam文件的解读 Barcoded BAM
The cellranger pipeline outputs an indexed BAM file containing position-sorted reads aligned to the genome and transcriptome, as well as unaligned reads. Each read in this BAM file has Chromium cellular and molecular barcode information attached.
参考2:如何提取出未比对上的reads How do I identify the unmapped reads in my Cell Ranger or Long Ranger output?
1 | samtools view -f 4 phased_possorted_bam.bam | cut -f1 > unmapped_reads.txt |
总结:
提取未比对序列的命令
1 2 3 4 | samtools view -b -f 4 7Ala-D60-BO-2.bam > 7Ala-D60-BO-2.unmapped.bam samtools view -b -f 4 UE-D60-BO-2.bam > UE-D60-BO-2.unmapped.bam samtools view -b -f 4 UE-D60-BO-3.bam > UE-D60-BO-3.unmapped.bam samtools view -b -f 4 7Ala-D60-BO-3.bam > 7Ala-D60-BO-3.unmapped.bam |
1 | samtools merge unmapped.bam 7Ala-D60-BO-2.unmapped.bam UE-D60-BO-2.unmapped.bam UE-D60-BO-3.unmapped.bam 7Ala-D60-BO-3.unmapped.bam |
其中CB是校正后的细胞barcode,没有的话也可以用CR代替。
之后的环节
- bam转fastq,然后比对到YFP基因组
- 提取出比对reads对应的cell barcode
bam转fastq bedtools bamtofastq
1 | bedtools bamtofastq -i unmapped.bam -fq unmapped.fq |
通过blast搜索所有的YFP相关序列,然后用STAR构建索引。
1 2 3 4 | source activate /home/lizhixin/softwares/anaconda3/envs/splicing mkdir GFP_index STAR --runThreadN 6 --runMode genomeGenerate --genomeDir GFP_index --genomeFastaFiles GFP.fasta |
比对
1 2 3 4 5 | STAR --runThreadN 10 \ --genomeDir YFP_index \ --outSAMtype BAM SortedByCoordinate \ --outFileNamePrefix unmap /PHOX2B \ --readFilesIn /home/lizhixin/project/scRNA-seq/rawData/10x/7Ala/unmapped .fq |
提取出比对上的序列
提取对应的cell barcode
比如提取10x genomics的barcode,fastq里的前16个碱基【搞错了,没这么简单】。
1 | seqkit subseq Vcl-YFP-CNCC_3_S35_L004_R2_001.fastq.gz -r 1:16 > tmp.fastq |
所有需要的信息都在这个bam文件里面,可以进行二次分析
possorted_genome_bam.bam BAM file containing both unaligned reads and reads aligned to the genome and transcriptome annotated with barcode information
统计fasta的每条序列的长度
目的:通过统计人基因组里转录本的长度分布,来评估不同的基因治疗的载体的局限。
1 2 | # ~/databases/ensembl.release-90/homo_sapiens/RSEM/bowtie2/ cat GRCh38.transcripts.fa | seqkit fx2tab -l | cut -f1,4 > transcript.len.txt |
参考:
Single-Library Analysis with cellranger count
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· 从 HTTP 原因短语缺失研究 HTTP/2 和 HTTP/3 的设计差异
· AI与.NET技术实操系列:向量存储与相似性搜索在 .NET 中的实现
· 基于Microsoft.Extensions.AI核心库实现RAG应用
· Linux系列:如何用heaptrack跟踪.NET程序的非托管内存泄露
· 开发者必知的日志记录最佳实践
· winform 绘制太阳,地球,月球 运作规律
· AI与.NET技术实操系列(五):向量存储与相似性搜索在 .NET 中的实现
· 超详细:普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 【硬核科普】Trae如何「偷看」你的代码?零基础破解AI编程运行原理
· 上周热点回顾(3.3-3.9)