AlphaPulldown | in silico co-IP | 蛋白质相互作用预测 | 人工智能 | alphafold
代码:
- https://github.com/KosinskiLab/AlphaPulldown
- https://github.com/google-deepmind/alphafold
生物专业入坑教程:Alphafold的原理和展望 钟博子韬 | 钰沐菡 公益公开课
2023年12月09日
我就想把示例pipeline跑完,看个效果,结果第二步继续报错。
知道代码哪里出错了以及大致出错原因,就是矩阵相乘shape不匹配,但没有能力去修改代码。
先弃坑,等下个版本更新了再来,现在github上已经快200个issue了,工具还很不成熟,没时间debug了。
module load gcc/9.2.0 module spider cuda/11.7 python3 /home/zhl595/softwares/AlphaPulldown/alphapulldown/create_individual_features.py \ --fasta_paths=baits.fasta \ --data_dir=/n/shared_db/alphafold-2.3 \ --output_dir=test \ --skip_existing=True \ --use_mmseqs2=True \ --max_template_date=2050-01-01 /home/zhl595/miniconda3/envs/AlphaPulldown/bin/run_multimer_jobs.py --mode=pulldown \ --num_cycle=3 \ --num_predictions_per_model=1 \ --output_path=test2 \ --data_dir=/n/shared_db/alphafold-2.3 \ --protein_lists=baits.txt,candidates_shorter.txt \ --monomer_objects_dir=test
# error
File "/home/zhl595/miniconda3/envs/AlphaPulldown/lib/python3.10/site-packages/alphafold/data/feature_processing.py", line 194, in _make_msa_mask
np_example['msa_mask'] *= seq_mask[None]
ValueError: operands could not be broadcast together with shapes (4095,1276) (1,1314) (4095,1276)
选择mmseqs2会快很多
运行报错
# /n/shared_db/alphafold-2.3 # /n/shared_db/alphafold python3 /home/zhl595/softwares/AlphaPulldown/alphapulldown/create_individual_features.py \ --fasta_paths=example_1_sequences_shorter.fasta \ --data_dir=/n/shared_db/alphafold-2.3 \ --output_dir=test \ --skip_existing=True \ --use_mmseqs2=True \ --max_template_date=2050-01-01
# step Launching sub-process ['/home/zhl595/miniconda3/envs/AlphaPulldown/bin/hmmsearch', '--noali', '--cpu', '8', '--F1', '0.1', '--F2', '0.1', '--F3', '0.1', '--incE', '100', '-E', '100', '--domE', '100', '--incdomE', '100', '-A', '/tmp/tmpcn9fkn18/output.sto', '/tmp/tmpcn9fkn18/query.hmm', '/n/shared_db/alphafold-2.3/pdb_seqres/pdb_seqres.txt'] # error (pdb_id, chain_id, min(mapping.values()) + mapping_offset, ValueError: min() arg is an empty sequence
没比对上的mapping的values是个空字典,用了min函数就报错了。
找到出错的alphafold脚本,把报错的代码给注释了。
流程里面用到的软件:
jackhmmer - iteratively search sequence(s) against a protein database
hmmbuild - build a profile HMM from an alignment
hmmsearch - search a sequence database with a profile HMM
hhblits - sensitive protein sequence searching based on the pairwise alignment of hidden Markov models (HMMs)
AlphaFold数据库【必须搞清楚每一个数据库大致是干什么用得,否则你都不知道程序每一步在干什么】
alphafold_database/ # Total: ~ 2.2 TB (download: 438 GB) bfd/ # ~ 1.7 TB (download: 271.6 GB) # 6 files. mgnify/ # ~ 64 GB (download: 32.9 GB) mgy_clusters_2018_12.fa params/ # ~ 3.5 GB (download: 3.5 GB) # 5 CASP14 models, # 5 pTM models, # 5 AlphaFold-Multimer models, # LICENSE, # = 16 files. pdb70/ # ~ 56 GB (download: 19.5 GB) # 9 files. pdb_mmcif/ # ~ 206 GB (download: 46 GB) mmcif_files/ # About 180,000 .cif files. obsolete.dat pdb_seqres/ # ~ 0.2 GB (download: 0.2 GB) pdb_seqres.txt small_bfd/ # ~ 17 GB (download: 9.6 GB) bfd-first_non_consensus_sequences.fasta uniclust30/ # ~ 86 GB (download: 24.9 GB) uniclust30_2018_08/ # 13 files. uniprot/ # ~ 98.3 GB (download: 49 GB) uniprot.fasta uniref90/ # ~ 58 GB (download: 29.7 GB) uniref90.fasta
UniProt蛋白质数据库简介 - 2019
uniref90.fasta里有1.5亿条蛋白序列,可以理解为阈值为90%的全物种蛋白序列。
蛋白质序列参考集UniRef分为三个数据集(Sequence Cluster),分别为UniRef100、UniRef90和UniRef50。
UniRef三个数据集的构建采用一定算法,分三步进行。第一步是把不同物种中长度不小于11个氨基酸的相同序列和序列片段合并在一起,得到UniRef100数据集。第二步是按相同位点所占序列全长比例90%为阈值,将UniRef100数据集中高度相似序列合并在一起,产生UniRef90数据集。第三步则是按相同位点所占序列全长比例50%为阈值,将UniRef90数据集中具有一定相似性的序列合并在一起,所得数据集即UniRef50。
cat /n/shared_db/alphafold-2.3/uniref90/uniref90.fasta | grep '>' | wc -l 153742194
MGnify(以前称为EBI Metagenomics)里面是微生物的蛋白质无冗余序列。6亿条蛋白序列。
$ cat /n/shared_db/alphafold-2.3/mgnify/mgy_clusters_2022_05.fa | grep '>' | wc -l 623796864
pdb_seqres,仅有70万条蛋白序列,已经用实验方法解析的蛋白序列。可以在https://www.rcsb.org/structure/101M数据库里查到结构信息。
The sequences for all currently released experimental structures are contained in pdb_seqres.txt, available in uncompressed form at ftp://ftp.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt and in Unix compressed (".Z") format at ftp://ftp.rcsb.org/pub/pdb/derived_data/pdb_seqres.txt.Z.
grep '>' /n/shared_db/alphafold-2.3/pdb_seqres/pdb_seqres.txt | wc -l 717700
>101m_A mol:protein length:154 MYOGLOBIN MVLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRVKHLKTEAEMKASEDLKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHPGNFGADAQGAMNKALELFRKDIAAKYKELGYQG
pdb_mmcif,已知结构蛋白的结构文件,mmCIF is a flexible and extensible tag-value format for representing macromolecular structural data.
由hhblits构建的数据和索引文件,其中ffdata文件是可以打开浏览的,相当于多重比对后的数据库。
多序列比对(Multiple Sequence Alignment,MSA)是对多个生物序列进行对齐的过程,以揭示它们之间的共同模式和结构。在生物信息学中,有多种文件格式用于存储多序列比对的结果,其中包括Stockholm (.sto) 和 A3M (.a3m) 格式。 注释信息: Stockholm文件通常包含更多的注释信息,提供关于序列和比对的额外信息。相比之下,A3M文件主要关注于序列本身。 应用领域: Stockholm文件广泛用于多种生物信息学工具,如HMMER。A3M文件通常用于蛋白质结构预测和深度学习模型的训练,如Alphafold。
参考:HMMER学习笔记
$ lt /n/shared_db/alphafold-2.3/uniref30/UniRef30_2021_03* -rw-r-xr-x 1 at237 rccg 6.8G Jul 18 2021 /n/shared_db/alphafold-2.3/uniref30/UniRef30_2021_03_cs219.ffdata -rw-r-xr-x 1 at237 rccg 684M Jul 18 2021 /n/shared_db/alphafold-2.3/uniref30/UniRef30_2021_03_cs219.ffindex -rw-r-xr-x 1 at237 rccg 160G Jul 18 2021 /n/shared_db/alphafold-2.3/uniref30/UniRef30_2021_03_a3m.ffdata -rw-r-xr-x 1 at237 rccg 758M Jul 18 2021 /n/shared_db/alphafold-2.3/uniref30/UniRef30_2021_03_a3m.ffindex -rw-r-xr-x 1 at237 rccg 22M Jul 18 2021 /n/shared_db/alphafold-2.3/uniref30/UniRef30_2021_03_hhm.ffindex -rw-r-xr-x 1 at237 rccg 38G Jul 18 2021 /n/shared_db/alphafold-2.3/uniref30/UniRef30_2021_03_hhm.ffdata -rw-r-xr-x 1 at237 rccg 379 Jul 18 2021 /n/shared_db/alphafold-2.3/uniref30/UniRef30_2021_03.md5sums -rw-rwxr-x 1 at237 rccg 53G Nov 24 2022 /n/shared_db/alphafold-2.3/uniref30/UniRef30_2021_03.tar.gz
BFD数据库,最主要的数据库,一人独占1.8T的空间。是啥呢?序列按相似性聚类的结果。
BFD was created by clustering 2.5 billion protein sequences from Uniprot/TrEMBL+Swissprot, Metaclust and Soil Reference Catalog Marine Eukaryotic Reference Catalog assembled by Plass. We clustered sequences that could be aligned to a longer sequence with 90% of their residues and a sequence identity of 30% using Linclust/MMseqs2 --cov-mode 1 --min-seq-id 0.3. We removed all clusters with less than three sequences and turned the database into an HH-suite3 database using the Uniclust pipeline.
配套工具:MMseqs2 (Many-against-Many sequence searching) is a software suite to search and cluster huge protein and nucleotide sequence sets.
$ lt alphafold-2.3/bfd/ total 1.8T -rw-r-xr-x 1 rp189 rccg 1.7G Mar 2 2019 bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffindex -rw-r-xr-x 1 rp189 rccg 124M Mar 2 2019 bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffindex -rw-r-xr-x 1 rp189 rccg 305G Mar 2 2019 bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_hhm.ffdata -rw-r-xr-x 1 rp189 rccg 1.5T Mar 3 2019 bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_a3m.ffdata -rwxr-xr-x 1 rp189 rccg 16G Mar 4 2019 bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffdata -rwxr-xr-x 1 rp189 rccg 1.6G Mar 4 2019 bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt_cs219.ffindex
PDB70,所有能比对到已知结构蛋白的未知蛋白。
Full chains of known 3D structures:
PDB70: PSI-BLAST alignments produced with sequences of PDB full chain representativies (<70% sequence identity) as queries
Uniclust30
The Uniclust90, Uniclust50, Uniclust30 databases cluster UniProtKB sequences at the level of 90%, 50% and 30% pairwise sequence identity.
原理:
AlphaFold Multimer is an extension of AlphaFold2 that has been specifically built to predict protein-protein complexes.
Despite near-experimental accuracy on single-chain predictions, there is still scope for improvement among multimeric predictions. Methods like AlphaFold-Multimer and FoldDock can accurately model dimers.
Inspired by pull-down assays, one can specify one or more proteins as "bait" and another list of proteins as "candidates". Then the programme will use AlphafoldMultimerV2 to predict interactions between baits (as in example_data/baits.txt) and candidates (as in example_data/candidates.txt).
可以做in silico的co-IP,试一下效果。
测试一下SOX9和JARID2
2023年12月07日
新版数据库有问题,直接下载了是不能用的,需要构建索引。
ln -s UniRef30_2021_03_a3m.ffdata UniRef30_2021_03_a3m_db ln -s UniRef30_2021_03_hhm.ffdata UniRef30_2021_03_hhm_db ln -s pdb70_a3m.ffdata pdb70_a3m_db ln -s pdb70_hhm.ffdata pdb70_hhm_db
自己一个人用,下载几个T的数据库有点太浪费资源了,存储本来就不够用。
HMS o2 alphafold,另一个服务器上已经下载好了,有2018盒2021两个版本。
GPU node的申请,以及cuda环境的配置非常重要。
# prepare cuda env module load gcc/9.2.0 module spider cuda/11.7 # module load alphafold/2.2.0 # don't do thies python3 -m pip install tensorflow[and-cuda] # pip install tensorflow-gpu==2.8.0 # Verify the installation: python3 -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))" # python3 -c "import tensorflow as tf; print(tf.test.gpu_device_name())"
2023年12月06日
alphafold的默认下载方式有点bug,用的是aria2c这个工具,下载2-3个TB的数据经常中断,且没有断点续传功能;
但其实下载链接都在script里,你用wget也能下载,稳定可控。
以为一切都下载完成了,结果运行时直接报错。
stderr: - 14:42:02.531 ERROR: Could find neither hhm_db nor a3m_db!
一查,发现是pdb70里面缺失了一些文件。
真的是何必的,alphafold你提供全部的下载链接,我wget -bi file直接就搞定了,到头来我还得一个一个排错。
案例数据
git clone https://github.com/KosinskiLab/AlphaPulldown.git
conda activate AlphaPulldown cd softwares/AlphaPulldown/example_data/
create_individual_features.py \ --fasta_paths=baits.fasta,example_1_sequences.fasta \ --data_dir=/home/zz950/softwares/alphafold/DB \ --save_msa_files=False \ --output_dir=test \ --max_template_date=2050-01-01 \ --use_precomputed_msas=False \ --skip_existing=False
各种疑难杂症,明明装好了module,Python里可以import,但运行就是显示ModuleNotFoundError: No module named。
折腾了一会儿,直接用了绝对路径来执行这个脚本,就work了。得用alphafold-2.3的数据库,2018版的会报错。
好久没有debug了,技能显得有点生疏了。
export PYTHONPATH=/home/zhl595/.local/lib/python3.10/site-packages:$PYTHONPATH python3 /home/zhl595/softwares/AlphaPulldown/alphapulldown/create_individual_features.py \ --fasta_paths=baits.fasta,example_1_sequences.fasta \ --data_dir=/n/shared_db/alphafold-2.3 \ --save_msa_files=False \ --output_dir=test \ --max_template_date=2050-01-01 \ --use_precomputed_msas=False \ --skip_existing=False
# make sure you don't see "GPU will not be used"
一定要用上GPU,否则速度感人。
安装及测试
conda create -n AlphaPulldown -c omnia -c bioconda -c conda-forge python==3.10 conda activate AlphaPulldown conda install -c omnia -c bioconda -c conda-forge openmm==8.0 pdbfixer==1.9 kalign2 cctbx-base pytest importlib_metadata python3 -m pip install alphapulldown==1.0.2 pip install jax==0.4.16 jaxlib==0.4.16+cuda11.cudnn86 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html # flax 0.7.5 requires jax>=0.4.19, but you have jax 0.4.16 which is incompatible. conda install -c bioconda hmmer hhsuite
安装问题:
- 版本冲突
安装很顺利,但需要GPU的支持,而且需要下载非常大的alphafold数据库。
conda install -c bioconda aria2
./scripts/download_all_data.sh DB
而且pulldown是有biased,因为计算资源的限制,没法全部pull,所以会选择大概几百个蛋白的pathway。
Whatever,再结合实验上的co-IP,也算是比较丰富的数据。
可以蹭一波alphafold的热度,也可以预测大概得结合位点和方式。