基因组云计算书籍推荐:Genomics in the Cloud Using Docker, GATK, and WDL in Terra

给一起学习基因组云计算的小伙伴推荐一本书,《Genomics in the Cloud:Using Docker, GATK, and WDL in Terra》,作者是GATK社区管理员,2020年出版,还算比较新吧。
image.png

Github地址:
genomics-in-the-cloud

本书涵盖内容:

  • 基本基因组学和计算技术背景
  • 基本的云计算操作
  • GATK 入门,以及三个主要的 GATK 最佳实践管道
  • 使用 WDL 和 Cromwell 使用脚本化工作流程自动分析
  • 在云中扩展工作流执行,包括并行化和成本优化
  • 使用 Jupyter 笔记本在云中进行交互式分析
  • 使用 Terra 的安全协作和计算可重复性

书很厚,花了很大篇幅介绍Broad自己的产品,但我们基本不会用到它的云平台Terra,排版很差,这是本书不足之处。另外,该书是针对人类基因组来写的,所以范围有限。不过有选择性地挑选一些章节来看,不失为一个好的选择,毕竟这方面的书籍太少了。

以下是目录,若要获取pdf电子版,请关注微信公众号Bioinfarmer,后台回复:cloud。

  1. Introduction
    The Promises and Challenges of Big Data in Biology and Life Sciences
    Infrastructure Challenges
    Toward a Cloud-Based Ecosystem for Data Sharing and Analysis
    Cloud-Hosted Data and Compute
    Platforms for Research in the Life Sciences
    Standardization and Reuse of Infrastructure
    Being FAIR
    Wrap-Up and Next Steps
  2. Genomics in a Nutshell: A Primer for Newcomers to the Field
    Introduction to Genomics
    The Gene as a Discrete Unit of Inheritance (Sort Of)
    The Central Dogma of Biology: DNA to RNA to Protein
    The Origins and Consequences of DNA Mutations
    Genomics as an Inventory of Variation in and Among Genomes
    The Challenge of Genomic Scale, by the Numbers
    Genomic Variation
    The Reference Genome as Common Framework
    Physical Classification of Variants
    Germline Variants Versus Somatic Alterations
    High-Throughput Sequencing Data Generation
    From Biological Sample to Huge Pile of Read Data
    Types of DNA Libraries: Choosing the Right Experimental Design
    Data Processing and Analysis
    Mapping Reads to the Reference Genome
    Variant Calling
    Data Quality and Sources of Error
    Functional Equivalence Pipeline Specification
    Wrap-Up and Next Steps
  3. Computing Technology Basics for Life Scientists
    Basic Infrastructure Components and Performance Bottlenecks
    Types of Processor Hardware: CPU, GPU, TPU, FPGA, OMG
    Levels of Compute Organization: Core, Node, Cluster, and Cloud
    Addressing Performance Bottlenecks
    Parallel Computing
    Parallelizing a Simple Analysis
    From Cores to Clusters and Clouds: Many Levels of Parallelism
    Trade-Offs of Parallelism: Speed, Efficiency, and Cost
    Pipelining for Parallelization and Automation
    Workflow Languages
    Popular Pipelining Languages for Genomics
    Workflow Management Systems
    Virtualization and the Cloud
    VMs and Containers
    Introducing the Cloud
    Categories of Research Use Cases for Cloud Services
    Wrap-Up and Next Steps
  4. First Steps in the Cloud
    Setting Up Your Google Cloud Account and First Project
    Creating a Project
    Checking Your Billing Account and Activating Free Credits
    Running Basic Commands in Google Cloud Shell
    Logging in to the Cloud Shell VM
    Using gsutil to Access and Manage Files
    Pulling a Docker Image and Spinning Up the Container
    Mounting a Volume to Access the Filesystem from Within the Container
    Setting Up Your Own Custom VM
    Creating and Configuring Your VM Instance
    Logging into Your VM by Using SSH
    Checking Your Authentication
    Copying the Book Materials to Your VM
    Installing Docker on Your VM
    Setting Up the GATK Container Image
    Stopping Your VM…to Stop It from Costing You Money
    Configuring IGV to Read Data from GCS Buckets
    Wrap-Up and Next Steps
  5. First Steps with GATK
    Getting Started with GATK
    Operating Requirements
    Command-Line Syntax
    Multithreading with Spark
    Running GATK in Practice
    Getting Started with Variant Discovery
    Calling Germline SNPs and Indels with HaplotypeCaller
    Filtering Based on Variant Context Annotations
    Introducing the GATK Best Practices
    Best Practices Workflows Covered in This Book
    Other Major Use Cases
    Wrap-Up and Next Steps
  6. GATK Best Practices for Germline Short Variant Discovery
    Data Preprocessing
    Mapping Reads to the Genome Reference
    Marking Duplicates
    Recalibrating Base Quality Scores
    Joint Discovery Analysis
    Overview of the Joint Calling Workflow
    Calling Variants per Sample to Generate GVCFs
    Consolidating GVCFs
    Applying Joint Genotyping to Multiple Samples
    Filtering the Joint Callset with Variant Quality Score Recalibration
    Refining Genotype Assignments and Adjusting Genotype Confidence
    Next Steps and Further Reading
    Single-Sample Calling with CNN Filtering
    Overview of the CNN Single-Sample Workflow
    Applying 1D CNN to Filter a Single-Sample WGS Callset
    Applying 2D CNN to Include Read Data in the Modeling
    Wrap-Up and Next Steps
  7. GATK Best Practices for Somatic Variant Discovery
    Challenges in Cancer Genomics
    Somatic Short Variants (SNVs and Indels)
    Overview of the Tumor-Normal Pair Analysis Workflow
    Creating a Mutect2 PoN
    Running Mutect2 on the Tumor-Normal Pair
    Estimating Cross-Sample Contamination
    Filtering Mutect2 Calls
    Annotating Predicted Functional Effects with Funcotator
    Somatic Copy-Number Alterations
    Overview of the Tumor-Only Analysis Workflow
    Creating a Somatic CNA PoN
    Applying Denoising
    Performing Segmentation and Call CNAs
    Additional Analysis Options
    Wrap-Up and Next Steps
  8. Automating Analysis Execution with Workflows
    Introducing WDL and Cromwell
    Installing and Setting Up Cromwell
    Your First WDL: Hello World
    Learning Basic WDL Syntax Through a Minimalist Example
    Running a Simple WDL with Cromwell on Your Google VM
    Interpreting the Important Parts of Cromwell’s Logging Output
    Adding a Variable and Providing Inputs via JSON
    Adding Another Task to Make It a Proper Workflow
    Your First GATK Workflow: Hello HaplotypeCaller
    Exploring the WDL
    Generating the Inputs JSON
    Running the Workflow
    Breaking the Workflow to Test Syntax Validation and Error Messaging
    Introducing Scatter-Gather Parallelism
    Exploring the WDL
    Generating a Graph Diagram for Visualization
    Wrap-Up and Next Steps
  9. Deciphering Real Genomics Workflows
    Mystery Workflow #1: Flexibility Through Conditionals
    Mapping Out the Workflow
    Reverse Engineering the Conditional Switch
    Mystery Workflow #2: Modularity and Code Reuse
    Mapping Out the Workflow
    Unpacking the Nesting Dolls
    Wrap-Up and Next Steps
  10. Running Single Workflows at Scale with Pipelines API
    Introducing the GCP Genomics Pipelines API Service
    Enabling Genomics API and Related APIs in Your Google Cloud Project
    Directly Dispatching Cromwell Jobs to PAPI
    Configuring Cromwell to Communicate with PAPI
    Running Scattered HaplotypeCaller via PAPI
    Monitoring Workflow Execution on Google Compute Engine
    Understanding and Optimizing Workflow Efficiency
    Granularity of Operations
    Balance of Time Versus Money
    Suggested Cost-Saving Optimizations
    Platform-Specific Optimization Versus Portability
    Wrapping Cromwell and PAPI Execution with WDL Runner
    Setting Up WDL Runner
    Running the Scattered HaplotypeCaller Workflow with WDL Runner
    Monitoring WDL Runner Execution
    Wrap-Up and Next Steps
  11. Running Many Workflows Conveniently in Terra
    Getting Started with Terra
    Creating an Account
    Creating a Billing Project
    Cloning the Preconfigured Workspace
    Running Workflows with the Cromwell Server in Terra
    Running a Workflow on a Single Sample
    Running a Workflow on Multiple Samples in a Data Table
    Monitoring Workflow Execution
    Locating Workflow Outputs in the Data Table
    Running the Same Workflow Again to Demonstrate Call Caching
    Running a Real GATK Best Practices Pipeline at Full Scale
    Finding and Cloning the GATK Best Practices Workspace for Germline Short Variant Discovery
    Examining the Preloaded Data
    Selecting Data and Configuring the Full-Scale Workflow
    Launching the Full-Scale Workflow and Monitoring Execution
    Options for Downloading Output Data—or Not
    Wrap-Up and Next Steps
  12. Interactive Analysis in Jupyter Notebook
    Introduction to Jupyter in Terra
    Jupyter Notebooks in General
    How Jupyter Notebooks Work in Terra
    Getting Started with Jupyter in Terra
    Inspecting and Customizing the Notebook Runtime Configuration
    Opening Notebook in Edit Mode and Checking the Kernel
    Running the Hello World Cells
    Using gsutil to Interact with Google Cloud Storage Buckets
    Setting Up a Variable Pointing to the Germline Data in the Book Bucket
    Setting Up a Sandbox and Saving Output Files to the Workspace Bucket
    Visualizing Genomic Data in an Embedded IGV Window
    Setting Up the Embedded IGV Browser
    Adding Data to the IGV Browser
    Setting Up an Access Token to View Private Data
    Running GATK Commands to Learn, Test, or Troubleshoot
    Running a Basic GATK Command: HaplotypeCaller
    Loading the Data (BAM and VCF) into IGV
    Troubleshooting a Questionable Variant Call in the Embedded IGV Browser
    Visualizing Variant Context Annotation Data
    Exporting Annotations of Interest with VariantsToTable
    Loading R Script to Make Plotting Functions Available
    Making Density Plots for QUAL by Using makeDensityPlot
    Making a Scatter Plot of QUAL Versus DP
    Making a Scatter Plot Flanked by Marginal Density Plots
    Wrap-Up and Next Steps
  13. Assembling Your Own Workspace in Terra
    Managing Data Inside and Outside of Workspaces
    The Workspace Bucket as Data Repository
    Accessing Private Data That You Manage Outside of Terra
    Accessing Data in the Terra Data Library
    Re-Creating the Tutorial Workspace from Base Components
    Creating a New Workspace
    Adding the Workflow to the Methods Repository and Importing It into the Workspace
    Creating a Configuration Quickly with a JSON File
    Adding the Data Table
    Filling in the Workspace Resource Data Table
    Creating a Workflow Configuration That Uses the Data Tables
    Adding the Notebook and Checking the Runtime Environment
    Documenting Your Workspace and Sharing It
    Starting from a GATK Best Practices Workspace
    Cloning a GATK Best Practices Workspace
    Examining GATK Workspace Data Tables to Understand How the Data Is Structured
    Getting to Know the 1000 Genomes High Coverage Dataset
    Copying Data Tables from the 1000 Genomes Workspace
    Using TSV Load Files to Import Data from the 1000 Genomes Workspace
    Running a Joint-Calling Analysis on the Federated Dataset
    Building a Workspace Around a Dataset
    Cloning the 1000 Genomes Data Workspace
    Importing a Workflow from Dockstore
    Configuring the Workflow to Use the Data Tables
    Wrap-Up and Next Steps
  14. Making a Fully Reproducible Paper
    Overview of the Case Study
    Computational Reproducibility and the FAIR Framework
    Original Research Study and History of the Case Study
    Assessing the Available Information and Key Challenges
    Designing a Reproducible Implementation
    Generating a Synthetic Dataset as a Stand-In for the Private Data
    Overall Methodology
    Retrieving the Variant Data from 1000 Genomes Participants
    Creating Fake Exomes Based on Real People
    Mutating the Fake Exomes
    Generating the Definitive Dataset
    Re-Creating the Data Processing and Analysis Methodology
    Mapping and Variant Discovery
    Variant Effect Prediction, Prioritization, and Variant Load Analysis
    Analytical Performance of the New Implementation
    The Long, Winding Road to FAIRness
    Final Conclusions

https://www.oreilly.com/library/view/genomics-in-the/9781491975183/
https://www.amazon.ca/Genomics-Cloud-GATK-Spark-Docker/dp/1491975199

posted @ 2022-05-23 18:16  生物信息与育种  阅读(297)  评论(0编辑  收藏  举报