GWAS | 原理和流程 | 全基因组关联分析 | Linkage disequilibrium (LD)连锁不平衡 | 曼哈顿图 Manhattan_plot | QQ plot

GWAS入门必看教程：Statistical analysis of genome-wide association (GWAS) data

名词解释和基本问题：

关联分析：就是AS的中文，全称是GWAS。应用基因组中数以百万计的单核苷酸多态；SNP为分子遗传标记，进行全基因组水平上的对照分析或相关性分析，通过比较发现影响复杂性状的基因变异的一种新策略。在全基因组范围内选择遗传变异进行基因分析，比较异常和对照组之间每个遗传变异及其频率的差异，统计分析每个变异与目标性状之间的关联性大小，选出最相关的遗传变异进行验证，并根据验证结果最终确认其与目标性状之间的相关性。

连锁不平衡：LD，P（AB）= P（A）*P（B）。不连锁就独立，如果不存在连锁不平衡——相互独立，随机组合，实际观察到的群体中单倍体基因型 A和B 同时出现的概率。P (AB) = D + P (A) * P (B) 。D是表示两位点间LD程度值。

曼哈顿图：在生物和统计学上，做频率统计、突变分布、GWAS关联分析的时候，我们经常会看到一些非常漂亮的manhattan plot，能够对候选位点的分布和数值一目了然。位点坐标和pvalue。map文件至少包含三列——染色体号，SNP名字，SNP物理位置。assoc文件包含SNP名字和pvalue。haploview即可画出。

SNP的本质属性是什么？广义上讲是变异：most common type of genetic variation，平级的还有indel、CNV、SV。Each SNP represents a difference in a single DNA building block, called a nucleotide. 狭义上讲是标记：biological markers，因为SNP是单碱基的，所以SNP又是一个位点，标记了染色体上的一个位置。大部分人的基因组，99%都是一模一样的，还有些SNP的位点，就是一些可变的位点，在人群中有差异。这些差异/标记可以用于疾病的分析，根据统计学原理，找出与疾病最相关的位点，从而确定某个疾病的risk allele。

SNP array是如何工作的？SNP array测得不是单个碱基，而是allele。所以GWAS的结果是三种：(1 - AA; 2 - AB; 3 - BB)，也可能是0、1、2.

linkage disequilibrium (LD)和 pairwise correlation的区别？

如何鉴定Somatic vs Germline Mutations？In multicellular organisms, mutations can be classed as either somatic or germ-line。必须做通常需要trios或healthy tissue的测序才能确定。最显然的是cancer里大部分都是somatic的variations。

SNP、variant和mutation有什么区别？SNP是中性的，mutation显然和疾病相关；其次就是频率，频率很高的是SNP，mutation则很低。variant和variation是同义词，因此和SNP是等价的。

为什么还需要haplotype？HapMap计划的动机是什么？The HapMap is valuable by reducing the number of SNPs required to examine the entire genome for association with a phenotype from the 10 million SNPs that exist to roughly 500,000 tag SNPs.

common variant和rare variant是根据什么来区别的？paper 怎么理解这里的common和rare？variant就是SNP，”常见的变异“，SNP就是位点，一个位点怎么能说常见和不常见呢？这里是有点反直觉的。这里的common说的是minor allele，就是the second most common allele。比如一个SNP：rs78601809，它的位置可知，在不同人群中的allele frequency可知，总体的MAF是0.39 (T)。一个SNP的MAF<1%，那就是rare variant。直觉理解就是这个位点的碱基在人群中很少发生变化。rare variants (MAF < 0.05) appeared more frequently in coding regions than common variants (MAF > 0.05) in this population

Genetic variants that are outside the reach of the most statistically powered association studies [13] are thought to contribute to the missing heritability of many human traits, including common variants (here denoted by minor allele frequency [MAF] >5%) of very weak effect, low-frequency (MAF 1–5%) and rare variants (MAF <1%) of small to modest effect, or a combination of both, with several possible scenarios all deemed plausible in simulation studies [14].

为什么genetic这么执着于MAF？

因为从进化角度，risk allele更有可能是minor allele，自然选择。不绝对，但可以说是富集。看文章：Are minor alleles more likely to be risk alleles?

common variants together account for a small proportion of heritability estimated from family studies，common variants通常都在非编码区，占总variants的很小一部分，同时effect size也比较低。

SNP的small effect和large effect是什么意思？effect size

极其容易搞混的术语：SNP、mutation、variant、allele、genotype。Allele frequency、Genotype frequency，alternative allele frequency、MAF。一定要能快速区分这些术语的差异，否则你做的就是假的统计遗传学。

gene-based rare-variant burden tests是用来干什么的？Increased Burden of Rare Variants Among S-HSCR。

epistatic effects是什么？

为什么说L-HSCR是autosomal dominant？很难说是完全的线性，显隐性的关系是非常复杂的，存在不完全和剂量效应。

DNA序列角度如何看待等位基因，显隐性的关系？关于Allele(等位基因)的理解，allele在基因上的组合，传统的等位基因是非常抽象的概念。Dominant vs. Recessive 我们是两倍体，对每个基因来说，我们都有两个等位基因，杂合的话，这两个基因序列就不同，表达出来的蛋白也就不同，而且两个等位基因有复杂的显隐性关系。所以说我们传统的基因表达分析其实是很粗糙的，最好要做到isoform层次的表达，毕竟基因离蛋白还是有一段距离。现在之所以还没做到isoform水平，大部分原因是我们对蛋白的研究还不够。

一个新的课题，全球范围内，人种是如何逐步分化到今天，哪些核心的遗传因素决定了人种的表型差异；其次，不同的人种在某些疾病上为什么会出现显著的频率差异，为什么亚洲人的HSCR发病率会更高？遗传因素在其中发挥了什么作用？

遗传效应：Additive genetic effects occur when two or more genes source a single contribution to the final phenotype, or when alleles of a single gene (in heterozygotes) combine so that their combined effects equal the sum of their individual effects.[1][2] Non-additive genetic effects involve dominance (of alleles at a single locus) or epistasis (of alleles at different loci). 就是risk allele的数量和患病率之间成正比。

人类基因组里有多少个variant/SNP？ 1000 genome里的数据是84.4 million，这是保守数据，因为只包括了2504个人，相当于每个population只测了100个人，虽然具有一定的代表，性，但实际肯定更多，那就保守估计一下300 million吧，那就真是百分之一了，也就是100个碱基里就有一个variant。算到个体，就是3 million左右，也就是万分之一。

先从直觉上理解一下GWAS的原理：

核心就是SNP与表型的关联，对于每一个genome位点，如果某个SNP总是与某疾病同时出现 SNP与phenotype这两个维度协同变化，那我们就可以推测这个SNP极有可能与此phenotype（疾病）相关。

规范点讲就是看某个SNP在case和control两个population间是否有allel frequency的显著差异。

而现实情况是，我们样本数有限，而且有时候control和case样本不平衡，样本还分男女、人群，而且我们需要对3亿个碱基位点都做统计检验。

我们应该设计哪些指标来评价一个snp与表型的关联呢？

思考：如果一个位点有多个SNP，而只有其中的一个SNP与疾病相关怎么办？错误认知，一个基因组位点只能有一个SNP，可以有很多种allele。

牢记：曼哈顿图中的点代表的不是样品，而是SNP。

思考：曼哈顿图中，显著的SNP并不是鹤立鸡群的冒出来，而是似乎被捧出来的，就像高楼大厦一样，从底下逐步冒出来的。这一座大厦其实就是连锁在一起的SNP，具有很高的LD score。

思考：虽然曼哈顿图里每个点是SNP，但是通常都会把最显著的SNP指向某个基因，因为大家最关注的还是SNP的致病根源，但这样找出来的只有编码区的SNP。

注意：最突出的SNP极有可能不是causal SNP，它只是near the causal SNP。问题就来了，怎么找causal SNP呢？fine mapping

基本背景

什么是SNP？进化过程中随机产生的单点突变，并能稳定的在群体中遗传。

什么是allele frequency in population？每一个genome位点都有两个或多个allele，不同allel之间有明显的频率上的差异，简单点理解就是A和a两个性质的频率，但这里是碱基位点，而不是性状基因。

GWAS分析的前提

sample size足够，学过统计的都知道sample size会影响power，没有足够的power是得不出正确结论的，GWAS通常需要大量的样本，几千是标配，几百就太少，现在有的都达到了几万几十万级别；

一个大误区就是GWAS会测全基因组WGS，其实不是的，那太贵了，大部分是做DNA chip DNA芯片（专业的叫SNP array），只包含了常见的10^6个SNP。稍微有钱的就会上WES，就会得到所有编码区的SNP；最有钱的就是WGS了，全部检测，编码非编码，常见罕见，1000genome就是靠这个才NB的。

大致原理已经讲了，其实还有统计原理，暂时略过，先看实操。

怎么用PLINK来做GWAS？油管视频：GWAS in Plink 里面有paper、示例数据、代码下载，可以跑跑熟悉一下。

参考：

Analysis of Microarray Data

Single nucleotide polymorphism arrays: a decade of biological, computational and technological advances

Genotype Calling (CRLMM) and Copy Number Analysis tool for Affymetrix SNP 5.0 and 6.0 and Illumina arrays

Discriminating somatic and germline mutations in tumor DNA samples without matching normals

The impact of rare and low-frequency genetic variants in common disease

发表了paper的，GWAS pipeline：A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis。

github地址

一下着重讲解一下这个流程的操作细节：

主要是四方面的分析：

All essential GWAS QC steps along with scripts for data visualization.
Dealing with population stratification, using 1000 genomes as a reference.
Association analyses of GWAS data.
Polygenic risk score (PRS) analyses.

先看下PLINK的文本文件格式：

ped：行是个体，列是表型和SNP的基因型数据；

map：snp的特征数据；

二进制有三个格式：

主要就是把ped拆成了fam和bed，map变成了bim。

通常要做covariate分析，所以还有个covariate文件。

QC：

Step	Command	Function
1: Missingness of SNPs and individuals	‐‐geno	Excludes SNPs that are missing in a large proportion of the subjects. In this step, SNPs with low genotype calls are removed.
1: Missingness of SNPs and individuals	‐‐mind	Excludes individuals who have high rates of genotype missingness. In this step, individual with low genotype calls are removed.
2: Sex discrepancy	‐‐check‐sex	Checks for discrepancies between sex of the individuals recorded in the dataset and their sex based on X chromosome heterozygosity/homozygosity rates.
3: Minor allele frequency (MAF)	‐‐maf	Includes only SNPs above the set MAF threshold.
4: Hardy–Weinberg equilibrium (HWE)	‐‐hwe	Excludes markers which deviate from Hardy–Weinberg equilibrium.
5: Heterozygosity	For an example script see https://github.com/MareesAT/GWA_tutorial/	Excludes individuals with high or low heterozygosity rates
6: Relatedness	‐‐genome	Calculates identity by descent (IBD) of all sample pairs.
6: Relatedness	‐‐min	Sets threshold and creates a list of individuals with relatedness above the chosen threshold. Meaning that subjects who are related at, for example, pi‐hat >0.2 (i.e., second degree relatives) can be detected.
7: Population stratification	‐‐genome	Calculates identity by descent (IBD) of all sample pairs.
7: Population stratification	‐‐cluster ‐‐mds‐plot k	Produces a k‐dimensional representation of any substructure in the data, based on IBS.

fine mapping

一个常识就是GWAS是2007年才出现得，所以2017年才出了篇有名的综述ten years of GWAS，fine mapping是GWAS后才出现得。

实验室很早就开始研究fine mapping了：2009 - Fine mapping of the 9q31 Hirschsprung’s disease locus

看一下introduction，什么是fine mapping？

目的很简单：GWAS找到的大多不是causal variants，fine mapping就是就fill这个gap。

GWAS得到大体的SNP后，必须做两方面的深入分析：

第一步就是对SNP给一个概率上的causality，这就是fine-mapping；第二步就是根据功能注释来确定该SNP确实能导致某个基因。

The first is to assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping. The second step is to try to connect these variants to likely genes whose perturbation leads to altered disease risk by functional annotation.

基本原理：

Strategies for fine-mapping complex traits -

Integrative modeling of eQTLs and cis-regulatory elements suggests mechanisms underlying cell type specificity of eQTLs

Although eQTLs are increasingly used to provide mechanistic interpretations for human disease associations, the cell type specificity of eQTLs presents a problem. Because the cell type from which a given physiological phenotype arises may not be known, and because eQTL data exist for a limited number of cell types, it is critical to quantify and understand the mechanisms generating cell type specific eQTLs. For example, if a GWAS identifies a set of SNPs associated with risk of type II diabetes, the researcher must choose a target cell type to develop a mechanistic model of the molecular phenotype that causes the gross physiological change. One can imagine that the relevant cell type might be adipose tissue, liver, pancreas, or another hormone-regulating tissue. Furthermore, if the GWAS SNP produces a molecular phenotype (i.e., is an eQTL) in lymphoblastoid cell lines (LCLs), it is not necessarily the case that the SNP will generate a similar molecular phenotype in the cell type of interest. Furthermore, there are many examples of cell types with particular relevance to common diseases, for example dopaminergic neurons and Parkinson's disease, that lack comprehensive eQTL data or catalogs of CREs. The utility of eQTLs for complex trait interpretation will therefore be improved by a more thorough annotation of their cell type specificity.

eQTL最大的问题还是celltype的特异性不够，关键还是要celltype的定义足够精准！

现在GWAS已经属于比较古老的技术了，主要是碰到严重的瓶颈了，单纯的snp与表现的关联已经不够，需要具体的生物学解释，这些snp是如何具体导致疾病的发生的。

而且，大多数病找到的都不是个别显著的snp，大多数都找到了很多的snp，而且snp都落在非编码区了，这就导致对这些snp的解读非常的困难。

经典解读看这篇新英格兰杂志上的文章：FTO Obesity Variant Circuitry and Adipocyte Browning in Humans

GWAS的核心结果就两个，曼哈顿图和QQ-plot，看懂就够了。

单纯会跑GWAS pipeline已经没什么价值了，现在重在下游的分析，有几个热点：

Polygenic risk score (PRS) analyses
meta-analysis

The International HapMap Project (http://hapmap. ncbi.nlm.nih.gov/; Gibbs et al., 2003) described the patterns of com- mon SNPs within the human DNA sequence whereas the 1000 Genomes (1KG) project (http://www.1000genomes.org/; Altshuler et al., 2012) provided a map of both common and rare SNPs.

common和rare就是根据allele frequency来界定的，但是似乎没有明确界限。

HapMap用的是array，所有测得都是一些人为挑的点，所以就是common snps；而1000 genomes是WGS，所以包含了所有的点，所以有common和rare一起。

GWAS和核心就是LD，目前大部分的GWAS都是测得array，因为便宜。

GWAS会漏掉很多点，所以才会有fine-mapping，根据haplotype来做一些imputation。

Linkage disequilibrium (LD)连锁不平衡：不同基因座位的各等位基因在人群中以一定的频率出现。在某一群体中，不同座位某两个等位基因出现在同一条染色体上的频率高于预期的随机频率的现象。（就是孟德尔的分离不是随机的，在染色体上越靠近的allele越倾向于绑在一起，属于物质性的限制。）

例如两个相邻的基因A B, 他们各自的等位基因为a b. 假设A B相互独立遗传，则后代群体中观察得到的单倍体基因型 AB 中出现的P(AB)的概率为 P(A) * P（B）．实际观察得到群体中单倍体基因型 AB 同时出现的概率为P(AB)。计算这种不平衡的方法为： D = P（AB）－ P(A) * P（B）．

事实上，可以检测遍布基因组中的大量遗传标记位点snp，或者候选基因附近的遗传标记来寻找到因为与致病位点距离足够近而表现出与疾病相关的位点，这就是等位基因关联分析或连锁不平衡定位基因的基本思想。

待看的paper：Strategies for fine-mapping complex traits

assign well-calibrated probabilities of causality to candidate variants, known as fine-mapping.

还有一些非常重要的概念：

effect size：效应量

power：功效，power analyses

Underestimated Effect Sizes in GWAS: Fundamental Limitations of Single SNP Analysis for Dichotomous Phenotypes

在语境里理解：One explanation of the missing heritability is that complex diseases are caused by a large number of causal variants with small effect sizes.

PRS combines the effect sizes of multiple SNPs into a single aggregated score that can be used to predict disease risk

haplotype phasing单倍体分型

Positions with 00 and 11 are called homozygous positions. Positions with 10 or 01 are called heterozygous positions. We note that the reference genome is neither the paternal nor the maternal genome but the genome of an un-related human (or more precisely the mixture of genomes of a few individuals). An individual’s haplotype is the set of variations in that individual’s chromosomes. We note that as any two human haplotypes are 99.9% similar, the mapping problem can be solved quite easily.

Haplotype phasing is the problem of inferring information about an individual’s haplotype. To solve this problem, there are many methods.

Lecture 10: Haplotype Phasing - Community Recovery

参考：PLINK | File format reference

vcftools

plink的主要功能：数据处理，质量控制的基本统计，群体分层分析，单位点的基本关联分析，家系数据的传递不平衡检验，多点连锁分析，单倍体关联分析，拷贝数变异分析，Meta分析等等。

首先必须了解plink的三种格式：bed、fam和bim。（注意：这里的bed和我们genome里的区域文件bed完全不同）

plink需要的格式一般可以从vcf文件转化而来 (顺便了解一下ped和map两种格式)：

PED: Original standard text format for sample pedigree information and genotype calls. Normally must be accompanied by a .map file. 谱系信息和基因型信息。每一行是一个人。

MAP: Variant information file accompanying a .ped text pedigree + genotype table. 变异信息。每一行是一个变异 | snp。

# PED
     1 1 0 0 1  0    G G    2 2    C C
     1 2 0 0 1  0    A A    0 0    A C
     1 3 1 2 1  2    0 0    1 2    A C
     2 1 0 0 1  0    A A    2 2    0 0
     2 2 0 0 1  2    A A    2 2    0 0
     2 3 1 2 1  2    A A    2 2    A A

# MAP 
     1 snp1 0 1
     1 snp2 0 2
     1 snp3 0 3

# vcf转ped和map
plink --vcf file.vcf --recode --out file
# ped和map转bed、bim和fam
plink --file test --make-bed --out test

三种格式的官方介绍

bed文件（真实的bed文件是二进制的，比较难读）

bed：Primary representation of genotype calls at biallelic variants. Must be accompanied by .bim and .fam files. Loaded with --bfile; generated in many situations, most notably when the --make-bed command is used. Do not confuse this with the UCSC Genome Browser's BED format, which is totally different. 基因型信息。所以转换后就是一个matrix，每一行是一个个体，每一列就是一个变异。其中0、1、2分别对应了aa、Aa或aA和AA。不考虑碱基型，因为我们不关注ATGC的变化。

fam：Sample information file accompanying a .bed binary genotype table. 样本信息。每一行就是一个样本。

bim：Extended variant information file accompanying a .bed binary genotype table. 每一行是一个变异，及其注释信息。

             rs4970383 rs3748592 rs9442373 rs1571150 rs6687029
NA19916         2         0         0         0         1
NA19835         1         0         1         2         0
NA20282         1         0         1         0         1
NA19703         0         0         0         2         0
NA19901         1         0         1         2         2

OR
# xxd -b test.bed
00000000: 01101100 00011011 00000001 11011100 00001111 11100111 l.....
00000006: 00001111 01101011 00000001 .k.

First two bytes 01101100 00011011 for PLINK v1.00 BED file
Third byte is 00000001 (SNP-major) or 00000000 (individual-major)
Genotype data, either in SNP-major or individual-major order
New "row" always starts a new byte
Each byte encodes up to 4 genotypes
10 indicates missing genotype, otherwise 0 and 1 point to allele 1 or allele 2 in the BIM file, respectively
Bits in each byte read in reverse order

fam文件

2431 NA19916  0  0  1
2424 NA19835  0  0  2
2469 NA20282  0  0  2
2368 NA19703  0  0  1
2425 NA19901  0  0  2

bim文件

1  1 rs4970383  0  828418  A
2  1 rs3748592  0  870101  A
3  1 rs9442373  0 1052501  C
4  1 rs1571150  0 1464167  A
5  1 rs6687029  0 1508931  C

OR
1       snp1    0       1       G       A
1       snp2    0       2       1       2
1       snp3    0       3       A       C

跑跑PLINK工具

1 2	`plink` `--bfile` `--pheno` `--pheno-name t16` `--linear hide-covar` `--covar` `--covar-name` `AGE,SEX,PC1,PC2,PC3,PC4` `--ci` `0.95` `--out`

--bfile  将snp文件变成二进制格式
--pheno 这里导入我们刚刚处理的性状文件 
--pheno-name t16 要处理的性状名字是t16
--linear hide-covar 使用线性模型，hide-covar指的是不要对我没加入的协变量进行分析
--covar  --covar-name AGE,SEX,PC1,PC2,PC3,PC4 把我们选取的协变量加入线性回归模型中，我们选的协变量有：AGE，SEX，PC1，PC2，PC3，PC4
--ci 0.95 设置置信区间

SNP过滤问题

使用vcftools过滤：
1. MAF<0.05
vcftools --vcf test.vcf --maf 0.05 --out XX
2.完整度大于90%
vcftools --vcf test.vcf  --max-missing 0.9 --OUT XX
3.平均深度大于5
vcftools --vcf test.vc --min-meanDP 5 --out xx
 
注：
使用--gvcf更为快捷
使用plink过滤
1.vcf转化plink格式
vcftools --vcf test.vcf --plink --out  xxx
2.plink --noweb --file plink --geno 0.05 --maf 0.05 --hwe 0.0001 --make-bed

跟一个官网的教学，无需写代码，教学材料：Resources available for download 非常通俗，容易入门。

ped文件：谱系信息和基因型；

Contains no header line, and one line per sample with 2V+6 fields where V is the number of variants. The first six fields are the same as those in a .fam file.

The seventh and eighth fields are allele calls for the first variant in the .map file ('0' = no call); the 9th and 10th are allele calls for the second variant; and so on.

前6行就和fam文件一样，家庭id，家庭内id，性别，表型。

后面两个一组，比如第7和第8就是map中第一个snp的等位基因（人有两条染色体，每条DNA都是双链的，不考虑双链，因为有互补配对）。

fam文件：样本信息；

Family ID ('FID')

Within-family ID ('IID'; cannot be '0')

Within-family ID of father ('0' if father isn't in dataset)

Within-family ID of mother ('0' if mother isn't in dataset)

Sex code ('1' = male, '2' = female, '0' = unknown)

Phenotype value ('1' = control, '2' = case, '-9'/'0'/non-numeric = missing data if case/control)

map文件：突变信息；

Chromosome code. PLINK 1.9 also permits contig names here, but most older programs do not.

Variant identifier

Position in morgans or centimorgans (optional; also safe to use dummy value of '0')

Base-pair coordinate

bim文件：额外的突变信息；

Chromosome code (either an integer, or 'X'/'Y'/'XY'/'MT'; '0' indicates unknown) or name

Variant identifier

Position in morgans or centimorgans (safe to use dummy value of '0')

Base-pair coordinate (normally 1-based, but 0 ok; limited to 231-2)

Allele 1 (corresponding to clear bits in .bed; usually minor)

Allele 2 (corresponding to set bits in .bed; usually major)

MAF, Minor allele frequency: SNPs with a minor allele frequency of 0.05 or greater were targeted by the HapMap project. 最小等位基因频率