机器学习 —— 概率图模型(Homework: StructuredCPD)
Week2的作业主要是关于概率图模型的构造,主要任务可以分为两个部分:1、构造CPD;2、构造Graph。对于有向图而言,在获得单个节点的CPD之后就可依据图对Combine CPD进行构造。在获得Combine CPD之后则可利用变量的观测来进行问答。此周作业的大背景是对基因型与表现型之间的关系进行探索。在已知表现性的情况下对基因型以及下一代的基因进行推测。这是一个很有实际意义的有向图网络。
1、CPD构造
1.1、基因型与表现型的关系——确定
在孟德尔遗传假说基础上,对双碱基配对的基因推测表现型与基因型的关系。也就是说,表现型是一个var,基因型是一个var,求factor. 由于表现型被显性基因或者隐性基因支配,所以显性,隐性与基因型之间是0,1关系。这是一种很特殊的CPD.
1 function phenotypeFactor = phenotypeGivenGenotypeMendelianFactor(isDominant, genotypeVar, phenotypeVar) 2 % This function computes the probability of each phenotype given the 3 % different genotypes for a trait. It assumes that there is 1 dominant 4 % allele and 1 recessive allele. 5 % 6 % If you do not have much background in genetics, you should read the 7 % on-line Appendix or watch the Khan Academy Introduction to Heredity Video 8 % (http://www.khanacademy.org/video/introduction-to-heredity?playlist=Biology) 9 % before you start coding. 10 % 11 % For the genotypes, assignment 1 maps to homozygous dominant, assignment 2 12 % maps to heterozygous, and assignment 3 maps to homozygous recessive. For 13 % example, say that there is a gene with two alleles, dominant allele A and 14 % recessive allele a. Assignment 1 would map to AA, assignment 2 would 15 % make to Aa, and assignment 3 would map to aa. For the phenotypes, 16 % assignment 1 maps to having the physical trait, and assignment 2 maps to 17 % not having the physical trait. 18 % 19 % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST 20 % VARIABLE IN THE .var FIELD FOR GRADING PURPOSES 21 % 22 % Input: 23 % isDominant: 1 if the trait is caused by the dominant allele (trait 24 % would be caused by A in example above) and 0 if the trait is caused by 25 % the recessive allele (trait would be caused by a in the example above) 26 % genotypeVar: The variable number for the genotype variable (goes in the 27 % .var part of the factor) 28 % phenotypeVar: The variable number for the phenotype variable (goes in 29 % the .var part of the factor) 30 % 31 % Output: 32 % phenotypeFactor: Factor denoting the probability of having each 33 % phenotype for each genotype 34 35 phenotypeFactor = struct('var', [], 'card', [], 'val', []); 36 37 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 38 %INSERT YOUR CODE HERE 39 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 40 41 % Fill in phenotypeFactor.var. This should be a 1-D row vector. 42 % Fill in phenotypeFactor.card. This should be a 1-D row vector. 43 phenotypeFactor.var = [phenotypeVar genotypeVar]; 44 phenotypeFactor.card= [2 3]; 45 46 47 array = IndexToAssignment(1:prod(phenotypeFactor.card),phenotypeFactor.card); 48 49 if isDominant 50 phenotypeFactor.val = ones(1, prod(phenotypeFactor.card)); 51 for i = 1:length(array) 52 if array(i,2)==3&&array(i,1)==1 53 phenotypeFactor.val(i)=0; 54 elseif ismember(array(i,2),[1 2])&&array(i,1)==2 55 phenotypeFactor.val(i)=0; 56 end 57 end 58 else 59 phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card)); 60 for i = 1:length(array) 61 if array(i,2)==3&&array(i,1)==1 62 phenotypeFactor.val(i)=1; 63 end 64 end 65 end 66 end 67 68 % Replace the zeros in phentoypeFactor.val with the correct values. 69 70 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1.2 基因型与表现型的关系——不确定
这可认为是非孟德尔遗传。也就是说,对于给定的基因型,表现型不一定出现,基因只影响致病概率。
function phenotypeFactor = phenotypeGivenGenotypeFactor(alphaList, genotypeVar, phenotypeVar) % This function computes the probability of each phenotype given the % different genotypes for a trait. Genotypes (assignments to the genotype % variable) are indexed from 1 to the number of genotypes, and the alphas % are provided in the same order as the corresponding genotypes so that the % alpha for genotype assignment i is alphaList(i). % % For the phenotypes, assignment 1 maps to having the physical trait, and % assignment 2 maps to not having the physical trait. % % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST % VARIABLE IN THE .var FIELD FOR GRADING PURPOSES % % Input: % alphaList: Vector of alpha values for each genotype (n x 1 vector, % where n is the number of genotypes) -- the alpha value for a genotype % is the probability that a person with that genotype will have the % physical trait % genotypeVar: The variable number for the genotype variable (goes in the % .var part of the factor) % phenotypeVar: The variable number for the phenotype variable (goes in % the .var part of the factor) % % Output: % phenotypeFactor: Factor in which the val has the probability of having % each phenotype for each genotype combination (note that this is the % FULL CPD with no evidence observed) phenotypeFactor = struct('var', [], 'card', [], 'val', []); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %INSERT YOUR CODE HERE % The number of genotypes is the length of alphaList. % The number of phenotypes is 2. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Fill in phenotypeFactor.var. This should be a 1-D row vector. % Fill in phenotypeFactor.card. This should be a 1-D row vector. phenotypeFactor.var = [phenotypeVar genotypeVar]; phenotypeFactor.card= [2 3]; phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card)); % Replace the zeros in phentoypeFactor.val with the correct values. assign_ = IndexToAssignment(1:prod(phenotypeFactor.card),[2 3]); for i = 1:prod(phenotypeFactor.card) if assign_(i,1) == 1 switch assign_(i,2) case 1 phenotypeFactor.val(i) = alphaList(1); case 2 phenotypeFactor.val(i) = alphaList(2); case 3 phenotypeFactor.val(i) = alphaList(3); otherwise 'mistake in genotype' return end else switch assign_(i,2) case 1 phenotypeFactor.val(i) = 1-alphaList(1); case 2 phenotypeFactor.val(i) = 1-alphaList(2); case 3 phenotypeFactor.val(i) = 1-alphaList(3); otherwise 'mistake in genotype' return end end end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1.3 表现型的分布
表现型由基因决定,而基因是由统计决定的。也就是说,只观测到某个基因出现频率,要推测人群中表现型的概率
1 function genotypeFactor = genotypeGivenAlleleFreqsFactor(alleleFreqs, genotypeVar) 2 3 % alleleFreqs = [0.1; 0.9]; 4 % genotypeVar = 1; 5 6 % This function computes the probability of each genotype given the allele 7 % frequencies in the population. 8 9 % Note that we assume that the copies of the gene are independent. Thus, 10 % knowing the allele for one copy of the gene does not affect the 11 % probability of having each allele for the other copy of the gene. As a 12 % result, the probability of a genotype is the product of the frequencies 13 % of its constituent alleles (or twice that product for heterozygous 14 % genotypes). 15 16 % Input: 17 % alleleFreqs: An n x 1 vector of the frequencies of the alleles in the 18 % population, where n is the number of alleles 19 % genotypeVar: The variable number for the genotype (goes in the .var 20 % part of the factor) 21 % 22 % Output: 23 % genotypeFactor: Factor in which the val has the probability of having 24 % each genotype (note that this is the FULL CPD with no evidence 25 % observed) 26 27 % The number of genotypes is (number of alleles choose 2) + number of 28 % alleles -- need to add number of alleles at the end to account for 29 % homozygotes 30 31 genotypeFactor = struct('var', [], 'card', [], 'val', []); 32 numAlleles = length(alleleFreqs); 33 34 % Each allele has an ID that is the index of its allele frequency in the 35 % allele frequency list. Each genotype also has an ID. We need allele and 36 % genotype IDs so that we know what genotype and alleles correspond to each 37 % probability in the .val part of the factor. For example, the first entry 38 % in .val corresponds to the probability of having the genotype with 39 % genotype ID 1, which consists of having two copies of the allele with 40 % allele ID 1. There is a mapping from a pair of allele IDs to genotype 41 % IDs and from genotype IDs to a pair of allele IDs below; we compute this 42 % mapping using generateAlleleGenotypeMappers(numAlleles). (A genotype 43 % consists of 2 alleles.) 44 45 [allelesToGenotypes, genotypesToAlleles] = generateAlleleGenotypeMappers(numAlleles); 46 47 % One or both of these matrices might be useful. 48 % 49 % 1. allelesToGenotypes: n x n matrix that maps pairs of allele IDs to 50 % genotype IDs, where n is the number of alleles -- if 51 % allelesToGenotypes(i, j) = k, then the genotype with ID k comprises of 52 % the alleles with IDs i and j 53 % 54 % 2. genotypesToAlleles: m x 2 matrix of allele IDs, where m is the 55 % number of genotypes -- if genotypesToAlleles(k, :) = [i, j], then the 56 % genotype with ID k is comprised of the allele with ID i and the allele 57 % with ID j 58 59 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 60 %INSERT YOUR CODE HERE 61 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 62 63 % Fill in genotypeFactor.var. This should be a 1-D row vector. 64 % Fill in genotypeFactor.card. This should be a 1-D row vector. 65 genotypeFactor.var = genotypeVar; 66 genotypeFactor.card = length(genotypesToAlleles(:,1)); 67 68 genotypeFactor.val = zeros(1, prod(genotypeFactor.card)); 69 % Replace the zeros in genotypeFactor.val with the correct values. 70 n_ = length(alleleFreqs); 71 for i = 1:n_ 72 for j = 1:n_ 73 num_of_val = allelesToGenotypes(i,j); 74 genotypeFactor.val(num_of_val) = genotypeFactor.val(num_of_val)+alleleFreqs(i)*alleleFreqs(j) ; 75 end 76 end 77 78 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1.4 父辈基因型与子辈基因型的关系
每个基因拥有两个碱基,两位父辈对子辈进行等位遗传。也就是每个碱基遗传给子类的概率是相同的。那么父辈基因——子辈基因的关系则如下所示:
1 function genotypeFactor = genotypeGivenParentsGenotypesFactor(numAlleles, genotypeVarChild, genotypeVarParentOne, genotypeVarParentTwo) 2 3 4 5 % This function computes a factor representing the CPD for the genotype of 6 % a child given the parents' genotypes. 7 8 % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST 9 % VARIABLE IN THE .var FIELD FOR GRADING PURPOSES 10 11 % When writing this function, make sure to consider all possible genotypes 12 % from both parents and all possible genotypes for the child. 13 14 % Input: 15 % numAlleles: int that is the number of alleles 16 % genotypeVarChild: Variable number corresponding to the variable for the 17 % child's genotype (goes in the .var part of the factor) 18 % genotypeVarParentOne: Variable number corresponding to the variable for 19 % the first parent's genotype (goes in the .var part of the factor) 20 % genotypeVarParentTwo: Variable number corresponding to the variable for 21 % the second parent's genotype (goes in the .var part of the factor) 22 % 23 % Output: 24 % genotypeFactor: Factor in which val is probability of the child having 25 % each genotype (note that this is the FULL CPD with no evidence 26 % observed) 27 28 % The number of genotypes is (number of alleles choose 2) + number of 29 % alleles -- need to add number of alleles at the end to account for homozygotes 30 31 genotypeFactor = struct('var', [], 'card', [], 'val', []); 32 33 % Each allele has an ID. Each genotype also has an ID. We need allele and 34 % genotype IDs so that we know what genotype and alleles correspond to each 35 % probability in the .val part of the factor. For example, the first entry 36 % in .val corresponds to the probability of having the genotype with 37 % genotype ID 1, which consists of having two copies of the allele with 38 % allele ID 1, given that both parents also have the genotype with genotype 39 % ID 1. There is a mapping from a pair of allele IDs to genotype IDs and 40 % from genotype IDs to a pair of allele IDs below; we compute this mapping 41 % using generateAlleleGenotypeMappers(numAlleles). (A genotype consists of 42 % 2 alleles.) 43 44 [allelesToGenotypes, genotypesToAlleles] = generateAlleleGenotypeMappers(numAlleles); 45 46 % One or both of these matrices might be useful. 47 % 48 % 1. allelesToGenotypes: n x n matrix that maps pairs of allele IDs to 49 % genotype IDs, where n is the number of alleles -- if 50 % allelesToGenotypes(i, j) = k, then the genotype with ID k comprises of 51 % the alleles with IDs i and j 52 % 53 % 2. genotypesToAlleles: m x 2 matrix of allele IDs, where m is the 54 % number of genotypes -- if genotypesToAlleles(k, :) = [i, j], then the 55 % genotype with ID k is comprised of the allele with ID i and the allele 56 % with ID j 57 58 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 59 %INSERT YOUR CODE HERE 60 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 61 62 % Fill in genotypeFactor.var. This should be a 1-D row vector. 63 % Fill in genotypeFactor.card. This should be a 1-D row vector. 64 65 genotypeFactor.var = [genotypeVarChild, genotypeVarParentOne, genotypeVarParentTwo]; 66 var_num_ = length(genotypeFactor.var); 67 var_length_ = length(genotypesToAlleles(:,1)); 68 genotypeFactor.card = linspace(var_length_,var_length_,var_num_); 69 genotypeFactor.val = zeros(1, prod(genotypeFactor.card)); 70 assign_ = IndexToAssignment(1:prod(genotypeFactor.card),genotypeFactor.card); 71 % Replace the zeros in genotypeFactor.val with the correct values. 72 for i = 1:length(assign_) 73 GeType_child = assign_(i,1); 74 GeType_par1 = assign_(i,2); 75 GeType_par2 = assign_(i,3); 76 AllType_child = genotypesToAlleles(GeType_child,:); 77 AllType_par1 = genotypesToAlleles(GeType_par1,:); 78 AllType_par2 = genotypesToAlleles(GeType_par2,:); 79 combination = []; 80 length_tmp_allpar = length(AllType_par1); 81 for j = 1:length_tmp_allpar 82 for k = 1:length_tmp_allpar 83 combination = [combination;AllType_par1(j) AllType_par2(k)]; 84 end 85 end 86 num_count = 0; 87 length_tmp_combination = length(combination); 88 for j = 1:length_tmp_combination 89 combin_row = combination(j,:); 90 if sort(AllType_child) == sort(combin_row) 91 num_count = num_count+1; 92 end 93 end 94 genotypeFactor.val(i)= num_count/length_tmp_combination; 95 end 96 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
1.5 表现型与碱基对之间的关系
从生物上来讲,碱基对和基因说的是同一个东西。但是在概率图中,有两种建模方式:1、基因是var ;2、碱基是var。 如果基因是var,那么表现型(var)则只由一个var决定,如果碱基是var,则表现型由两个var决定。如果使用碱基作为var来对图进行建模,可以大幅降低CPD的复杂程度。如果每个碱基有n种值,那么基因的CPD会达到恐怖的n!种取值。但是如果使用两个碱基进行描述,则CPD的取值是n+n。从而达到解耦的目的。
1 function phenotypeFactor = phenotypeGivenCopiesFactor(alphaList, numAlleles, geneCopyVarOne, geneCopyVarTwo, phenotypeVar) 2 3 % alphaList = [0.8; 0.6; 0.1; 0.5; 0.05; 0.01]; 4 % numAlleles = 3; 5 % geneCopyVarOne= 1; 6 % geneCopyVarTwo = 2; 7 % phenotypeVar = 3; 8 9 % This function makes a factor whose values are the probabilities of 10 % a phenotype given an allele combination. Note that a person has one 11 % copy of the gene from each parent. 12 13 % In the factor, each assignment maps to the allele at the corresponding 14 % location on the allele list, so allele assignment 1 maps to 15 % alleleList{1}, allele assignment 2 maps to alleleList{2}, .... For the 16 % phenotypes, assignment 1 maps to having the physical trait, and 17 % assignment 2 maps to not having the physical trait. 18 19 % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST 20 % VARIABLE IN THE .var FIELD FOR GRADING PURPOSES 21 22 % Input: 23 % alphaList: m x 1 vector of alpha values for the different genotypes, 24 % where m is the number of genotypes -- the alpha value for a genotype 25 % is the probability that a person with that genotype will have the 26 % physical trait 27 % numAlleles: int that is the number of alleles 28 % geneCopyVarOne: Variable number corresponding to the variable for 29 % the first copy of the gene (goes in the .var part of the factor) 30 % geneCopyVarTwo: Variable number corresponding to the variable for 31 % the second copy of the gene (goes in the .var part of the factor) 32 % phenotypeVar: Variable number corresponding to the variable for the 33 % phenotype (goes in the .var part of the factor) 34 % 35 % Output: 36 % phenotypeFactor: Factor in which the values are the probabilities of 37 % having each phenotype for each allele combination (note that this is 38 % the FULL CPD with no evidence observed) 39 40 phenotypeFactor = struct('var', [], 'card', [], 'val', []); 41 42 % Each allele has an ID. Each genotype also has an ID, which is the index 43 % of its alpha in the list of alphas. There is a mapping from a pair of 44 % allele IDs to genotype IDs and from genotype IDs to a pair of allele IDs 45 % below; we compute this mapping using 46 % generateAlleleGenotypeMappers(numAlleles). (A genotype consists of 2 47 % alleles.) 48 49 [allelesToGenotypes, genotypesToAlleles] = generateAlleleGenotypeMappers(numAlleles); 50 51 % One or both of these matrices might be useful. 52 % 53 % 1. allelesToGenotypes: n x n matrix that maps pairs of allele IDs to 54 % genotype IDs, where n is the number of alleles -- if 55 % allelesToGenotypes(i, j) = k, then the genotype with ID k comprises of 56 % the alleles with IDs i and j 57 % 58 % 2. genotypesToAlleles: m x 2 matrix of allele IDs, where m is the 59 % number of genotypes -- if genotypesToAlleles(k, :) = [i, j], then the 60 % genotype with ID k is comprised of the allele with ID i and the allele 61 % with ID j 62 63 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 64 %INSERT YOUR CODE HERE 65 % The number of phenotypes is 2 66 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 67 68 % Fill in phenotypeFactor.var. This should be a 1-D row vector. 69 % Fill in phenotypeFactor.card. This should be a 1-D row vector. 70 phenotypeFactor.var = [phenotypeVar geneCopyVarOne geneCopyVarTwo]; 71 phenotypeFactor.card= [2 numAlleles numAlleles]; 72 73 phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card)); 74 % Replace the zeros in phentoypeFactor.val with the correct values. 75 assign_alle = IndexToAssignment(1:prod(phenotypeFactor.card),phenotypeFactor.card); 76 77 for i = 1:prod(phenotypeFactor.card) 78 alle_type = assign_alle(i,2:3); 79 if assign_alle(i,1) == 1 80 phenotypeFactor.val(i) = alphaList(allelesToGenotypes(alle_type(1),alle_type(2))); 81 else 82 phenotypeFactor.val(i) = 1 - alphaList(allelesToGenotypes(alle_type(1),alle_type(2))); 83 end 84 end 85 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
2. 构造Graph
2.1 通常的Graph
在有了CPD之后,只需要把CPD串起来,就可以得到图模型了。CPD描述的是变量与变量之间的关系,对于基因---表现型而言,这种关系是“一般的”。也就是这种关系放在任意一个普系中都适用。构造Graph的函数应该返回很多的局部CPD。也就是变量与其父节点构成的CPD。在得到这些CPD之后,只需要按照贝耶斯网络规律连乘,就可以得到Combine CPD。使用evidence Observe,就可以观察到变量的分布情况。
1 function factorList = constructGeneticNetwork(pedigree, alleleFreqs, alphaList) 2 % pedigree = struct('parents', [0,0;1,3;0,0]); 3 % pedigree.names = {'Ira','James','Robin'}; 4 % alleleFreqs = [0.1; 0.9]; 5 % alphaList = [0.8; 0.6; 0.1]; 6 % This function constructs a Bayesian network for genetic inheritance. It 7 % assumes that there are only 2 phenotypes. It also assumes that either 8 % both parents are specified or neither parent is specified. 9 % 10 % In Matlab, each variable will have a number. We need a consistent way of 11 % numbering variables when instantiating CPDs so that we know what 12 % variables are involved in each CPD. For example, if IraGenotype is in 13 % multiple CPDs and IraGenotype is number variable 1 in a CPD, then 14 % IraGenotype should be numbered as variable 1 in all CPDs and no other 15 % variables should be numbered 1; thus, every time variable 1 appears in 16 % our network, we will know that it refers to Ira's genotype. 17 % 18 % Here is how the variables should be numbered, for a pedigree with n 19 % people: 20 % 21 % 1. The first n variables should be the genotype variables and should 22 % be numbered according to the index of the corresponding person in 23 % pedigree.names; the ith person with name pedigree.names{i} has genotype 24 % variable number i. 25 % 26 % 2. The next n variables should be the phenotype variables and should be 27 % numbered according to the index of the corresponding person in 28 % pedigree.names; the ith person with name pedigree.names{i} has phenotype 29 % variable number n+i. 30 % 31 % Here is an example of how the variable numbering should work: if 32 % pedigree.names = {'Ira', 'James', 'Robin'} and pedigree.parents = [0, 0; 33 % 1, 3; 0, 0], then the variable numbering is as follows: 34 % 35 % Variable 1: IraGenotype 36 % Variable 2: JamesGenotype 37 % Variable 3: RobinGenotype 38 % Variable 4: IraPhenotype 39 % Variable 5: JamesPhenotype 40 % Variable 6: RobinPhenotype 41 % 42 % Input: 43 % pedigree: Data structure that includes names, genders, and parent-child 44 % relationships 45 % alleleFreqs: Frequencies of alleles in the population 46 % alphaList: m x 1 vector of alpha values for genotypes, where m is the 47 % number of genotypes -- the alpha value for a genotype is the 48 % probability that a person with that genotype will have the 49 % physical trait 50 % 51 % Output: 52 % factorList: Struct array of factors for the genetic network (In each 53 % factor, .var, .card, and .val are all row 1-D vectors.) 54 55 numPeople = length(pedigree.names); 56 57 % Initialize factors 58 % The number of factors is twice the number of people because there is a 59 % factor for each person's genotype and a separate factor for each person's 60 % phenotype. Note that the order of the factors in the list does not 61 % matter. 62 factorList(2*numPeople) = struct('var', [], 'card', [], 'val', []); 63 64 numAlleles = length(alleleFreqs); % Number of alleles 65 66 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 67 %INSERT YOUR CODE HERE 68 % Variable numbers: 69 % 1 - numPeople: genotype variables 70 % numPeople+1 - 2*numPeople: phenotype variables 71 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 72 73 for i = 1 : 2*numPeople 74 if i <= numPeople 75 pedigree_parents = pedigree.parents(i,:); 76 if max(pedigree_parents) == 0 77 tmp_ = genotypeGivenAlleleFreqsFactor(alleleFreqs,i); 78 else 79 tmp_ = genotypeGivenParentsGenotypesFactor(length(alleleFreqs),i,pedigree_parents(1),pedigree_parents(2)); 80 end 81 else 82 tmp_ = phenotypeGivenGenotypeFactor(alphaList,i-numPeople,i); 83 end 84 factorList(i) = tmp_; 85 end 86 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
在完成这个构造规律函数之后,Koller 教授提供了一种更优雅可可视化方法,可以将这些CPD发送到SamIam里。
2.2 解耦网络的Graph
解耦网络的好处是可以应对某个碱基有多种取值的情况。但是代价是解耦网络的结构和普通网络完全不同。
1 function factorList = constructDecoupledGeneticNetwork(pedigree, alleleFreqs, alphaList) 2 3 % pedigree = struct('parents', [0,0;1,3;0,0]); 4 % pedigree.names = {'Ira','James','Robin'}; 5 % alleleFreqs = [0.1; 0.7; 0.2]; 6 % alleleListThree = {'F', 'f', 'n'}; 7 % alphaList = [0.8; 0.6; 0.1; 0.5; 0.05; 0.01]; 8 9 % This function constructs a Bayesian network for genetic inheritance. It 10 % assumes that there are only 2 phenotypes. It also assumes that, in the 11 % pedigree, either both parents are specified or neither parent is 12 % specified. 13 % 14 % In Matlab, each variable will have a number. We need a consistent way of 15 % numbering variables when instantiating CPDs so that we know what 16 % variables are involved in each CPD. For example, if IraGenotype is in 17 % multiple CPDs and IraGenotype is number variable 1 in a CPD, then 18 % IraGenotype should be numbered as variable 1 in all CPDs and no other 19 % variables should be numbered 1; thus, every time variable 1 appears in 20 % our network, we will know that it refers to Ira's genotype. 21 % 22 % Here is how the variables should be numbered, for a pedigree with n 23 % people: 24 % 25 % 1. The first n variables should be the gene copy 1 variables and should 26 % be numbered according to the index of the corresponding person in 27 % pedigree.names; the ith person with name pedigree.names{i} has gene copy 28 % 1 variable number i. If the parents are specified, then gene copy 1 is the 29 % copy inherited from pedigree.names{pedigree.parents(i, 1)}. 30 % 31 % 2. The next n variables should be the gene copy 2 variables and should 32 % be numbered according to the index of the corresponding person in 33 % pedigree.names; the ith person with name pedigree.names{i} has gene copy 34 % 2 variable number n+i. If the parents are specified, then gene copy 2 is the 35 % copy inherited from pedigree.parents(i, 2). 36 % 37 % 3. The final n variables should be the phenotype variables and should be 38 % numbered according to the index of the corresponding person in 39 % pedigree.names; the ith person with name pedigree.names{i} has phenotype 40 % variable number 2n+i. 41 % 42 % Here is an example of how the variable numbering should work: if 43 % pedigree.names = {'Ira', 'James', 'Robin'} and pedigree.parents = [0, 0; 44 % 1, 3; 0, 0], then the variable numbering is as follows: 45 % 46 % Variable 1: IraGeneCopy1 47 % Variable 2: JamesGeneCopy1 48 % Variable 3: RobinGeneCopy1 49 % Variable 4: IraGeneCopy2 50 % Variable 5: JamesGeneCopy2 51 % Variable 6: RobinGeneCopy2 52 % Variable 7: IraPhenotype 53 % Variable 8: JamesPhenotype 54 % Variable 9: RobinPhenotype 55 % 56 % Input: 57 % pedigree: Data structure that includes the names and parents of each 58 % person 59 % alleleFreqs: n x 1 vector of allele frequencies in the population, 60 % where n is the number of alleles 61 % alphaList: m x 1 vector of alphas for different genotypes, where m is 62 % the number of genotypes -- the alpha value for a genotype is the 63 % probability that a person with that genotype will have the physical 64 % trait 65 % 66 % Output: 67 % factorList: struct array of factors for the genetic network (In each 68 % factor, .var, .card, and .val are row 1-D vectors.) 69 70 numPeople = length(pedigree.names); 71 72 % Initialize factors 73 % We Need 3*numPeople factors because, for each person, there is a factor 74 % for the CPD at each of 2 parental copies of the gene and a factor for the 75 % CPD at the phenotype Note that the order of the factors in the list does 76 % not matter. 77 factorList(3*numPeople) = struct('var', [], 'card', [], 'val', []); 78 79 % Initialize factors 80 factorList(3*numPeople) = struct('var', [], 'card', [], 'val', []); 81 82 numAlleles = length(alleleFreqs); % Number of alleles 83 84 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 85 % INSERT YOUR CODE HERE 86 % Variable numbers: 87 % 1 - numPeople: first parent copy of gene variables 88 % numPeople+1 - 2*numPeople: second parent copy of gene variables 89 % 2*numPeople+1 - 3*numPeople: phenotype variables 90 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 91 92 93 for k = 1:3 94 switch k 95 case 1 96 for i = 1:numPeople 97 pedigree_parents = pedigree.parents(i,:); 98 if max(pedigree_parents)==0 99 factorList(i) = childCopyGivenFreqsFactor(alleleFreqs,i); 100 else 101 factorList(i) = childCopyGivenParentalsFactor(numAlleles,i,pedigree_parents(1),pedigree_parents(2)); 102 end 103 end 104 case 2 105 for i = (numPeople+1):2*numPeople 106 pedigree_parents = pedigree.parents(i-numPeople,:); 107 if max(pedigree_parents)==0 108 factorList(i) = childCopyGivenFreqsFactor(alleleFreqs,i); 109 else 110 factorList(i) = childCopyGivenParentalsFactor(numAlleles,i,pedigree_parents(1),pedigree_parents(2)); 111 end 112 end 113 case 3 114 for i = 2*numPeople+1:3*numPeople 115 factorList(i) = phenotypeGivenCopiesFactor(alphaList,numAlleles,i,i-numPeople,i-2*numPeople); 116 end 117 end 118 end 119 120 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
同样,此网络也可以可视化。不赘述。
3、多父节点CPD构造
对于父节点较少的情况,我们可以不同的程序生成CPD。但是现实生活中,某个变量往往由多个变量决定。这是如果要一项一项指定CPD表格会是一件非常麻烦的事情。不如直接指定父节点与子节点连接权重,然后后由indicate函数点亮连接。最后再由sigmoid函数生成变量之间的关系。这是一种常见的方法。
此实例是某种表现型由多个碱基(>4)的取值决定。每种碱基影响不同。故此CPD至少有5个entry(表现型+4个碱基对),假设每个entry两种取值,则有32种组合。
1 function phenotypeFactor = constructSigmoidPhenotypeFactor(alleleWeights, geneCopyVarOneList, geneCopyVarTwoList, phenotypeVar) 2 3 % alleleWeights = {[3, -3], [0.9, -0.8]}; 4 % geneCopyVarOneList = [1; 2]; 5 % geneCopyVarTwoList = [4; 5]; 6 % phenotypeVar = 3; 7 8 % This function takes a cell array of alleles' weights and constructs a 9 % factor expressing a sigmoid CPD. 10 % 11 % You can assume that there are only 2 genes involved in the CPD. 12 % 13 % In the factor, for each gene, each allele assignment maps to the allele 14 % whose weight is at the corresponding location. For example, for gene 1, 15 % allele assignment 1 maps to the allele whose weight is at 16 % alleleWeights{1}(1) (same as w_1^1), allele assignment 2 maps to the 17 % allele whose weight is at alleleWeights{1}(2) (same as w_2^1),.... 18 % 19 % You may assume that there are 2 possible phenotypes. 20 % For the phenotypes, assignment 1 maps to having the physical trait, and 21 % assignment 2 maps to not having the physical trait. 22 % 23 % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST 24 % VARIABLE IN THE .var FIELD FOR GRADING PURPOSES 25 % 26 % Input: 27 % alleleWeights: Cell array of weights, where each entry is an 1 x n 28 % of weights for the alleles for a gene (n is the number of alleles for 29 % the gene) 30 % geneCopyVarOneList: m x 1 vector (m is the number of genes) of variable 31 % numbers that are the variable numbers for each of the first parent's 32 % copy of each gene (numbers in this list go in the .var part of the 33 % factor) 34 % geneCopyVarTwoList: m x 1 vector (m is the number of genes) of variable 35 % numbers that are the variable numbers for each of the second parent's 36 % copy of each gene (numbers in this list go in the .var part of the 37 % factor) -- Note that both copies of each gene are from the same person, 38 % but each copy originally came from a different parent 39 % phenotypeVar: Variable number corresponding to the variable for the 40 % phenotype (goes in the .var part of the factor) 41 % 42 % Output: 43 % phenotypeFactor: Factor in which the values are the probabilities of 44 % having each phenotype for each allele combination (note that this is 45 % the FULL CPD with no evidence observed) 46 47 phenotypeFactor = struct('var', [], 'card', [], 'val', []); 48 49 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 50 % INSERT YOUR CODE HERE 51 % Note that computeSigmoid.m will be useful for this function. 52 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 53 54 % Fill in phenotypeFactor.var. This should be a 1-D row vector. 55 % Fill in phenotypeFactor.card. This should be a 1-D row vector. 56 phenotypeFactor.var = [phenotypeVar geneCopyVarOneList' geneCopyVarTwoList']; 57 %num_of_alle_in_each_geno 58 geno_var_card = []; 59 60 for i=1:length(geneCopyVarOneList) 61 geno_var_card = [geno_var_card max(size(alleleWeights{i}))]; 62 end 63 phenotypeFactor.card =[2 geno_var_card geno_var_card]; 64 phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card)); 65 % Replace the zeros in phentoypeFactor.val with the correct values. 66 assign_allele = IndexToAssignment(1:prod(phenotypeFactor.card),phenotypeFactor.card); 67 68 for i = 1:prod(phenotypeFactor.card) 69 pheno_allele_ = assign_allele(i,:); 70 % start from the second var 71 k = 2; 72 weight = 0; 73 % a pair of gene come from *two* people 74 for j = 1:2 75 % get num of allele from one people 76 for l = 1:length(geneCopyVarOneList) 77 weight = weight+alleleWeights{l}(pheno_allele_(k)); 78 k = k+1; 79 end 80 end 81 switch pheno_allele_(1) 82 case 1 83 phenotypeFactor.val(i) = computeSigmoid(weight); 84 case 2 85 phenotypeFactor.val(i) = 1-computeSigmoid(weight); 86 end 87 end 88 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
所有代码点击这里