机器学习 —— 概率图模型（Homework: StructuredCPD）

　　Week2的作业主要是关于概率图模型的构造，主要任务可以分为两个部分：1、构造CPD;2、构造Graph。对于有向图而言，在获得单个节点的CPD之后就可依据图对Combine CPD进行构造。在获得Combine CPD之后则可利用变量的观测来进行问答。此周作业的大背景是对基因型与表现型之间的关系进行探索。在已知表现性的情况下对基因型以及下一代的基因进行推测。这是一个很有实际意义的有向图网络。

1、CPD构造

1.1、基因型与表现型的关系——确定

　　在孟德尔遗传假说基础上，对双碱基配对的基因推测表现型与基因型的关系。也就是说，表现型是一个var，基因型是一个var，求factor. 由于表现型被显性基因或者隐性基因支配，所以显性，隐性与基因型之间是0,1关系。这是一种很特殊的CPD.

 1 function phenotypeFactor = phenotypeGivenGenotypeMendelianFactor(isDominant, genotypeVar, phenotypeVar)
 2 % This function computes the probability of each phenotype given the
 3 % different genotypes for a trait.  It assumes that there is 1 dominant
 4 % allele and 1 recessive allele.
 5 %
 6 % If you do not have much background in genetics, you should read the
 7 % on-line Appendix or watch the Khan Academy Introduction to Heredity Video
 8 % (http://www.khanacademy.org/video/introduction-to-heredity?playlist=Biology)
 9 % before you start coding.
10 %
11 % For the genotypes, assignment 1 maps to homozygous dominant, assignment 2
12 % maps to heterozygous, and assignment 3 maps to homozygous recessive.  For
13 % example, say that there is a gene with two alleles, dominant allele A and
14 % recessive allele a.  Assignment 1 would map to AA, assignment 2 would
15 % make to Aa, and assignment 3 would map to aa.  For the phenotypes, 
16 % assignment 1 maps to having the physical trait, and assignment 2 maps to 
17 % not having the physical trait.
18 %
19 % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
20 % VARIABLE IN THE .var FIELD FOR GRADING PURPOSES
21 %
22 % Input:
23 %   isDominant: 1 if the trait is caused by the dominant allele (trait 
24 %   would be caused by A in example above) and 0 if the trait is caused by
25 %   the recessive allele (trait would be caused by a in the example above)
26 %   genotypeVar: The variable number for the genotype variable (goes in the
27 %   .var part of the factor)
28 %   phenotypeVar: The variable number for the phenotype variable (goes in
29 %   the .var part of the factor)
30 %
31 % Output:
32 %   phenotypeFactor: Factor denoting the probability of having each 
33 %   phenotype for each genotype
34 
35 phenotypeFactor = struct('var', [], 'card', [], 'val', []);
36 
37 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
38 %INSERT YOUR CODE HERE
39 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
40 
41 % Fill in phenotypeFactor.var.  This should be a 1-D row vector.
42 % Fill in phenotypeFactor.card.  This should be a 1-D row vector.
43 phenotypeFactor.var = [phenotypeVar genotypeVar];
44 phenotypeFactor.card= [2 3];
45 
46 
47 array = IndexToAssignment(1:prod(phenotypeFactor.card),phenotypeFactor.card);
48 
49     if isDominant
50         phenotypeFactor.val = ones(1, prod(phenotypeFactor.card));
51         for i = 1:length(array)
52             if array(i,2)==3&&array(i,1)==1
53                 phenotypeFactor.val(i)=0;
54             elseif ismember(array(i,2),[1 2])&&array(i,1)==2
55                 phenotypeFactor.val(i)=0;
56             end
57         end
58     else
59         phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card));
60         for i = 1:length(array)
61             if array(i,2)==3&&array(i,1)==1
62                 phenotypeFactor.val(i)=1;             
63             end
64         end
65     end
66 end
67 
68 % Replace the zeros in phentoypeFactor.val with the correct values.
69 
70 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View Code

1.2 基因型与表现型的关系——不确定

　　这可认为是非孟德尔遗传。也就是说，对于给定的基因型，表现型不一定出现，基因只影响致病概率。

function phenotypeFactor = phenotypeGivenGenotypeFactor(alphaList, genotypeVar, phenotypeVar)
% This function computes the probability of each phenotype given the 
% different genotypes for a trait. Genotypes (assignments to the genotype
% variable) are indexed from 1 to the number of genotypes, and the alphas
% are provided in the same order as the corresponding genotypes so that the
% alpha for genotype assignment i is alphaList(i).
%
% For the phenotypes, assignment 1 maps to having the physical trait, and 
% assignment 2 maps to not having the physical trait.
%
% THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
% VARIABLE IN THE .var FIELD FOR GRADING PURPOSES
%
% Input:
%   alphaList: Vector of alpha values for each genotype (n x 1 vector,
%   where n is the number of genotypes) -- the alpha value for a genotype
%   is the probability that a person with that genotype will have the
%   physical trait 
%   genotypeVar: The variable number for the genotype variable (goes in the
%   .var part of the factor)
%   phenotypeVar: The variable number for the phenotype variable (goes in
%   the .var part of the factor)
%
% Output:
%   phenotypeFactor: Factor in which the val has the probability of having 
%   each phenotype for each genotype combination (note that this is the 
%   FULL CPD with no evidence observed)

phenotypeFactor = struct('var', [], 'card', [], 'val', []);

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%INSERT YOUR CODE HERE
% The number of genotypes is the length of alphaList.
% The number of phenotypes is 2.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  

% Fill in phenotypeFactor.var.  This should be a 1-D row vector.
% Fill in phenotypeFactor.card.  This should be a 1-D row vector.
phenotypeFactor.var = [phenotypeVar genotypeVar];
phenotypeFactor.card= [2 3];
phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card));
% Replace the zeros in phentoypeFactor.val with the correct values.
assign_ = IndexToAssignment(1:prod(phenotypeFactor.card),[2 3]);

for i = 1:prod(phenotypeFactor.card)
    if assign_(i,1) == 1
        switch assign_(i,2)
            case 1
                phenotypeFactor.val(i) = alphaList(1);
            case 2
                phenotypeFactor.val(i) = alphaList(2);
            case 3
                phenotypeFactor.val(i) = alphaList(3);
            otherwise
                'mistake in genotype'
                return
        end
    else
         switch assign_(i,2)
            case 1
                phenotypeFactor.val(i) = 1-alphaList(1);
            case 2
                phenotypeFactor.val(i) = 1-alphaList(2);
            case 3
                phenotypeFactor.val(i) = 1-alphaList(3);
            otherwise
                'mistake in genotype'
                return
        end
    end
end
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View Code

1.3 表现型的分布

　　表现型由基因决定，而基因是由统计决定的。也就是说，只观测到某个基因出现频率，要推测人群中表现型的概率

 1 function genotypeFactor = genotypeGivenAlleleFreqsFactor(alleleFreqs, genotypeVar)
 2 
 3 % alleleFreqs = [0.1; 0.9];
 4 % genotypeVar = 1;
 5 
 6 % This function computes the probability of each genotype given the allele 
 7 % frequencies in the population.
 8 
 9 % Note that we assume that the copies of the gene are independent.  Thus,
10 % knowing the allele for one copy of the gene does not affect the
11 % probability of having each allele for the other copy of the gene.  As a
12 % result, the probability of a genotype is the product of the frequencies 
13 % of its constituent alleles (or twice that product for heterozygous 
14 % genotypes).
15 
16 % Input:
17 %   alleleFreqs: An n x 1 vector of the frequencies of the alleles in the 
18 %   population, where n is the number of alleles
19 %   genotypeVar: The variable number for the genotype (goes in the .var
20 %   part of the factor)
21 %
22 % Output:
23 %   genotypeFactor: Factor in which the val has the probability of having 
24 %   each genotype (note that this is the FULL CPD with no evidence 
25 %   observed)
26 
27 % The number of genotypes is (number of alleles choose 2) + number of 
28 % alleles -- need to add number of alleles at the end to account for 
29 % homozygotes
30 
31 genotypeFactor = struct('var', [], 'card', [], 'val', []);
32 numAlleles = length(alleleFreqs);
33 
34 % Each allele has an ID that is the index of its allele frequency in the 
35 % allele frequency list.  Each genotype also has an ID.  We need allele and
36 % genotype IDs so that we know what genotype and alleles correspond to each
37 % probability in the .val part of the factor.  For example, the first entry
38 % in .val corresponds to the probability of having the genotype with
39 % genotype ID 1, which consists of having two copies of the allele with
40 % allele ID 1.  There is a mapping from a pair of allele IDs to genotype 
41 % IDs and from genotype IDs to a pair of allele IDs below; we compute this 
42 % mapping using generateAlleleGenotypeMappers(numAlleles). (A genotype 
43 % consists of 2 alleles.)
44 
45 [allelesToGenotypes, genotypesToAlleles] = generateAlleleGenotypeMappers(numAlleles);
46 
47 % One or both of these matrices might be useful.
48 %
49 %   1.  allelesToGenotypes: n x n matrix that maps pairs of allele IDs to 
50 %   genotype IDs, where n is the number of alleles -- if 
51 %   allelesToGenotypes(i, j) = k, then the genotype with ID k comprises of 
52 %   the alleles with IDs i and j
53 %
54 %   2.  genotypesToAlleles: m x 2 matrix of allele IDs, where m is the 
55 %   number of genotypes -- if genotypesToAlleles(k, :) = [i, j], then the 
56 %   genotype with ID k is comprised of the allele with ID i and the allele 
57 %   with ID j
58 
59 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
60 %INSERT YOUR CODE HERE
61 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
62 
63 % Fill in genotypeFactor.var.  This should be a 1-D row vector.
64 % Fill in genotypeFactor.card.  This should be a 1-D row vector.
65 genotypeFactor.var = genotypeVar;
66 genotypeFactor.card = length(genotypesToAlleles(:,1));
67 
68 genotypeFactor.val = zeros(1, prod(genotypeFactor.card));
69 % Replace the zeros in genotypeFactor.val with the correct values.
70 n_ = length(alleleFreqs);
71 for i = 1:n_
72     for j = 1:n_
73         num_of_val = allelesToGenotypes(i,j);
74         genotypeFactor.val(num_of_val) = genotypeFactor.val(num_of_val)+alleleFreqs(i)*alleleFreqs(j) ;
75     end
76 end
77 
78 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View Code

1.4 父辈基因型与子辈基因型的关系

　　每个基因拥有两个碱基，两位父辈对子辈进行等位遗传。也就是每个碱基遗传给子类的概率是相同的。那么父辈基因——子辈基因的关系则如下所示：

 1 function genotypeFactor = genotypeGivenParentsGenotypesFactor(numAlleles, genotypeVarChild, genotypeVarParentOne, genotypeVarParentTwo)
 2 
 3 
 4 
 5 % This function computes a factor representing the CPD for the genotype of
 6 % a child given the parents' genotypes.
 7 
 8 % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
 9 % VARIABLE IN THE .var FIELD FOR GRADING PURPOSES
10 
11 % When writing this function, make sure to consider all possible genotypes 
12 % from both parents and all possible genotypes for the child.
13 
14 % Input:
15 %   numAlleles: int that is the number of alleles
16 %   genotypeVarChild: Variable number corresponding to the variable for the
17 %   child's genotype (goes in the .var part of the factor)
18 %   genotypeVarParentOne: Variable number corresponding to the variable for
19 %   the first parent's genotype (goes in the .var part of the factor)
20 %   genotypeVarParentTwo: Variable number corresponding to the variable for
21 %   the second parent's genotype (goes in the .var part of the factor)
22 %
23 % Output:
24 %   genotypeFactor: Factor in which val is probability of the child having 
25 %   each genotype (note that this is the FULL CPD with no evidence 
26 %   observed)
27 
28 % The number of genotypes is (number of alleles choose 2) + number of 
29 % alleles -- need to add number of alleles at the end to account for homozygotes
30 
31 genotypeFactor = struct('var', [], 'card', [], 'val', []);
32 
33 % Each allele has an ID.  Each genotype also has an ID.  We need allele and
34 % genotype IDs so that we know what genotype and alleles correspond to each
35 % probability in the .val part of the factor.  For example, the first entry
36 % in .val corresponds to the probability of having the genotype with
37 % genotype ID 1, which consists of having two copies of the allele with
38 % allele ID 1, given that both parents also have the genotype with genotype
39 % ID 1.  There is a mapping from a pair of allele IDs to genotype IDs and 
40 % from genotype IDs to a pair of allele IDs below; we compute this mapping 
41 % using generateAlleleGenotypeMappers(numAlleles). (A genotype consists of 
42 % 2 alleles.)
43 
44 [allelesToGenotypes, genotypesToAlleles] = generateAlleleGenotypeMappers(numAlleles);
45 
46 % One or both of these matrices might be useful.
47 %
48 %   1.  allelesToGenotypes: n x n matrix that maps pairs of allele IDs to 
49 %   genotype IDs, where n is the number of alleles -- if 
50 %   allelesToGenotypes(i, j) = k, then the genotype with ID k comprises of 
51 %   the alleles with IDs i and j
52 %
53 %   2.  genotypesToAlleles: m x 2 matrix of allele IDs, where m is the 
54 %   number of genotypes -- if genotypesToAlleles(k, :) = [i, j], then the 
55 %   genotype with ID k is comprised of the allele with ID i and the allele 
56 %   with ID j
57 
58 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
59 %INSERT YOUR CODE HERE
60 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
61 
62 % Fill in genotypeFactor.var.  This should be a 1-D row vector.
63 % Fill in genotypeFactor.card.  This should be a 1-D row vector.
64 
65 genotypeFactor.var = [genotypeVarChild, genotypeVarParentOne, genotypeVarParentTwo];
66 var_num_ = length(genotypeFactor.var);
67 var_length_ = length(genotypesToAlleles(:,1));
68 genotypeFactor.card = linspace(var_length_,var_length_,var_num_);
69 genotypeFactor.val = zeros(1, prod(genotypeFactor.card));
70 assign_ = IndexToAssignment(1:prod(genotypeFactor.card),genotypeFactor.card);
71 % Replace the zeros in genotypeFactor.val with the correct values.
72 for i = 1:length(assign_)
73     GeType_child = assign_(i,1);
74     GeType_par1  = assign_(i,2);
75     GeType_par2  = assign_(i,3);
76     AllType_child = genotypesToAlleles(GeType_child,:);
77     AllType_par1 = genotypesToAlleles(GeType_par1,:);
78     AllType_par2 = genotypesToAlleles(GeType_par2,:);
79     combination = [];
80     length_tmp_allpar = length(AllType_par1);
81     for j = 1:length_tmp_allpar
82         for k = 1:length_tmp_allpar
83             combination = [combination;AllType_par1(j) AllType_par2(k)];
84         end
85     end
86     num_count = 0;
87     length_tmp_combination = length(combination);
88     for j = 1:length_tmp_combination
89         combin_row = combination(j,:);
90         if sort(AllType_child) == sort(combin_row)
91             num_count = num_count+1;
92         end
93     end
94     genotypeFactor.val(i)= num_count/length_tmp_combination;
95 end
96 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View Code

1.5 表现型与碱基对之间的关系

　　从生物上来讲，碱基对和基因说的是同一个东西。但是在概率图中，有两种建模方式：1、基因是var ;2、碱基是var。如果基因是var，那么表现型（var）则只由一个var决定，如果碱基是var，则表现型由两个var决定。如果使用碱基作为var来对图进行建模，可以大幅降低CPD的复杂程度。如果每个碱基有n种值，那么基因的CPD会达到恐怖的n!种取值。但是如果使用两个碱基进行描述，则CPD的取值是n+n。从而达到解耦的目的。

 1  function phenotypeFactor = phenotypeGivenCopiesFactor(alphaList, numAlleles, geneCopyVarOne, geneCopyVarTwo, phenotypeVar)
 2 
 3 % alphaList = [0.8; 0.6; 0.1; 0.5; 0.05; 0.01];
 4 % numAlleles = 3;
 5 % geneCopyVarOne= 1;
 6 % geneCopyVarTwo = 2;
 7 % phenotypeVar = 3;
 8 
 9 % This function makes a factor whose values are the probabilities of 
10 % a phenotype given an allele combination.  Note that a person has one
11 % copy of the gene from each parent.
12 
13 % In the factor, each assignment maps to the allele at the corresponding
14 % location on the allele list, so allele assignment 1 maps to
15 % alleleList{1}, allele assignment 2 maps to alleleList{2}, ....  For the
16 % phenotypes, assignment 1 maps to having the physical trait, and
17 % assignment 2 maps to not having the physical trait.
18 
19 % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
20 % VARIABLE IN THE .var FIELD FOR GRADING PURPOSES
21 
22 % Input:
23 %   alphaList: m x 1 vector of alpha values for the different genotypes,
24 %   where m is the number of genotypes -- the alpha value for a genotype
25 %   is the probability that a person with that genotype will have the
26 %   physical trait
27 %   numAlleles: int that is the number of alleles
28 %   geneCopyVarOne: Variable number corresponding to the variable for
29 %   the first copy of the gene (goes in the .var part of the factor)
30 %   geneCopyVarTwo: Variable number corresponding to the variable for
31 %   the second copy of the gene (goes in the .var part of the factor)
32 %   phenotypeVar: Variable number corresponding to the variable for the 
33 %   phenotype (goes in the .var part of the factor)
34 %
35 % Output:
36 %   phenotypeFactor: Factor in which the values are the probabilities of 
37 %   having each phenotype for each allele combination (note that this is 
38 %   the FULL CPD with no evidence observed)
39 
40 phenotypeFactor = struct('var', [], 'card', [], 'val', []);
41 
42 % Each allele has an ID. Each genotype also has an ID, which is the index 
43 % of its alpha in the list of alphas.  There is a mapping from a pair of 
44 % allele IDs to genotype IDs and from genotype IDs to a pair of allele IDs 
45 % below; we compute this mapping using 
46 % generateAlleleGenotypeMappers(numAlleles). (A genotype consists of 2 
47 % alleles.)
48 
49 [allelesToGenotypes, genotypesToAlleles] = generateAlleleGenotypeMappers(numAlleles);
50 
51 % One or both of these matrices might be useful.
52 %
53 %   1.  allelesToGenotypes: n x n matrix that maps pairs of allele IDs to 
54 %   genotype IDs, where n is the number of alleles -- if 
55 %   allelesToGenotypes(i, j) = k, then the genotype with ID k comprises of 
56 %   the alleles with IDs i and j
57 %
58 %   2.  genotypesToAlleles: m x 2 matrix of allele IDs, where m is the 
59 %   number of genotypes -- if genotypesToAlleles(k, :) = [i, j], then the 
60 %   genotype with ID k is comprised of the allele with ID i and the allele 
61 %   with ID j
62 
63 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
64 %INSERT YOUR CODE HERE
65 % The number of phenotypes is 2
66 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
67 
68 % Fill in phenotypeFactor.var.  This should be a 1-D row vector.
69 % Fill in phenotypeFactor.card.  This should be a 1-D row vector.
70 phenotypeFactor.var = [phenotypeVar geneCopyVarOne geneCopyVarTwo];
71 phenotypeFactor.card= [2 numAlleles numAlleles];
72 
73 phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card));
74 % Replace the zeros in phentoypeFactor.val with the correct values.
75 assign_alle = IndexToAssignment(1:prod(phenotypeFactor.card),phenotypeFactor.card);
76 
77 for i = 1:prod(phenotypeFactor.card)
78     alle_type = assign_alle(i,2:3); 
79         if assign_alle(i,1) == 1
80             phenotypeFactor.val(i) = alphaList(allelesToGenotypes(alle_type(1),alle_type(2)));
81         else
82             phenotypeFactor.val(i) = 1 - alphaList(allelesToGenotypes(alle_type(1),alle_type(2)));
83         end
84 end
85 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View Code

2. 构造Graph

2.1 通常的Graph

　　在有了CPD之后，只需要把CPD串起来，就可以得到图模型了。CPD描述的是变量与变量之间的关系，对于基因---表现型而言，这种关系是“一般的”。也就是这种关系放在任意一个普系中都适用。构造Graph的函数应该返回很多的局部CPD。也就是变量与其父节点构成的CPD。在得到这些CPD之后，只需要按照贝耶斯网络规律连乘，就可以得到Combine CPD。使用evidence Observe，就可以观察到变量的分布情况。

 1  function factorList = constructGeneticNetwork(pedigree, alleleFreqs, alphaList)
 2 % pedigree = struct('parents', [0,0;1,3;0,0]);
 3 % pedigree.names = {'Ira','James','Robin'};
 4 % alleleFreqs = [0.1; 0.9];
 5 % alphaList = [0.8; 0.6; 0.1];
 6 % This function constructs a Bayesian network for genetic inheritance.  It
 7 % assumes that there are only 2 phenotypes.  It also assumes that either 
 8 % both parents are specified or neither parent is specified.
 9 %
10 % In Matlab, each variable will have a number.  We need a consistent way of
11 % numbering variables when instantiating CPDs so that we know what
12 % variables are involved in each CPD.  For example, if IraGenotype is in
13 % multiple CPDs and IraGenotype is number variable 1 in a CPD, then
14 % IraGenotype should be numbered as variable 1 in all CPDs and no other
15 % variables should be numbered 1; thus, every time variable 1 appears in
16 % our network, we will know that it refers to Ira's genotype.
17 %
18 % Here is how the variables should be numbered, for a pedigree with n
19 % people:
20 %
21 % 1.  The first n variables should be the genotype variables and should
22 % be numbered according to the index of the corresponding person in
23 % pedigree.names; the ith person with name pedigree.names{i} has genotype
24 % variable number i.
25 %
26 % 2.  The next n variables should be the phenotype variables and should be
27 % numbered according to the index of the corresponding person in
28 % pedigree.names; the ith person with name pedigree.names{i} has phenotype
29 % variable number n+i.
30 %
31 % Here is an example of how the variable numbering should work: if
32 % pedigree.names = {'Ira', 'James', 'Robin'} and pedigree.parents = [0, 0;
33 % 1, 3; 0, 0], then the variable numbering is as follows:
34 %
35 % Variable 1: IraGenotype
36 % Variable 2: JamesGenotype
37 % Variable 3: RobinGenotype
38 % Variable 4: IraPhenotype
39 % Variable 5: JamesPhenotype
40 % Variable 6: RobinPhenotype
41 %
42 % Input:
43 %   pedigree: Data structure that includes names, genders, and parent-child
44 %   relationships
45 %   alleleFreqs: Frequencies of alleles in the population
46 %   alphaList: m x 1 vector of alpha values for genotypes, where m is the
47 %   number of genotypes -- the alpha value for a genotype is the 
48 %   probability that a person with that genotype will have the
49 %   physical trait
50 %
51 % Output:
52 %   factorList: Struct array of factors for the genetic network (In each
53 %   factor, .var, .card, and .val are all row 1-D vectors.)
54 
55 numPeople = length(pedigree.names);
56 
57 % Initialize factors
58 % The number of factors is twice the number of people because there is a 
59 % factor for each person's genotype and a separate factor for each person's 
60 % phenotype.  Note that the order of the factors in the list does not
61 % matter.
62 factorList(2*numPeople) = struct('var', [], 'card', [], 'val', []);
63 
64 numAlleles = length(alleleFreqs); % Number of alleles
65 
66 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
67 %INSERT YOUR CODE HERE
68 % Variable numbers:
69 % 1 - numPeople: genotype variables
70 % numPeople+1 - 2*numPeople: phenotype variables
71 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
72 
73 for i = 1 : 2*numPeople    
74     if i <= numPeople
75         pedigree_parents = pedigree.parents(i,:);
76         if max(pedigree_parents) == 0
77             tmp_ = genotypeGivenAlleleFreqsFactor(alleleFreqs,i);
78         else
79             tmp_ = genotypeGivenParentsGenotypesFactor(length(alleleFreqs),i,pedigree_parents(1),pedigree_parents(2));
80         end
81     else
82         tmp_ = phenotypeGivenGenotypeFactor(alphaList,i-numPeople,i);
83     end
84     factorList(i) = tmp_;
85 end
86 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View Code

　　在完成这个构造规律函数之后，Koller 教授提供了一种更优雅可可视化方法，可以将这些CPD发送到SamIam里。

2.2 解耦网络的Graph

　　解耦网络的好处是可以应对某个碱基有多种取值的情况。但是代价是解耦网络的结构和普通网络完全不同。

  1 function factorList = constructDecoupledGeneticNetwork(pedigree, alleleFreqs, alphaList)
  2 
  3 % pedigree = struct('parents', [0,0;1,3;0,0]);
  4 % pedigree.names = {'Ira','James','Robin'};
  5 % alleleFreqs = [0.1; 0.7; 0.2];
  6 % alleleListThree = {'F', 'f', 'n'};
  7 % alphaList = [0.8; 0.6; 0.1; 0.5; 0.05; 0.01];
  8 
  9 % This function constructs a Bayesian network for genetic inheritance.  It
 10 % assumes that there are only 2 phenotypes.  It also assumes that, in the
 11 % pedigree, either both parents are specified or neither parent is
 12 % specified.
 13 %
 14 % In Matlab, each variable will have a number.  We need a consistent way of
 15 % numbering variables when instantiating CPDs so that we know what
 16 % variables are involved in each CPD.  For example, if IraGenotype is in
 17 % multiple CPDs and IraGenotype is number variable 1 in a CPD, then
 18 % IraGenotype should be numbered as variable 1 in all CPDs and no other
 19 % variables should be numbered 1; thus, every time variable 1 appears in
 20 % our network, we will know that it refers to Ira's genotype.
 21 %
 22 % Here is how the variables should be numbered, for a pedigree with n
 23 % people:
 24 %
 25 % 1.  The first n variables should be the gene copy 1 variables and should
 26 % be numbered according to the index of the corresponding person in
 27 % pedigree.names; the ith person with name pedigree.names{i} has gene copy
 28 % 1 variable number i.  If the parents are specified, then gene copy 1 is the
 29 % copy inherited from pedigree.names{pedigree.parents(i, 1)}.
 30 %
 31 % 2.  The next n variables should be the gene copy 2 variables and should
 32 % be numbered according to the index of the corresponding person in
 33 % pedigree.names; the ith person with name pedigree.names{i} has gene copy
 34 % 2 variable number n+i.  If the parents are specified, then gene copy 2 is the
 35 % copy inherited from pedigree.parents(i, 2).
 36 %
 37 % 3.  The final n variables should be the phenotype variables and should be
 38 % numbered according to the index of the corresponding person in
 39 % pedigree.names; the ith person with name pedigree.names{i} has phenotype
 40 % variable number 2n+i.
 41 %
 42 % Here is an example of how the variable numbering should work: if
 43 % pedigree.names = {'Ira', 'James', 'Robin'} and pedigree.parents = [0, 0;
 44 % 1, 3; 0, 0], then the variable numbering is as follows:
 45 %
 46 % Variable 1: IraGeneCopy1
 47 % Variable 2: JamesGeneCopy1
 48 % Variable 3: RobinGeneCopy1
 49 % Variable 4: IraGeneCopy2
 50 % Variable 5: JamesGeneCopy2
 51 % Variable 6: RobinGeneCopy2
 52 % Variable 7: IraPhenotype
 53 % Variable 8: JamesPhenotype
 54 % Variable 9: RobinPhenotype
 55 %
 56 % Input:
 57 %   pedigree: Data structure that includes the names and parents of each
 58 %   person
 59 %   alleleFreqs: n x 1 vector of allele frequencies in the population,
 60 %   where n is the number of alleles
 61 %   alphaList: m x 1 vector of alphas for different genotypes, where m is
 62 %   the number of genotypes -- the alpha value for a genotype is the 
 63 %   probability that a person with that genotype will have the physical 
 64 %   trait
 65 %
 66 % Output:
 67 %   factorList: struct array of factors for the genetic network (In each
 68 %   factor, .var, .card, and .val are row 1-D vectors.)
 69 
 70 numPeople = length(pedigree.names);
 71 
 72 % Initialize factors
 73 % We Need 3*numPeople factors because, for each person, there is a factor 
 74 % for the CPD at each of 2 parental copies of the gene and a factor for the 
 75 % CPD at the phenotype Note that the order of the factors in the list does 
 76 % not matter.
 77 factorList(3*numPeople) = struct('var', [], 'card', [], 'val', []);
 78 
 79 % Initialize factors
 80 factorList(3*numPeople) = struct('var', [], 'card', [], 'val', []);
 81 
 82 numAlleles = length(alleleFreqs); % Number of alleles
 83 
 84 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 85 % INSERT YOUR CODE HERE
 86 % Variable numbers:
 87 % 1 - numPeople: first parent copy of gene variables
 88 % numPeople+1 - 2*numPeople: second parent copy of gene variables
 89 % 2*numPeople+1 - 3*numPeople: phenotype variables
 90 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
 91 
 92 
 93 for k = 1:3
 94     switch k
 95         case 1
 96             for i = 1:numPeople
 97                 pedigree_parents = pedigree.parents(i,:);
 98                 if max(pedigree_parents)==0
 99                     factorList(i) = childCopyGivenFreqsFactor(alleleFreqs,i);
100                 else
101                     factorList(i) = childCopyGivenParentalsFactor(numAlleles,i,pedigree_parents(1),pedigree_parents(2));
102                 end
103             end
104         case 2
105             for i = (numPeople+1):2*numPeople 
106                 pedigree_parents = pedigree.parents(i-numPeople,:);
107                 if max(pedigree_parents)==0
108                     factorList(i) = childCopyGivenFreqsFactor(alleleFreqs,i);
109                 else
110                     factorList(i) = childCopyGivenParentalsFactor(numAlleles,i,pedigree_parents(1),pedigree_parents(2));
111                 end
112             end
113         case 3
114             for i = 2*numPeople+1:3*numPeople
115                  factorList(i) = phenotypeGivenCopiesFactor(alphaList,numAlleles,i,i-numPeople,i-2*numPeople);
116             end
117     end
118 end
119         
120 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View Code

　　同样，此网络也可以可视化。不赘述。

3、多父节点CPD构造

　　对于父节点较少的情况，我们可以不同的程序生成CPD。但是现实生活中，某个变量往往由多个变量决定。这是如果要一项一项指定CPD表格会是一件非常麻烦的事情。不如直接指定父节点与子节点连接权重，然后后由indicate函数点亮连接。最后再由sigmoid函数生成变量之间的关系。这是一种常见的方法。

　　此实例是某种表现型由多个碱基（>4）的取值决定。每种碱基影响不同。故此CPD至少有5个entry(表现型+4个碱基对），假设每个entry两种取值，则有32种组合。

 1  function phenotypeFactor = constructSigmoidPhenotypeFactor(alleleWeights, geneCopyVarOneList, geneCopyVarTwoList, phenotypeVar)
 2 
 3 % alleleWeights = {[3, -3], [0.9, -0.8]};
 4 % geneCopyVarOneList = [1; 2];
 5 % geneCopyVarTwoList = [4; 5];
 6 % phenotypeVar = 3;
 7 
 8 % This function takes a cell array of alleles' weights and constructs a 
 9 % factor expressing a sigmoid CPD.
10 %
11 % You can assume that there are only 2 genes involved in the CPD.
12 %
13 % In the factor, for each gene, each allele assignment maps to the allele
14 % whose weight is at the corresponding location.  For example, for gene 1,
15 % allele assignment 1 maps to the allele whose weight is at
16 % alleleWeights{1}(1) (same as w_1^1), allele assignment 2 maps to the
17 % allele whose weight is at alleleWeights{1}(2) (same as w_2^1),....  
18 % 
19 % You may assume that there are 2 possible phenotypes.
20 % For the phenotypes, assignment 1 maps to having the physical trait, and
21 % assignment 2 maps to not having the physical trait.
22 %
23 % THE VARIABLE TO THE LEFT OF THE CONDITIONING BAR MUST BE THE FIRST
24 % VARIABLE IN THE .var FIELD FOR GRADING PURPOSES
25 %
26 % Input:
27 %   alleleWeights: Cell array of weights, where each entry is an 1 x n 
28 %   of weights for the alleles for a gene (n is the number of alleles for
29 %   the gene)
30 %   geneCopyVarOneList: m x 1 vector (m is the number of genes) of variable 
31 %   numbers that are the variable numbers for each of the first parent's 
32 %   copy of each gene (numbers in this list go in the .var part of the
33 %   factor)
34 %   geneCopyVarTwoList: m x 1 vector (m is the number of genes) of variable 
35 %   numbers that are the variable numbers for each of the second parent's 
36 %   copy of each gene (numbers in this list go in the .var part of the
37 %   factor) -- Note that both copies of each gene are from the same person,
38 %   but each copy originally came from a different parent
39 %   phenotypeVar: Variable number corresponding to the variable for the 
40 %   phenotype (goes in the .var part of the factor)
41 %
42 % Output:
43 %   phenotypeFactor: Factor in which the values are the probabilities of 
44 %   having each phenotype for each allele combination (note that this is 
45 %   the FULL CPD with no evidence observed)
46 
47 phenotypeFactor = struct('var', [], 'card', [], 'val', []);
48 
49 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
50 % INSERT YOUR CODE HERE
51 % Note that computeSigmoid.m will be useful for this function.
52 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%  
53 
54 % Fill in phenotypeFactor.var.  This should be a 1-D row vector.
55 % Fill in phenotypeFactor.card.  This should be a 1-D row vector.
56 phenotypeFactor.var = [phenotypeVar geneCopyVarOneList' geneCopyVarTwoList'];
57 %num_of_alle_in_each_geno
58 geno_var_card = [];
59 
60 for i=1:length(geneCopyVarOneList)
61     geno_var_card = [geno_var_card max(size(alleleWeights{i}))];
62 end
63 phenotypeFactor.card =[2 geno_var_card geno_var_card];
64 phenotypeFactor.val = zeros(1, prod(phenotypeFactor.card));
65 % Replace the zeros in phentoypeFactor.val with the correct values.
66 assign_allele = IndexToAssignment(1:prod(phenotypeFactor.card),phenotypeFactor.card);
67 
68 for i = 1:prod(phenotypeFactor.card)
69     pheno_allele_ = assign_allele(i,:);
70 % start from the second var
71     k = 2;
72     weight = 0;
73 % a pair of gene come from *two* people
74     for j = 1:2
75 % get num of allele from one people
76         for l = 1:length(geneCopyVarOneList)
77             weight = weight+alleleWeights{l}(pheno_allele_(k));
78             k = k+1;
79         end
80     end
81     switch pheno_allele_(1)
82         case 1
83             phenotypeFactor.val(i) = computeSigmoid(weight);            
84         case 2
85             phenotypeFactor.val(i) = 1-computeSigmoid(weight);                
86     end
87 end
88 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

View Code

　　所有代码点击这里

posted @ 2016-03-29 17:38 IronStark 阅读(1341) 评论(0) 编辑收藏举报

刷新页面返回顶部

IronStark

Ph.D, HUST.

机器学习 —— 概率图模型（Homework: StructuredCPD）

1、CPD构造

1.1、基因型与表现型的关系——确定

1.2 基因型与表现型的关系——不确定

1.3 表现型的分布

1.4 父辈基因型与子辈基因型的关系

1.5 表现型与碱基对之间的关系

2. 构造Graph

2.1 通常的Graph

2.2 解耦网络的Graph

3、多父节点CPD构造

公告