This bio-recipe shows how to test for independence in a contingency table of counts. To do that we will use a chi-square test which is also known under the name of Pearson's chi-square test. As in the bio-recipe AApairs, we will be using the total number of counts of amino acid pairs in the SwissProt database as an example.
The null-hypothesis we want to test is that the observed number of pairs of amino acids corresponds to the number of pairs we would expect based on the frequencies of the two amino acid involved in the pair.
The data we want to use for our null-hypothesis consists of the counts matrix C obtained from the bio-recipe AApairs. So we use the code from AApairs, especially the procedure Size that extracts the counts of the amino acid pairs from the data base.
Size := proc( x:string ) t := SearchSeqDb(x); t[1,2]-t[1,1]+1 end:define the location of the SwissProt database and load it
SwissProt := '~cbrg/DB/SwissProt.Z': ReadDb(SwissProt);
Reading 169638448 characters from file /home/cbrg/DB/SwissProt.Z Pre-processing input (peptides) 163235 sequences within 163235 entries considered Peptide file(/home/cbrg/DB/SwissProt.Z(169638448), 163235 entries, 59631787 aminoacids)
M := 8: C := CreateArray(1..M,1..M): for i to M do for j to M do aapair := IntToA(i).IntToA(j): C[i,j] := Size(aapair): od: od:print a header line and the count matrix
for i to M do printf( '%7s', IntToAAA(i) ) od; printf('\n'); print(C);
Ala Arg Asn Asp Cys Gln Glu Gly 469536 246987 165610 234294 67935 182718 302217 342032 236998 221709 128206 168203 49457 131368 216985 205625 172817 119157 127122 123057 39732 96894 148390 174090 238027 156099 125148 177308 46553 104121 226639 220279 60995 52031 38877 49757 24430 37576 52023 76560 193952 137005 97651 108404 35500 141753 156841 147336 315161 221906 180628 218457 51900 161480 345095 238123 313624 222980 162380 214222 64665 152035 250593 333941
The test statistic that we want to use for our test is computed as the squared difference between the observed and the expected counts divided by the expected counts, summed over all elements in the matrix. The expected counts under the null-hypothesis correspond to the product of the amino acid frequencies times the total number of amino acid pairs. Let us now start to compute the test statistic by defining a matrix E with expected counts of amino acid pairs.
nrRow := length(C); nrCol := length(C[1]); E:= CreateArray(1..nrRow, 1..nrCol): sumRow := [ seq(sum(C[i]), i=1..nrRow) ]; Ct := transpose(C): sumCol := [ seq(sum(Ct[i]), i=1..nrCol) ]; sumC := sum(sumCol); for i to nrRow do for j to nrCol do E[i,j] := sumRow[i] * sumCol[j] / sumC; od; od; printf( 'Expected counts of amino acid pairs: ' ); print(E);
nrRow := 8 nrCol := 8 sumRow := [2011329, 1358551, 1001259, 1294174, 392249, 1018442, 1732750, 1714440 ] sumCol := [2001110, 1377874, 1025622, 1293702, 380172, 1007945, 1698783, 1737986 ] sumC := 10523194 Expected counts of amino acid pairs: 382478.04 263357.11 196030.15 247269.07 72663.39 192651.49 324693.39 332186.37 258344.57 177884.40 132408.45 167017.75 49080.45 130126.34 219313.96 224375.09 190401.26 131101.71 97585.70 123092.93 36172.54 95903.77 161635.50 165365.58 246102.52 169455.08 126134.07 159103.36 46754.69 123960.10 208921.43 213742.74 74590.79 51359.85 38229.76 48222.37 14170.80 37570.86 63321.64 64782.92 193668.81 133351.60 99260.41 125205.38 36793.31 97549.61 164409.39 168203.49 329502.94 226880.85 168879.00 213021.08 62599.15 165968.31 279721.75 286176.92 326021.08 224483.39 167094.46 210770.08 61937.67 164214.52 276765.93 283152.88
Now that we have the expected counts for the amino acid pairs, the next step is to compute the test statistic which corresponds to the squared differences between each element of the two matrices of C and E divided by the expected count, and summed over all elements in matrix C. The degree of freedom and the significance level are both important to determine the limit of the test statistic at which we are willing to either accept or reject our null-hypothesis. The level of significance has to be determined by the researcher. The degrees of freedom for this test can be computed as the product of the number of rows and the number of columns each subtracted by one.
testStat := 0: for i to nrRow do for j to nrCol do testStat := testStat + (C[i,j] - E[i,j])^2 / E[i,j]: od: od: df := (nrRow - 1) * (nrCol - 1): printf( 'Chi-square test statistic: %.2f\n', testStat); printf( 'Degrees of freedom: %.0f\n', df);
Chi-square test statistic: 144847.38 Degrees of freedom: 49
With the test statistic, the degree of freedom, and the significance level which you have to decide on your own, you can get the test decision by looking at any table of the chi-square distribution.
© 2005 by Peter von Rohr, Informatik, ETH Zurich
Last updated on Fri Sep 2 14:43:55 2005 by PvR