It is well known that coding DNA converts to mRNA and that the mRNA is translated into proteins with the mediation of tRNA molecules. The translation rules from DNA to mRNA are trivial. The Thymines (T) are changed to Uracils (U). The translation from mRNA to amino acids, for most forms of life, is governed by the genetic code.
?GeneticCode
the following help files contain (approximately) the keyword: ?AAAToInt ?AToCInt ?AToCodon ?AToInt ?AltGenCode ?AminoToInt ?BBBToInt ?BToInt ?BaseToInt ?CIntToA ?CIntToAAA ?CIntToAmino ?CIntToCodon ?CIntToInt ?CodonToA ?CodonToCInt ?CodonToInt ?Complement ?ComplementSequence ?GetComplement ?IntToA ?IntToAAA ?IntToAmino ?IntToB ?IntToBBB ?IntToBase ?IntToCInt ?IntToCodon ------------
The groups of 3 nucleotides which are used for coding an amino acid or a stop signal are called codons. The amino acids which have more than one codon represent an ambiguity or redundancy which is not known to play a definitive role. Normally, each codon is translated by one tRNA molecule, although a tRNA molecule can translate more than one codon. There are 61 different codons, 41 tRNA molecules (for Saccharomyces cerevisiae, baker's yeast) and 20 amino acids. So the genetic code has two levels of redundancy.

The amino acids have particular frequencies which are correlated to the number of codons coding for it, but otherwise not uniform. The "Total Number of Pairs of Amino Acids" bio-recipe shows how to compute amino acid and pairs of amino acids frequencies. We will load the yeast database and compute these frequencies.
ReadDb( '/home/darwin/DB/genomes/YEAST/yeast.db' );
Size := proc(x) t := SearchSeqDb(x); t[1,2]-t[1,1]+1 end:
for i to 20 do
s := Size(IntToA(i));
printf( '%13s %6d %.2f%%\n',
IntToAmino(i), s, 100*s/DB[TotAA] )
od:
Peptide file(/home/darwin/DB/genomes/YEAST/yeast.db(12985775), 6194 entries,
2902851 aminoacids)
Alanine 159685 5.50%
Arginine 129516 4.46%
Asparagine 176570 6.08%
Aspartic acid 168212 5.79%
Cysteine 38075 1.31%
Glutamine 113385 3.91%
Glutamic acid 189023 6.51%
Glycine 144464 4.98%
Histidine 62450 2.15%
Isoleucine 190412 6.56%
Leucine 278480 9.59%
Lysine 211892 7.30%
Methionine 60549 2.09%
Phenylalanine 131155 4.52%
Proline 125962 4.34%
Serine 261986 9.03%
Threonine 170558 5.88%
Tryptophan 30403 1.05%
Tyrosine 97413 3.36%
Valine 162661 5.60%
The RNA codons used are not uniformly distributed either. The departure of the codon frequencies is normally called codon bias. It is also true, and sort of a consequence of the above, that the tRNAs also have a biased distribution. We will use the DNA fields of the yeast database. to count the codon frequencies (a codon is 3 DNA/RNA bases in the correct reading frame).
Codons := CreateArray(1..64):
for e in Entries(DB) do
tD := SearchTag(DNA,e);
for i by 3 to length(tD) do
j := CodonToCInt( tD[i..i+2] );
Codons[j] := Codons[j]+1
od
od:
for i to 64 do
printf( ' %s %.2f%%', CIntToCodon(i), 100*Codons[i]/DB[TotAA] );
if mod(i,4)=0 then lprint() fi
od:
AAA 4.25% AAC 2.46% AAG 3.05% AAT 3.63% ACA 1.80% ACC 1.25% ACG 0.82% ACT 2.01% AGA 2.10% AGC 1.01% AGG 0.96% AGT 1.46% ATA 1.85% ATC 1.69% ATG 2.09% ATT 3.02% CAA 2.67% CAC 0.77% CAG 1.23% CAT 1.38% CCA 1.76% CCC 0.69% CCG 0.54% CCT 1.34% CGA 0.32% CGC 0.27% CGG 0.19% CGT 0.63% CTA 1.36% CTC 0.57% CTG 1.08% CTT 1.26% GAA 4.56% GAC 2.01% GAG 1.96% GAT 3.78% GCA 1.63% GCC 1.22% GCG 0.63% GCT 2.02% GGA 1.13% GGC 0.98% GGG 0.62% GGT 2.25% GTA 1.22% GTC 1.13% GTG 1.09% GTT 2.16% TAA 0.10% TAC 1.45% TAG 0.05% TAT 1.91% TCA 1.91% TCC 1.42% TCG 0.88% TCT 2.35% TGA 0.06% TGC 0.50% TGG 1.05% TGT 0.81% TTA 2.64% TTC 1.83% TTG 2.68% TTT 2.69%
The stop codons, TAA, TAG and TGA show frequency 0 as they never appear in a coding sequence. The DNA that we have recorded in this database is just the coding part. The most and least probable codons are quite far apart, if we sort the values (and ignore the 3 stop codons) we can compute the worst ratio:
sCodons := sort(Codons): sCodons[64] / sCodons[4];
24.1807
There is a factor of 24 between the least and most frequent codons.
To count the frequencies of the tRNAs we can use the results from the previously computed codon frequencies, but we need to use the tRNA translation table.
tRNAtable := [ [[GCT,GCC],[GCA,GCG],Ala], [[CGT,CGC,CGA],[CGG],[AGA],[AGG],Arg], [[AAT,AAC],Asn], [[GAT,GAC],Asp], [[TGT,TGC],Cys], [[CAA],[CAG],Gln], [[GAA],[GAG],Glu], [[GGT,GGC],[GGA],[GGG],Gly], [[CAT,CAC],His], [[ATT,ATC],[ATA],Ile], [[TTA],[TTG],[CTT,CTC],[CTA,CTG],Leu], [[AAA],[AAG],Lys], [[ATG],Met], [[TTT,TTC],Phe], [[CCT,CCC],[CCA,CCG],Pro], [[TCT,TCC],[TCA],[TCG],[AGT,AGC],Ser], [[ACT,ACC],[ACA],[ACG],Thr], [[TGG],Trp], [[TAT,TAC],Tyr], [[GTT,GTC],[GTA],[GTG],Val] ]:
From these tables we would like to create various functions which convert to and from codons, amino acids and tRNAs. This is done by the function SetuptRNA.
?SetuptRNA
Function ComputeTPI - TPI index of a DNA sequence
Calling Sequence: ComputeTPI(e,mode)
Parameters:
Name Type Description
------------------------------------------------------
e string an Entry which contains a DNA sequence
mode string (optional), the string AllAA
Returns:
list({list,numeric})
Synopsis: The TPI index measures how much correlation there exists among the
consecutive tRNAs coding for each amino acid. This autocorrelation is
measured in a way that it is insensitive to different frequencies of amino
acids, different frequencies of tRNAs and different frequencies of bases.
Two indices are computed, which are two representations of the same
magnitude. In both cases, the TPI measures the cumulative distribution of
the number of pairs of consecutive tRNAs coding for the same amino acids.
If the actual number of pairs it too low, this means that the tRNAs are
"rotated" around quite often. If the number of pairs is high, this means
that the tRNAs are "reused" often. The first value returned is a normal
deviate with the same cumulative probability of having the observed number
of pairs. A negative value means less correlation than expected, a positive
value higher correlation than expected. Since it is a normal deviate, N(0,
1), it is easy to estimate how rare the values are. E.g. 1.96 means that it
is higher only 2.5% of the time, etc. The second value is the cumulative
probability of having the observed number of pairs spread over the interval
-1 .. 1.
The function cannot compute the TPI unless the tRNA information is given.
This is normally done with the functin SetuptRNA.
If a second argument, AllAA, is given, then both indices are computed for all
the individual amino acids as well as for the ensemble. In this case, a
list of lists is returned, where each component is a list with the two
values and the name of the amino acid.
See also: ?SetuptRNA ?TPIDistr
------------
SetuptRNA( tRNAtable );
With these tables, which will also be used later, we can compute and print the tRNA frequencies of use. We place them in an array so that we can sort them and find their largest bias.
tRNA_freq := CreateArray(1..ntRNA):
for i to ntRNA do
for c in tIntToCInt(i) do tRNA_freq[i] := tRNA_freq[i] + Codons[c] od;
printf( 'tRNA %2d %s, %.2f%% %a\n', i, tIntTotRNA(i),
100*tRNA_freq[i]/DB[TotAA], tIntToCodon(i) )
od;
tRNA 1 Ala1, 3.24% {GCC,GCT}
tRNA 2 Ala2, 2.26% {GCA,GCG}
tRNA 3 Arg1, 1.21% {CGA,CGC,CGT}
tRNA 4 Arg2, 0.19% {CGG}
tRNA 5 Arg3, 2.10% {AGA}
tRNA 6 Arg4, 0.96% {AGG}
tRNA 7 Asn1, 6.08% {AAC,AAT}
tRNA 8 Asp1, 5.79% {GAC,GAT}
tRNA 9 Cys1, 1.31% {TGC,TGT}
tRNA 10 Gln1, 2.67% {CAA}
tRNA 11 Gln2, 1.23% {CAG}
tRNA 12 Glu1, 4.56% {GAA}
tRNA 13 Glu2, 1.96% {GAG}
tRNA 14 Gly1, 3.23% {GGC,GGT}
tRNA 15 Gly2, 1.13% {GGA}
tRNA 16 Gly3, 0.62% {GGG}
tRNA 17 His1, 2.15% {CAC,CAT}
tRNA 18 Ile1, 4.71% {ATC,ATT}
tRNA 19 Ile2, 1.85% {ATA}
tRNA 20 Leu1, 2.64% {TTA}
tRNA 21 Leu2, 2.68% {TTG}
tRNA 22 Leu3, 1.83% {CTC,CTT}
tRNA 23 Leu4, 2.44% {CTA,CTG}
tRNA 24 Lys1, 4.25% {AAA}
tRNA 25 Lys2, 3.05% {AAG}
tRNA 26 Met1, 2.09% {ATG}
tRNA 27 Phe1, 4.52% {TTC,TTT}
tRNA 28 Pro1, 2.04% {CCC,CCT}
tRNA 29 Pro2, 2.30% {CCA,CCG}
tRNA 30 Ser1, 3.77% {TCC,TCT}
tRNA 31 Ser2, 1.91% {TCA}
tRNA 32 Ser3, 0.88% {TCG}
tRNA 33 Ser4, 2.47% {AGC,AGT}
tRNA 34 Thr1, 3.26% {ACC,ACT}
tRNA 35 Thr2, 1.80% {ACA}
tRNA 36 Thr3, 0.82% {ACG}
tRNA 37 Trp1, 1.05% {TGG}
tRNA 38 Tyr1, 3.36% {TAC,TAT}
tRNA 39 Val1, 3.29% {GTC,GTT}
tRNA 40 Val2, 1.22% {GTA}
tRNA 41 Val3, 1.09% {GTG}
The ratio between the most frequent and less frequent is:
max(tRNA_freq) / min(tRNA_freq);
32.2856
Why is autocorrelation relevant? Autocorrelation measures the reuse of a particular codon/tRNA as contrasted with its use. E.g. a codon may be very rare, that is it appears with low frequency, but it could be highly autocorrelated, i.e. it is very likely to appears in tandems. In terms of reloading tRNA molecules or simply having the tRNA available, autocorrelation of codons/tRNA will give us an indication of whether this is a relevant process for the regulation of protein synthesis.
Besides their biased frequencies, codons, tRNA and amino acids show signs of autocorrelation. Autocorrelation is a departure of the expected frequency when we know the previous symbol. In other words, the frequencies of pairs are not the same as the product of the individual frequencies. It is of particular interest to study the autocorrelation of the codons coding for one particular amino acid or the tRNAs coding for one particular amino acid. Notice that within the codons or tRNA coding for a particular amino acid, there can be no natural selection pressures as the proteins produced are the same.
We will now compute the codon autocorrelation for serine. Serine is coded by the following codons and tRNA:
tRNAtable[ AToInt(S) ];
[[TCT, TCC], [TCA], [TCG], [AGT, AGC], Ser]
For this autocorrelation we will take one protein (database entry) at a time. We will not correlate amino acids between different proteins (last of one with first of the next one), as we have no information about the order of the proteins within the yeast genome (there are also other good reasons for not doing this). It is simpler to do it for all amino acids in one pass, we just keep an array of 20 entries with the information of the previous codon used by each protein. Notice that we should skip any stop codon that we encounter which are coded as amino acid number 22.
Cod_Cod := CreateArray(1..64,1..64):
for e in Entries(DB) do
tD := SearchTag(DNA,e);
prev := CreateArray(1..20);
for i by 3 to length(tD) do
j := CodonToCInt( tD[i..i+2] );
k := CIntToInt(j);
if k = 22 then next
elif prev[k] > 0 then
Cod_Cod[prev[k],j] := Cod_Cod[prev[k],j]+1 fi;
prev[k] := j;
od
od:
Now we print the portion of the matrix corresponding to serine.
for i to 64 do if CIntToA(i)='S' then
printf( '%s ', CIntToCodon(i) );
for j to 64 do if CIntToA(j)='S' then
printf( '%6d', Cod_Cod[i,j] ) fi od;
printf( '\n' )
fi od;
AGC 4001 5260 5487 4035 2757 6888 AGT 5282 7553 8544 6008 3883 9957 TCA 5755 8704 12409 8231 5468 13571 TCC 4171 6136 8097 6856 3874 11057 TCG 2923 3909 5396 3976 2765 6132 TCT 6476 9771 14028 11131 6260 19044
From the counts alone it is difficult to see the autocorrelation. The different codons have different frequencies. To see the autocorrelation we will subtract the expected values of the counts and divide by the standard deviation. This is what is called a z-transform, and it has the advantage of revealing how far from expected our results are. In Np we will store the total number of pairs of codons. The probability of a pair appearing is the product of the individual probabilities, divided by the probability of having a serine (this is because the pairs are counted exclusively among the serines). After a simple cancellation we obtain the formula below.
Np := sum(sum(Cod_Cod));
for i to 64 do if CIntToA(i)='S' then
printf( '%s ', CIntToCodon(i) );
for j to 64 do if CIntToA(j)='S' then
pr := Codons[i]/DB[TotAA] * Codons[j]/Size(S);
std := sqrt( Np*pr*(1-pr) );
printf( '%7.2f', ( Cod_Cod[i,j] - Np*pr ) / std ) fi od;
printf( '\n' )
fi od;
Np := 2780865 AGC 15.33 10.69 -5.73 -5.68 0.19 -4.97 AGT 11.02 12.26 -0.23 -4.61 -1.37 -5.95 TCA -2.24 1.50 11.62 -1.01 3.98 -1.92 TCC -3.63 -3.00 -2.48 8.41 0.27 7.79 TCG 3.36 -0.96 2.98 1.92 7.42 -3.27 TCT -9.80 -7.76 1.98 8.53 -1.67 15.56
We can see that there are obvious signs of autocorrelation, the diagonal values are all positive and almost always the largest for their columns/rows. There are also signs of tRNA autocorrelation. If we observe that codons AGC,AGT are translated by the same tRNA and ditto for codons TCC,TCT we see autocorrelation also at the tRNA level.
The above analysis of autocorrelation suffers from a bias that may be induced by different base frequencies in different parts of the genome. It is known that some parts of the genome are GC rich while other parts are GC poor. Such long-streched biases induce an autocorrelation in the codons, which could be significant.
So we would like to measure the autocorrelation in a way that is independent of the base composition. Furthermore it would be nice to be able to "add" together the autocorrelations of different amino acids. To meet these goals we have defined the TPI (tRNA Pairing Index).
The TPI is an index which is computed for each protein and measures how autocorrelated (positively or negatively) are its codons. Is is completely independent of the frequencies of the amino acids, tRNA or codons, so it will not suffer from any of the common sources of bias.
The way of measuring the autocorrelation independently of everything else is by transforming the problem into a combinatorial one. For example, suppose that we are considering an amino acid X which occurs 7 times in a protein and can be translated by two tRNAs, A and B (e.g. 3 A's and 4 B's). We will extract the tRNAs from our sequence and represent them as a sequence of 7 symbols. E.g. AABABBB.
Highly autocorrelated cases are AAABBBB and BBBBAAA. A highly negatively autocorrelated case is BABABAB. This autocorrelation can be quantified by the number of identical pairs in the sequence or, conversely, by the number of changes (nc) as we read from left to right. Notice that for a sequence of length n, the number of identical pairs plus the number of changes is n-1. The mathematics is completely analogous for the number of pairs or number of changes. We call these breaks in the sequences changes, with the thought that if a tRNA molecule is doing the translation for one particular amino acid, when these breaks happen, this tRNA will have to be changed for another molecule. The first two examples have 1 change each, the last example has 6 changes. The TPI measures how high the actual number of pairs are or how low the nc is compared to all possible permutations of the sequence of tRNAs. In this case we have 35 cases:
| sequence/nc | sequence/nc | sequence/nc | sequence/nc | sequence/nc |
| AAABBBB 1 | AABABBB 3 | AABBABB 3 | AABBBAB 3 | AABBBBA 2 |
| ABAABBB 3 | ABABABB 5 | ABABBAB 5 | ABABBBA 4 | ABBAABB 3 |
| ABBABAB 5 | ABBABBA 4 | ABBBAAB 3 | ABBBABA 4 | ABBBBAA 2 |
| BAAABBB 2 | BAABABB 4 | BAABBAB 4 | BAABBBA 3 | BABAABB 4 |
| BABABAB 6 | BABABBA 5 | BABBAAB 4 | BABBABA 5 | BABBBAA 3 |
| BBAAABB 2 | BBAABAB 4 | BBAABBA 3 | BBABAAB 4 | BBABABA 5 |
| BBABBAA 3 | BBBAAAB 2 | BBBAABA 3 | BBBABAA 3 | BBBBAAA 1 |
We can compute the probability of a given nc happening. This is the number of times they happen divided by 35 (each of the permutation is equally likely to happen). The following histogram shows the distribution of the nc. From such a curve we can judge how rare or common a given nc is.

We will now write a function to compute the probability and cumulative distribution of a given nc. This is a fairly mathematical and computational problem, and some readers may want to skip to the section where the TPI is computed.
It is easy to observe that the probability of the number of changes (nc) does not depend on what the symbols are, but rather depends on how many of each. We will write a function, called nc, that computes the probability of a having a given nc from a random permutation. This function will be based on another function, called nc_r, which does the recursive part of the computation. Nc_r assumes that we are not at the beginning of the sequence, but rather that the last symbol observed is known. To identify this known symbol (all symbols are otherwise equivalent), we will make it the first of the arguments. Our function nc_r assumes that it is called with a symbol of the first class preceeding the n1+n2 other symbols.
nc_r := proc( nc:integer, n1:integer, n2:integer )
If any of the arguments is negative or if more changes than symbols are exptected, then the answer is 0.
if n1 < 0 or n2 < 0 or nc < 0 or nc > n1+n2 then 0
if there are no more symbols (and nc must be 0) then the probability is 1.
elif n1=0 and n2=0 then 1
Now we recurse. The first symbol is either from the class of n1 (no change) or from the class of n2, in which case now the preceeding symbol is of the second class and we invert the arguments.
else n1/(n1+n2) * nc_r( nc, n1-1, n2 ) +
n2/(n1+n2) * nc_r( nc-1, n2-1, n1 )
fi
end:
Testing it with the above example, computing the probability of 3 changes assuming that the previous symbol was an A:
nc_r( 3, 3, 4 ) * 35;
9.0000
There are 6 cases starting with A which have 3 changes and 3 cases starting with B with 2 changes. This coincides with the probability computed (once multiplied by 35). Before we resolve the original problem we will generalize nc_r to be able to handle any number of groups of symbols. In our application we may have up to 4 symbols (yeast has some amino acids which are translated by 4 different tRNA molecules). Also, the code as written above, is badly exponential. We could use ideas from dynamic programming to speed it up. In this case it is much simpler to use the remember function to avoid recomputing cases.
?remember
Function Clusters - find Clusters of seqs or objects
Calling Sequence: Clusters(seqs,lim)
Clusters(AllAll,lim)
Clusters(Dist,lim)
Parameters:
Name Type Description
-------------------------------------------------------------------
seqs list(string) a list of Sequences (DNA or proteins)
AllAll matrix(Alignment) all vs all Alignment matrix
Dist matrix(numeric) all vs all distance matrix (symmetric)
lim symbol = positive mode and value used to define clusters
Returns:
list(set(posint))
Synopsis: This function finds clusters in a set of sequences or any objects
from their distance or similarity constraints. The input is either a set of
sequences or a distance matrix or an AllAll matrix and the result is a list
of sets of clusters. The components of the clusters are identified by the
indices to the seqs or AllAll or Dist arrays. The parameters can be:
List of sequences - n sequences. The sequences are aligned all against all
using Global alignments with the default DM matrix. (the rest
is as with AllAll matrix).
AllAll matrix - an n x n symmetric matrix of Alignments. If the cluster
definition is based on MaxDistance=ddd or AveDistance=dd then
the clusters are selected so that the PamDistance (or average)
of the Alignments are less than ddd. If MinSimil=sss or
AveSimil=sss is specified, the the clusters will be determined
by the Score (or average) of the Alignments being larger than
sss.
Distance matrix - an n x n symmetric distance matrix. MaxDistance=ddd or
AveDistance=ddd should be specified and the clusters are
determined by this maximum/average distance.
MaxDistance = ddd - The clusters are determined by the distance ddd. I.e. any
two sequences or objects which are separated by less than ddd
will be part of the same cluster
AveDistance = ddd - The clusters are determined by the distance ddd. The
clusters are built one at a time, starting with the first
sequence/object and adding one member at a time. The member
added is the one whose average distance to the rest of the
cluster is less than ddd. The clusters built this way, may
depend on the order of the input sequences.
MinSimil = sss - Like MaxDistance, but the selection criteria is based on
Similarity or Score being greater than sss.
AveSimil = sss - Like AveDistance, but the selection criteria is based on the
average Similarity or Score being greater than sss.
The output is the list of sets of indices. Each set is a cluster. All
indices are included, hence some clusters may be singletons.
Examples:
> seqs := [SSSSS, AAAAA, AAAAS, SASSS, SSSSA, ASAAA]:
> Clusters(seqs,AveSimil=8);
[{1,4,5}, {2,3,6}]
See also: ?CircularTour ?ComputeTSP ?FindCircularOrder ?Malign
------------
The code has a few other optimizations which are intended to reuse previous computations as much as possible and to cut useless computation as early as possible. One of the ways of reusing the previous results as much as possible is to keep the 2nd to last sizes (3rd to last argument) in increasing order. (The order of these arguments does not matter, only the first is relevant).
nc_r := proc( ch:integer, n1:integer, n2:integer ) n := sum( args[i], i=2..nargs );
resolve as many trivial cases as possible
if min(args) < 0 or ch > n or ch > 2*(n-max(args[2..nargs])) + 1 then 0 elif max(args) = 0 then 1 elif nargs >= 3 and n2=0 then nc_r(ch,n1,args[4..nargs])
if any of the args[3..nargs] are out of order, reorder them and call recursively
else for i from 3 to nargs-1 do if args[i] > args[i+1] then return( nc_r( args[1..i-1], args[i+1], args[i], args[i+2..nargs] )) fi od;
compute recursively (remember/reuse previous values)
r := n1/n * remember( nc_r( ch, n1-1, args[3..nargs] ));
for i from 3 to nargs do
r := r + args[i]/n * remember( nc_r( ch-1, args[i]-1,
args[2..i-1], args[i+1..nargs] ))
od;
r
fi
end:
Warning: procedure nc_r reassigned
Now we can write our main function to compute the probability of a given nc. To make coding simpler, we assume we have an extra symbol which appears 0 times and is preceeding our sequence. This works correctly unless we have the exceptional case of no changes and no symbols at all. The function becomes:
nc := proc( ch )
if {args}=0 then 1 else nc_r( ch+1, 0, args[2..nargs] ) fi end:
and we can test it with a more significant example, say 7 and 10 symbols:
DrawHistogram( [ seq( nc(i,7,10), i=1..17 ) ] );

It is wise to check that the probabilities add up to 1:
sum( nc(i,7,10), i=1..17 ) - 1;
-1.1102e-16
the difference to 1 is well within round-off errors.
Due to the importance of this function in our research and its complexity for large arguments, it has been implemented in the kernel to make it faster and more compact. The kernel function is called TPIDistr and computes the entire distribution of the nc for a given set of number of symbols. For example:
DrawHistogram( TPIDistr(5,10,20,30) );

Finally, to estimate how rare a given nc is, we need to compute its cumulative distribution. Since the distribution is over the integers, we will take the cumulative distribution which adds one half of the probability at the point. (The second to last arguments of this function are the numbers of each symbol.)
nc_cum := proc( ch:integer ) d := TPIDistr(args[2..nargs]); if ch < 0 then 0 elif ch >= length(d) then 1 else d[ch+1] / 2 + sum( d[1..ch] ) fi end: nc_cum(3,7,10);
0.00365076
The previous computation was designed to find how common/rare is a particular number of changes for a single amino acid. When we study a database entry, we would like to take all the amino acids and somehow summarize all the pairs/changes information into a single index. To do this we will consider that an nc is the unit to measure and we will add the nc's of all the amino acids. The distribution of the sum of all the nc's is the convolution of the individual distributions for each amino acid.
The previous computation was designed to find how common/rare is a particular number of changes for a single amino acid. When we study a database entry, we would like to take all the amino acids and somehow summarize all the pairs/changes information into a single index. To do this we will consider that an nc is the unit to measure and we will add the nc's of all the amino acids. The distribution of the sum of all the nc's is the convolution of the individual distributions for each amino acid.
Darwin has a function which does this computation for a given Entry called ComputeTPI. The Entry has to contain DNA information for this to be possible. ComputeTPI computes the number of pairs/changes for each amino acid and the total number. Then computes the distribution of the pairs/changes for each amino acid and convolves the distributions. Finally it computes the cumulative probability. Two indices, mathematically equivalent, are returned from this cumulative probability. The first one is just a normal N(0,1) deviate with the same cumulative probability. The second is just the cumulative distribution linearly spread over the interval -1 ... 1. We test the function against an arbitrary Entry:
ComputeTPI( Entry(9) );
[1.2222, 0.7783]
Darwin has a function which does this computation for a given Entry called ComputeTPI. The Entry has to contain DNA information for this to be possible. ComputeTPI computes the number of pairs/changes for each amino acid and the total number. Then computes the distribution of the pairs/changes for each amino acid and convolves the distributions. Finally it computes the cumulative probability. Two indices, mathematically equivalent, are returned from this cumulative probability. The first one is just a normal N(0,1) deviate with the same cumulative probability. The second is just the cumulative distribution linearly spread over the interval -1 ... 1. We test the function against an arbitrary Entry:
ComputeTPI( Entry(9) );
[1.2222, 0.7783]
We now compute the univariate and covariate statistics of the TPI indices for a few random entries.
cov := Covariance( 'TPI index', [normal,'-1...1'] ): to 100 do cov + ComputeTPI( Rand(Entry) ) od: print( cov );
Covariance analysis for TPI index, 100 data points
normal -1...1
Means 0.1284 0.0572
Covariance matrix
normal 0.9110
-1...1 0.5252 0.3137
and we see that the TPI indices are correlated and their means are positively biased. (This is also true for the entire yeast genome). This means that there is a positive autocorrelation of tRNA completely independent of their frequencies.
© 2005 by Gaston Gonnet, Informatik, ETH Zurich
Please cite as:
@techreport{ Gonnet-TPI,
author = {Gaston H. Gonnet},
title = {The tRNA Pairing Index},
month = { April },
year = {2003},
number = { 399 },
howpublished = { Electronic publication },
copyright = {code under the GNU General Public License},
institution = {Informatik, ETH, Zurich},
URL = { http://www.biorecipes.com/g/TPI/code.html }
}
Last updated on Fri Sep 2 17:08:50 2005 by GhG