The tRNA Pairing Index


by Gaston H. Gonnet

     It is well known that coding DNA converts to mRNA and that the mRNA is translated into proteins with the mediation of tRNA molecules. The translation rules from DNA to mRNA are trivial. The Thymines (T) are changed to Uracils (U). The translation from mRNA to amino acids, for most forms of life, is governed by the genetic code.

?GeneticCode
the following help files contain (approximately) the keyword:
  ?AAAToInt
  ?AToCInt
  ?AToCodon
  ?AToInt
  ?AltGenCode
  ?AminoToInt
  ?BBBToInt
  ?BToInt
  ?BaseToInt
  ?CIntToA
  ?CIntToAAA
  ?CIntToAmino
  ?CIntToCodon
  ?CIntToInt
  ?CodonToA
  ?CodonToCInt
  ?CodonToInt
  ?Complement
  ?ComplementSequence
  ?GetComplement
  ?IntToA
  ?IntToAAA
  ?IntToAmino
  ?IntToB
  ?IntToBBB
  ?IntToBase
  ?IntToCInt
  ?IntToCodon
   ------------

     The groups of 3 nucleotides which are used for coding an amino acid or a stop signal are called codons. The amino acids which have more than one codon represent an ambiguity or redundancy which is not known to play a definitive role. Normally, each codon is translated by one tRNA molecule, although a tRNA molecule can translate more than one codon. There are 61 different codons, 41 tRNA molecules (for Saccharomyces cerevisiae, baker's yeast) and 20 amino acids. So the genetic code has two levels of redundancy.

mappings from codons to tRNA to amino acids

     The amino acids have particular frequencies which are correlated to the number of codons coding for it, but otherwise not uniform. The "Total Number of Pairs of Amino Acids" bio-recipe shows how to compute amino acid and pairs of amino acids frequencies. We will load the yeast database and compute these frequencies.

ReadDb( '/home/darwin/DB/genomes/YEAST/yeast.db' );
Size := proc(x) t := SearchSeqDb(x);  t[1,2]-t[1,1]+1 end:
for i to 20 do
    s := Size(IntToA(i));
    printf( '%13s  %6d  %.2f%%\n',
	IntToAmino(i), s, 100*s/DB[TotAA] )
    od:
Peptide file(/home/darwin/DB/genomes/YEAST/yeast.db(12985775), 6194 entries, 
2902851 aminoacids)
      Alanine  159685  5.50%
     Arginine  129516  4.46%
   Asparagine  176570  6.08%
Aspartic acid  168212  5.79%
     Cysteine   38075  1.31%
    Glutamine  113385  3.91%
Glutamic acid  189023  6.51%
      Glycine  144464  4.98%
    Histidine   62450  2.15%
   Isoleucine  190412  6.56%
      Leucine  278480  9.59%
       Lysine  211892  7.30%
   Methionine   60549  2.09%
Phenylalanine  131155  4.52%
      Proline  125962  4.34%
       Serine  261986  9.03%
    Threonine  170558  5.88%
   Tryptophan   30403  1.05%
     Tyrosine   97413  3.36%
       Valine  162661  5.60%

     The RNA codons used are not uniformly distributed either. The departure of the codon frequencies is normally called codon bias. It is also true, and sort of a consequence of the above, that the tRNAs also have a biased distribution. We will use the DNA fields of the yeast database. to count the codon frequencies (a codon is 3 DNA/RNA bases in the correct reading frame).

Codons := CreateArray(1..64):
for e in Entries(DB) do
    tD := SearchTag(DNA,e);
    for i by 3 to length(tD) do
	j := CodonToCInt( tD[i..i+2] );
	Codons[j] := Codons[j]+1
	od
    od:
for i to 64 do
    printf( '   %s %.2f%%', CIntToCodon(i), 100*Codons[i]/DB[TotAA] );
    if mod(i,4)=0 then lprint() fi
    od:
   AAA 4.25%   AAC 2.46%   AAG 3.05%   AAT 3.63%
   ACA 1.80%   ACC 1.25%   ACG 0.82%   ACT 2.01%
   AGA 2.10%   AGC 1.01%   AGG 0.96%   AGT 1.46%
   ATA 1.85%   ATC 1.69%   ATG 2.09%   ATT 3.02%
   CAA 2.67%   CAC 0.77%   CAG 1.23%   CAT 1.38%
   CCA 1.76%   CCC 0.69%   CCG 0.54%   CCT 1.34%
   CGA 0.32%   CGC 0.27%   CGG 0.19%   CGT 0.63%
   CTA 1.36%   CTC 0.57%   CTG 1.08%   CTT 1.26%
   GAA 4.56%   GAC 2.01%   GAG 1.96%   GAT 3.78%
   GCA 1.63%   GCC 1.22%   GCG 0.63%   GCT 2.02%
   GGA 1.13%   GGC 0.98%   GGG 0.62%   GGT 2.25%
   GTA 1.22%   GTC 1.13%   GTG 1.09%   GTT 2.16%
   TAA 0.10%   TAC 1.45%   TAG 0.05%   TAT 1.91%
   TCA 1.91%   TCC 1.42%   TCG 0.88%   TCT 2.35%
   TGA 0.06%   TGC 0.50%   TGG 1.05%   TGT 0.81%
   TTA 2.64%   TTC 1.83%   TTG 2.68%   TTT 2.69%

The stop codons, TAA, TAG and TGA show frequency 0 as they never appear in a coding sequence. The DNA that we have recorded in this database is just the coding part. The most and least probable codons are quite far apart, if we sort the values (and ignore the 3 stop codons) we can compute the worst ratio:

sCodons := sort(Codons):
sCodons[64] / sCodons[4];
24.1807

There is a factor of 24 between the least and most frequent codons.

     To count the frequencies of the tRNAs we can use the results from the previously computed codon frequencies, but we need to use the tRNA translation table.

tRNAtable := [
  [[GCT,GCC],[GCA,GCG],Ala],
  [[CGT,CGC,CGA],[CGG],[AGA],[AGG],Arg],
  [[AAT,AAC],Asn],
  [[GAT,GAC],Asp],
  [[TGT,TGC],Cys],
  [[CAA],[CAG],Gln],
  [[GAA],[GAG],Glu],
  [[GGT,GGC],[GGA],[GGG],Gly],
  [[CAT,CAC],His],
  [[ATT,ATC],[ATA],Ile],
  [[TTA],[TTG],[CTT,CTC],[CTA,CTG],Leu],
  [[AAA],[AAG],Lys],
  [[ATG],Met],
  [[TTT,TTC],Phe],
  [[CCT,CCC],[CCA,CCG],Pro],
  [[TCT,TCC],[TCA],[TCG],[AGT,AGC],Ser],
  [[ACT,ACC],[ACA],[ACG],Thr],
  [[TGG],Trp],
  [[TAT,TAC],Tyr],
  [[GTT,GTC],[GTA],[GTG],Val] ]:

     From these tables we would like to create various functions which convert to and from codons, amino acids and tRNAs. This is done by the function SetuptRNA.

?SetuptRNA

Function ComputeTPI - TPI index of a DNA sequence

Calling Sequence:  ComputeTPI(e,mode)

Parameters:
  Name   Type     Description                           
  ------------------------------------------------------
  e      string   an Entry which contains a DNA sequence
  mode   string   (optional), the string AllAA          

Returns:
	list({list,numeric})
Synopsis:  The TPI index measures how much correlation there exists among the
  consecutive tRNAs coding for each amino acid.  This autocorrelation is
  measured in a way that it is insensitive to different frequencies of amino
  acids, different frequencies of tRNAs and different frequencies of bases. 
  Two indices are computed, which are two representations of the same
  magnitude.  In both cases, the TPI measures the cumulative distribution of
  the number of pairs of consecutive tRNAs coding for the same amino acids. 
  If the actual number of pairs it too low, this means that the tRNAs are
  "rotated" around quite often.  If the number of pairs is high, this means
  that the tRNAs are "reused" often.  The first value returned is a normal
  deviate with the same cumulative probability of having the observed number
  of pairs.  A negative value means less correlation than expected, a positive
  value higher correlation than expected.  Since it is a normal deviate, N(0,
  1), it is easy to estimate how rare the values are.  E.g. 1.96 means that it
  is higher only 2.5% of the time, etc.  The second value is the cumulative
  probability of having the observed number of pairs spread over the interval
  -1 .. 1.
The function cannot compute the TPI unless the tRNA information is given. 
  This is normally done with the functin SetuptRNA.
If a second argument, AllAA, is given, then both indices are computed for all
  the individual amino acids as well as for the ensemble.  In this case, a
  list of lists is returned, where each component is a list with the two
  values and the name of the amino acid.
See also:  ?SetuptRNA  ?TPIDistr


   ------------
SetuptRNA( tRNAtable );

     With these tables, which will also be used later, we can compute and print the tRNA frequencies of use. We place them in an array so that we can sort them and find their largest bias.

tRNA_freq := CreateArray(1..ntRNA):
for i to ntRNA do
    for c in tIntToCInt(i) do tRNA_freq[i] := tRNA_freq[i] + Codons[c] od;
    printf( 'tRNA %2d %s,  %.2f%%  %a\n', i, tIntTotRNA(i),
	100*tRNA_freq[i]/DB[TotAA], tIntToCodon(i) )
    od;
tRNA  1 Ala1,  3.24%  {GCC,GCT}
tRNA  2 Ala2,  2.26%  {GCA,GCG}
tRNA  3 Arg1,  1.21%  {CGA,CGC,CGT}
tRNA  4 Arg2,  0.19%  {CGG}
tRNA  5 Arg3,  2.10%  {AGA}
tRNA  6 Arg4,  0.96%  {AGG}
tRNA  7 Asn1,  6.08%  {AAC,AAT}
tRNA  8 Asp1,  5.79%  {GAC,GAT}
tRNA  9 Cys1,  1.31%  {TGC,TGT}
tRNA 10 Gln1,  2.67%  {CAA}
tRNA 11 Gln2,  1.23%  {CAG}
tRNA 12 Glu1,  4.56%  {GAA}
tRNA 13 Glu2,  1.96%  {GAG}
tRNA 14 Gly1,  3.23%  {GGC,GGT}
tRNA 15 Gly2,  1.13%  {GGA}
tRNA 16 Gly3,  0.62%  {GGG}
tRNA 17 His1,  2.15%  {CAC,CAT}
tRNA 18 Ile1,  4.71%  {ATC,ATT}
tRNA 19 Ile2,  1.85%  {ATA}
tRNA 20 Leu1,  2.64%  {TTA}
tRNA 21 Leu2,  2.68%  {TTG}
tRNA 22 Leu3,  1.83%  {CTC,CTT}
tRNA 23 Leu4,  2.44%  {CTA,CTG}
tRNA 24 Lys1,  4.25%  {AAA}
tRNA 25 Lys2,  3.05%  {AAG}
tRNA 26 Met1,  2.09%  {ATG}
tRNA 27 Phe1,  4.52%  {TTC,TTT}
tRNA 28 Pro1,  2.04%  {CCC,CCT}
tRNA 29 Pro2,  2.30%  {CCA,CCG}
tRNA 30 Ser1,  3.77%  {TCC,TCT}
tRNA 31 Ser2,  1.91%  {TCA}
tRNA 32 Ser3,  0.88%  {TCG}
tRNA 33 Ser4,  2.47%  {AGC,AGT}
tRNA 34 Thr1,  3.26%  {ACC,ACT}
tRNA 35 Thr2,  1.80%  {ACA}
tRNA 36 Thr3,  0.82%  {ACG}
tRNA 37 Trp1,  1.05%  {TGG}
tRNA 38 Tyr1,  3.36%  {TAC,TAT}
tRNA 39 Val1,  3.29%  {GTC,GTT}
tRNA 40 Val2,  1.22%  {GTA}
tRNA 41 Val3,  1.09%  {GTG}

     The ratio between the most frequent and less frequent is:

max(tRNA_freq) / min(tRNA_freq);
32.2856

Autocorrelation of codons, tRNA or amino acids

     Why is autocorrelation relevant? Autocorrelation measures the reuse of a particular codon/tRNA as contrasted with its use. E.g. a codon may be very rare, that is it appears with low frequency, but it could be highly autocorrelated, i.e. it is very likely to appears in tandems. In terms of reloading tRNA molecules or simply having the tRNA available, autocorrelation of codons/tRNA will give us an indication of whether this is a relevant process for the regulation of protein synthesis.

     Besides their biased frequencies, codons, tRNA and amino acids show signs of autocorrelation. Autocorrelation is a departure of the expected frequency when we know the previous symbol. In other words, the frequencies of pairs are not the same as the product of the individual frequencies. It is of particular interest to study the autocorrelation of the codons coding for one particular amino acid or the tRNAs coding for one particular amino acid. Notice that within the codons or tRNA coding for a particular amino acid, there can be no natural selection pressures as the proteins produced are the same.

     We will now compute the codon autocorrelation for serine. Serine is coded by the following codons and tRNA:

tRNAtable[ AToInt(S) ];
[[TCT, TCC], [TCA], [TCG], [AGT, AGC], Ser]

For this autocorrelation we will take one protein (database entry) at a time. We will not correlate amino acids between different proteins (last of one with first of the next one), as we have no information about the order of the proteins within the yeast genome (there are also other good reasons for not doing this). It is simpler to do it for all amino acids in one pass, we just keep an array of 20 entries with the information of the previous codon used by each protein. Notice that we should skip any stop codon that we encounter which are coded as amino acid number 22.

Cod_Cod := CreateArray(1..64,1..64):
for e in Entries(DB) do
    tD := SearchTag(DNA,e);
    prev := CreateArray(1..20);
    for i by 3 to length(tD) do
	j := CodonToCInt( tD[i..i+2] );
	k := CIntToInt(j);
	if k = 22 then next
	elif prev[k] > 0 then
	    Cod_Cod[prev[k],j] := Cod_Cod[prev[k],j]+1 fi;
	prev[k] := j;
	od
    od:

     Now we print the portion of the matrix corresponding to serine.

for i to 64 do if CIntToA(i)='S' then
    printf( '%s   ', CIntToCodon(i) );
    for j to 64 do if CIntToA(j)='S' then
	printf( '%6d', Cod_Cod[i,j] ) fi od;
    printf( '\n' )
    fi od;
AGC     4001  5260  5487  4035  2757  6888
AGT     5282  7553  8544  6008  3883  9957
TCA     5755  8704 12409  8231  5468 13571
TCC     4171  6136  8097  6856  3874 11057
TCG     2923  3909  5396  3976  2765  6132
TCT     6476  9771 14028 11131  6260 19044

     From the counts alone it is difficult to see the autocorrelation. The different codons have different frequencies. To see the autocorrelation we will subtract the expected values of the counts and divide by the standard deviation. This is what is called a z-transform, and it has the advantage of revealing how far from expected our results are. In Np we will store the total number of pairs of codons. The probability of a pair appearing is the product of the individual probabilities, divided by the probability of having a serine (this is because the pairs are counted exclusively among the serines). After a simple cancellation we obtain the formula below.

Np := sum(sum(Cod_Cod));
for i to 64 do if CIntToA(i)='S' then
    printf( '%s   ', CIntToCodon(i) );
    for j to 64 do if CIntToA(j)='S' then
	pr := Codons[i]/DB[TotAA] * Codons[j]/Size(S);
	std := sqrt( Np*pr*(1-pr) );
	printf( '%7.2f', ( Cod_Cod[i,j] - Np*pr ) / std ) fi od;
    printf( '\n' )
    fi od;
Np := 2780865
AGC     15.33  10.69  -5.73  -5.68   0.19  -4.97
AGT     11.02  12.26  -0.23  -4.61  -1.37  -5.95
TCA     -2.24   1.50  11.62  -1.01   3.98  -1.92
TCC     -3.63  -3.00  -2.48   8.41   0.27   7.79
TCG      3.36  -0.96   2.98   1.92   7.42  -3.27
TCT     -9.80  -7.76   1.98   8.53  -1.67  15.56

We can see that there are obvious signs of autocorrelation, the diagonal values are all positive and almost always the largest for their columns/rows. There are also signs of tRNA autocorrelation. If we observe that codons AGC,AGT are translated by the same tRNA and ditto for codons TCC,TCT we see autocorrelation also at the tRNA level.

     The above analysis of autocorrelation suffers from a bias that may be induced by different base frequencies in different parts of the genome. It is known that some parts of the genome are GC rich while other parts are GC poor. Such long-streched biases induce an autocorrelation in the codons, which could be significant.

     So we would like to measure the autocorrelation in a way that is independent of the base composition. Furthermore it would be nice to be able to "add" together the autocorrelations of different amino acids. To meet these goals we have defined the TPI (tRNA Pairing Index).

tRNA Pairing Index for a given amino acid

     The TPI is an index which is computed for each protein and measures how autocorrelated (positively or negatively) are its codons. Is is completely independent of the frequencies of the amino acids, tRNA or codons, so it will not suffer from any of the common sources of bias.

     The way of measuring the autocorrelation independently of everything else is by transforming the problem into a combinatorial one. For example, suppose that we are considering an amino acid X which occurs 7 times in a protein and can be translated by two tRNAs, A and B (e.g. 3 A's and 4 B's). We will extract the tRNAs from our sequence and represent them as a sequence of 7 symbols. E.g. AABABBB.

     Highly autocorrelated cases are AAABBBB and BBBBAAA. A highly negatively autocorrelated case is BABABAB. This autocorrelation can be quantified by the number of identical pairs in the sequence or, conversely, by the number of changes (nc) as we read from left to right. Notice that for a sequence of length n, the number of identical pairs plus the number of changes is n-1. The mathematics is completely analogous for the number of pairs or number of changes. We call these breaks in the sequences changes, with the thought that if a tRNA molecule is doing the translation for one particular amino acid, when these breaks happen, this tRNA will have to be changed for another molecule. The first two examples have 1 change each, the last example has 6 changes. The TPI measures how high the actual number of pairs are or how low the nc is compared to all possible permutations of the sequence of tRNAs. In this case we have 35 cases:


sequence/ncsequence/ncsequence/ncsequence/ncsequence/nc

AAABBBB 1AABABBB 3AABBABB 3AABBBAB 3AABBBBA 2
ABAABBB 3ABABABB 5ABABBAB 5ABABBBA 4ABBAABB 3
ABBABAB 5ABBABBA 4ABBBAAB 3ABBBABA 4ABBBBAA 2
BAAABBB 2BAABABB 4BAABBAB 4BAABBBA 3BABAABB 4
BABABAB 6BABABBA 5BABBAAB 4BABBABA 5BABBBAA 3
BBAAABB 2BBAABAB 4BBAABBA 3BBABAAB 4BBABABA 5
BBABBAA 3BBBAAAB 2BBBAABA 3BBBABAA 3BBBBAAA 1

     We can compute the probability of a given nc happening. This is the number of times they happen divided by 35 (each of the permutation is equally likely to happen). The following histogram shows the distribution of the nc. From such a curve we can judge how rare or common a given nc is.

sequences per number of changes (nc)

Computation of the probability of the number of changes (nc)

     We will now write a function to compute the probability and cumulative distribution of a given nc. This is a fairly mathematical and computational problem, and some readers may want to skip to the section where the TPI is computed.

     It is easy to observe that the probability of the number of changes (nc) does not depend on what the symbols are, but rather depends on how many of each. We will write a function, called nc, that computes the probability of a having a given nc from a random permutation. This function will be based on another function, called nc_r, which does the recursive part of the computation. Nc_r assumes that we are not at the beginning of the sequence, but rather that the last symbol observed is known. To identify this known symbol (all symbols are otherwise equivalent), we will make it the first of the arguments. Our function nc_r assumes that it is called with a symbol of the first class preceeding the n1+n2 other symbols.

nc_r := proc( nc:integer, n1:integer, n2:integer )

If any of the arguments is negative or if more changes than symbols are exptected, then the answer is 0.

if n1 < 0 or n2 < 0 or nc < 0 or nc > n1+n2 then 0

if there are no more symbols (and nc must be 0) then the probability is 1.

elif n1=0 and n2=0 then 1

Now we recurse. The first symbol is either from the class of n1 (no change) or from the class of n2, in which case now the preceeding symbol is of the second class and we invert the arguments.

else n1/(n1+n2) * nc_r( nc, n1-1, n2 ) +
     n2/(n1+n2) * nc_r( nc-1, n2-1, n1 )
     fi
end:

     Testing it with the above example, computing the probability of 3 changes assuming that the previous symbol was an A:

nc_r( 3, 3, 4 ) * 35;
9.0000

There are 6 cases starting with A which have 3 changes and 3 cases starting with B with 2 changes. This coincides with the probability computed (once multiplied by 35). Before we resolve the original problem we will generalize nc_r to be able to handle any number of groups of symbols. In our application we may have up to 4 symbols (yeast has some amino acids which are translated by 4 different tRNA molecules). Also, the code as written above, is badly exponential. We could use ideas from dynamic programming to speed it up. In this case it is much simpler to use the remember function to avoid recomputing cases.

?remember

Function Clusters - find Clusters of seqs or objects

Calling Sequence:  Clusters(seqs,lim)
                   Clusters(AllAll,lim)
                   Clusters(Dist,lim)

Parameters:
  Name     Type                Description                           
  -------------------------------------------------------------------
  seqs     list(string)        a list of Sequences (DNA or proteins) 
  AllAll   matrix(Alignment)   all vs all Alignment matrix           
  Dist     matrix(numeric)     all vs all distance matrix (symmetric)
  lim      symbol = positive   mode and value used to define clusters

Returns:
	list(set(posint))
Synopsis:  This function finds clusters in a set of sequences or any objects
  from their distance or similarity constraints.  The input is either a set of
  sequences or a distance matrix or an AllAll matrix and the result is a list
  of sets of clusters.  The components of the clusters are identified by the
  indices to the seqs or AllAll or Dist arrays.  The parameters can be:
List of sequences - n sequences.  The sequences are aligned all against all
               using Global alignments with the default DM matrix.  (the rest
               is as with AllAll matrix).
AllAll matrix - an n x n symmetric matrix of Alignments.  If the cluster
               definition is based on MaxDistance=ddd or AveDistance=dd then
               the clusters are selected so that the PamDistance (or average)
               of the Alignments are less than ddd.  If MinSimil=sss or
               AveSimil=sss is specified, the the clusters will be determined
               by the Score (or average) of the Alignments being larger than
               sss.
Distance matrix - an n x n symmetric distance matrix.  MaxDistance=ddd or
               AveDistance=ddd should be specified and the clusters are
               determined by this maximum/average distance.
MaxDistance = ddd - The clusters are determined by the distance ddd.  I.e. any
               two sequences or objects which are separated by less than ddd
               will be part of the same cluster
AveDistance = ddd - The clusters are determined by the distance ddd.  The
               clusters are built one at a time, starting with the first
               sequence/object and adding one member at a time.  The member
               added is the one whose average distance to the rest of the
               cluster is less than ddd.  The clusters built this way, may
               depend on the order of the input sequences.
MinSimil = sss - Like MaxDistance, but the selection criteria is based on
               Similarity or Score being greater than sss.
AveSimil = sss - Like AveDistance, but the selection criteria is based on the
               average Similarity or Score being greater than sss.
The output is the list of sets of indices.  Each set is a cluster.  All
  indices are included, hence some clusters may be singletons.

Examples:
> seqs := [SSSSS, AAAAA, AAAAS, SASSS, SSSSA, ASAAA]:
> Clusters(seqs,AveSimil=8);
[{1,4,5}, {2,3,6}]
See also:  ?CircularTour  ?ComputeTSP  ?FindCircularOrder  ?Malign


   ------------

The code has a few other optimizations which are intended to reuse previous computations as much as possible and to cut useless computation as early as possible. One of the ways of reusing the previous results as much as possible is to keep the 2nd to last sizes (3rd to last argument) in increasing order. (The order of these arguments does not matter, only the first is relevant).

nc_r := proc( ch:integer, n1:integer, n2:integer )
n := sum( args[i], i=2..nargs );

resolve as many trivial cases as possible

if min(args) < 0 or ch > n or ch > 2*(n-max(args[2..nargs])) + 1 then 0
elif max(args) = 0 then 1
elif nargs >= 3 and n2=0 then nc_r(ch,n1,args[4..nargs])

if any of the args[3..nargs] are out of order, reorder them and call recursively

else for i from 3 to nargs-1 do if args[i] > args[i+1] then
	 return( nc_r( args[1..i-1], args[i+1], args[i], args[i+2..nargs] ))
	 fi od;

compute recursively (remember/reuse previous values)

     r := n1/n * remember( nc_r( ch, n1-1, args[3..nargs] ));
     for i from 3 to nargs do
	 r := r + args[i]/n * remember( nc_r( ch-1, args[i]-1,
	     args[2..i-1], args[i+1..nargs] ))
	 od;
     r
     fi
end:
Warning: procedure nc_r reassigned

     Now we can write our main function to compute the probability of a given nc. To make coding simpler, we assume we have an extra symbol which appears 0 times and is preceeding our sequence. This works correctly unless we have the exceptional case of no changes and no symbols at all. The function becomes:

nc := proc( ch )
if {args}=0 then 1 else nc_r( ch+1, 0, args[2..nargs] ) fi end:

and we can test it with a more significant example, say 7 and 10 symbols:

DrawHistogram( [ seq( nc(i,7,10), i=1..17 ) ] );

probabilities per number of changes (nc)

     It is wise to check that the probabilities add up to 1:

sum( nc(i,7,10), i=1..17 ) - 1;
-1.1102e-16

the difference to 1 is well within round-off errors.

     Due to the importance of this function in our research and its complexity for large arguments, it has been implemented in the kernel to make it faster and more compact. The kernel function is called TPIDistr and computes the entire distribution of the nc for a given set of number of symbols. For example:

DrawHistogram( TPIDistr(5,10,20,30) );

probabilities per number of changes (nc)

     Finally, to estimate how rare a given nc is, we need to compute its cumulative distribution. Since the distribution is over the integers, we will take the cumulative distribution which adds one half of the probability at the point. (The second to last arguments of this function are the numbers of each symbol.)

nc_cum := proc( ch:integer )
d := TPIDistr(args[2..nargs]);
if ch < 0 then 0
elif ch >= length(d) then 1
else d[ch+1] / 2 + sum( d[1..ch] ) fi
end:
nc_cum(3,7,10);
0.00365076

Computing the TPI for all the amino acids in a given entry

     The previous computation was designed to find how common/rare is a particular number of changes for a single amino acid. When we study a database entry, we would like to take all the amino acids and somehow summarize all the pairs/changes information into a single index. To do this we will consider that an nc is the unit to measure and we will add the nc's of all the amino acids. The distribution of the sum of all the nc's is the convolution of the individual distributions for each amino acid.

     The previous computation was designed to find how common/rare is a particular number of changes for a single amino acid. When we study a database entry, we would like to take all the amino acids and somehow summarize all the pairs/changes information into a single index. To do this we will consider that an nc is the unit to measure and we will add the nc's of all the amino acids. The distribution of the sum of all the nc's is the convolution of the individual distributions for each amino acid.

     Darwin has a function which does this computation for a given Entry called ComputeTPI. The Entry has to contain DNA information for this to be possible. ComputeTPI computes the number of pairs/changes for each amino acid and the total number. Then computes the distribution of the pairs/changes for each amino acid and convolves the distributions. Finally it computes the cumulative probability. Two indices, mathematically equivalent, are returned from this cumulative probability. The first one is just a normal N(0,1) deviate with the same cumulative probability. The second is just the cumulative distribution linearly spread over the interval -1 ... 1. We test the function against an arbitrary Entry:

ComputeTPI( Entry(9) );
[1.2222, 0.7783]

     Darwin has a function which does this computation for a given Entry called ComputeTPI. The Entry has to contain DNA information for this to be possible. ComputeTPI computes the number of pairs/changes for each amino acid and the total number. Then computes the distribution of the pairs/changes for each amino acid and convolves the distributions. Finally it computes the cumulative probability. Two indices, mathematically equivalent, are returned from this cumulative probability. The first one is just a normal N(0,1) deviate with the same cumulative probability. The second is just the cumulative distribution linearly spread over the interval -1 ... 1. We test the function against an arbitrary Entry:

ComputeTPI( Entry(9) );
[1.2222, 0.7783]

     We now compute the univariate and covariate statistics of the TPI indices for a few random entries.

cov := Covariance( 'TPI index', [normal,'-1...1'] ):
to 100 do cov + ComputeTPI( Rand(Entry) ) od:
print( cov );
Covariance analysis for TPI index, 100 data points
         normal  -1...1
Means    0.1284  0.0572
Covariance matrix
normal   0.9110
-1...1   0.5252  0.3137

and we see that the TPI indices are correlated and their means are positively biased. (This is also true for the entire yeast genome). This means that there is a positive autocorrelation of tRNA completely independent of their frequencies.

© 2005 by Gaston Gonnet, Informatik, ETH Zurich

Please cite as:

@techreport{ Gonnet-TPI,
        author = {Gaston H. Gonnet},
        title = {The tRNA Pairing Index},
        month = { April },
        year = {2003},
        number = { 399 },
        howpublished = { Electronic publication },
        copyright = {code under the GNU General Public License},
        institution = {Informatik, ETH, Zurich},
        URL = { http://www.biorecipes.com/g/TPI/code.html }
        }

Index of bio-recipes

Last updated on Fri Sep 2 17:08:50 2005 by GhG