The tRNA Pairing Index


by Gaston H. Gonnet

     It is well known that coding DNA converts to mRNA and that the mRNA is translated into proteins with the mediation of tRNA molecules. The translation rules from DNA to mRNA are trivial. The Thymines (T) are changed to Uracils (U). The translation from mRNA to amino acids, for most forms of life, is governed by the genetic code.

?GeneticCode

GGG  G  Gly    AGG  R  Arg    CGG  R  Arg    UGG  W  Trp    
GGA  G  Gly    AGA  R  Arg    CGA  R  Arg    UGA    Stop    
GGC  G  Gly    AGC  S  Ser    CGC  R  Arg    UGC  C  Cys    
GGU  G  Gly    AGU  S  Ser    CGU  R  Arg    UGU  C  Cys    

GAG  E  Glu    AAG  K  Lys    CAG  Q  Gln    UAG    Stop    
GAA  E  Glu    AAA  K  Lys    CAA  Q  Gln    UAA    Stop    
GAC  D  Asp    AAC  N  Asn    CAC  H  His    UAC  Y  Tyr    
GAU  D  Asp    AAU  N  Asn    CAU  H  His    UAU  Y  Tyr    

GCG  A  Ala    ACG  T  Thr    CCG  P  Pro    UCG  S  Ser    
GCA  A  Ala    ACA  T  Thr    CCA  P  Pro    UCA  S  Ser    
GCC  A  Ala    ACC  T  Thr    CCC  P  Pro    UCC  S  Ser    
GCU  A  Ala    ACU  T  Thr    CCU  P  Pro    UCU  S  Ser    

GUG  V  Val    AUG  M  Met    CUG  L  Leu    UUG  L  Leu    
GUA  V  Val    AUA  I  Ile    CUA  L  Leu    UUA  L  Leu    
GUC  V  Val    AUC  I  Ile    CUC  L  Leu    UUC  F  Phe    
GUU  V  Val    AUU  I  Ile    CUU  L  Leu    UUU  F  Phe    

See Also:
   ?AltGenCode      ?BaseToInt    ?CIntToAmino    ?CodonToInt     ?IntToBBB  
   ?AminoToInt      ?BBBToInt     ?CIntToCodon    ?Complement     ?IntToCInt 
   ?antiparallel    ?BToInt       ?CIntToInt      ?GeneticCode    ?IntToCodon
   ?AToCInt         ?CIntToA      ?CodonToA       ?IntToB         ?Reverse   
   ?AToCodon        ?CIntToAAA    ?CodonToCInt    ?IntToBase                 


   ------------

     The groups of 3 nucleotides which are used for coding an amino acid or a stop signal are called codons. The amino acids which have more than one codon represent an ambiguity or redundancy which is not known to play a definitive role. Normally, each codon is translated by one tRNA molecule, although a tRNA molecule can translate more than one codon. There are 61 different codons, 41 tRNA molecules (for Saccharomyces cerevisiae, baker's yeast) and 20 amino acids. So the genetic code has two levels of redundancy.

mappings from codons to tRNA to amino acids

     The amino acids have particular frequencies which are correlated to the number of codons coding for it, but otherwise not uniform. The "Total Number of Pairs of Amino Acids" bio-recipe shows how to compute amino acid and pairs of amino acids frequencies. We will load the yeast database and compute these frequencies.

ReadDb( '/home/darwin/DB/yeast.db' );
Size := proc(x) t := SearchSeqDb(x);  t[1,2]-t[1,1]+1 end:
for i to 20 do
    s := Size(IntToA(i));
    printf( '%13s  %6d  %.2f%%\n',
	IntToAmino(i), s, 100*s/DB[TotAA] )
    od:
Peptide file(/pub/projects/cbrg-oma-data/current/genomes/YEAST/yeast.db(12985775
), 6194 entries, 2902851 aminoacids)
      Alanine  159685  5.50%
     Arginine  129516  4.46%
   Asparagine  176570  6.08%
Aspartic acid  168212  5.79%
     Cysteine   38075  1.31%
    Glutamine  113385  3.91%
Glutamic acid  189023  6.51%
      Glycine  144464  4.98%
    Histidine   62450  2.15%
   Isoleucine  190412  6.56%
      Leucine  278480  9.59%
       Lysine  211892  7.30%
   Methionine   60549  2.09%
Phenylalanine  131155  4.52%
      Proline  125962  4.34%
       Serine  261986  9.03%
    Threonine  170558  5.88%
   Tryptophan   30403  1.05%
     Tyrosine   97413  3.36%
       Valine  162661  5.60%

     The RNA codons used are not uniformly distributed either. The departure of the codon frequencies is normally called codon bias. It is also true, and sort of a consequence of the above, that the tRNAs also have a biased distribution. We will use the DNA fields of the yeast database. to count the codon frequencies (a codon is 3 DNA/RNA bases in the correct reading frame).

Codons := CreateArray(1..64):
for e in Entries(DB) do
    tD := SearchTag(DNA,e);
    for i by 3 to length(tD) do
	j := CodonToCInt( tD[i..i+2] );
	Codons[j] := Codons[j]+1
	od
    od:
for i to 64 do
    printf( '   %s %.2f%%', CIntToCodon(i), 100*Codons[i]/DB[TotAA] );
    if mod(i,4)=0 then lprint() fi
    od:
   AAA 4.25%   AAC 2.46%   AAG 3.05%   AAT 3.63%
   ACA 1.80%   ACC 1.25%   ACG 0.82%   ACT 2.01%
   AGA 2.10%   AGC 1.01%   AGG 0.96%   AGT 1.46%
   ATA 1.85%   ATC 1.69%   ATG 2.09%   ATT 3.02%
   CAA 2.67%   CAC 0.77%   CAG 1.23%   CAT 1.38%
   CCA 1.76%   CCC 0.69%   CCG 0.54%   CCT 1.34%
   CGA 0.32%   CGC 0.27%   CGG 0.19%   CGT 0.63%
   CTA 1.36%   CTC 0.57%   CTG 1.08%   CTT 1.26%
   GAA 4.56%   GAC 2.01%   GAG 1.96%   GAT 3.78%
   GCA 1.63%   GCC 1.22%   GCG 0.63%   GCT 2.02%
   GGA 1.13%   GGC 0.98%   GGG 0.62%   GGT 2.25%
   GTA 1.22%   GTC 1.13%   GTG 1.09%   GTT 2.16%
   TAA 0.10%   TAC 1.45%   TAG 0.05%   TAT 1.91%
   TCA 1.91%   TCC 1.42%   TCG 0.88%   TCT 2.35%
   TGA 0.06%   TGC 0.50%   TGG 1.05%   TGT 0.81%
   TTA 2.64%   TTC 1.83%   TTG 2.68%   TTT 2.69%

The stop codons, TAA, TAG and TGA show frequency 0 as they never appear in a coding sequence. The DNA that we have recorded in this database is just the coding part. The most and least probable codons are quite far apart, if we sort the values (and ignore the 3 stop codons) we can compute the worst ratio:

sCodons := sort(Codons):
sCodons[64] / sCodons[4];
24.1807

There is a factor of 24 between the least and most frequent codons.

     To count the frequencies of the tRNAs we can use the results from the previously computed codon frequencies, but we need to use the tRNA translation table.

tRNAtable := [
  [[GCT,GCC],[GCA,GCG],Ala],
  [[CGT,CGC,CGA],[CGG],[AGA],[AGG],Arg],
  [[AAT,AAC],Asn],
  [[GAT,GAC],Asp],
  [[TGT,TGC],Cys],
  [[CAA],[CAG],Gln],
  [[GAA],[GAG],Glu],
  [[GGT,GGC],[GGA],[GGG],Gly],
  [[CAT,CAC],His],
  [[ATT,ATC],[ATA],Ile],
  [[TTA],[TTG],[CTT,CTC],[CTA,CTG],Leu],
  [[AAA],[AAG],Lys],
  [[ATG],Met],
  [[TTT,TTC],Phe],
  [[CCT,CCC],[CCA,CCG],Pro],
  [[TCT,TCC],[TCA],[TCG],[AGT,AGC],Ser],
  [[ACT,ACC],[ACA],[ACG],Thr],
  [[TGG],Trp],
  [[TAT,TAC],Tyr],
  [[GTT,GTC],[GTA],[GTG],Val] ]:

     From these tables we would like to create various functions which convert to and from codons, amino acids and tRNAs. This is done by the function SetuptRNA.

?SetuptRNA

Function SetuptRNA - set up functions for tRNA translations

Calling Sequence:  SetuptRNA(d)

Parameters:
  Name   Type         Description                        
  -------------------------------------------------------
  d      list(list)   a list (by aa) of list of codons or
  d      string       the name of a known table of tRNA  

Returns:
	NULL
Global Variables:  CIntTotInt_list  IntTotInt_list  ntRNA  tIntToCInt_list 
           tIntToInt_list  tIntTotRNA_list
Synopsis:  This function sets up all the necessary functions to translate
  tRNAs.  These are from tInt to A, AAA, Amino, Int, CInt and Codon and from
  Int and CInt to tRNA and tInt.  Its input is either a string (which means a
  predefined name) or it is a list of 20 (one per amino acid) lists of tRNAs. 
  The format is best given by an example, see the file lib/SetuptRNA. 
  Execution of SetuptRNA causes the following functions and values to be
  defined:
      Name         Description                                         
      -----------------------------------------------------------------
      ntRNA        integer, the number of tRNA molecules used          
      -----------------------------------------------------------------
      tIntToInt    tInt (1..ntRNA) to Int (aa number, 1..20)           
      tIntToA      tInt (1..ntRNA) to A (aa one-letter code)           
      tIntToAAA    tInt (1..ntRNA) to AAA (aa 3-letter code)           
      tIntToAmino  tInt (1..ntRNA) to Amino (aa full name)             
      -----------------------------------------------------------------
      tIntToCInt   tInt (1..ntRNA) to set of CInt (codon number, 1..64)
      tIntToCodon  tInt (1..ntRNA) to set of Codon (3-letter codon)    
      -----------------------------------------------------------------
      tIntTotRNA   tInt (1..ntRNA) to tRNA (tRNA name)                 
      tRNATotInt   tRNA (tRNA name) to tInt (1..ntRNA)                 
      -----------------------------------------------------------------
      IntTotInt    Int (aa number, 1..20) to set of tInt (1..ntRNA)    
      IntTotRNA    Int (aa number, 1..20) to set of tRNA (tRNA name)   
      -----------------------------------------------------------------
      CIntTotInt   CInt (codon number, 1..64) to tInt (1..ntRNA)       
      CIntTotRNA   CInt (codon number, 1..64) to tRNA (tRNA name)      
Currently the following names are recognized as arguments for SetuptRNA:
  [Archaea, Bacteria, Eukaryota, eukaryotes, prokaryotes, YEAST, yeast]

Examples:
> SetuptRNA(yeast);

See also:  ?ComputeTPI  ?TPIDistr


   ------------
SetuptRNA( tRNAtable );

     With these tables, which will also be used later, we can compute and print the tRNA frequencies of use. We place them in an array so that we can sort them and find their largest bias.

tRNA_freq := CreateArray(1..ntRNA):
for i to ntRNA do
    for c in tIntToCInt(i) do tRNA_freq[i] := tRNA_freq[i] + Codons[c] od;
    printf( 'tRNA %2d %s,  %.2f%%  %a\n', i, tIntTotRNA(i),
	100*tRNA_freq[i]/DB[TotAA], tIntToCodon(i) )
    od;
tRNA  1 Ala1,  3.24%  {GCC,GCT}
tRNA  2 Ala2,  2.26%  {GCA,GCG}
tRNA  3 Arg1,  1.21%  {CGA,CGC,CGT}
tRNA  4 Arg2,  0.19%  {CGG}
tRNA  5 Arg3,  2.10%  {AGA}
tRNA  6 Arg4,  0.96%  {AGG}
tRNA  7 Asn1,  6.08%  {AAC,AAT}
tRNA  8 Asp1,  5.79%  {GAC,GAT}
tRNA  9 Cys1,  1.31%  {TGC,TGT}
tRNA 10 Gln1,  2.67%  {CAA}
tRNA 11 Gln2,  1.23%  {CAG}
tRNA 12 Glu1,  4.56%  {GAA}
tRNA 13 Glu2,  1.96%  {GAG}
tRNA 14 Gly1,  3.23%  {GGC,GGT}
tRNA 15 Gly2,  1.13%  {GGA}
tRNA 16 Gly3,  0.62%  {GGG}
tRNA 17 His1,  2.15%  {CAC,CAT}
tRNA 18 Ile1,  4.71%  {ATC,ATT}
tRNA 19 Ile2,  1.85%  {ATA}
tRNA 20 Leu1,  2.64%  {TTA}
tRNA 21 Leu2,  2.68%  {TTG}
tRNA 22 Leu3,  1.83%  {CTC,CTT}
tRNA 23 Leu4,  2.44%  {CTA,CTG}
tRNA 24 Lys1,  4.25%  {AAA}
tRNA 25 Lys2,  3.05%  {AAG}
tRNA 26 Met1,  2.09%  {ATG}
tRNA 27 Phe1,  4.52%  {TTC,TTT}
tRNA 28 Pro1,  2.04%  {CCC,CCT}
tRNA 29 Pro2,  2.30%  {CCA,CCG}
tRNA 30 Ser1,  3.77%  {TCC,TCT}
tRNA 31 Ser2,  1.91%  {TCA}
tRNA 32 Ser3,  0.88%  {TCG}
tRNA 33 Ser4,  2.47%  {AGC,AGT}
tRNA 34 Thr1,  3.26%  {ACC,ACT}
tRNA 35 Thr2,  1.80%  {ACA}
tRNA 36 Thr3,  0.82%  {ACG}
tRNA 37 Trp1,  1.05%  {TGG}
tRNA 38 Tyr1,  3.36%  {TAC,TAT}
tRNA 39 Val1,  3.29%  {GTC,GTT}
tRNA 40 Val2,  1.22%  {GTA}
tRNA 41 Val3,  1.09%  {GTG}

     The ratio between the most frequent and less frequent is:

max(tRNA_freq) / min(tRNA_freq);
32.2856

Autocorrelation of codons, tRNA or amino acids

     Why is autocorrelation relevant? Autocorrelation measures the reuse of a particular codon/tRNA as contrasted with its use. E.g. a codon may be very rare, that is it appears with low frequency, but it could be highly autocorrelated, i.e. it is very likely to appears in tandems. In terms of reloading tRNA molecules or simply having the tRNA available, autocorrelation of codons/tRNA will give us an indication of whether this is a relevant process for the regulation of protein synthesis.

     Besides their biased frequencies, codons, tRNA and amino acids show signs of autocorrelation. Autocorrelation is a departure of the expected frequency when we know the previous symbol. In other words, the frequencies of pairs are not the same as the product of the individual frequencies. It is of particular interest to study the autocorrelation of the codons coding for one particular amino acid or the tRNAs coding for one particular amino acid. Notice that within the codons or tRNA coding for a particular amino acid, there can be no natural selection pressures as the proteins produced are the same.

     We will now compute the codon autocorrelation for serine. Serine is coded by the following codons and tRNA:

tRNAtable[ AToInt(S) ];
[[TCT, TCC], [TCA], [TCG], [AGT, AGC], Ser]

For this autocorrelation we will take one protein (database entry) at a time. We will not correlate amino acids between different proteins (last of one with first of the next one), as we have no information about the order of the proteins within the yeast genome (there are also other good reasons for not doing this). It is simpler to do it for all amino acids in one pass, we just keep an array of 20 entries with the information of the previous codon used by each protein. Notice that we should skip any stop codon that we encounter which are coded as amino acid number 22.

Cod_Cod := CreateArray(1..64,1..64):
for e in Entries(DB) do
    tD := SearchTag(DNA,e);
    prev := CreateArray(1..20);
    for i by 3 to length(tD) do
	j := CodonToCInt( tD[i..i+2] );
	k := CIntToInt(j);
	if k = 22 then next
	elif prev[k] > 0 then
	    Cod_Cod[prev[k],j] := Cod_Cod[prev[k],j]+1 fi;
	prev[k] := j;
	od
    od:

     Now we print the portion of the matrix corresponding to serine.

for i to 64 do if CIntToA(i)='S' then
    printf( '%s   ', CIntToCodon(i) );
    for j to 64 do if CIntToA(j)='S' then
	printf( '%6d', Cod_Cod[i,j] ) fi od;
    printf( '\n' )
    fi od;
AGC     4001  5260  5487  4035  2757  6888
AGT     5282  7553  8544  6008  3883  9957
TCA     5755  8704 12409  8231  5468 13571
TCC     4171  6136  8097  6856  3874 11057
TCG     2923  3909  5396  3976  2765  6132
TCT     6476  9771 14028 11131  6260 19044

     From the counts alone it is difficult to see the autocorrelation. The different codons have different frequencies. To see the autocorrelation we will subtract the expected values of the counts and divide by the standard deviation. This is what is called a z-transform, and it has the advantage of revealing how far from expected our results are. In Np we will store the total number of pairs of codons. The probability of a pair appearing is the product of the individual probabilities, divided by the probability of having a serine (this is because the pairs are counted exclusively among the serines). After a simple cancellation we obtain the formula below.

Np := sum(sum(Cod_Cod));
for i to 64 do if CIntToA(i)='S' then
    printf( '%s   ', CIntToCodon(i) );
    for j to 64 do if CIntToA(j)='S' then
	pr := Codons[i]/DB[TotAA] * Codons[j]/Size(S);
	std := sqrt( Np*pr*(1-pr) );
	printf( '%7.2f', ( Cod_Cod[i,j] - Np*pr ) / std ) fi od;
    printf( '\n' )
    fi od;
Np := 2780865
AGC     15.33  10.69  -5.73  -5.68   0.19  -4.97
AGT     11.02  12.26  -0.23  -4.61  -1.37  -5.95
TCA     -2.24   1.50  11.62  -1.01   3.98  -1.92
TCC     -3.63  -3.00  -2.48   8.41   0.27   7.79
TCG      3.36  -0.96   2.98   1.92   7.42  -3.27
TCT     -9.80  -7.76   1.98   8.53  -1.67  15.56

We can see that there are obvious signs of autocorrelation, the diagonal values are all positive and almost always the largest for their columns/rows. There are also signs of tRNA autocorrelation. If we observe that codons AGC,AGT are translated by the same tRNA and ditto for codons TCC,TCT we see autocorrelation also at the tRNA level.

     The above analysis of autocorrelation suffers from a bias that may be induced by different base frequencies in different parts of the genome. It is known that some parts of the genome are GC rich while other parts are GC poor. Such long-streched biases induce an autocorrelation in the codons, which could be significant.

     So we would like to measure the autocorrelation in a way that is independent of the base composition. Furthermore it would be nice to be able to "add" together the autocorrelations of different amino acids. To meet these goals we have defined the TPI (tRNA Pairing Index).

tRNA Pairing Index for a given amino acid

     The TPI is an index which is computed for each protein and measures how autocorrelated (positively or negatively) are its codons. Is is completely independent of the frequencies of the amino acids, tRNA or codons, so it will not suffer from any of the common sources of bias.

     The way of measuring the autocorrelation independently of everything else is by transforming the problem into a combinatorial one. For example, suppose that we are considering an amino acid X which occurs 7 times in a protein and can be translated by two tRNAs, A and B (e.g. 3 A's and 4 B's). We will extract the tRNAs from our sequence and represent them as a sequence of 7 symbols. E.g. AABABBB.

     Highly autocorrelated cases are AAABBBB and BBBBAAA. A highly negatively autocorrelated case is BABABAB. This autocorrelation can be quantified by the number of identical pairs in the sequence or, conversely, by the number of changes (nc) as we read from left to right. Notice that for a sequence of length n, the number of identical pairs plus the number of changes is n-1. The mathematics is completely analogous for the number of pairs or number of changes. We call these breaks in the sequences changes, with the thought that if a tRNA molecule is doing the translation for one particular amino acid, when these breaks happen, this tRNA will have to be changed for another molecule. The first two examples have 1 change each, the last example has 6 changes. The TPI measures how high the actual number of pairs are or how low the nc is compared to all possible permutations of the sequence of tRNAs. In this case we have 35 cases:

sequence/ncsequence/ncsequence/ncsequence/ncsequence/nc
AAABBBB 1AABABBB 3AABBABB 3AABBBAB 3AABBBBA 2
ABAABBB 3ABABABB 5ABABBAB 5ABABBBA 4ABBAABB 3
ABBABAB 5ABBABBA 4ABBBAAB 3ABBBABA 4ABBBBAA 2
BAAABBB 2BAABABB 4BAABBAB 4BAABBBA 3BABAABB 4
BABABAB 6BABABBA 5BABBAAB 4BABBABA 5BABBBAA 3
BBAAABB 2BBAABAB 4BBAABBA 3BBABAAB 4BBABABA 5
BBABBAA 3BBBAAAB 2BBBAABA 3BBBABAA 3BBBBAAA 1

     We can compute the probability of a given nc happening. This is the number of times they happen divided by 35 (each of the permutation is equally likely to happen). The following histogram shows the distribution of the nc. From such a curve we can judge how rare or common a given nc is.

sequences per number of changes (nc)

Computation of the probability of the number of changes (nc)

     We will now write a function to compute the probability and cumulative distribution of a given nc. This is a fairly mathematical and computational problem, and some readers may want to skip to the section where the TPI is computed.

     It is easy to observe that the probability of the number of changes (nc) does not depend on what the symbols are, but rather depends on how many of each. We will write a function, called nc, that computes the probability of a having a given nc from a random permutation. This function will be based on another function, called nc_r, which does the recursive part of the computation. Nc_r assumes that we are not at the beginning of the sequence, but rather that the last symbol observed is known. To identify this known symbol (all symbols are otherwise equivalent), we will make it the first of the arguments. Our function nc_r assumes that it is called with a symbol of the first class preceeding the n1+n2 other symbols.

nc_r := proc( nc:integer, n1:integer, n2:integer )

If any of the arguments is negative or if more changes than symbols are exptected, then the answer is 0.

if n1 < 0 or n2 < 0 or nc < 0 or nc > n1+n2 then 0

if there are no more symbols (and nc must be 0) then the probability is 1.

elif n1=0 and n2=0 then 1

Now we recurse. The first symbol is either from the class of n1 (no change) or from the class of n2, in which case now the preceeding symbol is of the second class and we invert the arguments.

else n1/(n1+n2) * nc_r( nc, n1-1, n2 ) +
     n2/(n1+n2) * nc_r( nc-1, n2-1, n1 )
     fi
end:

     Testing it with the above example, computing the probability of 3 changes assuming that the previous symbol was an A:

nc_r( 3, 3, 4 ) * 35;
9.0000

There are 6 cases starting with A which have 3 changes and 3 cases starting with B with 2 changes. This coincides with the probability computed (once multiplied by 35). Before we resolve the original problem we will generalize nc_r to be able to handle any number of groups of symbols. In our application we may have up to 4 symbols (yeast has some amino acids which are translated by 4 different tRNA molecules). Also, the code as written above, is badly exponential. We could use ideas from dynamic programming to speed it up. In this case it is much simpler to use the remember function to avoid recomputing cases.

?remember

Function remember - evaluate a function and remember its result
Option: builtin

Calling Sequence:  remember(func_call)

Parameters:
  Name        Type        Description    
  ---------------------------------------
  func_call   structure   a function call

Returns:
	anything
Synopsis:  The remember function stores results of function evaluations in an
  internal table for the purpose of saving computation time.  When remember is
  called, the system checks to see if the argument function has been called
  previously with the same arguments, and if so, then the previous result is
  returned.  If it is not found, the function call is executed and its result
  stored in the internal table as well as returned to the user.  The internal
  table does not keep all the results forever, at garbage collection time
  arguments that are no longer available will cause the corresponding entries
  to be removed.  Eventually, all unused entries will be removed.  The user of
  remember should keep in mind that this is a heuristic saving of evaluations,
  it should not be counted on happening every time.
Remember is usable when the argument function does not have side effects (for
  example printing), as it will be unpredictable when these side effects will
  happen.  It should also be used on functions which do a significant amount
  of computation, else its effort is not justifiable.  The profiling tools are
  good to determine which functions will profit from remembering.
Warning: When the returned value is a structure (e.g. a matrix or a class),
  changing the structure will also change value stored in the remember-table!
  This will lead to unexpected behaviour.
In case that the user wants to erase the remember table, (for example the
  function to be remembered has changed its behavior in some way and old
  values should not be remembered), calling remember with the argument "erase"
  will erase all previously remembered values
For the example below we compute the Fibonacci numbers with their simple
  recurrence.  Without the remember function, this definition takes
  exponential time.

Examples:
> F := proc( n:integer )
if n < 2 then n else
remember(F(n-1)) + remember(F(n-2)) fi end:
> [ seq( F(i), i=0..10 )];
[0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55]
> F(50);
12586269025
See also:  ?hash  ?profiling  ?table


   ------------

The code has a few other optimizations which are intended to reuse previous computations as much as possible and to cut useless computation as early as possible. One of the ways of reusing the previous results as much as possible is to keep the 2nd to last sizes (3rd to last argument) in increasing order. (The order of these arguments does not matter, only the first is relevant).

nc_r := proc( ch:integer, n1:integer, n2:integer )
n := sum( args[i], i=2..nargs );

resolve as many trivial cases as possible

if min(args) < 0 or ch > n or ch > 2*(n-max(args[2..nargs])) + 1 then 0
elif max(args) = 0 then 1
elif nargs >= 3 and n2=0 then nc_r(ch,n1,args[4..nargs])

if any of the args[3..nargs] are out of order, reorder them and call recursively

else for i from 3 to nargs-1 do if args[i] > args[i+1] then
	 return( nc_r( args[1..i-1], args[i+1], args[i], args[i+2..nargs] ))
	 fi od;

compute recursively (remember/reuse previous values)

     r := n1/n * remember( nc_r( ch, n1-1, args[3..nargs] ));
     for i from 3 to nargs do
	 r := r + args[i]/n * remember( nc_r( ch-1, args[i]-1,
	     args[2..i-1], args[i+1..nargs] ))
	 od;
     r
     fi
end:
Warning: procedure nc_r reassigned

     Now we can write our main function to compute the probability of a given nc. To make coding simpler, we assume we have an extra symbol which appears 0 times and is preceeding our sequence. This works correctly unless we have the exceptional case of no changes and no symbols at all. The function becomes:

nc := proc( ch )
if {args}=0 then 1 else nc_r( ch+1, 0, args[2..nargs] ) fi end:

and we can test it with a more significant example, say 7 and 10 symbols:

DrawHistogram( [ seq( nc(i,7,10), i=1..17 ) ] );

probabilities per number of changes (nc)

     It is wise to check that the probabilities add up to 1:

sum( nc(i,7,10), i=1..17 ) - 1;
-1.1102e-16

the difference to 1 is well within round-off errors.

     Due to the importance of this function in our research and its complexity for large arguments, it has been implemented in the kernel to make it faster and more compact. The kernel function is called TPIDistr and computes the entire distribution of the nc for a given set of number of symbols. For example:

DrawHistogram( TPIDistr(5,10,20,30) );

probabilities per number of changes (nc)

     Finally, to estimate how rare a given nc is, we need to compute its cumulative distribution. Since the distribution is over the integers, we will take the cumulative distribution which adds one half of the probability at the point. (The second to last arguments of this function are the numbers of each symbol.)

nc_cum := proc( ch:integer )
d := TPIDistr(args[2..nargs]);
if ch < 0 then 0
elif ch >= length(d) then 1
else d[ch+1] / 2 + sum( d[1..ch] ) fi
end:
nc_cum(3,7,10);
0.00365076

Computing the TPI for all the amino acids in a given entry

     The previous computation was designed to find how common/rare is a particular number of changes for a single amino acid. When we study a database entry, we would like to take all the amino acids and somehow summarize all the pairs/changes information into a single index. To do this we will consider that an nc is the unit to measure and we will add the nc's of all the amino acids. The distribution of the sum of all the nc's is the convolution of the individual distributions for each amino acid.

     The previous computation was designed to find how common/rare is a particular number of changes for a single amino acid. When we study a database entry, we would like to take all the amino acids and somehow summarize all the pairs/changes information into a single index. To do this we will consider that an nc is the unit to measure and we will add the nc's of all the amino acids. The distribution of the sum of all the nc's is the convolution of the individual distributions for each amino acid.

     Darwin has a function which does this computation for a given Entry called ComputeTPI. The Entry has to contain DNA information for this to be possible. ComputeTPI computes the number of pairs/changes for each amino acid and the total number. Then computes the distribution of the pairs/changes for each amino acid and convolves the distributions. Finally it computes the cumulative probability. Two indices, mathematically equivalent, are returned from this cumulative probability. The first one is just a normal N(0,1) deviate with the same cumulative probability. The second is just the cumulative distribution linearly spread over the interval -1 ... 1. We test the function against an arbitrary Entry:

ComputeTPI( Entry(9) );
[1.2222, 0.7783]

     Darwin has a function which does this computation for a given Entry called ComputeTPI. The Entry has to contain DNA information for this to be possible. ComputeTPI computes the number of pairs/changes for each amino acid and the total number. Then computes the distribution of the pairs/changes for each amino acid and convolves the distributions. Finally it computes the cumulative probability. Two indices, mathematically equivalent, are returned from this cumulative probability. The first one is just a normal N(0,1) deviate with the same cumulative probability. The second is just the cumulative distribution linearly spread over the interval -1 ... 1. We test the function against an arbitrary Entry:

ComputeTPI( Entry(9) );
[1.2222, 0.7783]

     We now compute the univariate and covariate statistics of the TPI indices for a few random entries.

cov := Covariance( 'TPI index', [normal,'-1...1'] ):
to 100 do cov + ComputeTPI( Rand(Entry) ) od:
print( cov );
Covariance analysis for TPI index, 100 data points
         normal  -1...1
Means    0.1284  0.0572
Covariance matrix
normal   0.9110
-1...1   0.5252  0.3137

and we see that the TPI indices are correlated and their means are positively biased. (This is also true for the entire yeast genome). This means that there is a positive autocorrelation of tRNA completely independent of their frequencies.

© 2015 by Gaston Gonnet, Informatik, ETH Zurich

Please cite as:

@techreport{ Gonnet-TPI,
        author = {Gaston H. Gonnet},
        title = {The tRNA Pairing Index},
        month = { April },
        year = {2003},
        number = { 399 },
        howpublished = { Electronic publication },
        copyright = {code under the GNU General Public License},
        institution = {Informatik, ETH, Zurich},
        URL = { http://www.biorecipes.com/TPI/code.html }
        }

Index of bio-recipes

Last updated on Thu Dec 10 09:54:20 2015 by GhG

!!! Dieses Dokument stammt aus dem ETH Web-Archiv und wird nicht mehr gepflegt !!!
!!! This document is stored in the ETH Web archive and is no longer maintained !!!