Virus Classification using k-nucleotide Frequencies

by Markus Friberg

     This bio-recipe shows how to classify viruses using their k-nucleotide (di-, tri- and tetra-nucleotide) frequencies. It also shows how we can build a phylogenetic tree of the viruses. The content is based on the research by Kurt Tobler & Mathias Ackermann, Institute of Virology, University of Zurich.

Virus database

     We have put together a virus genome database. Our goal is to build a phylogenetic tree of these viruses. First, let us load the database.

DNA file(/home/darwin/v2/source/bio-recipes/VirusClassification/viruses.db(
702100), 49 entries, 695010 bases)

Frequencies of nucleotides and dinucleotides

     We want to classify the viruses based on their k-nucleotide frequencies. Research has shown that dinucleotide frequencies are similar for related organisms and different for unrelated organisms. Furthermore, the frequencies are relatively constant over the whole genome, in both coding and non-coding regions of DNA. Hence, the vector of k-nucleotide frequencies is a signature of the genome.

     The standard method of phylogenetic analysis is to construct trees from multiple sequence alignments (MSA). However, that approach suffers from the disadvantage that we have to limit the analysis to parts of the DNA which can be reliably aligned, for example coding regions. Naturally, we have to know the coding regions of the DNA, and we have to make an arbitrary choice of which regions to compare. Different regions can lead to different trees. Furthermore, finding a multiple sequence alignment for long regions of many organisms is computationally very intensive. However, the main problem by this method is the quality of the MSA. In the case of dinucleotide analysis, these problems do not exist.

     To show the method, we start by looking at the nucleotide and dinucleotide frequencies of one single virus, namely SARS. First, we create a function that calculates the k-nucleotide frequencies of a DNA sequence s.

CalcFreq := proc(s: string, k: posint);
    freq := CreateArray(1..4^k);
    n := length(s)-k+1;
    for i to n do

     We encode a k-nucleotide as a number base 4. For example, AAA -> 0 and TTTT -> 256.

        knuc := 0;
        for j to k do
            knuc := 4*knuc + NToInt(s[i+j-1]) - 1; od;
        knuc := knuc + 1;
        freq[knuc] := freq[knuc] + 1;
    freq / n;

     Next, we locate the SARS genome in the database.

s := Sequence(SearchID('SARS'));

     We calculate the nucleotide frequencies and print them.

NF := CalcFreq(s, 1):
print('Nucleotide frequencies:');
for i to 4 do
    printf('%s    %f\n', IntToN(i), NF[i]); od;
printf('The nucleotide frequencies add to %f\n', sum(NF));
Nucleotide frequencies:
A    0.285066
C    0.199657
G    0.207959
T    0.307317
The nucleotide frequencies add to 1.000000

     Finally, we calculate the dinucleotide frequencies and show them in a histogram.

diNF := CalcFreq(s, 2):
labels := CreateArray(1..16):
for i to 4 do
    for j to 4 do
        a := 4*(i-1) + j;
        labels[a] := IntToN(i) . IntToN(j); od; od;
DrawHistogram(diNF, labels, Title='Diucleotide frequencies');

Dinucleotide frequencies.

k-nucleotide bias

     The observed frequency of a dinucleotide XY is not the same as the expected frequency of XY, which we can express as fXfY. As an example, we compare the observed and expected frequency of the dinucleotide AA.

NF[1] * NF[1];

      We will use the bias between observed and expected frequencies to classify the viruses. This bias can be calculated in various ways: as an odds ratio, as the logarithm of the odds ratio, as the deviation from a normalized Poisson distribution etc. Here we choose odds ratio, since it has been experimentally shown to be a good measure for phylogeny. The bias is calculated as b = fobs/fexp, where fobs=fXY and fexp=fXfY.

OddsRatio := proc(Fobs: list, Fexp: list);
    n := length(Fobs);
    bias := CreateArray(1..n);
    for i to n do
        bias[i] := Fobs[i] / Fexp[i]; od;

CalcBias := proc(Fobs: list, NF: list, method: string);
    n := length(Fobs);
    k := ln(n) / ln(4);
    Fexp := CreateArray(1..4^k, 1);
    for i to n do
        j := i-1;
        to k do
            Fexp[i] := Fexp[i] * NF[mod(j,4)+1];  j := floor(j/4) od; od;
    BiasFunction := symbol(method);
    if nargs >= 4 then
        BiasFunction(Fobs, Fexp, args[4..nargs]);
        BiasFunction(Fobs, Fexp);

      As an example, let us calculate this bias for the dinucleotides of SARS.

diNucBias := CalcBias(diNF, NF, 'OddsRatio'):
print('Dinucleotide bias:');
for i to 4 do
    for j to 4 do
        a := 4*(i-1) + j;
        printf('%s%s    %f\n', IntToN(i), IntToN(j), diNucBias[a]): od; od;
Dinucleotide bias:
AA    1.039476
AC    1.168768
AG    0.993398
AT    0.857931
CA    1.302831
CC    0.822144
CG    0.459831
CT    1.200282
GA    0.952573
GC    1.160101
GG    0.935021
GT    0.984059
TA    0.798459
TC    0.850771
TG    1.401140
TT    1.012561

      We need to calculate this bias for all the viruses in the database. The result will be stored in the matrix diNucBias, which will be used soon.

n := DB['TotEntries']:
diNucBias := CreateArray(1..n, 1..4^2):
for i to n do
    s := Sequence(Entry(i)):
    NF := CalcFreq(s, 1):
    diNF := CalcFreq(s, 2):
    diNucBias[i] := CalcBias(diNF, NF, 'OddsRatio'):

Calculating distances

     In order to create a phylogenetic tree, we need to know the evolutionary distance between the organisms in our population. This distance is defined as the euclidean distance of the k-nucleotide bias of two organisms a and b. For example, the dinucleotide distance between the first and the second organism in our database is:


     For the phylogenetic tree, we need the distances and variances of all pairs of organisms.

CalcDistances := proc(v: list)
    n := length(v);
    dist := CreateArray(1..n, 1..n);
    for i to n-1 do
        for j from i+1 to n do
            dist[i, j] := dist[j, i] := sqrt((v[i]-v[j])^2); od; od;

CalcVariances := proc(v: list(list));
    n := length(v);
    vmean := sum(Flatten(v)) / (n*(n-1));

dist := CalcDistances(diNucBias):

      We also need labels for the tree nodes.

GetLabels := proc(ID: string);
    n := DB['TotEntries'];
    labels := CreateArray(1..n);
    for i to n do
        labels[i] := SearchTag(ID, Entry(i)); od;

labels := GetLabels('ID'):

Phylogenetic tree

     When generating the phylogenetic tree, we will use the built-in function MinSquareTree. It will start with a random tree and adapt it to our data. In order to improve the result, we run MinSquareTree several times, each one with a new random tree, and finally choose the best tree generated.

CreateTree := proc(dist, labels)
    global MinLen, MST_Qual;
    bestq := DBL_MAX;
    oldpr := printlevel;    printlevel := 0;
    MinLen := 0;
    vari := CalcVariances(dist);
    to 10 do     
        t := MinSquareTree(dist, vari, labels, Random):
        if MST_Qual < bestq then
            bestq := MST_Qual;    
            bestt := t; 
    printlevel := oldpr;
    MST_Qual := bestq;

t := CreateTree(dist, labels):
#DrawUnrootedTree(t, 'Phylogenetic tree of viruses. Built from dinucleotide data. ' . sprintf('    MST_Qual=%f', MST_Qual));
cls := []:
for i to DB[TotEntries] do cls := append(cls, SearchTag(LA,Entry(i))[1] ) od:
DrawTree( t, Unrooted, OrderLeaves=cls, Clusters=cls, LengthFormat='' );
DrawColorShape expects a 1st argument, t:Leaf, found: [Leaf(IBV,-0.3887,22), 
85.5405, 64.8237, 72.3895, 72.0127]
Error, (in LeafLab) invalid arguments
Phylogenetic tree of viruses. dinuc.

     As we can see from the picture, the classification works quite well. The Corona, Flavi, Retro, Toga and Falvi viruses are all clustered together in the tree. All except one of the Mononega viruses are also together. The Picorna viruses are quite spread out, possibly because they have relatively inhomogeneous genomes. Here is the list of the viruses and their descriptions:

print('Label  Short     Description');
for e in Entries() do
    desc := SearchTag('DE', e):
    desc := If(length(desc)<=60, desc, desc[1..60]):
    printf(' %-4s  %-9s %s\n', SearchTag('LA', e), SearchTag('ID', e), desc);
Label  Short     Description
 c4    BCoV      Bovine coronavirus isolate BCoV-ENT, complete genome
 m9    BDV       BDU04608 Borna disease virus V, complete genome
 x1    BVDV      BVDCG Bovine viral diarrhea virus complete genome
 m1    CDV       AF164967 Canine distemper virus strain A75/17, complete geno
 x2    CSFV      HCU45477 Hog cholera virus strain Riems, complete genome
 n1    EAV       TOEAV Equine arteritis virus (EAV) RNA genome
 m5    EBOV      Zaire Ebola virus, complete genome
 p8    EMCV      EVCGAA Encephalomyocarditis virus complete RNA genome
 r2    FFV       FFVAJ3851 feline foamy virus RNA genome
 r4    FIV       FIVCG Feline immunodeficiency virus complete genome
 p6    FMDV      PIFMDV2 Foot and mouth disease virus (FMDV-O1K) RNA for poly
 . . . . (26 output lines skipped) . . . .
 m7    SeV       SNDT5RCG Sendai virus strain mutant T-5 revertant, complete 
 t4    SFV       ALSFV42S Semliki forest virus 42S RNA genome
 r6    SiFV      SFU04327 Simian foamy virus SFVcpz, complete genome
 t2    SINV      SINCG Sindbis virus (hrsp and wild-type strains) complete ge
 m11   SV-41     SV41RNA Simian virus 41 genes NP, P, M, F, HN and L
 f4    TBEV      Tick-borne encephalitis virus, complete genome
 c5    TGEV      TGA271965 Transmissible gastroenteritis virus complete genom
 p2    ThV       TMEGDVCG Theiler murine encephalomyelitis, complete genome
 t3    VEEV      EEVCOMGEN Venezuelan equine encephalitis virus, complete gen
 m3    VSV       VSVCG Vesicular stomatitis Indiana virus, complete genome
 f3    WNV       WNFCG West Nile virus RNA, complete genome
 f2    YEV       Yellow fever virus, complete genome

Trinucleotides and tetranucleotides

      We will try to improve the tree by studying trinucleotides and tetranucleotides. They are calculated in the same way as dinucleotides.

triNucBias := CreateArray(1..n, 1..4^3):
tetraNucBias := CreateArray(1..n, 1..4^4):
for i to n do
    s := Sequence(Entry(i)):
    NF := CalcFreq(s, 1):
    triNF := CalcFreq(s, 3):
    tetraNF := CalcFreq(s, 4):
    triNucBias[i] := CalcBias(triNF, NF, 'OddsRatio'):
    tetraNucBias[i] := CalcBias(tetraNF, NF, 'OddsRatio'):

     We build a tree from the trinucleotide data.

t := CreateTree(CalcDistances(triNucBias), labels):
#DrawUnrootedTree(t, 'Phylogenetic tree of viruses. Built from trinucleotide data' . sprintf('    MST_Qual=%f', MST_Qual));
DrawTree( t, Unrooted, OrderLeaves=cls, Clusters=cls, LengthFormat='' );
DrawColorShape expects a 1st argument, t:Leaf, found: [Leaf(IBV,-1.2247,22), 
88.7950, 73.0908, 70.9763, 79.7575]
Error, (in LeafLab) invalid arguments
Phylogenetic tree of viruses. trinuc.

     The tree from the tetranucleotide data looks like this.

t := CreateTree(CalcDistances(tetraNucBias), labels):
#DrawUnrootedTree(t, 'Phylogenetic tree of viruses. Built from tetranucleotide data' . sprintf('    MST_Qual=%f', MST_Qual));
DrawTree( t, Unrooted, OrderLeaves=cls, Clusters=cls, LengthFormat='' );
DrawColorShape expects a 1st argument, t:Leaf, found: [Leaf(MHV-A59,-3.3318,26),
95.2419, 44.7201, 80.4434, 48.2343]
Error, (in LeafLab) invalid arguments
Phylogenetic tree of viruses. tetranuc.

Information vs error

     It is clear that increasing the dimension k of the k-nucleotides increases the selectivity of the pattern, and hence the distances should be more accurate. So in principle, the higher the k value, the more information we capture and the more reliable the results should be. However, this is counterbalanced by the background noise of the distances, which is proportional to O(4^k). I.e. in every component of the vector there is some expected error. In the case of the Poisson model, this error has variance 1. As we compute the distance between two vectors, these errors add together, and because of the exponential size of the vectors, the error becomes very significant for k>3. The additional noise, which acts as an added value to each distance, is reflected in the trees themselves which become more corona-shaped (which can be seen in the images above). Furthermore, the additional noise, which makes all the distances larger by an additive constant, makes the tree fitting by least squares numerically better. This is quite disconcerting, as a better fit means worse trees. We choose the trinucleotide results as the best compromise between information and noise.

Classifying the SARS virus

     In the trees that we have produced so far, the coronaviruses are clearly clustered together on their own branch. SARS is among the coronaviruses in each tree, so we draw the conclusion that SARS is a corona virus. However, the internal topology of the coronaviruses differs from tree to tree. A possible reason for this is that the minimum square tree is calculated based on the all-against-all distances, and a global optimum is searched, even if this is not optimal for the local virus groups. As a result, the relative positions of the coronaviruses may be affected by the distances of the other virus groups. By studying the coronaviruses alone, this problem is avoided.

     We will do the same analysis as above for the coronaviruses only. This is easy now that we already have the functions.

n := DB['TotEntries']:
diNucBias := CreateArray(1..n, 1..4^2):
triNucBias := CreateArray(1..n, 1..4^3):
tetraNucBias := CreateArray(1..n, 1..4^4):
for i to n do
    s := Sequence(Entry(i)):
    NF := CalcFreq(s, 1):
    diNF := CalcFreq(s, 2):
    triNF := CalcFreq(s, 3):
    tetraNF := CalcFreq(s, 4):
    diNucBias[i] := CalcBias(diNF, NF, 'OddsRatio'):
    triNucBias[i] := CalcBias(triNF, NF, 'OddsRatio'):
    tetraNucBias[i] := CalcBias(tetraNF, NF, 'OddsRatio'):

labels := GetLabels('ID'):

diNucDist := CalcDistances(diNucBias):
t := CreateTree(diNucDist, labels):
#DrawUnrootedTree(t, 'Coronaviruses. Built from dinucleotide data' . sprintf('   MST_Qual=%f', MST_Qual));
DrawTree( t, Unrooted, LengthFormat='' );
DNA file(/home/darwin/v2/source/bio-recipes/VirusClassification/coronaviruses.db
(204872), 7 entries, 203680 bases)
Phylogenetic tree of coronaviruses. dinuc.
triNucDist := CalcDistances(triNucBias):
t := CreateTree(triNucDist, labels):
#DrawUnrootedTree(t, 'Built from trinucleotide data' . sprintf('   MST_Qual=%f', MST_Qual));
DrawTree( t, Unrooted, LengthFormat='' );

Phylogenetic tree of coronaviruses. trinuc.
tetraNucDist := CalcDistances(tetraNucBias):
t := CreateTree(tetraNucDist, labels):
#DrawUnrootedTree(t, 'Coronaviruses. Built from tetranucleotide data' . sprintf('   MST_Qual=%f', MST_Qual));
DrawTree( t, Unrooted, LengthFormat='' );

Phylogenetic tree of coronaviruses. tetranuc.

     The trees from di-, tri- and tetranucleotides all have the same topology. This is a good indication that our phylogenetic trees are reliable. The three neighboring coronaviruses of SARS (HCoV-229E, PEDV and TGEV) are known to belong to group I of coronaviruses. From these trees, we draw the conclusion that SARS also belongs to this group. SARS is in our tree closest to TGEV, so we hypothesize a close relationship to the TGEV virus.

Poisson distribution

     Instead of odds ratios, we can use the Poisson distribution as a bias measure.

PoissonDistr := proc(Fobs: list, Fexp: list, N: posint);
    n := length(Fobs);
    bias := CreateArray(1..n);
    for i to n do
        bias[i] := (Fobs[i] - Fexp[i]) / (N * Fexp[i] * (1-Fexp[i])); od;

triNucBias := CreateArray(1..n, 1..4^3):
for i to n do
    s := Sequence(Entry(i)):
    NF := CalcFreq(s, 1):
    triNF := CalcFreq(s, 3):
    triNucBias[i] := CalcBias(triNF, NF, 'PoissonDistr', length(s)):
t := CreateTree(CalcDistances(triNucBias), labels):
#DrawUnrootedTree(t, 'Tree from trinucleotide data using Poisson distr.' . sprintf('    MST_Qual=%f', MST_Qual));
DrawTree( t, Unrooted, LengthFormat='' );

trinuc poisson

     The topology of this tree is identical to the ones produced by the odds ratio method. SARS is placed among group I of coronaviruses. The fact that two different methods generate the same topology makes our results more reliable.

Pairwise alignments

     To further test our conclusion, we create a phylogenetic tree using optimal pairwise alignments. The pairwise alignments are based on an RNA mutation matrix derived from the 50 viruses.


     Since these alignments take a long time to compute, we will try to load them from a file. If the file is not found, the alignments will be computed and saved.

if not type(AllAll,matrix({0,Alignment})) then
    AllAll := CreateArray(1..n, 1..n):
    for i to n do
        for j from i+1 to n do
            AllAll[i,j] := AllAll[j,i] := Align(Entry(i),Entry(j), DMS);    lprint(i,j); od od;
    printf('AllAll := %A:\n', AllAll);

     Now we are ready to create the tree based on the pairwise alignments.

t := PhylogeneticTree( [ seq(Sequence(Entry(i)), i=1..n)],
    [ seq(SearchTag(ID,Entry(i)), i=1..n ) ],
    Distance, AllAll):

DrawUnrootedTree(t, 'Corona viruses, distance tree');
0.9904),-6.7172,Leaf(TGEV,-36.0311,7),1),1)),Corona viruses, distance tree)
Corona viruses, distance tree

     This method does not confirm our results, since SARS is here placed on a branch of its own, not among the group I of coronaviruses. Hence we have two methods (k-nucleotides and pairwise alignments) which give different results. We will investigate this in more detail.

Pairwise alignments vs k-nucleotides

     It is interesting to see which virus is most related to SARS. We compare the distances from the pairwise alignments to the distances from the k-nucleotides. For the PAM distances we can compute their variances, so we have an estimate of the confidence. We show 95% confidence intervals. For the euclidean distances of the k-nucleotideswe do not have an estimate of the variance. The distances have been scaled so that the minimum distances coincide.

      First, we prepare a list of PAM distances and variances between SARS and each of the other coronaviruses. We are not interested in the distance between SARS and itself, so we treat that case specially.

PAMdist := CreateArray(1..n-1):
PAMvari := CreateArray(1..n-1):
i := 0:
for z in AllAll[6] do
    if z = 0 then next fi:
    i := i+1:
    PAMdist[i] := z[PamDistance]:
    PAMvari[i] := z[PamVariance]:

      The k-nucleotide distances are normalized. Also here we need to exclude the distance between SARS and itself.

NormalizeDistances := proc(dist, m);
    d := dist[6];
    d := [op(d[1..5]), d[7]];
    k := m / min(d);

m := min(PAMdist):
diNucDist := NormalizeDistances(diNucDist, m):
triNucDist := NormalizeDistances(triNucDist, m):
tetraNucDist := NormalizeDistances(tetraNucDist, m):

      Finally, we print the results.

printf('             PAM distance     dinucl. dist.    trinucl. dist.   tetranucl. dist\n');
j := 0:
for i to 7 do
    if i = 6 then next fi;
    j := j + 1:
    printf('%9s    %.1f +- %.2f  %12.1f  %15.1f  %15.1f\n', SearchTag('ID', Entry(i)), PAMdist[j], 1.96*sqrt(PAMvari[j]), diNucDist[j], triNucDist[j], tetraNucDist[j]); od;
             PAM distance     dinucl. dist.    trinucl. dist.   tetranucl. dist
     BCoV    68.0 +- 1.93         249.9            125.5            104.7
HCoV-229E    70.0 +- 2.16         161.0             86.0             77.3
      IBV    77.0 +- 2.12         169.7             97.1             88.9
  MHV-A59    65.0 +- 1.98         294.9            138.8            110.4
     PEDV    71.0 +- 2.17         126.6             81.6             77.1
     TGEV    70.0 +- 2.15          65.0             65.0             65.0

     According to the pairwise alignments method, SARS is closest to MHV. However, note that the difference in distance between SARS and the other viruses is quite small, especially when considering the confidence interval. In the case of k-nucleotides, the differences are much clearer, which could indicate that the clustering is more reliable. Furthermore, the distances coincide for di-, tri- and tetranucleotides.

Inspection of a pairwise alignment

     In order to check the reliability of the pairwise alignment method, we print the alignment between SARS and the closest neighbor: MHV.

SARSlength := length(Sequence(Entry(6))):
SARSaligned := AllAll[6][4][Length2]:
printf('\n%d nucleotides out of %d (%.1f%%) in the SARS genome were used in the local pairwise alignment.\n', SARSaligned, SARSlength, 100*SARSaligned/SARSlength);
lengths=14199,14181 simil=13023.9, PAM_dist=65, identity=56.4%
ID=MHV-A59   DE=Mouse hepatitis virus strain MHV-A59 C12 mutant, complete genome

ID=SARS   DE=SARS coronavirus TOR2, complete genome


 . . . . (701 output lines skipped) . . . .



14181 nucleotides out of 29751 (47.7%) in the SARS genome were used in the local pairwise alignment.

     The number of identical nucleotides is quite low, and less than half of the nucleotides were used in the alignment.


     Our result that SARS belongs to group I of coronaviruses contradicts the result presented in (Rota, 2003 and Merra, 2003), where it is concluded that SARS does not belong to any of the three known coronavirus groups, but instead forms a new group. Their analysis was based on MSAs of 4-5 proteins that are common for the coronaviruses. There is strong evidence for both methods, and we do not claim that ours has to be the correct one. The purpose of this bio-recipe is rather to show that there are two scientifically sound methods that give different results.

© 2009 by Markus Friberg, Informatik, ETH Zurich

Please cite as:

@techreport{ Friberg-virusclassif
        author = {Markus Friberg},
        title = {Virus Classification using <i>k</i>-nucleotide Frequencies},
        month = { May },
        year = {2003},
        number = { 401 },
        howpublished = { Electronic publication },
        copyright = {code under the GNU General Public License},
        institution = {Informatik, ETH, Zurich},
        URL = { }

Index of bio-recipes

Last updated on Thu Mar 12 12:20:47 2009 by MF