Phylogenetic tree of the "Nigerian Prince" email scam


     Most people have received messages which at some point or other state "I would like to deposit 20,000,000.00 dollars in your account....". This is a common email fraud attempt which has been named the "Nigerian Prince" scam since the earliest versions of it were sent by a supposedly such member of royalty. The United States Postal Service has issued a warning on the subject and a google search will show hundreds of related hits.

     The messages, although very different, have a similar pattern. In this bio-recipe we will try to construct the phylogenetic tree of these messages and hence reconstruct how it is likely that people have copied and modified the original message(s) and created new versions. If the authors had copied from a single source and then modified the messages to their own taste, this procedure would recreate the tree quite well. However it is possible that messages have been merged or copied from several sources, obscuring the process. The latter would be equivalent to lateral transfer of genetic information. It is also possible that similar messages have been reinvented from scratch. In any case, the whole procedure is an interesting example that may serve as a pattern for dealing with text, for constructing phylogenetic trees from diverse sources, separating the members in clusters and on some special forms of alignment. The resulting tree is also interesting in itself. Applications outside molecular biology are also possible, e.g. the construction of a phylogenetic tree of some piece of software that underwent development at different places/times, would follow exactly the same steps.

Reading the data: email messages

     The raw email messages were collected for some time and kept in a standard plain text format. We read this file as is, and do all the processing in Darwin.

all := ReadRawFile( '/home/darwin/v2/source/bio-recipes/NigerianPrince/mbox' ):
length(all);
1095228

     To store and process the messages we will define a class called Message. If it receives a single argument, it assumes that it is a raw message and proceeds to parse it. It is always convenient to write a method that will print the instances of the class, and finally we run CompleteClass which will generate some standard methods automatically.

Message := proc( id:string, shortid:string, subject:string,
	header:string, body:string )

if nargs=1 then
     if id[1..5] <> 'From ' then
	  error(id,'message does not start with From') fi;
     i0 := SearchString( '\n', id );
     if i0 < 0 then error(id,'does not have a single line') fi;
     id1 := id[1..i0];

     i1 := SearchString( '\n\n', id );
     if i1 < 0 then error(id,'message does not appear to have a body') fi;
     header1 := id[1..i1];

     i2 := CaseSearchString( '\nSubject: ', header1 );
     if i2 < 0 then subject1 := ''
     else i3 := SearchString( '\n', i2+2+header1 );
	  subject1 := header1[ i2+11 .. i2+i3+2 ];
	  while length(subject1) >= 4 and
	        uppercase(subject1[1..4]) = 'RE: ' do
	      subject1 := 4+subject1 od
	  fi;

     body1 := id[i1+3..-1];
     Message( id1, ShortId(id1), subject1, header1,
	ConvertMessage(body1) )
     
elif nargs=5 then noeval(procname(args))
else error(args,'invalid message format') fi
end:

Message_print := proc( m:Message )
printf( '\n%s  (%s)\nSubject: %s\n\nbody: %s\n',
	m[id], m[shortid], m[subject], m[body] )
end:
CompleteClass(Message):

Converting the text to a comparable form

     The following function converts a general text into a form suitable for comparison and alignment. The messages are written over an alphabet of 26 characters, in upper and lower case and numbers and special symbols are also used. The mapping of lowercase characters to uppercase characters, or the grouping of several symbols in the same class (e.g. punctuation) can be done with a mapping function included in the Dayhoff matrices.

     ConvertMessage does this conversion, character by character. We search each character of the message in a string which contains all letters and digits (case-independent search). This is a simple and efficient way to find out if our character is a letter or a digit. The boolean variable spacing is used to keep only one spacing character per group. We will keep the first one, the rest will be just ignored. Finally, the result is accumulated in r , which may be shorter than the original string. One final exception is that in no case underscore characters are allowed in sequence alignment. The reason for this is that underscores indicate gaps, so if they would be allowed as sequence characters it would not be possible to determine where deletions happen. We transform all underscores to dashes.

ConvertMessage := proc( s:string )
ls := length(s);
r := CreateString(ls);
lr := 0;
spacing := true;
for i to ls do
    k := SearchString(s[i],'abcdefghijklmnopqrstuvwxyz0123456789');
    if k >= 0 then
         lr := lr+1;  r[lr] := s[i];  spacing := false
    elif not spacing then
         lr := lr+1;
         r[lr] := If( s[i]='_', '-', s[i] );
         spacing := true
    fi
od;
r[1..lr]
end:

     At this point it is convenient to define the function that will map the characters to positions in the scoring matrix. In our terminology this is a Mapping function. Numbers are rare in these messages, and their actual value does not have a very relevant meaning, so we will group all digits in a single symbol. The dimension of the scoring matrix will be 28. This is composed of 26 letters, 1 symbol for any of the 10 digits and 1 symbol for spacing and punctuation. Using the convenience of SearchString this is written as:

Dim := 28:
map := proc( ch:string )
if length(ch) <> 1 then error('invalid argument') fi;
k := SearchString(ch,'abcdefghijklmnopqrstuvwxyz');
if k >= 0 then k+1
elif SearchString(ch,'0123456789') >= 0 then 27
else 28 fi
end:

     The function ShortId is an auxiliary function to select the first part of the sending email address. This is typically the userid, and it is intended as a quick identification of the email message. It searches for an "@" or a blank as a separator and return whatever comes after the "From" until the separator.

ShortId := proc( s:string )
if s[1..5] = 'From ' then procname( 5+s )
else i := SearchString( '@', s );
     if i > 0 then return( s[1..i] ) fi;
     i := SearchString( ' ', s );
     if i > 0 then return( s[1..i] ) fi;
     s fi
end:

     Now we are ready to process all the messages in the string "all". Each message starts with a new line with the word "From" followed by a space, which is the standard email convention. We place all the messages in a list.

msgs := []:
do  i := CaseSearchString( '\nFrom ', all );
    if i < 0 then
	 if length(all) > 10 and all[1..5] = 'From ' then
	      msgs := append( msgs, Message(all) ) fi;
	 break
    else msgs := append( msgs, Message(all[1..i+1]) );
	 all := i+1+all
	 fi
    od:

     For curiosity and checking, we can print one message.

print(msgs[2]);
From petersigho@management.com Fri, 3 Nov 2000 21:27:40  (petersigho)
Subject: INVESTMENT PROPOSAL

body: ECO INTERNATIONAL BANK HEADQUARTERS
PLOT 84 AJOSE ADEOGUN STREET
VICTORIA ISLAND LAGOS
ATTN:PRESIDENT/CHAIRMAN
STRICTLY A PRIVATE BUSINESS PROPOSAL
I am DR PETER IGHOa manager in the Bills and Exchange at the Foreign
Remittance Department of the ECO INTERNATIONAL BANK I am writing this letter
to ask for your support and cooperation to carry out this business
opportunity in my department.We discovered an abandoned sum of$15,000,000.00 Fifteen million United
States Dollars only)in an account that belongs to one of our foreign
customers who died along with his entire family of a wife and two children
in November 1997 in a Plane crash.Since we heard of his death,we have been
expecting his next-of-kin to come over and put claims for his money as the
heir,because we cannot release the fund from his account unless someone
applies for claim as the next-of-kin to the deceased as indicated in our
banking guidelines.Unfortunately,neither their family member nor distant
relative has everappeared to claim the said fund.Upon this discovery,I and
other officials in my department have agreed to make business with you and
release the total amount into your account as the heir of the fund since no
one came for it or discovered he maintained account with our bank,otherwise
the fund will be returned to the banks treasury as unclaimed fund.We have
agreed that our ratio of sharing will be as stated thus;20 for you as
foreign partner,75 for us the officials in my department and 5 for the
settlement of all local and foreign expences incurred by us and you during
the course of this business.Upon the successful completion of this transfer,I and one of my colleagues will come to your country and mind our share.It
is from our 75 we intend to import Agricultural Machineries into my
country as a way of recycling the fund.To commence this transaction,we
require you to immediately indicate your interest by a return e-mail and enclose your private contact telephone number,fax number full name
and address and your designated bank coordinates to enable us file letter of
claim to the appropriate departments for necessary approvals before the
transfer can be made.Note also,this transaction must be kept STRICTLY CONFIDENTIAL because of
its nature.I look forward to receiving your prompt response.FAX
NUMBER234-1-7592911
Best Regarde
DR PETER IGHO

Eliminating identical duplicates

     Pairs of messages which are identical serve no purpose whatsoever. They will be part of the same subtree, they contribute no useful information about mutations (there are none) but cost a lot of additional computation. Hence it is desirable to eliminate the exact repetitions. The following loop compares every pair of messages and if they are identical removes the second one from the list. Since the list may shorten during the process, the loops are controlled by a while condition.

for i while i <= length(msgs) do
    for j from i+1 while j <= length(msgs) do
	if msgs[i,body] = msgs[j,body] then
	     printf( 'msg %s is a duplicate of\n    %s; removed\n',
		 msgs[j,id], msgs[i,id] );
	     msgs := [ op(msgs[1..j-1]), op(msgs[j+1..-1]) ];
	     fi
    od od;
msg From peterjohn2004@go.com Wed Feb 26 12:10 MET 2003 is a duplicate of
    From peterjohn@ny.com Wed Feb 26 10:43 MET 2003; removed
msg From benwills9@netzero.com Sat Apr  5 12:38 MET 2003 is a duplicate of
    From benwills11@netzero.com Sat Apr  5 12:28 MET 2003; removed

     As expected the exact duplicates come from (nearly) the same sender and at about the same time. This is expected from spam.

Building the Dayhoff matrices for this particular data

     Running the all-against-all of 348 messages requires too much time for a normal interactive session (60378 long alignments). We will set a flag, named PartialRun , under which we can run just a few entries to show the process while the complete results are read from a file. Conversely, if PartialRun is not set, the complete computation will be run, and the files of results will be stored.

PartialRun := true;
PartialRun := true

     The Dayhoff matrices used for protein alignments are completely useless in this case. The symbols do not correspond to amino acids, and their mutation probabilities will certainly differ. This is a situation that may be encountered quite often in biology too when we are faced with sequences which are not average proteins. So the following analysis is also very relevant for biology, in particular it is the standard procedure to derive Dayhoff matrices from new data.

     The procedure for finding the similarity/scoring matrices is iterative. It repeats the following steps:

(1)(first time only) Select an initial Dayhoff matrix which is sufficiently diagonally dominant. That is, identical matches will be positive and mismatches will be negative. We start with a matrix at PAM 30 which has all mutations equally likely. We will call this matrix DM1.
(2)Align all the sequences against each other with this similarity matrix.
(3)Select a subset of the alignments which ensures that there is no relative over-representation of some evolutionary event. This can be achieved by using a circular tour of the sequences with accumulated maximal similarity. It can also be achieved with the methods described for the Unbiased selection of sample alignments
(4)From each of the selected alignments, count the matches and mismatches per symbol pair. Mismatches are counted 1/2 for each pair, as we do not know which one mutated into which.
(5)Compute the frequencies, mutation matrix and Dayhoff matrices from the Counts matrix. (Darwin's ComputeDayMatrices command.)
(6)Repeat from step 2 with these new matrices until convergence. Convergence is normally very fast, and in this example we will run a single iteration.

     For the initial matrix, it is easier to use the darwin function CreateDayMatrices which requires a matrix of counts for each pair of symbols. We will produce a synthetic counts matrix with all 1's except for the diagonal (10's).

SynCounts := CreateArray(1..Dim,1..Dim,1):
for i to Dim do SynCounts[i,i] := 10 od:
CreateDayMatrices( SynCounts, map, type=Alphabetic );
DM1 := SearchDayMatrix(30,DMS);
DM1 := DayMatrix(Alphabetic, pam=30.2968, Sim: max=13.157, min=-5.673, Mapping=
map, dim=28, del=-26.627-1.396*(k-1))

     now we run the complete or restricted all-against-all matching.

n := If( PartialRun, 6, DB[TotEntries] );
AllAll := CreateArray(1..n,1..n):
Dist := CreateArray(1..n,1..n):
for i to n do
    for j from i+1 to n do
        AllAll[i,j] := AllAll[j,i] :=
	    Align(msgs[i,body],msgs[j,body],DM1);
        Dist[i,j] := Dist[j,i] := "[Score];
od od:
maxdist := max(Dist);
n := 6
maxdist := 14539.1689

Find a circular tour of the leaves to collect mutation information

     The set of all messages, assuming that they are all related, will form a phylogenetic tree. This tree is at this point unknown, (is what we want to construct). An excellent way to inspect all the pairs of leaves of the tree is by doing a circular tour of the tree in any of its planar representations. A circular tour that considers each pair of leaves in the tour and uses the information of the path that connects both leaves will consider each of the tree branches exactly twice. This is very important when we do not want to over-represent information. Notice that an all-against-all, would use some edges much more than others.

     If shorter edges mean higher similarity, a planar tour has minimum total sum of edges and maximum total similarity. A circular tour with maximal similarity can be found by using the Traveling Salesman Problem solution (TSP). The TSP function minimizes a tour, so we will complement the similarities so that by finding a minimum of the complements, we find the desired maximum. For convenience, we complement against the maximum so that all distances end up positive.

for i to n do for j from i+1 to n do
    Dist[i,j] := Dist[j,i] := maxdist - Dist[i,j] od od;

t := ComputeTSP( Dist );
t := [op(t),t[1]];
t := [1, 3, 5, 4, 2, 6]
t := [1, 3, 5, 4, 2, 6, 1]

     For programming convenience we have added the first sequence of the tour to the end. We are now ready to follow the tour and for each alignment count all the matches in a Count matrix.

Counts := CreateArray(1..Dim,1..Dim,If(PartialRun,1,0)):
if PartialRun then for i to Dim do Counts[i,i] := 10 od fi;
for i to n do
    dps := DynProgStrings( AllAll[t[i],t[i+1]] );
    for j to length(dps[2]) do
	i1 := map(dps[2,j]);
	i2 := map(dps[3,j]);
	Counts[i1,i2] := Counts[i1,i2] + 1/2;
	Counts[i2,i1] := Counts[i2,i1] + 1/2;
    od;
od:

and now we compute the new Dayhoff matrices.

CreateDayMatrices(Counts,map,type=Alphabetical);

     For the partial runs we usually have too few sequences, which leave some counts empty. When there are empty counts, it becomes impossible to compute the scoring matrix. So for the partial run we start with ones on each count and 10 on each diagonal to prevent empty counts.

Compute the all-against-all distance matrix

     Now we are ready (by having a better scoring matrix) to align the messages and estimate their distances. This is done with the function Align when it receives an array of Dayhoff matrices instead of a single one (e.g. DMS). Align will compute the estimated distance and its variance for every pair of messages. The distance in this case, corresponds to a Markovian model of random independent mutations (which is not exactly what happens to the messages, but an approximation). A final technical problem is caused by alignments which are spurious, very good, and very short. This could happen by a very short phrase which appears in two unrelated messages. E.g. "thousand united states dollars". These short, good matches falsify the distance computations. To prevent this from happening, we use a mode of Align which forces an alignment of at least 400 amino acids of each message (parameter MinLength(400)).

for i to n do
    for j from i+1 to n do
        AllAll[i,j] := AllAll[j,i] :=
            Align(msgs[i,body],msgs[j,body],DMS,MinLength(400));
od od:
bytes alloc=39200000, time=461.540

     If we are running a partial run, we will now read the full results from a pre-prepared file. We will also reset the number of messages and recompute the Dayhoff matrices. If this was a complete run, then we will create the file of results for future partial runs.

if PartialRun then
     ReadProgram( 'NigerianPrince/AllAll.drw' );
else OpenWriting( 'NigerianPrince/AllAll.drw' );
     printf( 'n := %d:\n', n );
     printf( 'Counts := %A:\n', Counts );
     printf( 'CreateDayMatrices(Counts,map,type=Alphabetical);\n' );
     printf( 'AllAll := CreateArray(1..n,1..n);\n' );
     printf( 'AllAll:=%A:\n', AllAll );
     OpenWriting( terminal );
fi:
bytes alloc=45600000, time=736.330

     It is interesting to see some alignments at different PAM distances. To make this easier we write a function which selects the pair of messages which are at a distance as close as possible to a given one and prints their alignment. This will allow us to test the limit of quality of the alignments.

PrintAlignmentAt := proc( d:positive )
best := AllAll[1,2];
for i to n do for j from i+1 to n do
    if |best[PamDistance]-d| > |AllAll[i,j,PamDistance]-d| then
	best := AllAll[i,j] fi od od;
print( best )
end:

     We now print alignments close to 30, 50 and 70 to see where the boundary is.

PrintAlignmentAt( 30 );
lengths=427,456 simil=1025.1, PAM_dist=30.2968, identity=45.4%
 assist us_ to__________ transfer the sum of_________________ Twenty Million Fiv
|||||||....||||......|..|||||||||||||||||||||..|..|...|...|..||...||||||||||||||
 assistance to enable me transfer the sum of US$30,500,000.00 Thirty Million,Fiv

e Hundred Thousand United St_ates Dollars 20,500,000)into ______his her_______ a
||||||||||||||||||||||||||||||||||||||||||..|...|...||||||....|..|...|........||
e Hundred Thousand United St ates Dollars___________)into your private/company a

ccount.This fund________________ resulted from an over-invoiced bill from contra
|||||||||..||||||....|.....|..|.|||||||..|..|.||.|....|........|....|....|||||||
ccount.Th_e fund came about as a result____ of a_________________________ contra

cts awarded by us unde___________r the budget allocation to___________ my Minist
||.|||||||||..|..|.||.........|..||...|......|.........|..|.|......|..||||||||||
ct_ awarded______ and executed for ___________________and on behalf of my Minist

ry and this bill has been approved for payment by the concerned authorities.The 
|||...|....|....|...|....|........|...|.......|..|...|.........|...........|||||
ry_________________________________________________________________________.The 

contract has since been execu___ted,commissioned and the_______________________ 
|||||||||.||||....|....|....|....|||.|..........|...|..||.......|..|...|.......|
contract was s______________upposed to b_______________e awarded to two foreign 

contractor was _________________________________________________________________
|||||||||||..||..|...|....|..|..|...|...|...|..|...|.......|...|......|.......|.
contractor___s to the tune of US$180,000,000.00 One hundred and Eighty Million U

_______________________paid the actual co__st of____________ the contract.W
.....|......|.......|....|.|||||......|||..|.||||...........|||||||||||||||
nited States Dollars)But in the_______ course of negotiation,the contract w
PrintAlignmentAt( 50 );
lengths=531,576 simil=416.0, PAM_dist=50, identity=36.5%
 of Contract Award_______________ Committee with the Federal Ministry of Agricul
|||||||||||||||||||...|..........|||||||||||....|||||.......|||||||||||||.......
 OF CONTRACT AWARD AND MONITORING COMMITTEE O__F THE________ MINISTRY OF _______

ture.By the virtue of our position and ___________________________the power best
....|..|...|......|..|.||........||||||.....|...........|..|....|...|.||||||....
_______________________URB______AN AND RURAL DEVELOPMENT.MY DUTY AS EMPOWER_____

owed on us by the_________ government,we carefully and deliberately over invoice
..|||..|..|||||||.........||||||||||||..|.........|...|........|.. .||..|....|.|
__ED______ BY THEMAURITIUS GOVERNMENT IS ______________________TO PROV_______IDE

d the v_______________________________alue of some contract_____________________
.|||||.....|.........|......|.........||..|..|....|.....|||.......|..|.....|...|
_ THE BASIC AMENITIES,SOCIAL RECRATIONAL _______________ACTIVITIES IN URBAN AND 

__________s th_____________________at we a_________________________________ward 
.....|....||||..|........|........||.....|...|..|........|.....|............|.||
RURAL AREAS,THIS PROGRAMM INCLUDES ASSISTANCE TO DEPRIVED LOCAL COMMUNITIES AND 
 . . . . (5 output lines skipped) . . . .
_______________________________________________________________________________u
...........|....|....|........|..|....|....|....|..|.......|....|..........|...|
FURTHERMORE FROM THIS PROJECTS WE HAVE BEEN ABLE TO SECURED SOME REASONABLE AMOU

ne of_________ Twenty-Two Million five_ Hundred Thousand United State Dollars US
| ||||.|.|..|.|||||||| ..|||||||||.|...|||||||||||||||||||.....||....|||||||||..
NT OF U.S.21.8(TWENTY ONE MILLION EIGHT HUNDRED THOUSAND U_____.S____.DOLLARS __

$22,500,000 00)Now the contracts have been fully executed and commission and pay
|..|...|...|..|...|...|.||......|....|....|...|||...|....|...|...|||||||| . ....
________________________ON____________________LY)AS CO___________MISSION FROM __

ment are about to be made to all the contractors who have successfully executed 
.....||.|..||.|..|..|....|..|...|...|||||||||||||...|....|....||..||..|........|
____VARI___OU______________________S CONTRACTORS R____________ES__UL____________

their contracts,the_________ over invoic
.....|.........|.......|....||||||||||||
___________________TING FROM OVER INVOIC
PrintAlignmentAt( 70 );
lengths=434,594 simil=459.0, PAM_dist=70, identity=39.7%
 account will be compensated with twenty percent 20%of the total sum if you_____
|||||||||....|..|.... |.|...|....|.... ||| | ...|..||||||||.| .|.. ..  |||||....
 account ____________in a_____________ny part______ of the wor_ld,which you prov

___ will stand____________________________________________ as the beneficiary___
...||||||   |...........|...|........|..|....|.....|..|...||||||||||||||||||||..
ide,will then facilitate the transfer of this money to you as the beneficiary ne

__ of______________________ the fun______________d.Eighty_______________________
..||||...|..|..|.....|.....|||||. |..|....|..|...||  . .|...|.......|...|..|..|.
xt of kin of Mr.Barry Kelly.The money will be paid into your account for us to s

_ percent___________ 80%for we_______________ the ______offic____ials involved.N
...|.  .|..|.....|..||||||||.||...|..|...|...||||..|..|.|. | .|...|| |........|.
hare in the ratio of 60%for me and 40%for you.There is no risk at all___________

ote that we have done our home work very well in our country,this transaction is
...|||..|.|| |..|....|...|...|.|||||. |.|....|..|...|.......||||||||||||||||||  
___ th____e pap______________erwork for_____________________ this transaction wi

 safe_____________________ and__________________________________ guaranteed ___1
. ..||....|..|...|........|||||..|........|..|...|......|.......||||||||||.|....
ll be done by the attorney and my position as the Branch Manager guarantees the 

00%risk free____________________________.If you are interested please send down 
  .  |.. ..|........|..|....|...........|||||||||||||||||||||||||||||| | .|....|
successful execution of this transaction.If you are interested,please rep_______

to us account information,so that we will immediately seek a____pproval o_______
..|..|.......|...........|..|....|..|..| |||||||||||||.....||....|| || .......|.
_______________________________________ly immediately ___via the private email a

____f the fund on your be__________half from the________________________________
........|.  . .||||||||.|......|.|.|||.|....||||.|.......|...|....|....|.......|
ddress below.Upon your response,I shal_____l then provide you with more details 

___ relevant 
...||||||||||
and relevant 

     We can see that the boundary is fuzzy, but at 50 there are very large portions of text which have been duplicated whereas at 70 the relation is dubious. Consequently we will choose 50 as the limiting distance for a trustworthy relation. It is interesting to notice that in Biology we are used to consider and study homologies at much larger distances, up to 250 in a reliable way. Clearly the techniques that have been developed are much more sensitive than the human eye.

     Now that we have computed all the distances, we will produce the full tree.

Labels := [ seq( m[shortid], m=msgs )]:
Seqs := [ seq( m[body], m=msgs )]:

tree := PhylogeneticTree( Seqs, Labels, Distance, AllAll ):
DrawTree( tree, Unrooted, CrossReference );
MinLen not set, using 0.17908
dimensionless fitting index 16.06
complete phylogenetic tree

     The full tree is quite large and impossible to read. To improve this we will do several things. First we will not print branch lengths nor nodes. In these cases, the Phylogram style is the most readable format. We will use the reordering that makes the left side the largest one so that the top of the tree appears more organized. Because the tree is so large we will trim it down by removing the entries which are 80% or more identical. To do this we will use the fact that we have constructed a tree based on distances, and hence very similar sequences will have a very short distance and hence will be neighbour leaves in the tree. Using this idea we write a function that returns a trimmed tree:

TrimTree := proc( t:Tree, allall:matrix )
if type(t,Leaf) then t
else l := TrimTree( t[Left], allall );
     r := TrimTree( t[Right], allall );
     if type(l,Leaf) and type(r,Leaf) and
	allall[l[3],r[3],Identity] > 0.8 then
	  printf( '%s and %s have identity %.2f%%, %s ignored\n',
	      l[Label], r[Label], allall[l[3],r[3],Identity]*100,
	      r[Label] );
	  l
     else Tree(l,t[2],r)
     fi
fi
end:

     With all these changes the tree is now:

DrawTree( TrimTree(tree,AllAll), Phylogram, Legend,
    OrderLeaves=LeftHeavy, LengthFormat='' );
sadu25 and sadu25 have identity 96.35%, sadu25 ignored
simonmuzenda and simonmuzenda have identity 98.60%, simonmuzenda ignored
simonmuzenda and muzendaasimon have identity 96.58%, muzendaasimon ignored
danstevens101 and dsmnd23 have identity 85.45%, dsmnd23 ignored
rubutuwena99 and rubutuwena99 have identity 99.72%, rubutuwena99 ignored
donaldste10 and jamesufon5050 have identity 96.79%, jamesufon5050 ignored
emmastevens100 and donaldste10 have identity 99.02%, donaldste10 ignored
emmastevens100 and emmastevens100 have identity 98.13%, emmastevens100 ignored
rubutuwena99 and emmastevens100 have identity 92.87%, emmastevens100 ignored
franksalihu23 and rubutuwena99 have identity 95.29%, rubutuwena99 ignored
musa10smith and franksalihu23 have identity 94.01%, franksalihu23 ignored
mahabile50 and mahabile10 have identity 99.54%, mahabile10 ignored
sodindo2001 and sodindo2001 have identity 99.94%, sodindo2001 ignored
harenma and harenma have identity 92.50%, harenma ignored
ekidinn and omoeki66 have identity 97.43%, omoeki66 ignored
ekiodion45 and ekiodion45 have identity 99.92%, ekiodion45 ignored
ekidinn and ekiodion45 have identity 97.51%, ekiodion45 ignored
ekidinn and gloriaeze2003 have identity 89.07%, gloriaeze2003 ignored
kambirelama and kambirelama have identity 99.95%, kambirelama ignored
farsan1 and farsan1 have identity 97.97%, farsan1 ignored
 . . . . (30 output lines skipped) . . . .
ericjons2 and ericjons01 have identity 97.90%, ericjons01 ignored
ericjons2 and ericm001 have identity 97.90%, ericm001 ignored
malik3 and afiz have identity 99.85%, afiz ignored
aya and aya have identity 99.73%, aya ignored
pat27 and colzaqb have identity 98.85%, colzaqb ignored
josephmobutu56 and jmobutu48 have identity 98.76%, jmobutu48 ignored
kurumamarvin and kurumamarvin have identity 99.89%, kurumamarvin ignored
seseseko20 and mari500 have identity 95.52%, mari500 ignored
seseko and seseseko20 have identity 93.62%, seseseko20 ignored
ladawamobutu3000 and ladawa_mobutu have identity 99.96%, ladawa_mobutu ignored
ladawamobutu3000 and ladawamobutu3000 have identity 99.67%, ladawamobutu3000 ignored
mrs_mariamsese13 and mrs_mariamsese17 have identity 99.71%, mrs_mariamsese17 ignored
mrs_mariamsese13 and mrs_mariamsese13 have identity 98.85%, mrs_mariamsese13 ignored
sesekom and mrs_mariamsese13 have identity 95.41%, mrs_mariamsese13 ignored
ladawamobutu3000 and sesekom have identity 94.44%, sesekom ignored
seseko and ladawamobutu3000 have identity 88.02%, ladawamobutu3000 ignored
sekmobut and seseko have identity 93.31%, seseko ignored
solomongarba22 and bulawa2000 have identity 96.81%, bulawa2000 ignored
luckygarbi and simonmuzendda have identity 95.75%, simonmuzendda ignored
almost complete tree

which shows some of the structure, but still virtually impossible to find the details.

Separating the sequences into clusters

     Sequences which are completely unrelated to one another disrupt the phylogenetic trees. Sometimes this disruption is so great that it is difficult to recognize the true properties. This can very well happen in this case, where there may be more than a single source of messages. Like with genomic sequences, trying to associate completely different classes of objects destroys real relations. Once that we have computed the all-against-all alignments we can cluster the sequences using the Clusters function in Darwin. This function provides a distance parameter to make the clusters and hence the algorithm can be tuned to the data. In this case we know that we would like to separate the clusters around distance 40

clusters := Clusters( AllAll, AveDistance=40 ):

     These clusters are composed of truly related messages. We write a function to compute a tree for a given cluster. This function receives a set of Entry numbers (a cluster) and uses the AllAll matrix to select the desired alignments; no further computation of alignments is needed. The rest is identical to the previous tree construction.

ClusterTree := proc( c:set(posint) )
n2 := length(c);
allall := CreateArray(1..n2, 1..n2);
for i to n2 do for j from i+1 to n2 do
    allall[i,j] := allall[j,i] := AllAll[c[i],c[j]];
    od od;
Labels := [ seq( msgs[i,shortid], i=c )];
Seqs := [ seq( msgs[i,body], i=c )];
tree := PhylogeneticTree( Seqs, Labels, Distance, allall );
DrawTree( TrimTree(tree,allall), Unrooted, CrossReference );
end:

     We now test the function over three largest clusters. To do this, we first sort the clusters by reverse size (so the three largest are the first three) and display the two largest.

clusters := sort( clusters, x -> -length(x) ):
clusters[1];
clusters[2];
{16,20,27,40,52,54,73,99,108,113,117,127,131,137,152,155,160,182,186,196,201,202
,207,212,221,226,235,246,247,256,269,276,287,295,318,323,336}
{18,42,47,67,102,146,150,154,156,183,214,215,242,261,262,270}
ClusterTree( clusters[1] );
MinLen not set, using 0.011506
dimensionless fitting index 2.407
muzendaasimon and simonmuzenda have identity 97.78%, simonmuzenda ignored
muzendaasimon and simonmuzenda have identity 96.58%, simonmuzenda ignored
mahabile50 and mahabile10 have identity 99.54%, mahabile10 ignored
rubutuwena99 and rubutuwena99 have identity 99.72%, rubutuwena99 ignored
musa10smith and franksalihu23 have identity 94.01%, franksalihu23 ignored
rubutuwena99 and musa10smith have identity 95.29%, musa10smith ignored
rubutuwena99 and emmastevens100 have identity 92.87%, emmastevens100 ignored
sophikan12 and lindaibrahim3 have identity 89.78%, lindaibrahim3 ignored
transvageazimba and mbowenilizzy have identity 95.24%, mbowenilizzy ignored
sodindo2001 and sodindo2001 have identity 99.94%, sodindo2001 ignored
solomongarba22 and bulawa2000 have identity 96.81%, bulawa2000 ignored
phylogenetic tree of largest cluster
ClusterTree( clusters[2] );
MinLen not set, using 0.0088926
dimensionless fitting index 2.269
benwills20012002 and benwills11 have identity 96.84%, benwills11 ignored
mk111km and francisduru12 have identity 95.18%, francisduru12 ignored
phylogenetic tree of 2nd largest cluster
ClusterTree( clusters[3] );
MinLen not set, using 0.015775
dimensionless fitting index 0.2383
malik3 and afiz have identity 99.85%, afiz ignored
aya and aya have identity 99.73%, aya ignored
pat27 and colzaqb have identity 98.85%, colzaqb ignored
phylogenetic tree of 3rd largest cluster

     The clusters show a pattern which is typical of sequences which are mutated again and again and we have all the intermediate steps of their evolution (which is not the case in biology). This is indicated by nodes which have branches so short that they lie against the main branch. As we collect more information, it is likely that we will connect some of the disjoint clusters.

     The source data for this bio-recipe (as an electronic mailbox standard format) can be found here.

© 2005 by Gaston Gonnet, Informatik, ETH Zurich

Please cite as:

@techreport{ Gonnet-NigerianPrince,
	author = {Gaston H. Gonnet},
	title = {Phylogenetic tree of the "Nigerian Prince" email scam},
	month = { July },
	year = {2003},
	number = { 415 },
	howpublished = { Electronic publication },
	copyright = {code under the GNU General Public License},
	institution = {Informatik, ETH, Zurich},
	URL = { http://www.biorecipes.com/NigerianPrince/code.html }
	}

Index of bio-recipes

Last updated on Mon Oct 3 15:07:57 2005 by GhG