In this bio-recipe we will analyze how significant is an alignment. This will be done by comparing the real alignment against alignments made out of shufflings of the same sequences.
First we define the file name where we can find the SwissProt database and then load it. We also create Dayhoff matrices to be able to align sequences.
SwissProt := '~cbrg/DB/SwissProt.Z': ReadDb(SwissProt); CreateDayMatrices();
Reading 169638448 characters from file /home/cbrg/DB/SwissProt.Z Pre-processing input (peptides) 163235 sequences within 163235 entries considered Peptide file(/home/cbrg/DB/SwissProt.Z(169638448), 163235 entries, 59631787 aminoacids)
Given two sequences from the database (given by the accession numbers)
seq1 := Sequence(AC(P58051)); seq2 := Sequence(AC(Q64403));
seq1 := MIWWFIVGASFFFAFILIAKDTRTTKKNLPPGPPRLPIIGNLHQLGSKPQ ..(496).. LQLIPVLTQWT seq2 := MGLLTGDALFSVAVAVAIFLLLVDLMHRRQRWAARYPPGPVPVPGLGNLL ..(500).. PSPYQLCAVLR
We align the two sequences using a local alignment, to align the best subsequences.
al1 := Align(seq1,seq2,Local);
al1 := Alignment(Sequence(AC('P58051'))[2..464],Sequence(AC('Q64403'))[9..474],376.1353,DM,0,0,{Local})
The alignment printed in a readable format is:
print(al1);
lengths=463,466 simil=376.1, PAM_dist=250, identity=23.3% ID=C72E_ARATH AC=P58051; DE=Cytochrome P450 71B14 (EC 1.14.-.-). OS=Arabidopsis thaliana (Mouse-ear cress). OC=Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids II; Brassicales; Brassicaceae; Arabidopsis. KW=Heme; Monooxygenase; Multigene family; Oxidoreductase; Transmembrane. ID=CPDG_CAVPO AC=Q64403; O54866; DE=Cytochrome P450 2D16 (EC 1.14.14.1) (CYPIID16). OS=Cavia porcellus (Guinea pig). OC=Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Rodentia; Hystricognathi; Caviidae; Cavia. KW=Direct protein sequencing; Electron transport; Endoplasmic reticulum; Heme; Membrane; Microsome; Monooxygenase; Oxidoreductase. IWWFIVGASFFFAFILIAKDTRTTKKNLPPGPPRLPIIGNLHQLG_SKPQRSLFKLSEKYGSLMSLKFGNVSAVVASTPE !! :.|::::|:.:! !.:..:. ..:.||||..:| !|||.|:: ::...|. ||.::!|:::||:| :::||::.. LFSVAVAVAIFLLLVDLMHRRQRWAARYPPGPVPVPGLGNLLQVDFENMAYSCDKLRHQFGDVFSLQFVWTPVVVVNGLL TVKDVLKTFDAECCSRPYMTYPARVTYNFNDLAF__SPYSKYWREVRKMTVIELYT_AKRVKSFQNVRQEEVASFVDFIK :|!!:|.: ::!.::||.!:..|.:.!: :..:: : |:..|||.|!::|.:|.: :...||::: .:||:|::...:: AVREALVNNSTDTSDRPTLPTNALLGFGPKAQGVIGAYYGPAWREQRRFSVSSLRNFGLGKKSLEQWVTEEAACLCAAFT QHASLEKTVNMKQKLVKLSGSVICKVGFGISLEWS_____KLANTYEEVIQGTMEVVGRFAAADYFPIIGRIIDRITGLH :||: :::..|..|.|...:||::: !:..:!!: !|.:..||:!:.:.:: !:.:.:.:|!! || ..:.:. NHAG__QPFCPKALLNKAVCNVISSLIYARRFDYDDPMVLRLLEFLEETLRENSSL__KIQVLNSIPLLLRIPCVAAKVL SKCEKVFKEMDSFFDQSIKHHLEDTNIKDDIIGLLLKMEKGETGLGEFQLTRNHTKGILLNVLIAGVDTSGHTVTWVMTH |..::::. .|:::.:. :....|:. !| . ::|.:|:|:: | :| :::.::.! !:.::: ||: |::.|::|:!.. SAQRSFIALNDKLLAEHNTGWAPDQPPRDLTDAFLTEMHKAQ_GNSESSFNDENLRLLVSDLFGAGMVTTSVTLSWALLL LIKNPRVMKKAQAEVREVIKNKDDITEEDIERLEYLKMVIKETLRINPLVPLLIPREASKYIKIGGYDIPKKTWIYVNIW !|.:|.|.!::|.|!.|||.:.. ...|..:!.!.:.||:|:.|:..!||! !|:.:|! .:!.|! |||.|.!!:|! MILHPDVQRHVQEEIDEVIGQVRCPEMADQAHMPFTNAVIHEVQRFADIVPMGVPHMTSRDTEVQGFLIPKGTMLFTNLS AVQRNPNVWKDPEVFIPERFMHSEIDYKGVDFELLPFGSGRRMCPGMGLGMALVHLTLINLLYRFDWKLPEG :|.!:.:||::| .|.|.:|!::| .!...! .:!||::|.|:|.|..|:.. :.|.:.:||.||:!::||| SVLKDETVWEKPLHFHPGHFLDAEGRFVKRE_AFMPFSAGPRICLGEPLARMELFLFFTSLLQRFSFSVPEG
Then we shuffle randomly the amino acids of both sequences and align them again. We repeat the random shuffling and alignment many times, so that we can have some reliable values for the average score. First we would like to have some information about how to shuffle sequences.
?Shuffle
the following help files contain (approximately) the keyword: ?Beta_Rand ?Binomial_Rand ?ChiSquare_Rand ?Exponential_Rand ?FDist_Rand ?GammaDist_Rand ?Geometric_Rand ------------
The code to compute 400 random alignments and gather the statistics of its score is:
s := Stat('random score of shuffled seq1 and seq2'):
to 400 do
a := Align(Shuffle(seq1),Shuffle(seq2),Global);
s+a[Score]
od:
This code uses the variable s
which is assigned a structure to collect univariate statistics.
Now we can print the results onto the screen
(in plain characters), or we can see them as a web page
(other possibilities, like Latex are also possible).
print(s);
random score of shuffled seq1 and seq2: number of sample points=400 mean = -135.6 +- 2.3 variance = 531 +- 78 skewness=0.43412, excess=0.246474 minimum=-191.574, maximum=-54.6842
View(HTML(s));
Global alignments are really poor (forcing the entire shuffled entries to match), so we may try Local alignments, that is the best subsequences that match. At this point we also want to create a histogram of the scores, to have a visual clue on their distribution.
Hist := CreateArray(1..100):
s2 := Stat('random (Local) score of seq1 and seq2'):
to 800 do
a := Align(Shuffle(seq1),Shuffle(seq2),Local);
s2+a[Score];
i := round(a[Score]);
Hist[i] := Hist[i]+1;
od:
print(s2);
random (Local) score of seq1 and seq2: number of sample points=800 mean = 46.39 +- 0.38 variance = 30.6 +- 4.1 skewness=0.970446, excess=1.74417 minimum=36.2973, maximum=74.8547
The histogram, which is produced as a postscript file, can be viewed on a separate window.
View(Histogram(Hist));
If we want to focus the histogram on the part where all the significant scores happen, we select the values between 35 and 80.
View(Histogram(Hist[35..80]));
We see from the histogram, that although the distribution is skewed to the right, it is pretty much normal-shaped. Given that the original score was around 376, and that the average of a random score is about 46 with a variance of 31, the hypothesis that the original score could be obtained at random is extremely unlikely. The following expression computes how many standard deviations away it is from the average:
(al1[Score]-s2[Mean]) / sqrt(s2[Variance]);
59.5698
And the final conclusion is quite clear, a random alignment would never have the score that we obtained, nor anything close. The reason for the high score lies in the homology of the sequences.
© 2005 by Gaston Gonnet, Informatik, ETH Zurich
Last updated on Fri Sep 2 16:42:03 2005 by GhG