We divide the algorithms in two groups. First the algorithms which produce a classification. Second the algorithms which take a classification and refine it (improve it).
The initial classification algorithms compute a hyperplane
that divides the space in the best possible way. There are
several paradigms for computing this hyperplane.
logistic(x) = 1/(1+exp(-x))
.(1-logistic(Aix))^2
for the positives and
logistic(Aix)^2
for the negatives.
This is a non-linear least squares problem and
it is resolved by iteration.
In this case we use steepest descent (SD) starting from x=0
and using small steps so that we can guarantee that
we are decreasing in the direction of the gradient.
The problem has a minimum at an infinite norm, so the
SD procedure is stopped after 200 steps.
-ln(logistic(Aix))
for the positives and
-ln(1-logistic(Aix))
for the negatives.
This function is minimized in terms of x using
steepest descent as for the logistic.
The refinement algorithms take an initial classification and
change it if it can be improved. Otherwise it is left as is.
For the refinements we always try to minimize the weighted
number of misclassifications (errors).
We have 4 basic forms of refinement. Two of them admit an
additional parameter in the form of a distribution.
Midpoint
Swath
or Swath(distribution)
Random
Progressive
or Progressive(distribution)
The simulation results use a matrix of independent variables with random values U(0,1). This matrix has 1000 rows and 25 columns (like 1000 data points with 25 values each; m=1000, n=25). The difference between the simulations is given by the way that the positives/negatives are determined.
The score column shows the average score which is the weighted falses (weighted so that the all the positives weigh the same as all the negatives). Lower scores are better.
The time is in seconds.
The quality column is only available for the non-random simulations. It measures the alignment of the computed hyperplane with the one used for the construction of the positives/negatives (which is known to us). The measure is the normalized internal product, so a value of 1 is a perfect result, a value of 0 is a perfectly orthogonal result. It should be noticed that almost independently of the model, method, and simulation, the lower the score, the better the quality.
Values in green indicate the best value for the column.
Random In this simulation the positives are decided randomly, so any classification is a random result. It serves as the null hypothesis of classification.
Deterministic
This simulation uses a linear combination of the entries
to determine positives. In particular, if
A[i,3]-3*A[i,4]+7*A[i,5] > 1/2
is true
then the row is considered a positive.
Deterministic+Noise
This simulation uses a linear combination of the entries
plus some random noise to determine positives. In particular, if
A[i,3]-3*A[i,4]+7*A[i,5]+Rand() > 1/2
is true
then the row is considered a positive.
Deterministic-full direction + Noise
This simulation uses a linear combination of the entries
plus some random noise to determine positives. In particular, if
A[i]*Direction+Rand()/4 > 1/2
is true
then the row is considered a positive. Direction
is a random direction in n dimensions.
Some of the conclusions are obvious.
The construction methods are uniformly mediocre to poor compared against the Best method. Notice that the Best method is much better, but it also requires more time. One exception is Svd(First(10)) which is much worse than the others. Using just a few singular values appears to be a poor choice.
As an initial method, we choose Svd(1e-8), as it gives results as good as the best and it takes only a fraction of the time.
Any initial method is much worse than Svd(1e-8) plus one iteration of refinement. In particular Swath appears to be the most effective refinement. (Progressive is better, but it takes much longer.)
The weight distributions for Swath and Progressive are roughly equivalent with a minor edge in favour of normal.
The Random refinements are, as expected, the weakest refinements.
Last updated on Sat Feb 8 20:28:51 2003 by GhG