K-Means Clustering

http://people.revoledu.com/kardi/tutorial/kMean/index.html
explains k-means clustering and gives a numerical example. Data is grouped 
into clusters by a repetitive process. The points4.txt file contains the 
four data point in the above article.  points10000a.txt contains 10,000 points 
and points100000a.txt contains 100,000. 

This is a good example to illustrate concurrency. For larger files you could 
divide the data among multiple agents.  Each cluster should have its own 
agent. Mutable refs may be used to hold the points associated with each cluster
so there will be one such ref for each cluster.

The book on page 111 shows how to read a file. These data files have one point 
on each line. The first two values are the x- and y- coordinates.  Ignore 
any other values.  All points are two-dimensional although this procedure 
works for any number of dimensions. A Java StringTokenizer will divide a 
string into tokens using the nextToken method.  Choose separators as " ," to 
include both the space and the comma.

Assuming k clusters assign the first k points as the cluster centers. The 
algorithm iterates until there are no more changes of points from one cluster 
to the next.  At each stage we first assign each point to a cluster then 
calculate the new cluster centers and check for any changes. Assign a point to 
a cluster whose center is at the minimum distance from the point (the closest 
cluster). Add it to the ref for that cluster, and update the point with 
its new cluster assignment. When all the points have been 
assigned to clusters have each cluster compute its new center as the 
average of the points assigned to it. Clear the member references 
associated with the cluster to be ready for the next round of assignments. 
Associate the new center with the cluster agent.  

When the algorithm completes output the cluster centers and the number of 
iterations performed. Compare the timing for the 10,000 and 100,000 files. 
Use five cluster agents for these files and two data agents. For fun you 
might try other numbers of clusters.