nodes is a frequency chart. If 200lbs. appears twice, nodes[200] is 2. The strongest node wins. The reduction is the mode average of the knowledge set.
This can get really complicated...
K-Means Clustering
i.e. sorting M&Ms by color
clusters inputs into k buckets identified by k different mean averages
infers the mean value of the bucket including or closest to the input
Bayes Classification
Thomas Bayes' theorem
given some known probabilities, calculate some unknown probability, and infer from it
P(A|B) = P(B|A) * P(A) / P(B)
The machine learning algorithms have already been written for you
Machine learning is a "big data" problem. Learn how those guys solve these kinds of problems.
MapReduce - I called it "reduction" for a reason
Apache Pig
Apache Hadoop
The Performance Problem
Say your website has 1 million songs. Imagine a grid with 106 rows and 106 columns. How many pairwise comparisons does that produce?
106 * 106 = 1012 (1 trillion*!)
*actually more like half a trillion if you ignore A vs. A and assume (A vs. B) = (B vs. A), but still...
Now add a collaborative analysis: every user is compared to every other user. More input!
The Performance Problem
How long does it take to calculate all those scores?
How many numbers make up each song's "DNA"?
Where do you store the results?
Hey database guy! You got room for 1 trillion rows lying around somewhere?
How quickly can you lookup a result at runtime?
e.g. Generate the top 5 most similar songs.
How often do you need to update your library?
e.g. If you update weekly, but it takes two weeks to structure and reduce the data, you'll never catch up.
A Common Performance Solution - Clustering
Cluster the songs so that similar songs get grouped together. All songs in the same cluster are equally similar to each other.
e.g. If you have 1 million songs, group them into chunks of 100 songs each, leaving 10,000 song-clusters.
Clustering algorithms:
hierarchical
k-means
self-organizing maps
Clustering Benefits
Storage: You only have to store 10,000 cluster ids instead of 1 trillion sim scores.
Lookup: To find the songs most similar to any given song, randomly select other songs from the same cluster.
Updates: Calculate "average DNAs" for each cluster. Place the new song in the cluster whose average DNA is closest. For the 1,000,001st song, that's 10,000 comparisons instead of 1 million.