Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome or Safari browser. Firefox 10 (to be released soon) will also handle it.

Machine Learning

(Ruby edition)


Ed Toro


Miami Ruby Brigade, CareCloud, 2012-09-24

How do you teach a machine to learn?
  1. Input.
How do you teach a machine to learn?
  1. Input.
  2. Data structures.
[[1,2,3], [4,5,6], [7,8,9]]
How do you teach a machine to learn?
  1. Input.
  2. Data structure.
  3. Reduction.

How do you teach a machine to learn?
  1. Input.
  2. Data structure.
  3. Reduction.
  4. Inference.

Input: visitors to a "guess your weight" site

Data Structure: you can't record everything about a visitor

Reduction: you can't guess every value you've recorded
Average weight of all previous visitors: 150lbs
Inference: What can we infer from the data?

Ruby Data Structures
Request/Response Cycle - Feedback
  1. Receive user request for data
  2. Infer a response based on some prior knowledge reduction
  3. Receive user request with feedback
  4. Add the feedback to the existing body of knowledge
  5. Reduce
Request/Response Cycle - Training
  1. Collect user data
  2. Build data structures
  3. Reduce

  1. Receive user request
  2. Infer a response based on prior knowledge and training

Simple example: Inference by mean average



The mean average "learns". When it's wrong, it adapts.

Faster simple example: Inference by mean average



The knowledge set is big data. The ability to do incremental updates is key to machine learning performance.

Another simple example: genetic algorithm


best_guess = retrieve_most_successful_guess()
response.inference = best_guess

if we_guessed_correctly()
  best_guess.success += 1
else
  best_guess.success -= 1
end

best_guess.save()

Use whichever guess is the most successful "in the wild". The visitor has to let us know if we were correct.

Complex example: neural network


nodes is a frequency chart. If 200lbs. appears twice, nodes[200] is 2. The strongest node wins. The reduction is the mode average of the knowledge set.

This can get really complicated...


The machine learning algorithms have already been written for you
Stop inventing "new" machine learning algorithms for your site. Just pick one!

OMGWTF! There are so many. Which one do I pick?

All of the above!

The Meta-Learning Algorithm


best_learning_algo = retrieve_most_successful_algorithm()
response.inference = best_learning_algo.guess(user_data)

if we_guessed_correctly()
  best_learning_algo.success += 1
else
  best_learning_algo.success -= 1
end

best_learning_algo.save()

This is the genetic algorithm applied to learning algorithms. Implement them all, then pick the most successful one.

The "best"? The "most successful"?

A "Bestest" algorithm - a weighted random choice



algos = { kmeans: 50, mean: 25, bayes: 20, neural: 5 }
sum = algos.values.inject(:+)
needle = rand(sum)
algos.sort_by{ |key,val| -val }.each do |algo|
  needle -= algo[1]
  if needle < 0
     #break or return
  end
end

FAIL! All weights must be > 0 for that to work.
success = [0, success - 1].max

The algorithms may be free (or at least "cloud cheap"), but the data isn't. Stop worrying about algorithms. It's the data, stupid!
There are two types of data
  • Content
  • Collaborative
Content Data

It's data that describes the content on your site. When two pieces of content are similar, we infer a user who likes one will also like the other.
Collaborative Data
It's data describing the visitors to your site. When two visitors are similar, we infer that they'll like the same stuff.
"Numerification"

Data needs to be numerical so you can do math on it.

When possible, "numerification" should be meaningful. Labels that are related map to numbers that are closer.
Sim Scores

The fully numerified description of a piece of content or a visitor is its "DNA".

['gender', 'age', 'location_lat', 'location_long']

becomes

[0, 30, 26, -80]

The similarity score between two pieces of content is the "distance" between these two "points". The lower the score, the more similar the content.

Sim Score Algorithm

# male, 30yo, Miami (lat, long)
person1 = [0, 30, 25.8, -80.2]
# female, 25yo, Ft. Lauderdale (lat, long)
person2 = [1, 25, 26.1, -80.1]

score = Math.sqrt(p1.zip(p2).map{|a,b| a-b}.map{|d| d*d}.reduce(:+))
# score is approx 5

This magic formula represents the Euclidean distance as given by the Pythagorean theorem.

score2 = (a1-a2)2 + (b1-b2)2 + (c1-c2)2 + (d1-d2)2
Performance

Machine learning is a "big data" problem. Learn how those guys solve these kinds of problems.


The Performance Problem

Say your website has 1 million songs. Imagine a grid with 106 rows and 106 columns. How many pairwise comparisons does that produce?

106 * 106 = 1012 (1 trillion*!)

*actually more like half a trillion if you ignore A vs. A and assume (A vs. B) = (B vs. A), but still...

Now add a collaborative analysis: every user is compared to every other user. More input!
The Performance Problem

A Common Performance Solution - Clustering

Cluster the songs so that similar songs get grouped together. All songs in the same cluster are equally similar to each other.

e.g. If you have 1 million songs, group them into chunks of 100 songs each, leaving 10,000 song-clusters.

Clustering algorithms:
Clustering Benefits

Storage: You only have to store 10,000 cluster ids instead of 1 trillion sim scores.

Lookup: To find the songs most similar to any given song, randomly select other songs from the same cluster.

Updates: Calculate "average DNAs" for each cluster. Place the new song in the cluster whose average DNA is closest. For the 1,000,001st song, that's 10,000 comparisons instead of 1 million.

Johnny 5 can learn!

Thanks for listening!
@eddroid
github.com/eddroid
Image Credits:

Google Images:
Image Credits:

Google Images:

http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/

Can I use these images? We beat that SOPA/PIPA thing, right?
Extra Credit

Use a spacebar or arrow keys to navigate