Your browser doesn't support the features required by impress.js, so you are presented with a simplified version of this presentation.

For the best experience please use the latest Chrome or Safari browser. Firefox 10 (to be released soon) will also handle it.

Machine Learning

(Python edition)


Ed Toro

How do you teach a machine to learn?
  1. Input.
How do you teach a machine to learn?
  1. Input.
  2. Data structures.
[[1,2,3], [4,5,6], [7,8,9]]
How do you teach a machine to learn?
  1. Input.
  2. Data structure.
  3. Reduction.

How do you teach a machine to learn?
  1. Input.
  2. Data structure.
  3. Reduction.
  4. Inference.

Input: visitors to a "guess your weight" site

Data Structure: you can't record everything about a visitor

Reduction: you can't guess every value you've recorded
Average weight of all previous visitors: 150lbs
Inference: What can we infer from the data? What weight do we guess?

Python Data Structures
Request/Response Cycle - Feedback
  1. Receive user request
  2. Infer a response based on prior knowledge data structures
  3. Receive user request with feedback
  4. Combine user feedback with existing body of knowledge
  5. Reduce
Request/Response Cycle - Training
  1. Collect user data
  2. Build data structures
  3. Reduce
  4. Receive user request
  5. Infer a response based on prior knowledge

Simple example: Inference by mean average



The mean average "learns". When it's wrong, it adapts.

Faster simple example: Inference by mean average



The knowledge set is big data. The ability to do incremental updates is key to machine learning performance.

Another simple example: genetic algorithm


best_guess = retrieve_most_successful_guess()
response.inference = best_guess

if we_guessed_correctly():
  best_guess.success += 1
else:
  best_guess.success -= 1

best_guess.save()

Use whichever guess is the most successful "in the wild". When a guess fails, reduce its fitness, allowing another best guess to possibly arise. The visitor has to let us know if we were correct.

Complex example: neural network


nodes is a frequency chart. If 200 appears twice, nodes[200] == 2. The strongest node wins. It's the mode average of the knowledge set. It's just another averaging function!

This can get really complicated...


Good news - the machine learning algorithms have already been written for you.
Stop inventing "new" machine learning algorithms for your site. Just pick one!

OMGWTF! There are so many. Which one do I pick?

All of the above!

The Meta-Learning Algorithm


best_learning_algo = retrieve_most_successful_algorithm()
response.inference = best_learning_algo.guess(user_data)

if we_guessed_correctly():
  best_learning_algo.success += 1
else:
  best_learning_algo.success -= 1

best_learning_algo.save()

It's the genetic algorithm applied to other learning algorithms. Implement all the learners, store them somewhere, and order them by how successful they are.

The "best"? The "most successful"?

A "Bestest" algorithm - a weighted random choice



from random import uniform
algos = { 'kmeans': 50, 'mean': 25, 'bayes': 20, 'neural': 5 }
needle = uniform(0, sum(algos.values()))
for name in sorted(algos, key=algos.get):
  needle -= algos[name]
  if needle <= 0:
    break
# return name

FAIL! All weights must be > 0 for that to work.
success = max(0, success - 1)

The algorithms may be free or cloud-cheap, but the data isn't. Stop worrying about algorithms. It's the data, stupid!
There are two types of data
  • Content
  • Collaborative
Content Data

It's data that describes the content on your site. When two pieces of content are similar, we infer a user who likes one will also like the other.
Collaborative Data
It's data describing the visitors to your site. When two visitors are similar, we infer that they'll like the same stuff.
"Numerification"

Data needs to be numerical so you can do math on it.

When possible, "numerification" should be meaningful. Labels that are related map to numbers that are closer.
Sim Scores

The fully numerified description of a piece of content or a visitor is its "DNA".

['gender', 'age', 'location_lat', 'location_long']

becomes

[0, 30, 26, -80]

The similarity score between two pieces of content is the "distance" between these two "points". The lower the score, the more similar the content.

Sim Score Algorithm

import numpy
# 30, male, Miami
person1 = numpy.array((0, 30, 25.8, -80.2))
# 25, female, Ft. Lauderdale
person2 = numpy.array((1, 25, 26.1, -80.1))

score = numpy.linalg.norm(person1 - person2)
# score is approx 5

In numpy, the distance between the two points is the norm of the vector between those points. You may remember it better as the Euclidean distance as given by the Pythagorean theorem.

score2 = (a1-a2)2 + (b1-b2)2 + (c1-c2)2 + (d1-d2)2
Performance Machine learning is a "big data" problem. Learn how those guys solve these kinds of problems.
(Hints: MapReduce - I called it "reduction" for a reason, Apache Pig, Apache Hadoop, etc.)
The Performance Problem

Say your website has 1 million songs. Imagine a grid with 106 rows and 106 columns. How many pairwise comparisons does that produce?

106 * 106 = 1012 (1 trillion*!)

*actually more like half a trillion if you ignore A vs. A and assume (A vs. B) = (B vs. A), but still...

Now add a collaborative analysis: every user is compared to every other user. More input!
The Performance Problem

A Common Performance Solution - Clustering

Cluster the songs so that similar songs get grouped together. All songs in the same cluster are equally similar to each other.

e.g. If you have 1 million songs, group them into chunks of 100 songs each, leaving 10,000 song-clusters.

Algorithm options:
Clustering Benefits

Storage: You only have to store 10,000 cluster ids instead of 1 trillion sim scores.

Lookup: To find the songs most similar to any given song, randomly select other songs from the same cluster.

Updates: Calculate "average DNAs" for each cluster. Place the new song in the cluster whose average DNA is closest. For the 1,000,001st song, that's 10,000 comparisons instead of 1 million.

Johnny 5 can learn!

Thanks for listening!
@eddroid
github.com/eddroid
Image Credits:

Google Images:
Image Credits:

Google Images:

http://eli.thegreenplace.net/2010/01/22/weighted-random-generation-in-python/

We beat that SOPA/PIPA thing, right?
Extra Credit

Use a spacebar or arrow keys to navigate