When it comes to neuro-linguistic processing (NLP) – how do we find how likely a word is to appear in context of another word using machine learning?
We have to convert these words to vectors via word embedding. Word embedding provides a lower-dimension vector representation of words while preserving the relationship / meaning between words. Here’s a good answer to what is word embedding.
In terms of machine learning terminology, the term word embedding is
“the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.” – Wikipedia
So when it comes to NLP (neural linguistic programming) – stuff related to analyzing words – we use an unsupervised learning algorithm called Word2Vec.
There are two main algorithms for implementing word2vec – which is what word embedding is all about:
Continuous Bag of Words – is a type of word2vec implementation for predicting what comes next given some initial word(s)
Example: The cat sat on the (mat) – given “the cat sat on the” – predict “mat”
Whereas Skip-Gram – is a type of word2vec impolemetnation for predicting other words near a given word
Example: Given cat, predict that the words (the, sat, on, the, mat) – generally appear nearby
SkipGram looks at a window of words to the left and to the right of the word cat – and indexes them.
We’ll look specifically at Skip-Gram version of Word2Vec today.
In SkipGram, we look at the context of a focus word.
Remember that when we use machine learning – it’s all about turning words into numbers — or rather vectors of number and making sense of the relationship between these vector representations of words.
It converts words into vectors:
Once mapped geospatially, the location of a word relative to another word gives us the relationship between them.
Normally, you have 1 dimension for each word. So you may have a large vector of 50,000 spots, all of them are 0 — except there’s a 1 in one of them – representing that one word out of the 50,000+ words.
But this big 50,000 x 1 vector of mostly 0’s doesn’t really tell us much about words. All it tells us is that we are representing a single word, say “cat” – with a “1” – and all other 50,000 words are “0’s”
normal vector representing the single word “cat”: [ 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 …….]
It’s also highly inefficient from a math and timing perspective to work with large matrices/vectors that are mostly 0’s when we can likely do the same ( and more) with smaller-sized vectors.
So what word2vec – and these 2 associated algorithms above do — is they provide a BETTER mathematical representation of that word “cat.”
Instead of having a big-ass vector array of mostly 0’s, word2vec instead maps each individual word into some theoretical multi-dimensional map – so that how “close” this word is to another word – can represent some relationship that word has with another word.
For example, the word “cat” – might be located at some random coordinate point [2, 4, 3, 1, 5] — and similar words such as “dog” might close on certain dimensions — say [2, 4, 3, 8, 9].
I just made up those numbers — but the point is the machine can sort of graphically plot these words in relation to each other across tons and tons of dimensions.
We are used to visualizing things in a 3-dimensional plane — the example above I picked 5 numbers — so that’s 5 dimensions. But the kind of dimensions I’m talking about is like 1,000s of dimensions. So the vector representing each word is really really big.
However, in terms of dimensions, it’s not quite as many dimensions as what we had before.
Recall that before, we may have had 50,000 dimensions — one for each word that we have in our imaginary dictionary.
Here with word2vec – the number of dimensions is sort of a parameter of your choosing – that doesn’t have to be as high as 50,000. In fact, the representation can be highly effective with significantly fewer dimensions. How much less – I’ve seen examples that reference between 200 and 500 dimensions
But the idea is that word2vec allows you to better represent the same word “cat” – with fewer dimensions – and in a more effective way because its vector representation has some “meaning” because of its proximity / relationship to other related words such as “dog” – “pet” etc
First, the “one-hot” vector representation is basically long vector where you have a “1” on the “hot spot” – and “0” everywhere else.
So let’s say you have the words “The big fat cat sneezed”
Since “cat” is the 4th word, then your “one-hot” vector representation for “cat” would look like:
[0 0 0 1 0 ]
This is a 1 x 5 (1 rows, 5 column) one-hot vector representation of the word cat.
In order to make a better mathematical representation, you make this 1 x 5 vector go through a “hidden layer” by multiplying it by some dimensional matrix – such that you get a (1 x ?) vector, where ? = number of features.
So that (? = number of features) can be different from the 5 columns we had before where each column represented a different word.
The number of features, is a parameter of your choosing. So if we pick 3 features as your parameter, then the matrix multiplication would look something like this:
[ 0 0 0 1 0 ] = original representation of the word “cat”[ 10 12 19] = new word2vec representation of the word “cat” – with fewer dimensions
In terms of Tensor Flow code,
n_vocab = len(int_to_vocab)
n_embedding = 3 # Number of embedding features
with train_graph.as_default():
embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
embed = tf.nn.embedding_lookup(embedding, inputs)
The number of words from “The big fat cat sneezed” – is 5.
Since (int_to_vocab) represents the list of words that were previously numbers — if we take the length of that list, that would give is the total number of unique words.
Hence, n_vocab = len(int_to_vocab). = 5
I mentioned earlier that the number of features is a parameter of our choosing — that’s where n_embeddding = 3 comes to play. We define this.
The code below creates the middle portion of the diagram above — the matrix that has the number of words (n_vocab) as the number of rows. And the number of embedding features (n_embedding) – as the number of columns.
embedding = tf.Variable(tf.random_uniform((n_vocab, n_embedding), -1, 1))
These two parameters (number of words n_vocab and number of embedding features n_embedding) are passed into the tensorflow random uniform generating function above and stored into”embedding.”
Then, tensorflow is this other function tf.nn.embedding_lookup – that lets you take that middle matrix that we just created and combine it with our original 1×5 input vector (of mostly 0’s) – and create our new input
embed = tf.nn.embedding_lookup(embedding, inputs)
That new input – [ 10 12 19] – in our example above – is now stored into “embed” – which we will later feed into a softmax function.
So we basically “converted” a (1 x 5) vector representation of the word “cat” – into a (1x 3) vector representation of the word cat”
The numbers in the diagram (the embedding/hidden layer) don’t actually mean anything at this point.
The point is that you turned a vector that was mostly 0’s and just one “1” — into a smaller dimensional vector representation of that same word — but each number actually means something now (is non-zero).
So effectively, we are reducing the size of the vectors we are working with, but making them more effective because each number matters, whereas before each number was mostly a 0.
We passed a (1 x 5) vector through a “hidden / embedding layer” – the size of which is controlled by a parameter of our choosing. We chose a parameter (3) that is smaller than the number of words in our example (5) – so the effect would be reducing our 1×5 vector into a 1×3 vector.
So now that we have our new input vector for the word “cat” that we already fed through a hidden layer and reduced the size of.
Our vector now looks smaller – something like this: [10 12 19]
But what does that tell us about the word “cat?”
Well, the numbers 10, 12, 19 need to be tweaked in relation to other words so we can make sense of their relationship.
In order to find relationship between words, we need to multiply this new “cat” vector” with other words – to get the relationship between them.
What these words are – depends on what we feed it in the training data set. If we feed the machine lots of examples of the word “cat” – in context of various words, say, in a sentence — the machine can access start putting together a probability distribution of how likely a certain word is to appear to the left or to the right of the word “cat.”
For each word that is near the word “cat” in our many examples found in the training data set – we can sort of pre-define a weight matrix that is of the correct dimensions, initialize them with random probability values. Then adjust these weight probability values through the use of fancy calculus math – such that we get a probability distribution that accurately represents the likelihood of each word appearing in the immediate vicinity of the word “cat.”
Recall our hypothetical word2vec input vector for “cat” from before is:
[ 10 12 19]
This represents the blue horizontal vector below. The diagram says it’s for “ants” – but let’s say it’s for “cat”
Now, if we want the relationship between “cat” (blue) with another word “car (red) – we first need an empty vector that is 3 rows x 1 column. This is the dimension we need if our input vector is 1 row x 3 columns – with the 3 representing the parameter we chose (the number of features).
So we are multiplying a (1 x 3) input vector for “cat” with a (3 x 1) output weight vector for a different word “car” – such that we get a single value (1 x 1).
That single value – let’s say it’s 8 — needs to be in the form of a decimal — such that that decimal, when summed up with all other decimals corresponding to each of the other relevant words — sums to 1.
So the relationship between “cat” and “car” – might be = .15
The relationship between “cat” and “fat” might be = .30
The relationship between “cat” and “big” might be = .25
The relationship between “cat” and ” sneezed” might be = .30
— such that the sum of these probabilities = .15 + .30 + .25 + .30 = 1
Squashing the numerical value “8” for the relationship between “cat” and “car” – as well as whatever numerical value we get for the relationship between all other pairs of words with “cat” — is what the softmax function does — that fancy math equation you see up there in the grey box.
We took our initial “one-hot” vector representation of the word “cat” where we have a “1” in the 4th column – representing the 4th word in a training data set example of 5 different words.
We took this [ 0 0 0 1 0] vector representation and multiplied it by a dimension size that is controlled by the number of columns to our choosing (this is the feature parameter that we specify). Usually this feature parameter we choose will be smaller than the number of words — hence the feature parameter = 3 would be smaller than the original number of words = 5.
The result is we get a 1×3 vector, which is smaller than our original 1×5 vector. That’s a good thing in terms of math efficiency. Accomplish the same thing with a smaller-sized vector. In fact, we accomplish more, because we now set it up such that we can form and develop relationships between words — something we were not able to do before.
Now that we have our word2vec representation of the word “cat”, we next multiply it by a pre-specified vector dimension with random placeholder values – such that when we multiply the two together, we get a single numerical value.
We get this single numerical value – one for each of the words that may be associated with our input word “cat.”
The larger the value, the more likely will see this word in context of our input word “cat.”
Once we go through all the training data – these values, or weights, will adjusted such that when we are done with training, we will have values that accurately represent how likely we are to see a word in the immediate vicinity (to the left or to the right of) the word “cat.”
The proportion of this value in relation to the sum of all other values for other words — represents the probability distribution (softmax).
If we have 4 different words, then we would repeat the above diagram 4 times, and our end result would be a 4 rows x 1 column vector representing probabilities of each word in relation to our input word “cat.”
So the probability vector representation might look like this:
[.15.30
.25
3]
The first row (.15) might represent “car”, the second row (.30) might represent “fat” — and so on.
We have just implemented the Skip-Gram algorithm / architecture of word2vec – which is one of the 2 algorithms of word2vec — this one focused on finding words that are close to the word “cat”.
Word embedding is the collective term used to describe mapping words to vectors as we did above for neuro-linguistic processing (NLP).
We enhanced the “one-hot” vector representation – and reduced the dimension size to a parameter of our choosing by running it through (multiplying) it through a hidden layer.
We then got a smaller input vector that no longer has mostly 0’s.
We multiplied this vector by placeholder vectors – one for each other word we are trying to find the relationship of with our input word.
We then generate a probability distribution through a Softmax function – that end result might look like this:
[.15.30
.25
3]
So now we have trained the machine to be able to predict (via a probability distribution vector) how likely each relevant word is to appear in context of another (input) word.
That’s pretty cool.
Udacity Machine Learning Course – First Impression
GitHub – Project 2: Image Classification
Allen is an entrepreneur and amateur machine learning enthusiast. His career has spanned management consulting with Booz & Co to derivatives trading on Wall Street and even mobile product management at TripAdvisor. But his biggest impact has been as founder of GMATPill.com – an online GMAT course that has helped thousands of students pass the rigorous GMAT exam used in MBA admissions. He received a B.S. in Management Science from Stanford University with a focus on Finance and Decision engineering. He was actually in the audience during Steve Jobs’ now famous graduation speech. This trading education blog is partly a result of the inspiration from that speech. One day, Allen hopes to incorporate machine learning into stock market pattern recognition so he can automate a lot of the manual pattern recognition that he is doing on a daily basis.
Here at LST101, Allen advises pro traders, high net-worth individuals, and hedge fund managers with his expert wave analysis on the S&P500. Now anyone, including amateurs, can subscribe to his Trade Of the Week premium service to learn exactly how he is trading today’s market, week after week.