Word2Vec and it’s smartest overview

In this blog, we will answer the following questions,

  1. What is Word2Vec?
  2. Why is Word2Vec needed?
  3. How does it work?
  4. How to implement Word2Vec?

In natural language processing, we need to convert the given sentences into some numbered representation. As you may know, we convert these sentences or rather words, into vector representations. The thing about representing this vector is a bit tricky.

The main point of writing this blog is to get a bird-eye view of the concept. This blog is different from my other blogs, it is just for practical purposes. After reading this blog you will have enough understanding of the given concept and you would be able to implement it without any hurdles. But this blog will fail if you don’t get encouraged to dive deeper into all the given concepts. Don’t let me stop your curiosity.


Naïve approaches and TFIDF

As we can see, the following representation of the words.

Learning machine learning without understanding maths is like having sex without orgasm.

There are a total of 12 words. After removing the stop words we have 11 words (removed is)
‘Learning’ and ‘learning’ can be considered as two separate words. As they can different context.

We can represent these words by using one-hot encoding. for example,

For word machine we can have representation, 
[0,1,0,0,0,0,0,0,0,0,0]

The above representation may work for this particular sentence but as the size of the corpus increases, the representation of the word becomes problematic. This kind of representation may work for the sentence level understanding but will not be able to extend itself to understand entire documents. In industry, most of the NLP task work on document classifications. Clearly, we can not use this approach there.

Another way of doing this is by using TFIDF (term frequency–inverse document frequency). This representation is widely used for document classifications. And it is similar to what we have done previously except here we take the frequency of words in the particular document into account. The words which are frequent in different documents have less weightage than those which are occurring in particular types of documents. This way we can solve the classification problems easily.

Need of word embedding

Algorithms like TFIDF are good for machine-level classification but they don’t provide any substantial information about the context. These algorithms merely provide a way to tokenise the words. But to understand the meaning of the sentence we need to form some kind of relationship between different tokens (or vectors).

Here comes the idea of embedding. Word embeddings are one way to create (or vectorise) the vectors. The words which have similar meanings or context will have vectors which are closer to each other. And those with no correlation will have vectors apart.

We can see that from the following figure.

The word embedding and vector have similar meaning and context. Whereas tfidf, although a vector representation has different context. (for representation purposes only)

What is word2vec?

Word2vec is one type of word embedding algorithm. Word2vec as the name suggested, creates vector forms of words. As we discussed, embedding is a way to create vector representations such that words with similar contexts have vectors closer to each other.

In the original paper, word2vec is based on the skipgram model. While many improvements have been done to this algorithm. The basic idea remains pretty much the same. The entire understanding of the skipgram model is beyond the scope of this blog (as this is just a conceptual overview.) But we will understand it on a high level below.

Skipgram model

Skip-gram model uses words as input and creates the context from the words as output. Consider the following sentence,

I like data science and mathematics but don't like to play with dogs. 

Consider the sentence as our corpus. And if we want to find the word embeddings for the above sentence the terms ‘data science’ and ‘mathematics’ should come close to each other as both have some correlation in them. While the word ‘dogs’ must have a lower correlation as compared to those words.

Skip-gram model uses a multi-output (and multi-loss) model to represent a word. If we give it the term ‘data science’ as an input it will try to predict the context (here, by context we mean the word surrounding the given term). To make things simpler, instead of taking entire sentences we take only the text window into consideration. The following picture helps us clarify it.

Here we take window into account instead of entire corpus to understand the context. This makes things easier and faster for embedding algorithm as it has to map new embedding based on fewer words.

You can already see the skip-gram model has to create 5 outputs (window size) and one input. This model needs five separate loss functions as each word prediction is done differently. To understand more about the skip-gram, you can refer to this original paper.

The original paper uses Hierarchical Softmax function. Which is basically a hierarchical binary classification method.

In the end, we get the word vector representation of the words. Thanks to the skipgram model, these vectors have a correlation based on the context of the words. This is what word2vec is used for! We will see how we can implement it in a real-world scenario.

Implementation of Word2Vec

This is pretty much straightforward. All we need to do is to create sentences from our entire paragraphs and then those sentences into words. We then do some preprocessing (like removing stop words lemmatisation) on these words and pass it to the word2vec library.

Here we go,

1. Importing libraries

We can make use of the nltk library to remove stopwords and create sentences from paragraphs and words from the sentences.

To use Word2Vec we can use gensim models. (You will need to first download stopwords and sentence tokenisers from nltk if you are doing it for the first time)

import pandas as pd 
import nltk 
from nltk.corpus import stopwords
from gensim.models import Word2Vec
import re
import os
#nltk.download('stopwords')
#nltk.download('punkt')

2. Importing data

For representation purposes, I just provided the raw document of this very blog. And created data out of it. Although, Word2Vec requires a large amount of data to train properly. But for now, we can import a text file.

with open('this_awesome_blog.txt', 'r') as f: 
    text = f.read()

3. Cleaning data

As you may have guessed, it is really important to clean data before performing word2vec. Especially, stop words (like is, then, a etc.) are important to remove. Their greater frequency in sentences can cause problems while embedding.

Also, stemming (or more accurately, lemmatisation) can help make better predictions. In my experience, the lemmatisation gives a slight edge for better representation.

And finally, it’s no surprise to remove special characters from the corpus.

sentences = sentences.lower() 

# basic cleaning operations 
sentences = re.sub('[^a-zA-Z]', ' ', sentences)
sentences = re.sub(r'\s+', ' ', sentences)

# create sentences from the data
sentences = nltk.sent_tokenize(sentences)

# create words from the senteces and removing stop words
word_list = [nltk.word_tokenize(sent) for sent in sentences]

Why do we tokenise sentences/words before Word2Vec?

Remember that window function that we talked about in the word2vec explanation. Our final goal is to create a representation of a word given its context. The sentence is the context for the word. We pass every sentence separately (a list of words as a sentence as code above). This creates the representation of words based on context.

4. Creating word tokens

Now, all we need to do is to pass the final word_list to our word2vec. But before doing that, we want to remove stop words. Keep in mind that we don’t want to pass the list of words to our Word2Vec but the list of list of words (list of sentences). Here is the unoptimised code for it.

final_list = []
for words in word_list:
    temp = []
    for word in words:
        if str(word) not in stopwords.words('english'):
            temp.append(word)

    final_list.append(temp)

5. Implementing Word2Vec

We can create a word2vec representation by writing just one line of code.

word2vec = Word2Vec(final_list, min_count=1)

To see the key-word pair use following,

word2vec.wv.key_to_index

And finally, to find similar words within the corpus, use this code.

word2vec.wv.most_similar("machine")

For this particular blog, I got the following output.

[('way', 0.2357744574546814),
 ('things', 0.21105220913887024),
 ('us', 0.19719639420509338),
 ('remains', 0.19654203951358795),
 ('becomes', 0.19585859775543213),
 ('know', 0.19291925430297852),
 ('merely', 0.19115883111953735),
 ('input', 0.1848684698343277),
 ('less', 0.17467346787452698),
 ('classification', 0.17092865705490112)]

Conclusion

Word2vec was one of the groundbreaking papers in NLP. The skip-gram model that we talked about briefly here works almost like magic to our corpus. In the end, we just have to understand the basic principles. In today’s machine-learning world, things are moving faster. There are various techniques like BERT have emerged. We will see those on some other day.

If you found this blog helpful or have any concerns/doubts or even just want to talk, feel free to hit me on Linkedin or on [email protected]!


References and further reading:

  1. Original Paper: https://arxiv.org/pdf/1310.4546.pdf
  2. The coding train
  3. Implementing Word2Vec with Gensim Library in Python
  4. Deep Learning(CS7015): Lec 10.5 Skip-gram model

Leave a Reply

Your email address will not be published. Required fields are marked *

*