By: Danijela Zarić
Word2Vec model
Word2Vec model is used for Word representations in Vector Space. Word2Vec is better than BoW and TF-IDF models because this model gives us the sense in which context the word is being used. That means the words with similar meanings have almost the same embedding vector [1]. Word2Vec maximizes the accuracy using two different architectures:
- Continuous Bag of Words (CBOW)
- Skip – Gram
Continuous Bag of Words (CBOW)
The aim of the CBOW model is to predict a target word, using all words in its neighborhood. Considering a simple sentence “Machine learning is the key for data science” and window of size 2, we have pairs: ([machine, is], learning), ([for, science], data), ([the, for], key) etc.
The CBOW model architecture is shown in the figure below:
Fig.1 The CBOW model architecture, Source: https://towardsdatascience.com/word2vec-explained-49c52b4ccb71
When we build the CBOW model architecture, we take in the context words as our input and try to predict the target word. This model architecture is simpler than the Skip – Gram model.
Let’s look into the parameters of the Word2Vec model:
- window – Maximum distance between the current and predicted word within a sentence.
For example, for a window size of 2, implies that for every word, we’ll pick the 2 words behind and the 2 words after it. Let’s see the following example:
Sentence: The brown fox is jumping.
Sentence | Word pairs |
The brown fox is jumping. | (the, brown), (the, fox) |
The brown fox is jumping. | (brown, the), (brown, fox), (brown, is) |
The brown fox is jumping. | (fox, the), (fox, brown), (fox, is), (fox, jumping) |
The brown fox is jumping. | (is, brown), (is, fox), (is, jumping) |
The brown fox is jumping. | (jumping, fox), (jumping, is). |
- min_count – Ignores all words with a total frequency lower than this.
- sg = 0 – Training algorithm is CBOW.
- hs = 1 – Hierarchical softmax will be used for model training.
To build a Word2Vec model, first you need to install the Gensim library. As in the previous example, we will use a cleaned dataset. The only difference is that the Word2Vec model requires a format of “list of lists” for training.
You can do it simply by using the word_tokenize function from the nltk library as shown below:
list_of_lists = [nltk.word_tokenize(corpus) for corpus in corpus]
The output for the first two sentences is:
Now, we can build the model:
We are going to use most_similar function to find the top three words that have very good similarity with the word “compani”.
To show that this really works, there is an example shown below:
We have less similarity between the words “russia” and “meter”, than “compani” and “invest”, as we have expected.
The similarity was calculated according to the formula:
Skip-Gram
If we change the parameter sg to one and hs to zero, the training algorithm will be Skip – Gram and negative sampling will be used. Also, if we want negative sampling, we need to fix one more parameter as negative = 1.
The difference between Skip – Gram and CBOW is that the Skip – Gram model tries to achieve the reverse of what the CBOW model does. This model will be able to predict the context.
Considering our simpler sentence “Machine learning is the key for data science”, the Skip – Gram model tries to predict the context window words based on the target word. That means, the task becomes to predict the context [for, science] given target word “data”, or [machine, is] given target word “learning”.
The problem with this model is how to learn which pairs of words are contextually relevant. We achieve that using positive and negative input samples. If we have pair ([target, word]) and the label is 1, that will be treated as positive input, which means that word occurring near the target word and the label 1 indicates this is a contextually relevant pair.
On the other side, if we have a pair ([target, random word]) and the label is 0, that will be treated as negative input, which means that random word is just a randomly selected word from vocabulary which has no context or association with the target word.
The Skip – Gram model architecture is shown in the figure below:
Fig.2 The Skip – Gram model architecture, Source: https://arxiv.org/pdf/1301.3781.pdf Mikolov el al [2]
Now let’s look at the word embeddings of the word “compani”, just to gain a sense of what the vector embedding actually looks like:
End goal of Word2Vec is primarily to find and adapt embeddings which will be used to train the model for sentiment. Bearing in mind that the sentences aren’t always the same length, it would be best to average each element so that the end result is a vector of the fixed length. The result vector is used for training in algorithms such as Multinomial Naive Bayes Classifier, Random Forest Classifier, Support Vector Machine, Logistic Regression etc.
No, let’s write the following code. Note that variable “embeddings” will be used in train test splits for model training.
Summary
Word Embeddings are trying to figure out better word representations than the existing ones. With Word Embeddings we are able to convert the high dimensional feature space of words into low dimensional feature vectors by preserving the contextual similarity in the corpus. This article was aimed to explain some of the workings of embedding models BoW, TF – IDF and Wor2Vec. To get the best possible accuracy, it is important to clean and preprocess our data. The CBOW model is observed to train faster than Skip – Gram and can better represent more frequent words, which means that it gives slightly better accuracy for the frequent words. The Skip – Gram model works well with a small amount of training datasets and can better represent rare words or phrases.
Word2Vec, although an improvement from the typical binary representation of data, still has issues when faced with polysemy (i.e. when one word can have a different meaning depending on the context it is used with, for example, bank can be an institution, but also river bank). However, transformers can tackle this issue efficiently, as they are the new state – of – the – art in NLP, but more on that in the next article.
References
[1] Vatsal, “Word2Vec Explained”, Towards Data Science, July 2021. [Online]. Available: https://towardsdatascience.com/word2vec-explained-49c52b4ccb71. [Accessed: 11-April-2022]
[2] T. Mikolov, K. Chen, G. Corrado, J. Dean, “Efficient Estimation of Word Representations in Vector Space”, September 2013.