rsz_danijela

By: Danijela Zaric

This blog presents types of word representations which are called Word Embeddings. Algorithms that allow mapping the words to the vectors are important in Natural Language Processing (shortly NLP), i.e., the ability of a computer program to understand human language as it is spoken and written [1].

The aim of this article is to understand:

  1. What are Word Embeddings?
  2. How to clean data?
  3. Differences between Bag of Words model, TF – IDF model and Word2Vec

What are Word Embeddings?

Word embeddings are numerical representations of texts, where each word is represented by a real – valued vector. The length of these vectors is basically a hyperparameter, often tens or hundreds of dimensions. The main goal of word embedding is to convert the high dimensional feature space of words into low dimensional feature vectors by preserving the contextual similarity in the corpus [2]. 

If we visualize the learned vector, we can see that all the similar words are in together. 

The picture below shows that the vectors capture some general, and in fact quite useful, semantic information about words and their relationships to one another. Developers discovered that certain directions in the induced vector space specialize towards certain semantic relationships, e.g. male – female, verb tense and even country – capital relationships between words, as illustrated in the figure below.

There are numerous algorithms that can map the words to the vectors like Bag of Words, TF – IDF vectorization, Word2Vec etc.

Fig.1 Word Embeddings plot [3]

How to clean data?

Before you start using vectorization, you need to clean and preprocess your data. To prepare text data for the model building, you need to remove punctuations and stop words, do lower casing, tokenization, stemming or lemmatization. 

  • Punctuations are often unnecessary as it doesn’t add value or meaning to the NLP model.
  • Stop words are irrelevant words that won’t help in identifying a text as real or fake.  
  • Tokenizing is the process of splitting strings into a list of words.
  • We employ stemming to reduce words to their basic form or stem, which may or may not be a legitimate word in the language. For instance, the stem of these three words, connections, connected, connects, is “connect”. On the other hand, the root of trouble, troubled, and troubles is “troubl”, which is not a recognized word.
  • Lemmatization is used to reduce words to a normalized form. In lemmatization, the transformation uses a dictionary to map different variants of a word back to its root format. With this approach, we are able to reduce non-trivial inflections such as “is”, “was”, “were” back to the root “be”.

The difference between stemming and lemmatizing is that, stemming sometimes results in stems that are not complete words in English and the final form of lemmatization is a meaningful word. Stemming is a faster process compared to lemmantizing. Hence, it a trade-off between speed and accuracy [4].

If you use WordNet for lemmatization, it is a good practice to write a function that maps Part-of-Speech (shortly POS) tags, because without the POS tag it assumes everything you feed it to be a noun.

The code to generate the POS tag is given below, together with the example output:

The first word is a noun (-n), the second word is an adverb (-r) and the third word is a verb (-v).

In the following examples, the nltk library will be used [5]. Nltk is a leading platform for building Python programs to work with human language data. 

Differences between Bag of Words model, TF-IDF model and Word2Vec are described on the example using Financial News dataset found here [6].

Bag of Words Model

A bag-of-words (shortly BoW) model is a way of extracting features from text so the text input can be used with machine learning algorithms. A BoW model learns vocabulary from all of the documents, then models each document by counting the number of times each word appears. The BoW model is very simple to understand and implement. 

The disadvantages of BoW are that the model is concerned with whether known words occur in the document and not where in the document they occur. Accordingly, any information about the order or structure of words in the document is discarded. Also, If the new sentences contain new words, then our vocabulary size would increase and thereby, the length of the vectors would increase too. This approach is rarely used for training, but is often used as an input to the more complex algorithms [7]. 

For example, we used the first two sentences from the cleaned dataset:

First, CountVectorizer must be created, then fit on the text documents:

Now, we count the number of times each word occurs in each document:

In the first sentence, “compani” appears twice and “accord”, “although”, “gran”, “grow”, “move”, “plan”, “product”, “russia” each appear once. In the second sentence, “compani” appears just once, and in the “matrix”, that word is on the same position as in first sentence.

TF-IDF Model

Term frequency-inverse document frequency is a text vectorizer that transforms the text into a usable vector. It combines two concepts: Term Frequency (TF) and Document Frequency (DF). Term frequency is the number of times a term occurs in a document. Term frequency indicates how important a specific term is in a document. Suppose we have a set of text documents and wish to rank which document is most relevant to the query “All about Data Science”. It is a good practice to count the number of times each term occurs in each document and then sort them.  

The difference between TF and DF is that DF is the number of times a term occurs in the document set, and term frequency is the number of times a term occurs in a document.

In the end, IDF is an inverse document frequency factor. The task of this factor is to reduce the weight of certain words that are often repeated, such as “is”, “of”, “that”, because they have little importance. This technique was necessary to increase accuracy of the classification, because some words (e.g. “the”, “is”, “a”) in BoW can be very present but useless for the classifier because they carry little information about the actual contents of the document [8].

TF = number of times term t appears in a document / number of terms in the document

IDF = log⁡(total number of documents / number of documents a term t has appeared in)

TF-IDF(t, document) = TF(t, document) * IDF(t)

To simplify it, let’s look at an example, where we want to know how relevant the word “city” is for both document 1 and document 2:

Document 1: The Metropolitan city of Rome is the most populous metropolitan city in Italy.

Document 2: He felt a tiny tremor of excitement as he glimpsed the city lights.

In the first document, the word “city” appears twice, and the length of the document is 13, so:

TF(city“, doc1) = 2/13 = 0.153

In the same way, we calculated TF for the second document:

TF(city“, doc2) = 1/13 = 0.076

IDF is constant per corpus and accounts for the ratio of documents that include the specific “term”.

IDF(“city, doc1 & doc2) = log(2/2)+1 =1

And finally we have:

TF-IDF(city, doc1) = 0.153 x 1 = 0.153

TF-IDF(city, doc2) = 0.076 x 1 = 0.076

which means that the word “city” is more relevant for the first document than the second document.

Like the previous example on our corpus from Sentiment Analysis for Financial News dataset we have:

Stay tuned for part II where we will discuss the next popular model, Word2Vec!

References

[1] B. Lutkevich, “Natural Language Processing”, Techtarget, March 2021. [Online]. Available:  https://www.techtarget.com/searchenterpriseai/definition/natural-language-processing-NLP. [Accessed: 29-March-2022]

[2] J. Brownlee, “What Are Word Embeddings for Text”, Machine Learning Mastery, August 2019. [Online]. Available: https://machinelearningmastery.com/what-are-word-embeddings. [Accessed: 31-March-2022]

[3] E. Bujokas, “Creating Word Embeddings: Coding the Word2Vec Algorithm in Python using Deep Learning”, Towards Data Science, March 2020. [Online]. Available: https://towardsdatascience.com/creating-word-embeddings-coding-the-word2vec-algorithm-in-python-using-deep-learning-b337d0ba17a8. [Accessed: 06-April-2022] 

[4] N. Lang, “Stemming vs. Lemmatization in NLP”, Towards Data Science, February 2019. [Online]. Available: https://towardsdatascience.com/stemming-vs-lemmatization-in-nlp-dea008600a0. [Accessed: 07-April-2022]

[5] https://www.nltk.org/ [Accessed: 08-April-2022]

[6] A. Sinha, “Sentiment Analysis for Financial News”, Kaggle, 2019. [Online]. Available: https://www.kaggle.com/datasets/ankurzing/sentiment-analysis-for-financial-news. [Accessed: 07-April-2022]

[7] J. Brownlee, ”A Gentle Introduction to the Bag-of-Words Model“, Machine Learning Mastery,  August 2019. [Online]. Available: https://machinelearningmastery.com/gentle-introduction-bag-words-model. [Accessed: 11-April-2022]

[8] T. Amuthan, “A Quick Intro to TFIDF”, Medium, January 2021. [Online]. Available: https://medium.com/swlh/a-quick-intro-to-tf-idf-483db9a749f5. [Accessed: 11-April-2022]