By: Belma Muftic
As we all know, computers talk mainly in numbers (e.g. ASCII, binary), and as such, even in Data Science, the models expect numbers as inputs. But, what happens when the input is a word, sentence, a paragraph, or a voice message? In data science, there is a branch which deals with this kind of input: Natural Language Processing (NLP for short). Natural language refers to the way we, humans, communicate with each other: speech and text. However, in order for the computer to understand us (like R2D2, or our E387), we need to convert our natural language to theirs: numbers.
NLP combines computational linguistics—rule-based modeling of human language—with statistical, machine learning, and deep learning models. At the early stages of NLP, they used hand-written rules, but since data science is not as easy as an if-else block, there were a lot of exceptions to those rules, leading to include models into the equation. The Arrival portrays nicely the complexity of languages, although, not quite the one we are dealing with.
When it comes down to techniques in NLP, we will cover that in this series of blogs, and this one is rather an overview as to what you can accomplish with NLP (warning: there is a LOT of it):
- Sentiment analysis – extracts subjective characteristics, e.g. if a sentence is positive, or negative; social media has a lot of these usecases, e.g. check if a review of a movie is good or bad
- Named Entity Recognition (NER) – identifying words as entities, e.g. Google will be identified as a company, whereas New York will be a location
- Part of Speech tagging (PoS) – or, grammatical tagging, identifies structural properties of a word (noun, verb, adjective, etc)
- Speech recognition – converts voice data into text; but the challenging part is to recognize text through mummbling, slurring, tone, accent, incorrect grammar, and such; Alexa and Siri are an example where we use speech recognition to analyze and respond to a voice command.
- Word sense disambiguation – decypher the meaning of a word with multiple meanings through semantic analysis, e.g. leg in “break a leg” (meaning, good luck) and “pull someone’s leg” (make a joke with someone)
- Co-reference resolution – identify when two words refer to the same thing, such as “she” can refer to Penny, whereas “genius” can refer to Sheldon
- Natural language generation – generates text from, well, text (check this out); this is often used as a text summarization tool, which, as the name implies, summarizes a big chunk of text for you
- Machine translation – Google Translate says it all, but try translating text to one language and then back to the other, you will find amusing results
- Content categorization – classifying texts according to their content, e.g. legal documents for lawsuits, purchase, contracts, etc; this can be achieved with topic discovery and modeling, so that documents which use similar phrases will be grouped together (but, what if one document talks about bank as an institution, whereas other as a side of the river?)
- Spam detection – to be fair, sometimes it detects spam unfailrly, but a lot of spam filters use NLP in the background: they check for misspelled names,inappropriate urgencies, bad grammar, etc.
Practical NLP
As you can see, there are a lot of applications which leverage the power of NLP. On paper, some of them sound easy, but there are a lot of steps involved in cleaning your data, and even the cleaning depends on the use-case you want to work on. Nevertheless, some of the most common cleaning steps that you perform in NLP are:
- tokenization – splitting chunks of text into smaller elements, be it symbols, words, sentences, paragraphs. We need to do this so that we can consider these elements as some kind of categorical data.
- stop word removal – there are a lot of words which do not carry any meaning, e.g. “the” does not tell us anything about a sentiment of a sentence, so we consider it a stop word. This helps with dimensionality reduction, as well as noise removal, when you start working on your model.
- stemming – there are a lot of varieties a word can take (especially depending on the language used), e.g. take the word “inform”, you can derive “information”, “informs”, “informed”, “informing” from it, but they all have a common root, and stemming finds the root of each word (note that the root might not always be an English word)
- lemmatization – it is similar to stemming, but in this case, lemmatization finds the root form of a word, so that the output is an actual English word, e.g. the verb “go” can appear as “going”, “went”, “goes”, but with lemmatization, they are all reduced to “go”. Below is an example of what stemming would produce, and what lemmatization would produce for the same group of words:
When it comes down to which library you use for NLP, the best one is nltk. It provides stopword list, stems and lemmatizations of words, and supports a variety of langauges, depending on the task. But, do pay attention that each language has it own traits and challenges, and a nice comparison between English and Croatian can be found here.
As you can see, there are a lot of areas where NLP is being used, but of course, they all have limitations (and they often cause frustrations). In this series, we will aim to cover some of the widely used techniques for NLP, so stay tuned for more!