4 NLP libraries that are awesome

There are many libraries created to solve NLP problems. Here are some of the most amazing ones that helped us deliver quality projects to our clients over the years. Keep in mind that this list is not a complete overview of all the available NLP libraries, but these are the ones that we think are awesome.

In the past, doing an NLP project required a lot of great minds together; you needed mathematicians, machine learning engineers and linguistics. Now, developers can use ready-made tools that simplify text preprocessing to concentrate on building machine learning models.

Why Python?

First, the programming language that is our first choice in NLP projects is Python. This language’s simple syntax and transparent semantics make it an excellent choice.

But there is something else why this makes an excellent programming language for helping computers cope with natural languages. An extensive collection of NLP libraries handles many tasks such as sentiment analysis, tokenization, classification, etc.

The NLTK library is generally the most popular. This is because of the wide range of applications it allows, such as sentiment analysis, tokenization, and classification. NLTK can also be applied to many languages, including Dutch (often not the case with other libraries). NLTK is especially useful in text processing.

The downside is that this library can be pretty slow and difficult to use; the learning curve is steep.

Natural language toolkit features include:

  • Text classification
  • Part-of-speech tagging
  • Entity extraction
  • Tokenization
  • Parsing
  • Stemming
  • Semantic reasoning
from nltk.tokenize import word_tokenize

sample_text = "this text needs to be tokenized"
word_tokenize(sample_text)

# ----- Expected output -----
# ['this', 'text', 'needs', 'to', 'be', 'tokenized']
from nltk.stem.snowball import SnowballStemmer

dutchStemmer = SnowballStemmer("dutch")
dutchStemmer.stem("artikelen")

# ----- Expected output -----
# 'artikel'

SpaCy

SpaCy, which stands for Python for convenience and Cython for speed, is the next step of the NLTK evolution. NLTK is clumsy and slow when it comes to more complex business applications.

We also prefer this library above NLTK because of its speed since it is written in Cython. It’s a relatively young library designed for production usage. But that makes it also more accessible than other Python libraries.

SpaCy is good at syntactic analysis, which is handy for aspect-based sentiment analysis and conversational user interface optimization.

SpaCy is also an excellent choice for named-entity recognition. You can use SpaCy for business insights and market research.

It’s perfect for comparing customer profiles, product profiles, or text documents.

It includes almost every feature found in those competing frameworks:

  • Part-of-speech tagging
  • Dependency parsing
  • Named entity recognition
  • Tokenization
  • Sentence segmentation
  • Rule-based match operations
  • Word vectors

You can also build word vectors that are used in, e.g. topic modelling. It’s a real advantage with this library. Unlike OpenNLP and CoreNLP, SpaCy works with word2vec and doc2vec.

The most significant advantage over the other NLP tools is its API. SpaCy combines all functions at once, so you don’t need to select modules alone.

However, there is also a big downside to this tool. It supports the smallest number of languages. But this should improve as its popularity increases.

import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("ML2Grow is a fast growing startup located in Ghent")

for ent in doc.ents:
    print(ent.text, ent.label_)

# ----- Expected output -----
ML2Grow ORG
Ghent GPE  

ORG: Companies, agencies, institutions
GPE: Geopolitical entity, i.e. countries, cities, states.

Gensim

Sometimes you need to extract specific information to discover business insights. GenSim is the perfect tool for such things.

We mainly use Gensim for finding similarities in text documents and topic modelling. It’s an excellent library for identifying similarities between two documents through vector space and topic modelling. It sees the content of the documents as sequences of vectors and clusters. And then, GenSim classifies them. It provides a beautiful visualization of the topics combined with the python library Pyldavis.

It also has excellent memory usage optimization and processing speed. That’s why it can handle large amounts of text data.

The prominent GenSim use cases are:

  • Data analysis
  • Semantic search applications
  • Text generation applications (chatbot, service customization, text summarization, etc.)
import gensim

# Load pre-trained Word2Vec model
model = gensim.models.Word2Vec.load("modelName.model")
model.similarity('Complement', 'Compliment')

# ----- Expected output -----
0.961089779453727

Flair

Flair is a simple NLP library. Flair’s framework builds directly on PyTorch, one of the best deep learning frameworks.

Flair is an excellent library for entity recognition and part-of-speech tagging. It works very well on English text but gives horrible results on Dutch documents. Since most of our customers have Dutch-language sources, it makes little sense for us to use this library,y but we still love it.

Main NLP tasks:

  • Name-Entity Recognition
  • Parts-of-Speech Tagging
  • Text classification
  • Training Custom Models
from flair.models import TextClassifier
from flair.data import Sentence

classifier = TextClassifier.load('en-sentiment')
sentence = Sentence('NLP libraries are awesome!')
classifier.predict(sentence)

# ----- Expected output -----
[Positive (1.0)]

Side note

Despite being often multilingual, the libraries mentioned above are firmly focused on English. Most NLP studies are therefore strongly focused on English and other widely spoken languages such as Chinese or Spanish. This naturally leads to further marginalization of other languages.

An NLP library that is certainly worth mentioning but didn’t make it to the list is BERT. An open-source neural network-based technique for NLP developed by Google.

BERT is an acronym for Bidirectional Encoder Representations from Transformers. The term bidirectional means that the context of a word is given by both the words that follow it and by the words preceding it. This technique makes this algorithm hard to train but very effective.

In 2019 BERT was adopted by Google Search for over 70 languages. Last year, almost every single English-based search query was processed by BERT.

In our next and final series post, we will delve deeper into the practical side of NLP in business and industry.

Gilles Deweerdt

Newsletter

Receive news about AI.
This field is for validation purposes and should be left unchanged.