Document classification using (un)structured text

Document tagging is the process of adding extra information to documents, typically keywords. It works by applying tags to documents from a predefined list and simplifies the organisation and maintenance of documents and data. For instance, if you have open-ended feedback forms, you can find out how many responses mention your customer support and what percentage of visitors talk about pricing.

Document classification  lets you organise unstructured text collections into specific categories.  Classifying large documents is essential to make them more manageable and obtain valuable insights. But we, as human beings, might find the incoming volume of data very hard to manage, not to mention tedious and inefficient.

Both processes used to be mainly performed manually, where people such as librarians/documentalists developed classification tools and mastered the art of efficiently classifying data or objects to preserve knowledge. However, this is slow and monotonous work and not practical in a business context where large volumes of documents are processed daily. As a result, many companies don’t tap into the potential of their unstructured text data, whether internal reports, customer interactions, service logs or case files.

Instead, it is much faster, more cost-efficient, and more accurate, to carry out automatic document classification through machine learning. Using Natural Language Processing (NLP) algorithms, we can automatically process vast amounts of text.

Machine learning tools are faster and more scalable than manual classification because machines never get tired or bored. And might discover novel categories of interest!

Why Use Document Classification/Tagging?

Businesses are drowning in unstructured data, and this is where automated processing can help:

  • Accelerated workflows at lower costs: Improve the throughput rate of your classification-heavy processes without increasing costs.
  • Compliance: Quickly and comprehensively scan documents for any sensitive information. Once identified, the software can  redact the information.
  • Discovery: Automate the process of grouping documents and use this information to focus faster on the topics that matter the most to your company at any given time.
  • Migration: Take collections of documents and easily organise, extract, and apply critical metadata to simplify and organise documents into a content management system.
  • Reduce ROT (Redundancy, Obsolete, Trivial). Identify duplicate documents and documents that don’t need to be preserved and remove them.

Getting Started with Document Classification/Tagging

Like any other machine learning project, there are two essential but straightforward steps: gathering and analysing your dataset and training the algorithm.

Gather your dataset

Without data, you can’t train a dataset; without quality data, it will not do a great job. When working with annotated data, the quality of the labels is critical. If most of the examples fed to the classifier are incorrectly tagged, the model will learn from these mistakes and commit similar errors when making predictions.

Training the Algorithm

Different approaches are available depending on the dataset that is available:

Supervised

In this method, a set of predefined categories or tags are created  and part of the data is annotated manually. The machine learning model then tries to generalize this knowledge in the training set in order to classify or augment newly available data.

Advantage: 

  • When the annotations are correct and the training set is representative, a correct estimation can be made of the model accuracy, performance.

Disadvantage:

  • Documents need to be first manually assigned into categories which can be time-consuming.

Unsupervised

This method groups documents containing content. Historically, this was typically done mainly keyword based, while lately algorithms can grasp a larger context or more abstract concepts. You can use unsupervised classification when you do not have a clear idea of possible classifications as an exploratory tool (topic modelling). One possible scenario is to use unsupervised classification to provide an initial set of categories and subsequently build on these through supervised classification.

Advantage: 

  • No tags whatsoever need to be provided with the documents.
  • It helps to discover patterns and content similarities in a set of documents that might be overlooked.

Disadvantage: 

  • Clustering might result in unexpected groupings since the clustering operation is not user-defined but based on an algorithm.

 

Use case: Belgian Data Protection Authority

The Belgian Data Protection Authority (Belgian DPA) is responsible for compliance with the data protection regulations in Belgium. Every day, it deals with high volumes of questions from the public about data protection. It has prepared answers to frequently asked questions, interpretations, and clarifications on the law. However, specialists at the Belgian DPA spend a lot of time answering similar questions that have already been discussed in the past.

The problem

The Belgian Data Protection Authority receives questions requests from citizens and organisations via email. These questions need to be linked by an expert to specific ruling and legislation, after which an answer is presented and published on the website.

Many of the incoming requests have already been published, and in these cases, the expert spends time simply linking the previous publication(s) to the incoming request.

The Belgian Data Protection authority wanted to explore the possibility of using an intelligent agent integrated into a website, which could automate this. The agent would need to start a natural conversation and intelligently guide the conversation to the relevant publications and documents.

Our solution

We used machine learning to solve this problem. Firstly, we used topic modelling to gain insights in the structure of the data. This meant we could identify the relevant theme in the data protection legislation. We used this information to steer the questioning and conversation. Secondly, we created an intelligent agent. This was used to mine information and steer the conversation, functioning in several languages. If the interaction with the user did not resolve the question, the intelligent agent presented the user with a search function to retrieve relevant publications.

Work performed

  • Examining the underlying structure of document  collections
  • Intelligent agent
  • High-precision search engine

How we added value

The experiment was valuable to learn how novel technologies could assist in decreasing specialist workloads.

Key points to remember

Documents are some of the richest sources of information for any business. Be it articles, customer surveys, or support tickets, all of them contain valuable insights. The best way to get to these insights is by classifying all the data you receive so you can start making sense of them.

Manual classification of documents can be a nightmare, especially if the volume of information is high. In this scenario, labelling documents becomes repetitive, and humans will likely make mistakes.

Document classification with NLP is much more efficient, cost-effective, and accurate when done by machines.

Gilles Deweerdt

Newsletter

Receive news about AI.
This field is for validation purposes and should be left unchanged.