Definition of Automatic Summarization
According to wikipedia Automatic Summarization is the process of shortening a set of data computationally, to create a subset (a summary) that represents the most important or relevant information within the original content.
How does text summarization work?
There are several computational methods used to perform this task. These methods can be classified into two main approaches when it comes to summarizing textual data.
These techniques aim to extract the relevant data from the original text without modifying the data itself. These kind of algorithms prioritize already-highlighted elements from the texts, such as heading or visual documents as pivotal points for the text summarization, as well as the first and last paragraphs of a document. Extractive methods have the advantage of repurposing actual content of the text as the output. This is the most researched and used type of text summarization
Contrary to the previous approach, as the name already points out, abstractive methods aim to produce a summary of the text by abstracting from the current information in the text. That is, it aims to paraphrase the text instead of building a new one by chopping off what has been considered the key parts of the original texts. It is not hard to to see that this kind of approach is computationally much more demanding, and challenging. The advantage of this approach is that it aims to emulate human behaviour when summarizing a text, instead of limiting itself to reducing the text to a few key points.
Machine learning and text summarization
Generally speaking, we can classify machine learning algorithms into two different branches: Supervised learning and Unsupervised Learning. Once again, the names hint at the difference between these two approaches.
In supervised learning models, we aim at creating an algorithm that will search for repeating patterns in the provided data. The key here is that this training procedure will be performed on already labeled data, with the correct labels. Once the training is done, our model will try to impose what it has learned from the provided data onto any input data we give it. In its simplest form, a Supervised learning algorithm is represented as Y = f(x).
Most Supervised learning algorithms may be used for text summarization, including decision trees and Naive Bayes’ algorithm.
The main hurdle of supervised models is the necessity of collecting a lot of data prior to training the model, and this data cannot be extrapolated to other domains. Unsupervised models try to solve these problems by removing training data. In this case, unsupervised algorithms scan the text to determine intrinsic meta-estructural elements that are extrapolated to compute a resulting summary of the data.
A well-known example of this type of algorithm is TextRank. TextRank computes the similarity between a sentence and all the other sentences in the same text. When the algorithm finds a sentence with a low similarity rate, this sentences is automatically considered to be more relevant for the text than others. In order to achieve this, it is necessary to first generate a bag of of words matrix with all the sentences and their words, which then needs to be normalized before running the TextRank algorithm on them.
Naive Bayes’ algorithm
bag of words and TextRank
NLP Libraries of Interest