Coherence Score of a Topic Model

The coherence score is a measure used to evaluate the quality of topics generated by a topic model. It quantifies the degree of semantic similarity between the words that define the topic. A topic model that generates topics with a high coherence score suggests that the topic is semantically consistent and therefore more interpretable. Here is an example.

Let us say you are building a topic model on a collection of news articles, and one of the topics generated has the following top words: Apple, Watermelon, Banana, Grape. Now, consider another topic with these top words: Apple, Computer, Watermelon, Mouse. The first topic clearly revolves around fruits. Words in this topic are closely related to each other in terms of meaning, so will get a high coherence score. The second topic seems mixed. While “Apple” and “Watermelon” are fruits, “Computer” and “Software” are related to technology. Since these words are less semantically consistent as a group, this topic will receive a lower coherence score.

Coherence score is calculated by measuring pairwise word similarity scores for the top words in a topic, and then averaging them. The similarity scores are often based on word co-occurrence statistics in the text corpus.

Text Summary Using Paragraphs

The idea of extracting key paragraphs to form a summary was presented by Mitra et al in the paper “Automatic Text Summarization by Paragraph Extraction”, in 2000. The key idea is to represent each paragraph using a vector, where each element corresponds to a word within that paragraph. For every pair of paragraphs, calculate a similarity score based on their vectors. This score is derived from the dot product of the vectors representing the respective paragraphs. Identify the top paragraphs with high similarity score. Establish a threshold for the similarity score, and mark all paragraphs exceeding this threshold as ‘connected’. Identify the top N most connected paragraphs and arrange them in the sequence they occur in the original text to produce the summarized extract.

Identifying Topics using LSA

Latent Semantic Analysis is method for identifying the key topics in a set of documents. It was presented by Deerwester and others in the paper “Indexing by latent semantic analysis”, in 1990. First, a term-document matrix is constructed, where rows represent the words, and the columns represent documents, and each entry in this matrix indicates the frequency of the term in a document. Next, the term-document matrix is decomposed into three matrices, using an operation called Singular Value Decomposition. The three matrices are term-topic matrix, topic-topic matrix, and topic-document matrix. Here, all values except the diagonal elements of the topic-topic matrix will be zero.

By keeping only the top N values in the topic-topic matrix, we can identify the top N topics in the document. From the term-topic matrix, the truncated topic-topic matrix and topic-document matrix, we can identify the strength of the relationship between the terms and the topic and the documents and the topic. In others words, we can identify the set of terms associated with each topic, and the topics discussed in each document.

Topic Shift Detection Using Words

An algorithm to detect topic boundaries in a text was presented by Kozima in his paper “Text Segmentation Based on Similarity Between Words”, in 1993. This paper introduces a new method for detecting the topic boundaries using a concept called the lexical cohesion profile aka LCP.

LCP is derived by moving a fixed-width window across the text and calculating the cohesiveness of the words within that window. The cohesiveness of a set of words is scored based on the semantic relationships between words and their significance in a corpus. If the words within the window are from the same segment (or topic), they are expected to be cohesive, leading to a higher value in the LCP. Conversely, if the window spans a segment boundary, the words will be less cohesive, resulting in a dip or valley in the LCP. Plot the LCP and identify the the minimum points in the plot. These points correspond to the topic or segment boundaries.

Experiments revealed that the fixed-width window of 25 words yielded the best results. Applying this algorithm to O. Henry’s “Springtime à la Carte”, shows that the LCP’s valleys mostly align with the topic boundaries identified by human readers.

Finding Topic Boundaries

An method to identify topic boundaries using lexical cohesion between sentences was presented by Jeffrey C. Reynar in the paper “An Automatic Method of Finding Topic Boundaries”. The core idea is that when a set of sentences or a passage of text revolves around the same topic, they are more likely to share common vocabulary. One can identify the topic shifts, or boundaries by identifying points where the lexical cohesion drops.

To implement this, the text is divided into blocks of sentences. These blocks are then compared with each other to calculate a cohesion score. The idea is to measure how much vocabulary is shared between adjacent blocks. If two consecutive blocks share a lot of vocabulary, their cohesion score will be high. On the other hand, if they don’t share much vocabulary, the cohesion score will be low, potentially indicating a topic boundary.

Text Summarization using Simplified Lesk

This algorithm for generating text summaries was presented by Pal et al in 2013, in the paper “An Approach to Automatic Text Summarization Using Simplified Lesk Algorithm and WordNet”. The key idea is to create a score for each sentence by using the words in the sentence, and their overlap with other words in the WordNet dictionary.

Begin with the first sentence in the text. For each of the words in this sentence, find its meaning in WordNet, a database of English words. Count how many words are the same between the WordNet meaning and the original text. This count is the word’s score. Add up the scores of all the words in a sentence to get the sentence’s score. Do this for every sentence in the text. Now, each sentence has a score. Select the sentences with the highest score, to form the summary.

Summarizer Using Naive Bayes

The idea of using a Naïve Bayes classifier to identify sentences that need to be included in a summary was presented by Kupiec et al in 1995, in the paper titled Trainable Document Summarizer.

The idea is to extract the key features of each sentence: length, all two-word combinations aka bigrams, whether the sentence is present in the first five or the last five paragraphs, location of the sentence in the paragraph (beginning, middle, end) etc. Then, using an existing set of documents and their summaries, three probabilities are calculated: the probability of finding the sentence features in the summary, probability of sentence being in the manual summary, and the probability of finding the sentence features in the document. By combining these three probabilities, and by using Bayes rule, we can compute the probability for each sentence. The sentences with a high score is the summary.

Luhn’s Algorithm To Generate Abstracts

Here is a simplified explanation of Luhn’s algorithm to generate abstracts from a text. This is based on his paper titled “Automatic Creation of Literature Abstracts” from 1958.

Find out the frequency of occurrence of each word in the document, and order the words by the same. Pick a top cutoff and get rid of all words that appear more frequently that the top cutoff. This will get rid of common words like “a” and “the.” Pick a bottom cutoff and get rid of all words that appear less frequently than the bottom cutoff. This will get rid of unimportant words. Write down the words that are left. These words are important.

For each sentence in the text, look for groups of important words. A group is made of important words that have at least four unimportant words between them. Using the number of important words and the total number of words in the sentence, compute a “score” for the sentence. This score tells you how important the sentence is. Pick the sentences with the highest score. These sentences will give you the abstract.