
An algorithm to detect topic boundaries in a text was presented by Kozima in his paper “Text Segmentation Based on Similarity Between Words”, in 1993. This paper introduces a new method for detecting the topic boundaries using a concept called the lexical cohesion profile aka LCP.
LCP is derived by moving a fixed-width window across the text and calculating the cohesiveness of the words within that window. The cohesiveness of a set of words is scored based on the semantic relationships between words and their significance in a corpus. If the words within the window are from the same segment (or topic), they are expected to be cohesive, leading to a higher value in the LCP. Conversely, if the window spans a segment boundary, the words will be less cohesive, resulting in a dip or valley in the LCP. Plot the LCP and identify the the minimum points in the plot. These points correspond to the topic or segment boundaries.
Experiments revealed that the fixed-width window of 25 words yielded the best results. Applying this algorithm to O. Henry’s “Springtime à la Carte”, shows that the LCP’s valleys mostly align with the topic boundaries identified by human readers.