Summarizer Using Naive Bayes

The idea of using a Naïve Bayes classifier to identify sentences that need to be included in a summary was presented by Kupiec et al in 1995, in the paper titled Trainable Document Summarizer.

The idea is to extract the key features of each sentence: length, all two-word combinations aka bigrams, whether the sentence is present in the first five or the last five paragraphs, location of the sentence in the paragraph (beginning, middle, end) etc. Then, using an existing set of documents and their summaries, three probabilities are calculated: the probability of finding the sentence features in the summary, probability of sentence being in the manual summary, and the probability of finding the sentence features in the document. By combining these three probabilities, and by using Bayes rule, we can compute the probability for each sentence. The sentences with a high score is the summary.