Here is a super-short intro to the concept of text similarity. There are many ways to quantify the similarity between two pieces of text. One popular approach is to use what is known as the Jaccard similarity. Consider these two sentences:
- I like movies
- I enjoy watching movies
The number of words that occur in both the sentences is two (“I”, “movies”). The total number of unique words across the two sentences is five (“I”, “like”, “movies”, “enjoy”, “watching”). Here, the Jaccard similarity score = 2/5. That is, 40% similarity. The idea of text similarity is has several applications. For example, given a question and a collection of sentences that are potential answers to the question, the most likely answer to the question is the one with the highest similarity score.
