Identifying Topics using LSA

Latent Semantic Analysis is method for identifying the key topics in a set of documents. It was presented by Deerwester and others in the paper “Indexing by latent semantic analysis”, in 1990. First, a term-document matrix is constructed, where rows represent the words, and the columns represent documents, and each entry in this matrix indicates the frequency of the term in a document. Next, the term-document matrix is decomposed into three matrices, using an operation called Singular Value Decomposition. The three matrices are term-topic matrix, topic-topic matrix, and topic-document matrix. Here, all values except the diagonal elements of the topic-topic matrix will be zero.

By keeping only the top N values in the topic-topic matrix, we can identify the top N topics in the document. From the term-topic matrix, the truncated topic-topic matrix and topic-document matrix, we can identify the strength of the relationship between the terms and the topic and the documents and the topic. In others words, we can identify the set of terms associated with each topic, and the topics discussed in each document.

Share this:

Related

Published by eananth