Question
Term frequency matrix for the five articles (A1 to A5) is shown below.
Answer the following questions:
1) What is the TF-IDF value for (A4, Corona)?
2) Find the cosine similarity between articles. Identify the two articles that are the most similar.
Solution
==================================================================================================
Ans a: Find TF-IDF Values
Let t = term = ‘Corona’, d = article = A4
TF(t,d) = = 0.63
N = count of corpus = 5
df(t) = occurrence of t in documents
Thus, df(Corona) = {A1, A4} = 2
IDF(Corona) = = 0.74
TF-IDF(Corona, A4) = 0.63 * 0.74 = 0.47
Ans b: Find the most Similar Articles Using Cosine Similarity
Calculating Cosine Similarity across articles:
Based on Cosine Similarity, the most similar articles are A2 and A3.