Question

Term frequency matrix for the five articles (A1 to A5) is shown below.

    \[ \begin{tabular}{|c|c|c|c|c|c|} \hline Articles & Trump & JNU & AAP & Corona & Divestiture\\ \hline A1 & 14 & 1 & 0 & 6 & 3\\ \hline A2 & 0 & 21 & 5 & 0 & 0\\ \hline A3 & 0 & 15 & 18 & 0 & 5\\ \hline A4 & 5 & 2 & 0 & 12 & 0\\ \hline A5 & 0 & 0 & 5 & 0 & 0\\\hline \end{tabular} \]

Answer the following questions:
1) What is the TF-IDF value for (A4, Corona)?
2) Find the cosine similarity between articles. Identify the two articles that are the most similar.

Solution

==================================================================================================

Ans a: Find TF-IDF Values

    \begin{align*}  \boxed{tf(t, d) = \frac{count \ of \ t \ in \ d}{no. \ of \ words \ in \ d} } \end{align*}

Let t = term = ‘Corona’, d = article = A4

TF(t,d) = \frac{12}{19} = 0.63

    \begin{align*}  \boxed{IDF(t) = log[ \frac{N}{df + 1} ] } \end{align*}

N = count of corpus = 5
df(t) = occurrence of t in documents

Thus, df(Corona) = {A1, A4} = 2
IDF(Corona) = log[\frac{5}{2+1}] = 0.74

    \begin{align*}  \boxed{TF-IDF(t, d) = tf(t, d) * log[ \frac{N}{df + 1} ] = TF * IDF } \end{align*}

TF-IDF(Corona, A4) = 0.63 * 0.74 = 0.47

Ans b: Find the most Similar Articles Using Cosine Similarity

    \begin{align*}  \boxed{Cosine \ Similarity = \frac{A.B}{\lVert A \rVert . \lVert B \rVert} } \end{align*}

Calculating Cosine Similarity across articles:

  • CS(A1, A2) = \frac{(14,1,0,6,3).(0,21,5,0,0)}{\sqrt{14^2 + 1^2 + 6^2 + 3^2} . \sqrt{21^2 + 5^2}} = \frac{21}{15.56 * 21.59} = 0.063
  • CS(A1, A3) = \frac{(14,1,0,6,3).(0,15,18,0,5)}{15.56 * \sqrt{15^2 + 18^2 + 5^2}} = \frac{30}{15.56 * 23.96} = 0.08
  • CS(A1, A4) = \frac{(14,1,0,6,3).(5,2,0,12,0)}{15.56 * \sqrt{5^2 + 2^2 + 12^2}} = \frac{144}{15.56 * 13.15} = 0.7
  • CS(A1, A5) = \frac{(14,1,0,6,3).(0,0,5,0,10)}{15.56 * \sqrt{5^2 + 10^2}} = \frac{30}{15.56 * 50} = 0.039
  • CS(A2, A3) = \frac{(0,21,5,0,0).(0,15,18,0,5)}{21.59 * 23.96} = 0.78
  • CS(A2, A4) = \frac{(0,21,5,0,0).(5,2,0,12,0)}{21.59 * 13.15} = 0.15
  • CS(A2, A5) = \frac{(0,21,5,0,0).(0,0,5,0,10)}{21.59 * 50} = 0.023
  • CS(A3, A4) = \frac{(0,15,18,0,5).(5,2,0,12,0)}{23.96 * 13.15} = 0.095
  • CS(A3, A5) = \frac{(0,15,18,0,5).(0,0,5,0,10)}{23.96 * 50} = 0.12
  • CS(A4, A5) = \frac{(5,2,0,12,0).(0,0,5,0,10)}{13.15 * 50} = 0

Based on Cosine Similarity, the most similar articles are A2 and A3.

Subscribe to Ehan Ghalib!

Invalid email address
We promise not to spam you. You can unsubscribe at any time.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>