Question

An organization has million documents in its repository. A document X has term ‘Mining’ occurring 4 times and term ‘Discovery’ occurring for 5 times. Other words occur less frequently. Altogether 100 documents have term ‘Mining’ and 1000 documents have term ‘Discovery’.

Calculate TF-IDF values for both terms w.r.t. document X.

Solution

==================================================================================================

Before we start, let’s write down the terms.

  • t_{1} = term 1 = ‘Mining’
  • t_{2} = term 2 = ‘Discovery’
  • N = count of corpus = 10^6
  • tf_{1} = frequency of term 1
  • tf_{2} = frequency of term 2
  • df(t) = occurrence of t in documents
  • df(t_{1}) = 100
  • df(t_{2}) = 1000

Assumption: Let number of words in doc X be 100.

Calculate TF:

    \begin{align*}  \boxed{tf(t, d) = \frac{count \ of \ t \ in \ d}{no. \ of \ words \ in \ d} } \end{align*}

  • tf(t_{1},X) = tf_{1} = \frac{4}{100} = 0.04
  • tf(t_{2},X) = tf_{2} = \frac{5}{100} = 0.05

Calculate IDF:

    \begin{align*}  \boxed{IDF(t) = log[ \frac{N}{df + 1} ] } \end{align*}

  • IDF(t_{1}) = log[ \frac{10^6}{100 + 1} ] = 9901
  • IDF(t_{2}) = log[ \frac{10^6}{1000 + 1} ] = 999

Calculate TF-IDF:

    \begin{align*}  \boxed{TF-IDF(t, d) = tf(t, d) * log[ \frac{N}{df + 1} ] = TF * IDF } \end{align*}

  • TF-IDF(t_{1}, X) = 0.04 * 9901 = 396.04
  • TF-IDF(t_{2}, X) = 0.05 * 999 = 49.95

Thus, we have our answers.

Subscribe to Ehan Ghalib!

Invalid email address
We promise not to spam you. You can unsubscribe at any time.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>