Question
An organization has million documents in its repository. A document X has term ‘Mining’ occurring 4 times and term ‘Discovery’ occurring for 5 times. Other words occur less frequently. Altogether 100 documents have term ‘Mining’ and 1000 documents have term ‘Discovery’.
Calculate TF-IDF values for both terms w.r.t. document X.
Solution
==================================================================================================
Before we start, let’s write down the terms.
-
= term 1 = ‘Mining’
-
= term 2 = ‘Discovery’
- N = count of corpus =
-
= frequency of term 1
-
= frequency of term 2
- df(t) = occurrence of t in documents
-
= 100
-
= 1000
Assumption: Let number of words in doc X be 100.
Calculate TF:
- tf(
,X) =
=
= 0.04
- tf(
,X) =
=
= 0.05
Calculate IDF:
- IDF(
) =
= 9901
- IDF(
) =
= 999
Calculate TF-IDF:
- TF-IDF(
, X) = 0.04 * 9901 = 396.04
- TF-IDF(
, X) = 0.05 * 999 = 49.95
Thus, we have our answers.