Question
An organization has million documents in its repository. A document X has term ‘Mining’ occurring 4 times and term ‘Discovery’ occurring for 5 times. Other words occur less frequently. Altogether 100 documents have term ‘Mining’ and 1000 documents have term ‘Discovery’.
Calculate TF-IDF values for both terms w.r.t. document X.
Solution
==================================================================================================
Before we start, let’s write down the terms.
- = term 1 = ‘Mining’
- = term 2 = ‘Discovery’
- N = count of corpus =
- = frequency of term 1
- = frequency of term 2
- df(t) = occurrence of t in documents
- = 100
- = 1000
Assumption: Let number of words in doc X be 100.
Calculate TF:
- tf(,X) = = = 0.04
- tf(,X) = = = 0.05
Calculate IDF:
- IDF() = = 9901
- IDF() = = 999
Calculate TF-IDF:
- TF-IDF(, X) = 0.04 * 9901 = 396.04
- TF-IDF(, X) = 0.05 * 999 = 49.95
Thus, we have our answers.