Question

Consider a supermarket that contains 1000 products. In a market-basket analysis, you want to compare baskets of 2 customers C1 and C2 to find similarity in their buying behavior. C1’s basket contains sugar, coffee, tea, rice and eggs. C2’s basket contains sugar, coffee, bread and biscuit. Find the Jaccard and simple matching coefficient for the two customers. Comment on which coefficient is more suitable.

Solution

==================================================================================================
Let’s first write the given data in table form:

    \[ \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline Basket & Sugar & Coffee & Tea & Rice & Eggs & Bread & Biscuit\\ \hline C1 & 1 & 1 & 1 & 1 & 1 & 0 & 0\\ \hline C2 & 1 & 1 & 0 & 0 & 0 & 1 & 1\\ \hline \end{tabular} \]

Step-1: Create a Contingency table

Contingency table structure:

    \[ \begin{tabular}{|c|c|c|c|c|} \hline  & 1 & 0 & Sum\\ \hline 1 & p & q & p+q\\ \hline 0 & r & s & r+s\\ \hline Sum & p+r & q+s & t\\ \hline \end{tabular} \]

Contingency table for the problem:

    \[ \begin{tabular}{|c|c|c|c|} \hline  & 1 & 0 & Sum\\ \hline 1 & 2 & 3 & 5\\ \hline 0 & 2 & 993 & 995\\ \hline Sum & 4 & 996 & 1000\\ \hline \end{tabular} \]

Step-2: Calculate Jaccard Coefficient

These are asymmetric binary variables. Thus,

    \begin{align*}  \boxed{d(i, j) = \frac{q + r}{p + q + r} } \end{align*}

    \begin{align*}  \left d(C1, C2) = \frac{3 + 2}{2 + 3 + 2} = \frac{5}{7} = 0.71 \end{align*}

    \begin{align*} \left Therefore, Jaccard \ Coefficient = sim(C1, C2) = 1 - d(C1, C2) \end{align*}

    \begin{align*} \left Jaccard \ Coefficient = sim(C1, C2) = 1 - 0.71 = 0.29 \end{align*}

Step-3: Calculate Simple Matching Coefficient

    \begin{align*}  \boxed{d(i, j) = \frac{q + r}{t} } \end{align*}

    \begin{align*}  \left d(C1, C2) = \frac{3 + 2}{1000} = 0.005 \end{align*}

    \begin{align*} \left Therefore, \ Simple \ Matching \ Coefficient = 1 - d(C1, C2) = 0.995 \end{align*}

The absence of a particular product in a basket is not important; only its presence is. Thus, the variables are binary asymmetric in nature. Simple Matching Coefficient does not consider this, whereas Jaccard Coefficient does. As such, the latter gives a more accurate indication of similarity.

Jaccard Coefficient gives a more accurate description of this scenario because the variables in question are binary asymmetric variables.

Subscribe to Ehan Ghalib!

Invalid email address
We promise not to spam you. You can unsubscribe at any time.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>