PHILADELPHIA REFLECTIONS
Musings of a Philadelphia Physician who has served the community for six decades

Return to Home

TF-IDF Term Frequency-Inverse Document Frequency

TF-IDF (Term Frequency-Inverse Document Frequency) is an algorithm for determining the most-relevant terms in a document within a large collection of documents. One example is Google looking for important terms in a vast collection of websites. I attempted a more modest analysis to find the most important term in each of the works of Shakespeare.

I wrote a collection of Java MapReduce programs to run under Hadoop. My computer setup was/is the following:

The whole program consists of a MapReduce driver and 5 Mapper/Reducer steps:

NOTES: This algorithm may run into problems at scale either because of memory usage or because of the startup/teardown cost of numerous steps.

  1. Memory is likely to be the larger problem
    • In the first step you can implement stemming to reduce the number of terms and you can include certain high frequency terms in the stop word list.
    • The memory problems will be more severe in the reducer steps and implementing combiners may help by offloading some of the memory usage to the mapper tasks, which are likely to be more numerous than the reducer tasks.
    • Instead of using in-memory HashMaps, you can implement temporary disk storage.
  2. The second and fourth steps can fairly easily be folded into the previous steps. There are two penalties for this: (1) the code is more complex and harder to understand, explain and debug (2) the individual tasks will require more resources.

My thanks to the following for the help I derived from their websites:

My thanks also to Jure Leskovec and Daniel Templeton for teaching Stanford's CS246 and CS246H

(2793)

Testing the comment facility from grfiv.github.io
Posted by: G4   |   Feb 5, 2015 5:04 PM

Please Let Us Know What You Think


(HTML tags provide better formatting)

Because of robot spam we ask you to confirm your comment: we will send you an email containing a link to click. We apologize for this inconvenience but this ensures the quality of the comments. (Your email will not be displayed.)
Thank you.