TF-IDF (Term Frequency-Inverse Document Frequency) is an algorithm for determining the most-relevant terms in a document within a large collection of documents. One example is Google looking for important terms in a vast collection of websites. I attempted a more modest analysis to find the most important term in each of the works of Shakespeare.
I wrote a collection of Java MapReduce programs to run under Hadoop. My computer setup was/is the following:
The whole program consists of a MapReduce driver and 5 Mapper/Reducer steps:
NOTES: This algorithm may run into problems at scale either because of memory usage or because of the startup/teardown cost of numerous steps.
My thanks to the following for the help I derived from their websites:
My thanks also to Jure Leskovec and Daniel Templeton for teaching Stanford's CS246 and CS246H
Originally published: Wednesday, February 04, 2015; most-recently modified: Saturday, September 07, 2019
|Posted by: G4 | Feb 5, 2015 5:04 PM|