Philadelphia Reflections

The musings of a physician who has served the community for over six decades

Volumes

George IV

The Age of the Philadelphia Computer
Computers have a long slow history. The computer industry, however, had an abrupt start and sudden decline, in Philadelphia.

George and Computers(2)

New topic 2016-12-20 20:06:30 contents

Central Securities

Macroeconomics of The 2007 Collapse

Sudden wealth creation, whether from the discovery of gold or oil, the conversion of poverty into useful cheap labor, or the sudden abundance of cheap credit, is of course a good thing. Sudden wealth creation can be compared with a stone thrown into a pond, causing a splash, and ripples, but leaving a somewhat higher water level after things calm down. The globalization of trade and finance in the past fifty years has caused 150 such disturbances, mostly confined to a primative developing country and its neighbors. Only the 2007 disruption has been large enough to upset the biggest economies. It remains to be seen whether disorder to the whole world will result in revised world monetary arrangement. One hopes so, but national currencies, tightly controlled by local governments, have been successful in the past in confining the damage. This time, the challenge is to breach the dykes somewhat, without letting destructive tidal waves sweep past them. Many will resist this idea, claiming instead it would be better to have higher dykes.

It is the suddenness of new wealth creation in a particular region which upsets existing currency arrangements. Large economies "float" their currencies in response to the fluxes of trade, smaller economies can be permitted to "peg" their currencies to larger ones, with only infrequent readjustments. Even the floating nations "cheat" a little, in response to the political needs of the governing party, or to stimulate or depress their economies as locally thought best. All politicians in all countries therefore fear a strictly honest floating system, and their negotiations about revising the present system will surely be guilty of finding loopholes for each other; the search for flexible floating will therefore claim to seek an arrangement which is "workable".

In thousands of years of governments, they have invariably sought ways to substitute inflated currency for unpopular taxes. The heart of any international payment system is to find ways to resist local inflation strategems. Aside from using gunboats, only two methods have proven successful. The most time-honored is to link currencies to gold or other precious substances, which has the main handicap of inflexibility in response to economic fluctuations. After breaking the link to gold in 1971, central banks regulated the supply of national currency in response to national inflation, so-called "inflation targeting". It worked far better than many feared, apparently allowing twenty years without a recession. It remains to be investigated whether the substitution of foreign currency defeated the system, and therefore whether the system can be repaired by improving the precision of universal floating, or tightening the obedience to targets, or both. These mildest of measures involve a certain surrender of national sovereignty; stronger methods would require even more draconian external force. The worse it gets, the more likely it could be enforced only by military threat. Even the Roman Empire required gold and precious metals to enforce a world currency. The use of the International Monetary Fund (IMF) implies attempts to dominate the politics of the IMF. So it comes to the same thing: this crisis will have to get a lot worse, maybe with some rioting and revolutions, before we can expect anything more satisfactory than a rickety negotiated international arrangement, riddled with embarassing "earmarks". Economic recovery will be slow and gradual, unless this arrangement is better, or social upheavals worse, than would presently appear likely.

An iPhone web app

The iPhone is the best PDA to come along since the Blackberry 15 years ago. It is to the Blackberry what the Blackberry was to cell phones.

Philadelphia Reflections is now a fully-fledged iPhone web app. The application will appear on your iPhone in the appropriate format automatically: just navigate to http://www.philadelphia-reflections.com with the iPhone browser; we will detect it and do the right thing.

A two-step process is required to get a little icon on your iPhone home page so you can go there directly:

  1. Click the "+" plus sign at the bottom of the iPhone screen
  2. Click the "Add to Home Screen" button that appears.

That's it. We do all the rest.

We are listed in the Apple web-app store: click here to go see our special page.

http://www.philadelphia-reflections.com/images/missing_img.gif
Click Add to Home Screen
http://www.philadelphia-reflections.com/images/missing_img.gif
Click the plus sign

SMTP Authorization and Handling Bounced Emails with PEAR Mail

Recently our ISP started requiring user signon in order to send emails. PHP's mail function stopped working as a result.

Naturally, the ISP did not notify us of this change so we were quite surprised when many thousands of emails on our newsletter list were rejected (every one of them, in fact).

What error message was returned to us to notify us of what the problem was? Why this helpful note:

Mail sent by user nobody being discarded due to sender restrictions in WHM->Tweak Settings

Doesn't that just say it all?

I'm being snide, but our ISP is really quite good about keeping its software up to date and aside from an occasional surprise like this, they are very reliable. Being up to date included the automatic incorporation of the PEAR Mail facility which we are now using.

PEAR's Mail system works quite well but two problems were very vexing until we stumbled our way to a solution:

  1. How, exactly, do we sign on to the SMTP server?
  2. How do we ensure that bounced emails (the bane of all email lists) get returned to us?

You might not think that the first question would be so hard but it actually took a good deal of trial and error to get it right. As for the second question, there is an awful lot of wrong information available out in Internet land (including but not limited to VERP and XVERP which I advise you to avoid).

With PEAR Mail you first set up a "factory" and then send emails, either singly or in a loop. We keep the user id, password, etc. in a file "above" the web server in hopes that will keep them secret ... here's the code (it actually is in production and it does in fact work):

<?php
include('Mail.php');

# the email constants are contained in a file outside the web server
include("/level1/level2/level3/constants.php");

$headers = array (
         'From' => '"name"<addr@domain.com>',
         'Sender' => '"name"<addr@domain.com>',
         'Reply-To' => '"name"<addr@domain.com>',
         'Return-Path' => 'addr@domain.com',
         'Content-type' => 'text/html; charset=iso-8859-1',
         'X-Mailer' => 'PHP/' . phpversion(),
         'Date' => date("D, j M Y H:i:s O",time()),
         'Content-Language' => 'en-us',
         'MIME-Version' => '1.0'
         );

// call the PEAR mail "factory"
$smtp = Mail::factory('smtp',
      array (
            'host' => EMAIL_HOST,
            'port' => EMAIL_PORT,
            'auth' => true,
            'username' => EMAIL_USERNAME,
            'password' => EMAIL_PASSWORD,
            'persist' => true,
            'debug' => false
            ), '-f addr@domain.com'
      );

# to send emails:
#
# $headers['To']      = $to;        # provide the "$to" variable, something like $to = '"name"<addr@domain.com>';
#                                   # note that the first parameter of $smtp->send can be "decorated" this way or just a naked email address
# $headers['Subject'] = $subject;   # provide the "$subject" variable
# $mail = $smtp->send($to, $headers, $contents_of_the_email);
#                          -------- ................................> except for 'To' and 'Subject',
#                                                                     $headers is provided by this module but can be over-ridden
# if (PEAR::isError($mail))
# {
#   echo "<p style='color:red;'>The email failed; debug information follows:<br />";
#   echo $mail->getDebugInfo() . "<br />";
#   echo $mail->getMessage()   . "</p>";
# }
# else
# {
#   echo "<p>email successfully sent</p>";
# }

?>

My thanks to http://htmlentities.net/ for the HTML entites conversion.

TF-IDF Driver

<- TF-IDF Main Page

package com.georgefisher.tfidf;
 
import java.util.Arrays;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
 
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
 
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
 
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.KeyValueTextInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
 
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
 
// ============ DRIVER ===============
public class TFIDF_driver extends Configured implements Tool{
    public static void main(String[] args) throws Exception {
 
        System.out.println("driver main() args: " + Arrays.toString(args));
        int res = ToolRunner.run(new Configuration(), new TFIDF_driver(), args);
 
        System.exit(res);
    }
 
    @Override
    public int run(String[] args) throws Exception {
        System.out.println("driver  run() args: " + Arrays.toString(args));
 
        Configuration conf = new Configuration();
        int step = 1;
 
        // ======================== step 1 ============================
        // input  (byteCount, line)
        // output (word@doc, n)       n = the frequency of word in doc
 
        Job job1 = Job.getInstance(conf);
        job1.setJobName("TFIDF" + step);
        System.out.println("job: " + job1.getJobName().toString());
        job1.setJarByClass(TFIDF_driver.class);
 
        // Set the output Key, Value types for the Mapper
        job1.setMapOutputKeyClass(Text.class);
        job1.setMapOutputValueClass(IntWritable.class);
 
        // Set the output Key, Value types for the Reducer
        job1.setOutputKeyClass(Text.class);
        job1.setOutputValueClass(IntWritable.class);
 
        // Mapper, Combiner, Reducer
        // -------------------------
        job1.setMapperClass(TFIDF_step1.Map.class);
        job1.setReducerClass(TFIDF_step1.Reduce.class);
 
        // Specify that the Mapper & Reducer are reading text files
        job1.setInputFormatClass(TextInputFormat.class);
        job1.setOutputFormatClass(TextOutputFormat.class);
 
        // Specify Mapper input and Reducer output paths
        conf.set("inputDir", args[0]);
        conf.set("originalInputDir", args[0]);
        conf.set("outputDir", job1.getJobName());
 
        FileInputFormat.addInputPath(job1, new Path(conf.get("inputDir")));
 
        FileSystem fs1 = FileSystem.get(conf);
        if (fs1.exists(new Path(conf.get("outputDir"))))
            fs1.delete(new Path(conf.get("outputDir")), true);
        FileOutputFormat.setOutputPath(job1, new Path(conf.get("outputDir")));
 
        for (Path inputPath: FileInputFormat.getInputPaths(job1))
            System.out.println("input  path " + inputPath.toString());
        System.out.println("output path " +
                FileOutputFormat.getOutputPath(job1).toString());
 
        job1.waitForCompletion(true);
        System.out.println("job completed: " + job1.getJobName().toString());
 
        // ======================== step 2 =============================
        // input  (word@doc, n)         n = the frequency of word in doc
        // output (word@doc, n;N)       N = total words in doc
 
        conf.set("inputDir", conf.get("outputDir"));
        step++;
 
        Job job2 = Job.getInstance(conf);
        job2.setJobName("TFIDF" + step);
        System.out.println("job : " + job2.getJobName().toString());
        job2.setJarByClass(TFIDF_driver.class);
 
        // Set the output Key, Value types for the Mapper
        job2.setMapOutputKeyClass(Text.class);
        job2.setMapOutputValueClass(Text.class);
 
        // Set the output Key, Value types for the Reducer
        job2.setOutputKeyClass(Text.class);
        job2.setOutputValueClass(Text.class);
 
        // Mapper, Combiner, Reducer
        // -------------------------
        job2.setMapperClass(TFIDF_step2.Map.class);
        job2.setReducerClass(TFIDF_step2.Reduce.class);
 
        // Specify that the Mapper is reading "key tab value"
        conf.set("key.value.separator.in.input.line", "\t");
        job2.setInputFormatClass(KeyValueTextInputFormat.class);
 
        // Specify that the Reducer is writing a text file
        job2.setOutputFormatClass(TextOutputFormat.class);
 
        // Specify Mapper input and Reducer output paths
        conf.set("outputDir", job2.getJobName());
 
        FileInputFormat.addInputPath(job2, new Path(conf.get("inputDir")));
 
        FileSystem fs2 = FileSystem.get(conf);
        if (fs2.exists(new Path(conf.get("outputDir"))))
            fs2.delete(new Path(conf.get("outputDir")), true);
        FileOutputFormat.setOutputPath(job2, new Path(conf.get("outputDir")));
 
        for (Path inputPath: FileInputFormat.getInputPaths(job2))
            System.out.println("input  path " + inputPath.toString());
        System.out.println("output path " +
                FileOutputFormat.getOutputPath(job2).toString());
 
        job2.waitForCompletion(true);
        System.out.println("job completed: " + job2.getJobName().toString());
 
        // ======================== step 3 =================================
        // input  (word@doc, n;N)       n  = the frequency of word in doc
        //                              N  = total words in doc
        // output (word@doc, n;N;df)    df = the frequency of word in dataset
 
        conf.set("inputDir", conf.get("outputDir"));
        step++;
 
        Job job3 = Job.getInstance(conf);
        job3.setJobName("TFIDF" + step);
        System.out.println("job : " + job3.getJobName().toString());
        job3.setJarByClass(TFIDF_driver.class);
 
        // Set the output Key, Value types for the Mapper
        job3.setMapOutputKeyClass(Text.class);
        job3.setMapOutputValueClass(Text.class);
 
        // Set the output Key, Value types for the Reducer
        job3.setOutputKeyClass(Text.class);
        job3.setOutputValueClass(Text.class);
 
        // Mapper, Combiner, Reducer
        // -------------------------
        job3.setMapperClass(TFIDF_step3.Map.class);
        job3.setReducerClass(TFIDF_step3.Reduce.class);
 
        // Specify that the Mapper is reading "key tab value"
        conf.set("key.value.separator.in.input.line", "\t");
        job3.setInputFormatClass(KeyValueTextInputFormat.class);
 
        // Specify that the Reducer is writing a text file
        job3.setOutputFormatClass(TextOutputFormat.class);
 
        // Specify Mapper input and Reducer output paths
        conf.set("outputDir", job3.getJobName());
 
        FileInputFormat.addInputPath(job3, new Path(conf.get("inputDir")));
 
        FileSystem fs3 = FileSystem.get(conf);
        if (fs3.exists(new Path(conf.get("outputDir"))))
            fs3.delete(new Path(conf.get("outputDir")), true);
        FileOutputFormat.setOutputPath(job3, new Path(conf.get("outputDir")));
 
        for (Path inputPath: FileInputFormat.getInputPaths(job3))
            System.out.println("input  path " + inputPath.toString());
        System.out.println("output path " +
                FileOutputFormat.getOutputPath(job3).toString());
 
        job3.waitForCompletion(true);
        System.out.println("job completed: " + job3.getJobName().toString());
 
        // ======================== step 4 =================================
        // input  (word@doc, n;N;df)    n  = the frequency of word in doc
        //                              N  = total words in doc
        //                              df = the frequency of word in dataset
        // output (word@doc, [tf, idf, tfidf])
        //
        // map-only
        // --------
 
        conf.set("inputDir", conf.get("outputDir"));
        step++;
 
        Job job4 = Job.getInstance(conf);
        job4.setJobName("TFIDF" + step);
        System.out.println("job : " + job4.getJobName().toString());
        job4.setJarByClass(TFIDF_driver.class);
 
        // Set the output Key, Value types for the Mapper
        job4.setMapOutputKeyClass(Text.class);
        job4.setMapOutputValueClass(Text.class);
 
        job4.setNumReduceTasks(0);
 
        // Mapper, Combiner, Reducer
        // -------------------------
        job4.setMapperClass(TFIDF_step4.Map.class);
 
        // Specify that the Mapper is reading "key tab value"
        conf.set("key.value.separator.in.input.line", "\t");
        job4.setInputFormatClass(KeyValueTextInputFormat.class);
 
        // Specify Mapper input and output paths
        conf.set("outputDir", job4.getJobName());
 
        FileInputFormat.addInputPath(job4, new Path(conf.get("inputDir")));
 
        FileSystem fs4 = FileSystem.get(conf);
        if (fs4.exists(new Path(conf.get("outputDir"))))
            fs4.delete(new Path(conf.get("outputDir")), true);
        FileOutputFormat.setOutputPath(job4, new Path(conf.get("outputDir")));
 
        for (Path inputPath: FileInputFormat.getInputPaths(job4))
            System.out.println("input  path " + inputPath.toString());
        System.out.println("output path " +
                FileOutputFormat.getOutputPath(job4).toString());
 
        job4.waitForCompletion(true);
        System.out.println("job completed: " + job4.getJobName().toString());
 
        // ======================== step 5 =================================
        // ========= Find the max TF-IDF word in each document =============
 
        // input  (word@doc, [tf, idf, tfidf])
        // [tf=2 idf=log(fileCount/df)=log(38/1)=3.6375861597263857 tfidf=tf*idf=7.275172319452771]
        //                             
        // output (word@doc, max-tfidf)   
 
        conf.set("inputDir", conf.get("outputDir"));
        step++;
 
        Job job5 = Job.getInstance(conf);
        job5.setJobName("TFIDF" + step);
        System.out.println("job : " + job5.getJobName().toString());
        job5.setJarByClass(TFIDF_driver.class);
 
        // Set the output Key, Value types for the Mapper
        job5.setMapOutputKeyClass(Text.class);
        job5.setMapOutputValueClass(Text.class);
 
        // Set the output Key, Value types for the Reducer
        job5.setOutputKeyClass(Text.class);
        job5.setOutputValueClass(DoubleWritable.class);
 
        // Mapper, Combiner, Reducer
        // -------------------------
        job5.setMapperClass(TFIDF_step5.Map.class);
        job5.setReducerClass(TFIDF_step5.Reduce.class);
 
        // Specify that the Mapper is reading "key tab value"
        conf.set("key.value.separator.in.input.line", "\t");
        job5.setInputFormatClass(KeyValueTextInputFormat.class);
 
        // Specify that the Reducer is writing a text file
        job5.setOutputFormatClass(TextOutputFormat.class);
 
        // Specify Mapper input and Reducer output paths
        conf.set("outputDir", job5.getJobName());
 
        FileInputFormat.addInputPath(job5, new Path(conf.get("inputDir")));
 
        FileSystem fs5 = FileSystem.get(conf);
        if (fs5.exists(new Path(conf.get("outputDir"))))
            fs5.delete(new Path(conf.get("outputDir")), true);
        FileOutputFormat.setOutputPath(job5, new Path(conf.get("outputDir")));
 
        for (Path inputPath: FileInputFormat.getInputPaths(job5))
            System.out.println("input  path " + inputPath.toString());
        System.out.println("output path " +
                FileOutputFormat.getOutputPath(job5).toString());
 
        job5.waitForCompletion(true);
        System.out.println("job completed: " + job5.getJobName().toString());
 
        // =================================================================
 
        return 0;
    }
}

TF-IDF Step1 ->

TF-IDF Step1

<- TF-IDF Driver
package com.georgefisher.tfidf;
 
import java.io.IOException;
import java.util.HashSet;
import java.util.Set;
 
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
 
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
 
//MapReduce TF-IDF
//
//step 1
//------
//map       in  (byteCount, line)
//          out (word@doc, 1)
//
//reduce    in  (word@doc, [count, count, ...])
//          out (word@doc, n)
 
public class TFIDF_step1 { 
 
    // ============ MAPPER ===============
    public static class Map
    extends Mapper<LongWritable, Text, Text, IntWritable> {
 
        private final static IntWritable ONE = new IntWritable(1);
        private static       Text   word_doc = new Text();
 
        // NOTE: stemming might also be considered
         
        // NOTE: at scale, memory may be an issue;
        //       possibly forcing the inclusion of certain high-frequency
        //       words that are not technically stop words.
        private static Set<String> stopWords;
        static {
            // we stop ALL words less than 3 in length
            stopWords = new HashSet<String>();
            stopWords.add("about"); stopWords.add("and");
            stopWords.add("are");   stopWords.add("com");
            stopWords.add("for");   stopWords.add("from");
            stopWords.add("how");
            stopWords.add("that");  stopWords.add("the");
            stopWords.add("this");  stopWords.add("was");
            stopWords.add("what");  stopWords.add("when");
            stopWords.add("where"); stopWords.add("with");
            stopWords.add("who");   stopWords.add("will");
            stopWords.add("the");   stopWords.add("www");
             
            stopWords.add("verse"); // for shakespeare's sonnets
        }
 
        //map       in  (byteCount, line)
        //          out (word@doc, 1)
        @Override
        public void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
 
            String line    = value.toString().toLowerCase();
            String[] words = line.split("\\W+");
 
            // skip stop words and junk
            for (int i=0; i < words.length; i++) {
                String word = words[i];
                if (word.length() < 3                   || // note
                    !Character.isLetter(word.charAt(0)) ||
                    Character.isDigit(word.charAt(0))   ||
                    stopWords.contains(word)            ||
                    word.contains("_")) {
                    continue;
                }
 
                String fileName =
                        ((FileSplit) context.getInputSplit()).getPath().getName();
 
                word_doc.set(word+"@"+fileName);
                context.write(word_doc, ONE);
            }
        }
    }
 
    // ============ REDUCER ===============
    public static class Reduce
    extends Reducer<Text, IntWritable, Text, IntWritable> {
 
        private static IntWritable n = new IntWritable();
 
        //reduce    in  (word@doc, [count, count, ...])
        //          out (word@doc, n)
        @Override
        public void reduce(Text key, Iterable<IntWritable> values, Context context)
                throws IOException, InterruptedException {
 
            int sum = 0;
            for (IntWritable val : values) {
                sum += val.get();
            }
            n.set(sum);
            context.write(key, n);
        }
    }
}
TF-IDF Step2 ->

TF-IDF Step 2

<- TF-IDF Step1
package com.georgefisher.tfidf;
 
import java.io.IOException;
import java.util.HashMap;
 
import org.apache.hadoop.io.Text;
 
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
 
//MapReduce TF-IDF
//
//step 2
//------
//map       in  (word@doc, n)
//          out (doc, word;n)
//
//reduce    in  (doc, [word;n, word;n, ...])
//          out (word@doc, n;N)
 
public class TFIDF_step2 {
    // ============ MAPPER ===============
    public static class Map
    extends Mapper<Text, Text, Text, Text> {
 
        private final static Text doc    = new Text();
        private static       Text word_n = new Text();
 
        //map       in  (word@doc, n)
        //          out (doc, word;n)
        @Override
        public void map(Text key, Text value, Context context)
                throws IOException, InterruptedException {
 
            String[] word_doc = key.toString().split("@");
            String word       = word_doc[0];
            doc.set(word_doc[1]);
             
            String n = value.toString();
            word_n.set(word+";"+n);
 
            context.write(doc, word_n);
        }
    }
 
    // ============ REDUCER ===============
    public static class Reduce
    extends Reducer<Text, Text, Text, Text> {
 
        private static Text word_doc = new Text();
        private static Text n_N       = new Text();
 
        //reduce    in  (doc, [word;n, word;n, ...])
        //          out (word@doc, n;N)
        @Override
        public void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
             
            String doc = key.toString();
             
            // NOTE: at scale, memory may be an issue;
            //       possibly forcing the use of intermediate disk storage
            int N = 0;
            HashMap<String, Integer> wordList = new HashMap<String, Integer>();
            wordList.clear();
            for (Text word_n: values) {
                String[] bits = word_n.toString().split(";");
                String word = bits[0];
                int n       = Integer.parseInt(bits[1]);
                wordList.put(word, n);
                N += n;
            }
 
            for (String word: wordList.keySet()) {
                word_doc.set(word+"@"+doc);
                n_N.set(wordList.get(word)+";"+N);
                context.write(word_doc,  n_N);
            }
        }
    }
}
TF-IDF Step3 ->

TF-IDF Step 3

<- TF-IDF Step2
package com.georgefisher.tfidf;
 
import java.io.IOException;
import java.util.HashMap;
 
import org.apache.hadoop.io.Text;
 
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
 
//MapReduce TF-IDF
//
//step 3
//------
//map       in  (word@doc, n;N)
//          out (word, doc;n;N;1)
//
//reduce    in  (word, [doc;n;N;count, doc;n;N;count, ...])
//          out (word@doc, n;N;df)
 
public class TFIDF_step3 {
    // ============ MAPPER ===============
    public static class Map
    extends Mapper<Text, Text, Text, Text> {
 
        private final static Text word      = new Text();
        private static       Text doc_n_N_1 = new Text();
 
        //map       in  (word@doc, n;N)
        //          out (word, doc;n;N;1)
        @Override
        public void map(Text key, Text value, Context context)
                throws IOException, InterruptedException {
 
            String[] word_doc = key.toString().split("@");
            word.set(word_doc[0]);
            String doc = word_doc[1];
 
            String[] n_N = value.toString().split(";");
            String n = n_N[0];
            String N = n_N[1];
 
            doc_n_N_1.set(doc +";"+ n +";"+ N +";"+ 1);
 
            context.write(word, doc_n_N_1);
        }
    }
 
    // ============ REDUCER ===============
    public static class Reduce
    extends Reducer<Text, Text, Text, Text> {
 
        private static Text word_doc = new Text();
        private static Text n_N_df   = new Text();
 
        //reduce    in  (word, [doc;n;N;count, doc;n;N;count, ...])
        //          out (word@doc, n;N;df)
        @Override
        public void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
 
            // NOTE: at scale, memory may be an issue;
            //       possibly forcing the use of intermediate disk storage
            HashMap<String, Integer> word_docList = new HashMap<String, Integer>();
            HashMap<String, Integer> docList      = new HashMap<String, Integer>();
            HashMap<String, Integer> wordList     = new HashMap<String, Integer>();
            word_docList.clear();
            docList.clear();
            wordList.clear();
 
            String word = key.toString();
 
            for (Text value: values) {
                // doc;n;N;count
                String[] doc_n_N_count = value.toString().split(";");
                String  doc    = doc_n_N_count[0];
                Integer n      = Integer.parseInt(doc_n_N_count[1]);
                Integer N      = Integer.parseInt(doc_n_N_count[2]);
                Integer count  = Integer.parseInt(doc_n_N_count[3]);
 
                // save (doc, N), the total number of words in doc
                if (!docList.containsKey(doc)) {
                    docList.put(doc,N);
                } else {
                    if (N != docList.get(doc)) {
                        System.out.println("N != docList.get(doc)"+
                                ": N="+N+
                                "; doc="+doc+
                                "; docList.get(doc)="+docList.get(doc));
                        System.exit(-1);
                    }
                }
                 
                // save (word@doc, n), the frequency of word in doc
                if (!word_docList.containsKey(word+"@"+doc)) {
                    word_docList.put(word+"@"+doc,n);
                } else {
                    if (n != word_docList.get(word+"@"+doc)) {
                        System.out.println("n != word_docList.get(word+\"@\"+doc)"+
                                ": n="+n+
                                "; (word+\"@\"+doc="+word+"@"+doc+
                                "; word_docList="+word_docList.get(word+"@"+doc));
                        System.exit(-1);
                    }
                }
                 
                // increment df, the frequency of word in the dataset
                if (!wordList.containsKey(word)) {
                    wordList.put(word, count);
                } else {
                    Integer df = wordList.get(word);
                    df += count;
                    wordList.put(word, df);
                }
            }
 
            // compose the (word@doc, n;N;df) output
            for (String WordDoc: word_docList.keySet()) {
                String[] Word_Doc = WordDoc.split("@");
                String Word       = Word_Doc[0];
                String Doc        = Word_Doc[1];
                Integer little_en = word_docList.get(WordDoc);
                Integer big_en    = docList.get(Doc);
                Integer df        = wordList.get(Word);
 
                word_doc.set(WordDoc);
                n_N_df.set(little_en+";"+big_en+";"+df);
                context.write(word_doc,  n_N_df);
            }
        }
    }
}
TF-IDF Step4 ->

TF-IDF Step 4

<- TF-IDF Step3
package com.georgefisher.tfidf;
 
import java.io.IOException;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.ContentSummary;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
 
import org.apache.hadoop.mapreduce.Mapper;
 
//MapReduce TF-IDF
//
//step 4
//------
//map       in  (word@doc, n;N;df)
//          out (word@doc, [tf, idf, tfidf])
//reduce    none, map-only
 
public class TFIDF_step4 {
    // ============ MAPPER ===============
    public static class Map
    extends Mapper<Text, Text, Text, Text> {
         
        private static int fileCount;
         
        public void setup(Context context) throws IOException {
            // get the number of documents in the original input folder
            Configuration  conf = context.getConfiguration();
            FileSystem     fs   = FileSystem.get(conf);
            Path           pt   = new Path(conf.get("originalInputDir"));
            ContentSummary cs   = fs.getContentSummary(pt);
            fileCount           = (int)cs.getFileCount();
        }
 
        private final static Text word_doc = new Text();
        private static       Text result   = new Text();
 
        //map       in  (word@doc, n;N;df)
        //          out (word@doc, [tf, idf, tfidf])
        @Override
        public void map(Text key, Text value, Context context)
                throws IOException, InterruptedException {
 
            // NOTE: since this step is map-only, its logic could easily be
            //       incorporated into a loop at the end of the previous
            //       step if the cost of job setup and teardown exceeds
            //       the desire for clear code
            String[] n_N_m = value.toString().split(";");
            Integer n = Integer.parseInt(n_N_m[0]);
          //Integer N = Integer.parseInt(n_N_m[1]);
            Integer df = Integer.parseInt(n_N_m[2]);
             
            double tf    = (double)n;
            double idf   = Math.log((double)fileCount / (double)df);
            double tfidf = tf * idf;
             
            word_doc.set(key);
            result.set("[tf="+n +
                        " idf=log(fileCount/df)=log("+fileCount+"/"+df+")="+idf+
                        " tfidf=tf*idf="+tfidf+"]");
 
            context.write(word_doc, result);
        }
    }
}
TF-IDF Step5 ->

TF-IDF Step 5

<- TF-IDF Step4
package com.georgefisher.tfidf;
 
import java.io.IOException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;
 
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
 
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
 
//MapReduce TF-IDF
//
// step 5   find the word in each document with the highest TF-IDF
// ------   =====================================================
//
// map      in  (word@doc, [tf, idf, tfidf])
//          out (doc, word;tfidf)
//  
// reduce   in  (doc, [word;tfidf, word;tfidf, ...])
//          out (word@doc, max_tfidf)
 
public class TFIDF_step5 {
    // ============ MAPPER ===============
    public static class Map
    extends Mapper<Text, Text, Text, Text> {
 
        private static Pattern regex;
 
        public void setup(Context context) throws PatternSyntaxException {
            regex = Pattern.compile("tfidf=tf\\*idf=([-]?\\d+\\.\\d+)", Pattern.MULTILINE);
        }
 
        private static Text doc        = new Text();
        private static Text word_tfidf = new Text();
 
        // map      in  (word@doc, [tf, idf, tfidf])
        //          out (doc, word;tfidf)
        @Override
        public void map(Text key, Text value, Context context)
                throws IOException, InterruptedException {
 
            String[] word_doc = key.toString().split("@");
            String word       = word_doc[0];
            doc.set(word_doc[1]);
 
            String inputValue = value.toString();
            String tfidf = null;
            Matcher regexMatcher = regex.matcher(inputValue);
            if (regexMatcher.find()) {
                tfidf = regexMatcher.group(1);
            }
            word_tfidf.set(word+";"+tfidf);
            context.write(doc,  word_tfidf);
        }
    }
 
    // ============ REDUCER ===============
    public static class Reduce
    extends Reducer<Text, Text, Text, DoubleWritable> {
 
        private static Text           word_doc  = new Text();
        private static DoubleWritable max_tfidf = new DoubleWritable();
 
        // reduce   in  (doc, [word;tfidf, word;tfidf, ...])
        //          out (word@doc, max_tfidf)
        @Override
        public void reduce(Text key, Iterable<Text> values, Context context)
                throws IOException, InterruptedException {
 
            String doc = key.toString();
 
            Double max = Double.MIN_VALUE;
            for (Text value: values) {
                String[] word_tfidf = value.toString().split(";");
                String word  = word_tfidf[0];
                Double tfidf = Double.parseDouble(word_tfidf[1]);
                if (tfidf > max) {
                    max = tfidf;
                    max_tfidf.set(tfidf);
                    word_doc.set(word+"@"+doc);
                }
            }
            context.write(word_doc,  max_tfidf);
        }
    }
}
TF-IDF Result ->

TF-IDF Most-Significant Term In Each of Shakespeare's Plays

<- TF-IDF Step5
parolles@ALLS_WELL_THAT_ENDS_WELL	636.5775779521175
rosalind@AS_YOU_LIKE_IT	960.3227461677658
org@A_LOVER_S_COMPLAINT	58.20137855562217
hermia@A_MIDSUMMER_NIGHT_S_DREAM	378.3089606115441
imogen@CYMBELINE	603.83930251458
wolsey@KING_HENRY_THE_EIGHTH	367.396202132365
hubert@KING_JOHN	349.208271333733
buckingham@KING_RICHARD_III	337.6937697909743
bolingbroke@KING_RICHARD_THE_SECOND	304.22223709384275
berowne@LOVE_S_LABOUR_S_LOST	716.6044734660979
isabella@MEASURE_FOR_MEASURE	578.3761993964953
pedro@MUCH_ADO_ABOUT_NOTHING	563.8258547575898
falstaff@SECOND_PART_OF_KING_HENRY_IV	497.5354874920355
syracuse@THE_COMEDY_OF_ERRORS	865.7455060148798
talbot@THE_FIRST_PART_OF_HENRY_THE_SIXTH	431.6255580799069
fal@THE_FIRST_PART_OF_KING_HENRY_THE_FOURTH	549.2755101186842
cressida@THE_HISTORY_OF_TROILUS_AND_CRESSIDA	524.1101382916264
fluellen@THE_LIFE_OF_KING_HENRY_THE_FIFTH	323.74516821564833
timon@THE_LIFE_OF_TIMON_OF_ATHENS	1021.7203257707548
bassanio@THE_MERCHANT_OF_VENICE	461.973442285251
ford@THE_MERRY_WIVES_OF_WINDSOR	853.8873039582677
cade@THE_SECOND_PART_OF_KING_HENRY_THE_SIXTH	356.4834436531858
muse@THE_SONNETS	14.872575337986811
petruchio@THE_TAMING_OF_THE_SHREW	633.0543805207847
prospero@THE_TEMPEST	527.4499931603259
warwick@THE_THIRD_PART_OF_KING_HENRY_THE_SIXTH	465.14832600557935
antony@THE_TRAGEDY_OF_ANTONY_AND_CLEOPATRA	710.6432758418573
menenius@THE_TRAGEDY_OF_CORIOLANUS	712.9668873063716
ham@THE_TRAGEDY_OF_HAMLET_PRINCE_OF_DENMARK	1302.255845182046
cassius@THE_TRAGEDY_OF_JULIUS_CAESAR	668.3876482707819
lear@THE_TRAGEDY_OF_KING_LEAR	847.5575752162479
macbeth@THE_TRAGEDY_OF_MACBETH	1058.5375724803782
iago@THE_TRAGEDY_OF_OTHELLO_MOOR_OF_VENICE	1316.8061898209517
rom@THE_TRAGEDY_OF_ROMEO_AND_JULIET	592.9265440354009
titus@THE_TRAGEDY_OF_TITUS_ANDRONICUS	436.7506089296601
proteus@THE_TWO_GENTLEMEN_OF_VERONA	803.9065412995312
leontes@THE_WINTER_S_TALE	545.6379239589578
toby@TWELFTH_NIGHT_OR_WHAT_YOU_WILL	752.9803350633618
TF-IDF Main Page ->

TF-IDF Term Frequency-Inverse Document Frequency

TF-IDF (Term Frequency-Inverse Document Frequency) is an algorithm for determining the most-relevant terms in a document within a large collection of documents. One example is Google looking for important terms in a vast collection of websites. I attempted a more modest analysis to find the most important term in each of the works of Shakespeare.

I wrote a collection of Java MapReduce programs to run under Hadoop. My computer setup was/is the following:

The whole program consists of a MapReduce driver and 5 Mapper/Reducer steps:

NOTES: This algorithm may run into problems at scale either because of memory usage or because of the startup/teardown cost of numerous steps.

  1. Memory is likely to be the larger problem
    • In the first step you can implement stemming to reduce the number of terms and you can include certain high frequency terms in the stop word list.
    • The memory problems will be more severe in the reducer steps and implementing combiners may help by offloading some of the memory usage to the mapper tasks, which are likely to be more numerous than the reducer tasks.
    • Instead of using in-memory HashMaps, you can implement temporary disk storage.
  2. The second and fourth steps can fairly easily be folded into the previous steps. There are two penalties for this: (1) the code is more complex and harder to understand, explain and debug (2) the individual tasks will require more resources.

My thanks to the following for the help I derived from their websites:

My thanks also to Jure Leskovec and Daniel Templeton for teaching Stanford's CS246 and CS246H

 

Please Let Us Know What You Think

 
 

(HTML tags provide better formatting)
 

Blogs

Central Securities
An analysis of Central Securities (CET).

Macroeconomics of The 2007 Collapse
What happened to America in 2007 has happened to hundreds of developing economies in the past fifty years.

An iPhone web app
Philadelphia Reflections is now available on the iPhone as a web app

SMTP Authorization and Handling Bounced Emails with PEAR Mail
Sign on to an SMTP server and get bounced emails returned to you

TF-IDF Driver
Driver code for a Hadoop MapReduce program to perform TF-IDF on any collection of documents.

TF-IDF Step1
First Mapreduce job in a series of five to analyze a collection of documents and return the most-significant words in each document

TF-IDF Step 2
Second Mapreduce job in a series of five to analyze a collection of documents and return the most-significant words in each document

TF-IDF Step 3
Third Mapreduce job in a series of five to analyze a collection of documents and return the most-significant words in each document

TF-IDF Step 4
Fourth Mapreduce job in a series of five to analyze a collection of documents and return the most-significant words in each document

TF-IDF Step 5
Fifth Mapreduce job in a series of five to analyze a collection of documents and return the most-significant words in each document

TF-IDF Most-Significant Term In Each of Shakespeare's Plays
Run the TF-IDF algorithm using Hadoop and MapReduce on the collected works of Shakespeare and it will return this list of most-significant terms.

TF-IDF Term Frequency-Inverse Document Frequency
A "Big Data" algorithm for Hadoop MapReduce to determine the most-relevant terms is a large collection of documents