MapReduce¶
Counts words in a text corpus using a mapreduce strategy.
Installation¶
This application only requires TaPS to be installed.
Data¶
The Enron email dataset is available at https://www.cs.cmu.edu/~enron/.
The following command will download and extract the tarfile to data/maildir
.
Example¶
To see all parameters, run the following command:
Enron Corpus
The following command distributes the text files of the Enron Corpus within data/maildir
across 16 map tasks.
Once the computations have finished, the top 10 most common tokens will be printed.
Randomly Generated
Here, we will generate 16 random files for each of 16 map tasks.