MapReduce¶
Counts words in a text corpus using a mapreduce strategy.
Installation¶
This application only requires TaPS to be installed.
Data¶
The Enron email dataset is available at https://www.cs.cmu.edu/~enron/.
The following command will download and extract the tarfile to data/maildir
.
Example¶
To see all parameters, run the following command:
Enron Corpus
The following command distributes the text files of the Enron Corpus within data/maildir
across 16 map tasks.
Once the computations have finished, the top 10 most common tokens will be printed.
python -m taps.run --app mapreduce
--app.data-dir data/maildir --app.map-tasks 16 \
--engine.executor process-pool
Randomly Generated
Here, we will generate 16 random files for each of 16 map tasks.