Processing small (100s of megabytes) to large (terabytes) volumes of data is frequently necessary during development. And quite frequently in rapidly evolving projects, the required infrastructure to run these is not present, difficult to deploy against, or simply not ready yet. But we do have access to a multicore machine - anywhere from a local laptop to a 30-core blade.
Having worked with subject matter experts in multiple domains, ranging from healthcare, finance, machine learning or statistics, a recurring theme one notices is the need to process large amounts of data during development using different infrastructures: in-house distributed systems, PySpark, hadoop, or plain ol' large machines. Often it can be quite time-consuming to do rapid prototyping or fast development while using these systems, or it could even be the case that these pieces of infrastructure aren't even deployed yet. A consistent challenge has been the need to do so while using code written in R, Python, Java, or perl.When these situations occur, the Unix Philosophy comes to mind: The data pipeline under development should be up and running quickly, it needs to be modular, and it should use text streams or file systems to pass data between modules. This is where Make comes into play.
One could easily think "Wait What? Make? That really old C build tool? "
That is a fair question. Plenty of good pipeline tools exist for large datasets, yet one will surely find themselves in a scenario without such tools. A scenario where no one maintains a PySpark cluster, or maintains an AirFlow deployment, or translates the code out of scala. One common scenario in research and cross functional domains is where a core contributor may not be a software engineer, yet they understand enough of their code such that they can read a file in, process it, and write a file out. Make helps in these scenarios to create an appropriately abstracted *simple* data pipeline that becomes very easily parallelizable, and indeed, even extendable.
This post aims to help you use Make to set up a simple pipeline that can be iterated. When the data pipeline is then up and running, a simple design abstraction will exist that can port to a more appropriate infrastructure.
There are many tools that perform pipeline processing, but a comparison of tools is beyond the scope of this document. However, there are a few positives for using Make:
Map Reduce is a well-known technique for parallel processing. It’s most useful for distributing jobs among large clusters with many distributed resources. But we don’t always need large hadoop clusters to get the benefit of the paradigm.
Make is one of the greatest programs in the *nix ecosystem. It’s been in active development for over forty years, meaning that it does its job really, really, well.What is its job? Most people know it as a build tool for compiling C programs, or perhaps as a way of tracking small aliases in a directory for functions like ‘install’ or ‘clean’. But more fundamentally, its job is to be a scheduler.
We can spend a lot of time getting into the what, why, and how, but let’s make this a practical tutorial, for now, to get us up and running for Map Reduce in Make.
Make has a long history and excellent documentation at the GNU Make website. But to make the following tutorial more digestible, we will refresh some basics about Make here.
Variables are assigned using the = operator. Convention is that variable names are all caps because they’re considered to be constants. They are defined at the beginning of make execution and do not change during each call to make. We can also pass variables to make via the command line.
Special variables help us create reusable recipes:
In unix, Everything is a File, and it’s helpful to remember that targets are files as well. However, for Make Reduce, we can have a convention: If a target does not have a file extension, then it does not represent a file to be updated. It’s known as a Phony Target, it’s simply a name for the recipe that will be executed.
We will work with a typical example of Map Reduce: word counting.
Goal: Count all of the unique words in some group of text files.
Pipeline Plan:
1. We will grab some text files from gutenberg (initialization)
2. We will clean each text file (map)
3. We will count the words in each cleaned file (map)
4. We will sum up all of the counted words and merge them together (reduce)
The tasks above marked with (map) are considered obviously parallel: task 2 can operate on each file independently; the cleanliness of each file in task 2 is irrelevant to the cleanliness of every other file in task 2. Task 3 is also obviously parallel in that the word count for each file is irrelevant to every other file in task 3. Finally, the reduce step adds up everything and outputs a count file.
We can now begin building our Makefile. We’ll start with a few initialisation variables: what directory our source text files will be downloaded to, and what directory our work will be stored in.
We’ll start with a simple rule, init, to initialise and download. We want our rule’s recipe to do the following:
We'll be downloading files from Project Gutenberg . In order to follow their robot policy, we'll rsync files from a mirror at aleph.gutenberg.org. The LIMITSIZE variable allows us to download only about 1000 files by limiting the files available for download to 100kb-105kb. If we set the LIMITSIZE variable blank, we will download all ~100,000 serially and it could take an hour or so.
To execute the initialisation rules, run the following command:
In this command, we’ve told make that our goal is to create or update the ./make_reduce/data directory. make reviews the dependency graph in the Makefile, then topologically sorts and determines the correct order of operations and then executes all the recipes required. We’ve fed it a graph and it is executing it for us!
Once it completes, you should see something similar to the output below: We have downloaded all the files and then soft linked them:
You should be able to see all the downloaded files in the DOWNLOAD_DIR:
It is a little messy because it follows the directory hierarchy from gutenberg, which is why we soft linked all the .txt files to a single directory.
And the files count:
We have about 1000 text files that we want to perform a word count on. Here is the first 10 lines of The Vigil of Venus and Other Poems by “Q”:
Let’s create a function that might make a data scientist a bit more comfortable. We’ll write our clean function in python, and have the recipe execute it from the shell.
Below is our python script for cleaning text files. First, it reads a file into memory, lowercases all characters, removes all punctuation, and removes any tokens that are less than three characters long. It then outputs the cleaned text file with one word per line into an output file:
Let’s encode the clean step into our Makefile using the following strategy:
Create a Match-Anything Pattern Rule which links each input file name to the output filename.
We can see the creation of the *.cln files in our WORK_DIR
Each pair of *.txt and *.cln files are independent of each other pair. This means that they have now become obviously parallel. make has a critical directive that we can pass from the command line to process each of the independent targets in parallel: -j. -j or --jobs tells make how many jobs to run in parallel. We can say -j 4, which will process 4 jobs at a time, or we can just pass -j without the number of jobs and make will process as many jobs as cores are available.
Below is the GNU Time output from running make clean in serial - one job at a time. It took about 1 minute to process 1000 files on a standard laptop.
When running make -j next, it will parallelise as much as possible. If you would like to try, you will need to delete the *.cln files so that make knows to remake them.
Below is the GNU Time output from running make -j with 6 cores. It took about 12 seconds - a 4x speed up
Each of the *.cln files need to be counted. We’ll use simple BASH commands for this task:
We will repeat the steps that we did with clean. You should notice a small difference in the rule to create *.cnt files: We have added the executable count_cln.sh as a prerequisite. Why? Because if we change our count function, make will automatically rerun our count step! We will automatically restart parts of the pipeline only from the steps that have either new data or a new function. We change our special variable from $< to $^ so that all normal prerequisites are passed to the recipe.
Once we run make count -j we should see *.cnt files with unique counts for each word:
Now that we have cleaned and counted all of the words and put them into their own files, we need to perform the reduce step. We will use our $^ variable again to pass in all of the normal prerequisites to the recipe.
We have been executing these steps individually to understand how each one works. Now let’s put them all together in their own phony target. We will define a goal pipeline that will execute all of the make commands recursively from the shell using the special variable $(MAKE). We can set our parallel parameters with the -j directive, and end with the reduce.
If you’ve been following the commands, the make pipeline command will probably let you know that everything is complete. That’s okay. We can force it to execute a couple of different ways: passing a new WORK_DIR variable or the force directive -B
We have also been playing with some sample files - approximately 1000. When we ran it on a standard laptop, it ran in about a minute after the initial download. Not bad. But if we want to see what happens when we run against all 100,000 text files of the gutenberg, we need to change the LIMITSIZE variable. This means that the initialisation step of downloading all of the data is the bottleneck. Rsync will help us reduce duplicate downloads. However, once the initial download is complete, then we will be able to execute on a much much larger dataset.