At a recent client engagement, I used MapReduce to run some API level checks in parallel. It was surprisingly simple and needed me to change less than a dozen lines of code in my tests. Generally MapReduce makes people think they need a farm of servers. But in this case I did it using only my local machine. I was able to do it because the map (i.e., parallel) action is an API call over the network and thus not needing parallel CPUs to perform the action.
I’ve cooked up a simple example to show testers how they can take advantage of MapReduce to speed up their API tests.
Ummmmm… what is MapReduce?
Fair warning: I’m going to explain MapReduce in a slightly wrong/inaccurate way just so it is easy for readers to pick up the gist of it. I want this post to be about running API checks in parallel rather than a full blown tutorial on MapReduce. For folks who want to get into the details, please Google around. Minds much smarter than me have written a lot on MapReduce.
Think of MapReduce as a programming model that has four parts:
1) chunks of data: break up your input data into smaller chunks. This part should be done so each chunk of data can be processed independently by 3) a method
2) Map: takes two arguments – a method you want to run in parallel and a list/array containing the chunks of data
3) a method: that you want to run in parallel. You may need to modify the arguments the method accepts so it matches your 1) chunk of data
4) Reduce: collects and merges the processing result from each chunk of input data
Usually, your method will process the whole blob of input data sequentially. However, by chunking your input data and then sandwiching map and reduce on either side of the method, you can get the method to process the input data in parallel.
The canonical example when learning MapReduce, is to count the number of words in a text repository. Assume you have a directory with 100 text files. If you wanted to count the total number of words within all the text files in the directory using MapReduce, here is how you would break it down
1) chunks of data: each file is a chunk of data
2) Map: takes two arguments – count_file_words and a list/array containing the filenames
3) a method: that counts and returns the number of words within a file. Lets call it count_file_words(filename)
4) Reduce: collects and then adds up the results returned by count_file_words for each file.
Again, if you want more technical rigor, I encourage you to Google around. There are umpteen examples. MapReduce is not a concept that is beyond the reach or technical competence of the average tester.
Example problem: the sequential way
Let’s say you have an API that can tell you the average salary in a country for a given age group. Let’s pretend like the API call is:
So your method to get the salary, get_salary, would look something like:
def get_salary(country_name,age): "Make the api call to get the salary" url = '/country/%s/age/%d'%(country_name,age) salary = fetch_url(url) #Assume fetch_url gets the URL, parses it and returns the salary return salary
If you wanted to compile a list of salaries by age, you would write a snippet like:
actual_salaries =  for country in countries: for age in ages: salary = get_salary(country,age) actual_salaries.append(salary)
Hopefully you notice that every time you make the call fetch_url(), which in turn makes a network call, your script does nothing while waiting for the remote call to finish. That is not ideal. Lets use the magic of MapReduce to make the network calls in parallel. That way you greatly reduce the time your script is waiting and doing nothing.
Example problem: the MapReduce way
Lets breakdown our problem into the four components of MapReduce:
1) chunks of data: I’d define each chunk of data as [country_name,age] because that is the discrete chunk of data that is operated upon by the method we want to run in parallel.
data_chunks =  for country in countries: for age in ages: data_chunks.append([country,age])
So our data_chunks list looks something like: [[‘India’,32],[‘India’,64],[‘USA’,24]].
2) Map: takes two arguments – get_salary and a list/array containing the discrete chunks of data. With Python, introducing map is only a few lines of code:
from multiprocessing import Pool #You may need to 'pip install multiprocessing' if you hit an import error #Define the number of parallel processes you want to run #I found this script to be unstable beyond 15 parallel processes on my machine. YMMV. num_parallel_processes = min(len(data_chunks),15) pool = Pool(processes=num_parallel_processes) results = pool.map(get_salary,data_chunks)
3) a method: get_salary. But notice that we need to modify it slightly so it takes our chunk of data as the argument. So instead of two arguments called country_name,age, lets modify it to take a list and then unpack the list within the method.
def get_salary(a_list): "Make the api call to get the salary" #a_list is a list of form [country_name,age] country_name = a_list age = a_list url = '/country/%s/age/%d'%(country_name,age) salary = fetch_url(url) #Assume fetch_url gets the URL, parses it and returns the salary return salary
WOOT! Our method is ready!!
4) Reduce: form an ordered list of salaries for each [country_name,age]. Guess what – there is nothing special you need to do here. This one line below does that for you automagically!
results = pool.map(get_salary,data_chunks)
And that’s about it! Really! With less than a dozen lines of code, you can make your API checks run in parallel.
1. You may run into PicklingErrors if you are passing around objects that are not serializable.
2. On my machine, the script grew unstable beyond ~20 parallel processes. So I limit myself to 15 parallel calls. You need to experiment on your machines to figure out how may such parallel calls you can make.
3. The real life example I used MapReduce on was more complex and does not fit within a blog post. If you are interested in the details, hit me up at mak at qxf2.com
1. Python parallelism in one line: An excellent post to get you started with both MapReduce as well as multiprocessing
2. PyMOTW: MapReduce with multiprocessing: Doug Hellman’s guides are a fantastic way to learn new Python Modules.
I have tried my best to generalize my approach with this example. If it is lacking, please do let me know how I can make it more illustrative!
If you are a startup finding it hard to hire technical QA engineers, learn more about Qxf2 Services.
I want to find out what conditions produce remarkable software. A few years ago, I chose to work as the first professional tester at a startup. I successfully won credibility for testers and established a world-class team. I have lead the testing for early versions of multiple products. Today, I run Qxf2 Services. Qxf2 provides software testing services for startups. If you are interested in what Qxf2 offers or simply want to talk about testing, you can contact me at: [email protected] I like testing, math, chess and dogs.