公告

concurrency processing mutiple file python solution

Some Notes on Tim Bray's Wide Finder Benchmark

Fredrik Lundh | Updated October 12, 2007 | Originally posted October 6, 2007

The Problem #

Tim Bray recently posted about his experiences from using Erlang to do some straightforward parsing of a large log file, inspired by a chapter he wrote for the book Beautiful Code. As it turned out, Erlang isn’t exactly optimized for tasks like this. After trying to parse a 1,000,000-line log file, Tim notes:

“My first cut in Erlang, based on the per-process dictionary, took around eight minutes of CPU, and kept one of my MacBook’s Core Duo processors pegged at 97% while it was running. Ouch!”

That’s less than a half megabyte per second. Not very impressive. Let’s see if we can come up with something better in Python.

A Single-Threaded Python Solution #

Santiago Gala followed up on Tim’s original post with a nice map/reduce-based implementation in Python:

http://memojo.com/~sgala/blog/2007/09/29/Python-Erlang-Map-Reduce

Santiago’s script uses a series of nested generators to do filtering and mapping, and then uses a for-in-loop to reduce the mapped stream into a dictionary.

To benchmark the script, I created a sample by concatenating 100 copies of Tim’s original 10,000-line sample file. With that file, Santiago’s script needs about 6.7 seconds wall-time to parse 200 megabytes of log data on my Core Duo laptop (using Windows XP, warmed-up disk caches, and the final print statement replaced with a pass).

Tim’s 1.67 GHz Core Duo L2400 MacBook should match my 1.66 GHz Core Duo T2300 HP notebook pretty well, so that’s about 70 times faster than his Erlang program, and about twice as fast as his Ruby version. Not too shabby.

But we can speed things up a bit more, of course.

Compiling the RE #

Python’s RE engine caches compiled expressions, but it’s usually a good idea to move the cache lookup out of the inner loop anyway. And while we’re at it, we can move the method lookup out of the loop as well:

pat = r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) "
search = re.compile(pat).search

matches = (search(line) for line in file("o10k.ap"))

With these changes, the script finishes in 4.1 seconds.

Skipping lines that cannot match #

Somewhat less obvious is the fact that we can use Python’s in operator to filter out lines that cannot match:

matches = (search(line) for line in file("o10k.ap")
    if "GET /ongoing/When" in line)

The RE engine does indeed use special code for literal prefixes, but the sublinear substring search algorithm that was introduced in 2.5 is a lot faster in cases like this, so this simple change gives a noticable speedup; the script now runs in 2.9 seconds.

Reading files in binary mode (Windows) #

On Windows (and in theory, on other platforms that distinguish between text files and binary files), data read via the standard file object are scanned for Windows-style line endings (“\r\n”). Any such character combination is then translated to a single newline, for consistency.

This is of course very convenient, since it allows you to treat text files in the same way no matter what platform you’re on, but on files this large, the performance penality is starting to get noticable.

We can turn this off simply by passing in the “rb” flag (read binary) to the open function.

matches = (search(line) for line in file("o10k.ap", "rb")
    if "GET /ongoing/When" in line)

The file object will still break things up in lines, and our code doesn’t look at the line endings, so we still get the same result. Just a bit quicker.

The Code #

Here’s the final version of Santiago’s script:

import re
from collections import defaultdict

FILE = "o1000k.ap"

pat = re.compile(r"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) ")

search = pat.search

# map
matches = (search(line) for line in file(FILE, "rb") if "GET /ongoing/When" in line)
mapp    = (match.group(1) for match in matches if match)

# reduce
count = defaultdict(int)
for page in mapp:
    count[page] +=1

for key in sorted(count, key=count.get)[:10]:
    print "%40s = %s" % (key, count[key])

To get a version that’s set up for benchmarking, get the wf-2.py file from this directory:

http://svn.effbot.org/public/stuff/sandbox/wide-finder/

This version of the script finishes in 1.9 seconds. This is a 3.5x speedup over Santiago’s version, and over 250x faster than Tim’s Erlang version. Pretty good for a short single-threaded script, don’t you think?

But I’m running this on a Core Duo machine. Two CPU cores, that is. What about using them both for this task?

A Multi-Threaded Python Solution #

To run multiple subtasks in parallel, we need to split the task up in some way. Since the program reads a single text file, the easiest way to do that is to split the file into multiple pieces on the way in. Here’s a simple function that rushes through the file, splitting it up in 1 megabyte chunks, and returns chunk offsets and sizes:

def getchunks(file, size=1024*1024):
    f = open(file)
    while 1:
        start = f.tell()
        f.seek(size, 1)
        s = f.readline()
        yield start, f.tell() - start
        if not s:
            break

By default, this splits the file in megabyte-sized chunks:

>>> for chunk in getchunks("o1000k.ap"):
...     print chunk
(0L, 1048637L)
(1048637L, 1048810L)
(2097447L, 1048793L)
(3146240L, 1048603L)

Note the use of readline to make sure that each chunk ends at a newline character. (Without this, there’s a small chance that we’ll miss some entries here and there. This is probably not much of a problem in practice, but let’s stick to the exact solution for now.)

So, given a list of chunks, we need something that takes a chunk, and produces a partial result. Here’s a first attempt, where the map and reduce steps are combined into a single loop:

pat = re.compile(...)

def process(file, chunk):
    f = open(file)
    f.seek(chunk[0])
    d = defaultdict(int)
    search = pat.search
    for line in f.read(chunk[1]).splitlines():
        if "GET /ongoing/When" in line:
            m = search(line)
            if m:
                d[m.group(1)] += 1
    return d

Note that we cannot loop over the file itself, since we need to stop when we reach the end of it. The above version solves this by reading the entire chunk, and then splitting it into lines.

To test this code, we can loop over the chunks and feed them to the process function, one by one, and combine the result:

count = defaultdict(int)
for chunk in getchunks(file):
    for key, value in process(file, chunk).items():
        count[key] += value

This version is a bit slower than the non-chunked version on my machine; one pass over the 200 megabyte file takes about 2.6 seconds.

However, since a chunk is guaranteed to contain a full set of lines, we can speed things up a bit more by looking for matches in the chunk itself instead of splitting it into lines:

def process(file, chunk):
    f = open(file)
    f.seek(chunk[0])
    d = defaultdict(int)
    for page in pat.findall(f.read(chunk[1])):
        d[page] += 1
    return d

With this change, the time drops to 1.8 seconds (3.7x faster than the original version).

The next step is to set things up so we can do the processing in parallel. First, we’ll call the process function from a standard “worker thread” wrapper:

import threading, Queue

# job queue
queue = Queue.Queue()

# result queue
result = []

class Worker(threading.Thread):
    def run(self):
        while 1:
            args = queue.get()
            if args is None:
                break
            result.append(process(*args))
            queue.task_done()

This uses the standard “worker thread” pattern, with a thread-safe Queue for pending jobs, and a plain list object to collect the results (list.append is an atomic operation in CPython).

To finish the script, just create a bunch of workers, give them something to do (via the queue), and collect the results into a single dictionary:

for i in range(4):
    w = Worker()
    w.setDaemon(1)
    w.start()

for chunk in getchunks(file):
    queue.put((file, chunk))

queue.join()

count = defaultdict(int)
for item in result:
    for key, value in item.items():
        count[key] += value

With a single thread, this runs in about 1.8 seconds (same as the non-threaded version). When we increase the number of threads, things are improved:

Two threads: 1.9 seconds
Three: 1.7 seconds
Four to eight: 1.6 seconds

For this specific test, the ideal number appears to be three threads per CPU. With fewer threads, the CPU:s will occasionally get stuck waiting for I/O.

Or perhaps they’re waiting for the interpreter itself; Python uses a global interpreter lock to protect the interpreter internals from simultaneous access, so there’s probably some fighting over the interpreter going on as well. To get even more performance out of this, we need to get around the lock in some way.

Luckily, for this kind of problem, the solution is straightforward.

A Multi-Processor Python Solution #

To fully get around the interpreter lock, we need to run each subtask in a separate process. An easy way to do that is to let each worker thread start an associated process, send it a chunk, and read back the result. To make things really simple, and also portable, we’ll use the script itself as the subprocess, and use a special option to enter “subprocess” mode.

Here’s the updated worker thread:

import subprocess, sys

executable = [sys.executable]
if sys.platform == "win32":
    executable.append("-u") # use raw mode on windows

class Worker(threading.Thread):
    def run(self):
        process = subprocess.Popen(
            executable + [sys.argv[0], "--process"],
            stdin=subprocess.PIPE,
            stdout=subprocess.PIPE
            )
        stdin = process.stdin
        stdout = process.stdout
        while 1:
            cmd = queue.get()
            if cmd is None:
                putobject(stdin, None)
                break
            putobject(stdin, cmd)
            result.append(getobject(stdout))
            queue.task_done()

where the getobject and putobject helpers are defined as:

import marshal, struct

def putobject(file, object):
    data = marshal.dumps(object)
    file.write(struct.pack("I", len(data)))
    file.write(data)
    file.flush()

def getobject(file):
    try:
        n = struct.unpack("I", file.read(4))[0]
    except struct.error:
        return None
    return marshal.loads(file.read(n))

The worker thread runs a copy of the script itself, and passes in the “—process” option. To enter subprocess mode, we need to look for that before we do anything else:

if "--process" in sys.argv:
    stdin = sys.stdin
    stdout = sys.stdout
    while 1:
        args = getobject(stdin)
        if args is None:
            sys.exit(0) # done
        result = process(*args)
        putobject(stdout, result)
else:
    ... create worker threads ...

With this approach, the processing time drops to 1.2 seconds, when using two threads/processes (one per CPU). But that’s about as good as it gets; adding more processes doesn’t really improve things on this machine.

Memory Mapping #

So, is this the best we can get? Not quite. We can speed up the file access as well, by switching to memory mapping:

import mmap

filemap = None

def process(file, chunk):
    global filemap, fileobj
    if filemap is None or fileobj.name != file:
        fileobj = open(file)
        filemap = mmap.mmap(
            fileobj.fileno(),
            os.path.getsize(file),
            access=mmap.ACCESS_READ
        )
    d = defaultdict(int)
    for file in pat.findall(filemap, chunk[0], chunk[0]+chunk[1]):
        d[file] += 1
    return d

Note that findall can be applied directly to the mapped region, thanks to Python’s internal memory buffer interface. Also note that the mmap module doesn’t support windowing, so the code needs to map the entire file in each subprocess. This can result in overly excessive use of virtual memory on some platforms (running this on your own log files if you’re on a shared web server is not necessarily a good idea. Yes, I’ve tried ;-).

Anyway, this gets the job done in 0.9 seconds, with the original chunk size. But since we’re mapping the entire file anyway in each subprocess, we can increase the chunk size to reduce the process communication overhead. With 50 megabyte chunks, the script runs in just under 0.8 seconds.

Summary #

In this article, we took a relatively fast Python implementation and optimized it, using a number of tricks:

Pre-compiled RE patterns
Fast filtering of candidate lines
Chunked reading
Multiple processes
Memory mapping, combined with support for RE operations on mapped buffers

This reduced the time needed to parse 200 megabytes of log data from 6.7 seconds to 0.8 seconds on the test machine. Or in other words, the final version is over 8 times faster than the original Python version, and (potentially) 600 times faster than Tim’s original Erlang version.

However, it should be noticed that the benchmark I’ve been using focuses on processing speed, not disk speed. The code will most likely behave differently on cold caches (and will definitely take longer to run), on machines with different disk systems, and of course also on machines with additional cores.

If you have some time to spare and some interesting hardware to run it on, feel free to grab the code and take it on a ride:

http://svn.effbot.org/public/stuff/sandbox/wide-finder/

(see the README.txt file for details.)

Addenda #

2007-10-07: Stanley Seibert has adapted the code to use the processing library, which provides multiprocess functionality with a lot less (user) code; see Parallel Processing in Python with processing for details.

2007-10-07: Bioinformatics veteran and fellow Python string-type hacker Andrew Dalke points out, via mail, that it’s possible to shave off a few more cycles by extracting all URL:s that start with “/ongoing/When/” (which we’re looking for anyway), and then removing bogus URL:s during post-processing. Andrew has also written a custom parser based on mxTextTools, which is a quite a bit faster than the RE solution. Hopefully, he’ll turn his findings into a blog post, so I can link to his work ;-) See More notes on Wide Finder for the full story (which is more about fast “narrow finding” than “wide finding”, though).

2007-10-07: Bill de hÓra has some code too.

2007-10-07: And Steve Vinoski has tried the code from this article on some big iron: “I ran his wf-6.py on an 8-core 2.33 GHz Intel Xeon Linux box with 8GB of RAM, and it ran best at 5 processes, clocking in at 0.336 sec. Another process-based approach, wf-5.py, executed best with 8 processes, presumably one per core, in 0.358 sec. The multithreaded approach, wf-4.py, ran best with 5 threads, at 1.402 sec (but also got the same result with 19 threads, go figure). Using the same dataset, I get 11.8 sec from my best Erlang effort, which is obviously considerably slower.”

2007-10-08: Paul Boddie provides code and results using a different parallelization library, pprocess.

2007-10-08: Tim Bray summarizes recent developments.

2007-10-12: Updated the article to use binary mode on Windows. This makes the chunk calculations a bit more reliable (tell can misbehave on text files), and speeds things up quite a bit, since the I/O layer no longer needs to convert line endings.

2007-10-31: Tim Bray has tested a bunch of implementations on a multicore Solaris box. When I write this, Python’s in the lead ;-)

[comment on/vote for this article]

Python multiprocessing: sharing a large read-only object between processes?

Do child processes spawned via multiprocessing share objects created earlier in the program?

I have the following setup:

do_some_processing(filename):
    for line in file(filename):
        if line.split(',')[0] in big_lookup_object:
            # something here

if __name__ == '__main__':
    big_lookup_object = marshal.load('file.bin')
    pool = Pool(processes=4)
    print pool.map(do_some_processing, glob.glob('*.data'))

I'm loading some big object into memory, then creating a pool of workers that need to make use of that big object. The big object is accessed read-only, I don't need to pass modifications of it between processes.

My question is: is the big object loaded into shared memory, as it would be if I spawned a process in unix/c, or does each process load its own copy of the big object?

Update: to clarify further - big_lookup_object is a shared lookup object. I don't need to split that up and process it separately. I need to keep a single copy of it. The work that I need to split it is reading lots of other large files and looking up the items in those large files against the lookup object.

Further update: database is a fine solution, memcached might be a better solution, and file on disk (shelve or dbm) might be even better. In this question I was particularly interested in an in memory solution. For the final solution I'll be using hadoop, but I wanted to see if I can have a local in-memory version as well.

Question by: Parand

This question originated from: stackoverflow.com

Answer

"Do child processes spawned via multiprocessing share objects created earlier in the program?"

No.

Processes have independent memory space.

Solution 1

To make best use of a large structure with lots of workers, do this.

Write each worker as a "filter" -- reads intermediate results from stdin, does work, writes intermediate results on stdout.

Connect all the workers as a pipeline:

process1 <source | process2 | process3 | ... | processn >result

Each process reads, does work and writes.

This is remarkably efficient since all processes are running concurrently. The writes and reads pass directly through shared buffers between the processes.

Solution 2

In some cases, you have a more complex structure -- often a "fan-out" structure. In this case you have a parent with multiple children.

Parent opens source data. Parent forks a number of children.
Parent reads source, farms parts of the source out to each concurrently running child.
When parent reaches the end, close the pipe. Child gets end of file and finishes normally.

The child parts are pleasant to write because each child simply reads sys.sydin.

The parent has a little bit of fancy footwork in spawning all the children and retaining the pipes properly, but it's not too bad.

Fan-in is the opposite structure. A number of independently running processes need to interleave their inputs into a common process. The collector is not as easy to write, since it has to read from many sources.

Reading from many named pipes is often done using the select module to see which pipes have pending input.

Solution 3

Shared lookup is the definition of a database.

Solution 3A -- load a database. Let the workers process the data in the database.

Solution 3B -- create a very simple server using werkzeug (or similar) to provide WSGI applications that respond to HTTP GET so the workers can query the server.

Solution 4

Shared filesystem object. Unix OS offers shared memory objects. These are just files that are mapped to memory so that swapping I/O is done instead of more convention buffered reads.

You can do this from a Python context in several ways

Write a startup program that (1) breaks your original gigantic object into smaller objects, and (2) starts workers, each with a smaller object. The smaller objects could be pickled Python objects to save a tiny bit of file reading time.

Write a startup program that (1) reads your original gigantic object and writes a page-structured, byte-coded file using seek operations to assure that individual sections are easy to find with simple seeks. This is what a database engine does -- break the data into pages, make each page easy to locate via a seek.

Spawn workers with access this this large page-structured file. Each worker can seek to the relevant parts and do their work there.

http://www.doughellmann.com/PyMOTW/multiprocessing/mapreduce.html

Implementing MapReduce with multiprocessing¶

The Pool class can be used to create a simple single-server MapReduce implementation. Although it does not give the full benefits of distributed processing, it does illustrate how easy it is to break some problems down into distributable units of work.

SimpleMapReduce¶

In MapReduce, input data is broken down into chunks for processing by different worker instances. Each chunk of input data is mapped to an intermediate state using a simple transformation. The intermediate data is then collected together and partitioned based on a key value so that all of the related values are together. Finally, the partitioned data is reduced to a result set.

import collections
import itertools
import multiprocessing

class SimpleMapReduce(object):
    
    def __init__(self, map_func, reduce_func, num_workers=None):
        """
        map_func

          Function to map inputs to intermediate data. Takes as
          argument one input value and returns a tuple with the key
          and a value to be reduced.
        
        reduce_func

          Function to reduce partitioned version of intermediate data
          to final output. Takes as argument a key as produced by
          map_func and a sequence of the values associated with that
          key.
         
        num_workers

          The number of workers to create in the pool. Defaults to the
          number of CPUs available on the current host.
        """
        self.map_func = map_func
        self.reduce_func = reduce_func
        self.pool = multiprocessing.Pool(num_workers)
    
    def partition(self, mapped_values):
        """Organize the mapped values by their key.
        Returns an unsorted sequence of tuples with a key and a sequence of values.
        """
        partitioned_data = collections.defaultdict(list)
        for key, value in mapped_values:
            partitioned_data[key].append(value)
        return partitioned_data.items()
    
    def __call__(self, inputs, chunksize=1):
        """Process the inputs through the map and reduce functions given.
        
        inputs
          An iterable containing the input data to be processed.
        
        chunksize=1
          The portion of the input data to hand to each worker.  This
          can be used to tune performance during the mapping phase.
        """
        map_responses = self.pool.map(self.map_func, inputs, chunksize=chunksize)
        partitioned_data = self.partition(itertools.chain(*map_responses))
        reduced_values = self.pool.map(self.reduce_func, partitioned_data)
        return reduced_values

Counting Words in Files¶

The following example script uses SimpleMapReduce to counts the “words” in the reStructuredText source for this article, ignoring some of the markup.

import multiprocessing
import string

from multiprocessing_mapreduce import SimpleMapReduce

def file_to_words(filename):
    """Read a file and return a sequence of (word, occurances) values.
    """
    STOP_WORDS = set([
            'a', 'an', 'and', 'are', 'as', 'be', 'by', 'for', 'if', 'in', 
            'is', 'it', 'of', 'or', 'py', 'rst', 'that', 'the', 'to', 'with',
            ])
    TR = string.maketrans(string.punctuation, ' ' * len(string.punctuation))

    print multiprocessing.current_process().name, 'reading', filename
    output = []

    with open(filename, 'rt') as f:
        for line in f:
            if line.lstrip().startswith('..'): # Skip rst comment lines
                continue
            line = line.translate(TR) # Strip punctuation
            for word in line.split():
                word = word.lower()
                if word.isalpha() and word not in STOP_WORDS:
                    output.append( (word, 1) )
    return output


def count_words(item):
    """Convert the partitioned data for a word to a
    tuple containing the word and the number of occurances.
    """
    word, occurances = item
    return (word, sum(occurances))


if __name__ == '__main__':
    import operator
    import glob

    input_files = glob.glob('*.rst')
    
    mapper = SimpleMapReduce(file_to_words, count_words)
    word_counts = mapper(input_files)
    word_counts.sort(key=operator.itemgetter(1))
    word_counts.reverse()
    
    print '\nTOP 20 WORDS BY FREQUENCY\n'
    top20 = word_counts[:20]
    longest = max(len(word) for word, count in top20)
    for word, count in top20:
        print '%-*s: %5s' % (longest+1, word, count)

Each input filename is converted to a sequence of (word, 1) pairs by file_to_words. The data is partitioned by SimpleMapReduce.partition() using the word as the key, so the partitioned data consists of a key and a sequence of 1 values representing the number of occurrences of the word. The reduction phase converts that to a pair of (word, count) values by calling count_words for each element of the partitioned data set.

$ python multiprocessing_wordcount.py

PoolWorker-2 reading communication.rst
PoolWorker-2 reading index.rst
PoolWorker-1 reading basics.rst
PoolWorker-1 reading mapreduce.rst

TOP 20 WORDS BY FREQUENCY

process         :    75
multiprocessing :    40
worker          :    35
after           :    30
running         :    29
start           :    28
processes       :    26
python          :    26
literal         :    25
header          :    25
pymotw          :    25
end             :    25
daemon          :    23
now             :    21
consumer        :    19
starting        :    18
exiting         :    16
event           :    15
value           :    14
run             :    13

浙江省高等学校教师教育理论培训

公告