Playing in the bakery

Inspired by Erik and Bring your own project, I figured out some actually meaningful and fun things to do with Bakery, my Python script to build this site.

I have added two parts: multiple processes and the basics of incremental building or caching.

Multiple processes

The main thing Bakery does is read a long list of Markdown files, convert them to HTML and save the results in an output folder. Plenty of disk access, and thus limited potential gain from multiple threads (everyone is waiting for the same disk after all). Still, it was a fun thing to do, and by using the multiprocessing pool the list of files became a natural point to split work. Each individual page can be worked on individually, so the pool maps over the list, handing work off to threads as it sees fit and gathering up the results nicely.

The main challenge was figuring out how to define the worker function and how to pass it arguments in a convenient way. The pool takes a function to invoke and a list which contains an argument - work to be done - for each invocation of the function. As my worker expected multiple arguments, I needed to create a list of tuples, one tuple for each page, with my arguments and then unpack that tuple into variables inside the worker function. Nothing difficult, it just needed to be figured out.

For the function itself, I ran into problems because it was originally a function on the Bakery object itself, meaning self was used here and there. The pool (or Python, or something inbetween) serializes the worker to send it to new threads, so if I wanted to use a function on Bakery I would need to handle serialization and deserialization (pickling, as they like to call it). It turned out to be much easier to just change the worker to be a free function and pass a bakery along as another argument.

Adjustments done, I suddenly had my first multi process Python script running along happily! Speed gains were limited as expected, but there were some. Plus, it is always satisfying to run measurements and see something you wrote using multiple CPU cores at once.

Incremental builds

On to the actual serious speed improvement. Once a run is finished, I write an extra file containing just the timestamp of the run. Next time, I only write files which have changed instead of clearing out the results and building everything from scratch. I still do the reading and Markdown conversion though, because I need to update the index page and RSS feed and have not yet thought of an elegant way to do it. The speed increase was even higher before I realized I need to do that …

Another thing I can improve is that I will still need to do a full rebuild - and remember to do so by removing the timestamp file - if I add or change any non-markdown files in the site. Images or CSS for example. This is because I currently never check those for modification. When I have no timestamp file, I clear the output folder and copy all files over. When I have the timestamp, I skip that step completely. I should of course check timestamps there too, I just have not got to it yet.

This is fun!

This is the first meaningful weekend coding I have done in a while. I like to think it is another small step in improving my ability to find fun little projects (I still think my main challenge is spotting opportunities for little projects). The code is on Github, as is customary.