Local whispers

April 23, 2023

Comfort food was clearly the order of the days this weekend. Comfort food in the form - once again - of the very first few episodes of Cortex. I am probably close to having listened to the whole run of Cortex twice at this point, and the first episodes at least once and possibly twice on top of that. What is the appeal? I think it has something to do with the topics being quite timeless. People may change how they work, but hearing how they worked at a certain time in a certain context is still as interesting. News of any kind rarely interrupt, either. It is just the problems with email, todo lists, notifications, app icon placements … the classics.

Inspired by all of the above, I of course ended up taking yet another long, hard look at my phone's home screen, another round of serious thought about what I want there, what I want to make easy for myself to do. Then, I somewhat surprised myself by removing both Ice cubes (my current Mastodon app of choice) and Safari from my home screen. Both are apps I can open almost reflexively, so getting them even just a bit more out of sight and mind feels helpful and somewhat relaxing.

I ended up not looking at Mastodon all morning, instead of catching up almost as soon as I woke up. I do not imagine this will last forever, or even for very long, but I will be happy about needing just a tiny bit more intention to dive into posts and websites for as long as it lasts.

What about whispers?

Glad you asked! I spent a bit of time setting up whisper.cpp on my M2 Macbook air. Whisper is a speech recognition model from Openai, and whisper.cpp is a C and C++ implementation of Whisper.

What does that mean then? You can transcribe audio on your own machine. With great performance! Not only is compiled C code fast, whisper.cpp also supports Apple's custom chips for machine learning-type tasks.

For being a project using make, getting whisper.cpp built and running was pretty much as straightforward as could be. The only snag I hit was due to me not following the instructions closely enough. Soon, I was watching it chew through a recent podcast episode, spitting out a surprisingly good interpretation of what was said.

No less impressive since the recording was in Swedish. At first, the output was in - also surprisingly good - English, which confused me greatly. Apparently, the model translates if no language setting has been made. Which is cool and all, but quite confusing when all settings point to translation being disabled by default.ee

Running the model also brought with it the geeky joy of the computer's hardware getting to stretch its legs properly for a change. All four performance cores were pretty much working at top speed, a couple of gigabytes of memory were used, and of course the machine was a dead silent and responsive as usual.

Speed was better than I had expected as well. My 54 minutes of audio was processed in between 14 and 18 minutes - so 3x to almost 4x faster than real time. Imagine the speed I could get with a few more cores …

What will you use this for?

My first idea is to find a summarization service, feed it a transcript and see if that can provide a solid base for episode information. I often wonder as I write the episode information if I am missing something interesting, and whether there is a nicer way of tying it all together or noticing topics that I am just not seeing at the moment. Having a machine draft would definitely not hurt my creativity.

Apart form that, it would of course be plain nice to offer transcripts for accessibility and searchability as well.

But first things first: building C code which can use all the power of your computer is just plain cool.

… or perhaps I should say hot. The cores ran at a stable 80 degrees celsius, after all.