Posts for the month of October 2009

Kmail + SpamBayes

I've been meaning to do something about spam filtering on my email, especially the email from this domain. I recently stumbled upon a menu entry in Kmail I hadn't noticed before 'Tools -> Anti-Spam Wizard...'. If you have SpamBayes installed (yum install spambayes), it is listed as an option for setting up spam filtering. Follow the prompts through the wizard, and click 'Finish' when done.

But now what? Nothing seems to change, no 'mark as ham' or 'mark as spam' options suddenly appeared in the context menu.

And thus it sat, unused, and therefore... useless.

Today I started looking a bit more closely at the filters that the wizard created. There were two that stood out: 'Classify as Spam' and 'Classify as NOT Spam'. These two are not applied to incoming mail, but are added to the 'Apply Filter' context menu. And apparently that is how you tell SpamBayes what is spam and what is ham.

So I went to my spam folder, selected today's spam, right-clicked, 'Apply Filter -> Classify as Spam'. It fed it to SpamBayes, and moved the messages to my spam folder. I selected a chunk of my read messages (ham) and right-clicked, 'Apply Filter -> Classify as NOT Spam' and it trained SpamBayes on them, and left them where they were.

I checked my email, and like magic, the incoming spam wound up in the spam folder without my intervention.

Moral of the story? Kmail needs to include in the Anti-Spam Wizard some basic 'getting started' instructions or a link to some help on the topic. This wasn't obvious to me before I found it. It makes sense, but it wasn't intuitive.

But now I know. And so do you.

Edit: And now I see a pair of new buttons in the toolbar. I don't think they were there before; I noticed them after a restart. So, after setting up SpamBayes, restart kmail, and you should see 'spam' and 'ham' toolbar buttons beside the 'trash' button.

LPub4 for Linux, 2nd iteration

LPub4 is a program by Kevin Clague for creating high-quality instructions for Lego models. It runs on OS X and Windows. I ported it to Linux a while ago, but I've done some more work on it.

Two general fixes:

And then the Linux porting patches:

As time permits, I have a couple of new features for LPub that I'm working on. More on that when they're ready.

The Joys of Harddrive Failures

Jamie's Mac, an old G4 tower, decided it was time to byte the dust, loudly, and refuse to boot past a grey screen. Backups of that machine are 2 months old, so recovering the data is important.

So first things first: get Linux running from CD on the Mac.

I downloaded the first CD of Fedora 11 for PPC. The eject button on the keyboard didn't do any good, but I managed to find the hardware eject button behind the face of the case, push it with a screwdriver and load the CD into the machine. I held down 'C' while powering it back on, was greeted with the bootloader from the CD, and typed 'linux rescue'. Followed the prompts to that comfortable root shell with networking up and running and nothing mounted.

Now to get the data.

Trusty dd grabbed the first 19M of the 80G drive before erroring out. A friend pointed me to ddrescue. So I pulled the drive, slapped it into my USB enclosure, and tried it from my Fedora 11 laptop. The problem is that Fedora seems to want to access parts of the drive before I explicitly tell it to, and ddrescue would just hang in D+ state. Unplugging the USB and plugging it back in while ddrescue was running would allocate a different device for the USB drive, so ddrescue would not see the device when it was reinserted.

I could have tried to get ddrescue built for PPC and connected my large USB drive to it directly, but I didn't want to connect that drive to a failing machine; I wanted it insulated from any problems on the dying Mac by a network. I suppose I could have setup sshfs or something, but since I was running from a CD in rescue mode I figured that was going to get painful quickly.

Besides, it's so much more fun to reinvent the wheel.

The Fedora 11 CD rescue mode has Python installed. So, I wrote a tool (salvage_data.py) in Python that would read data from a provided device, prioritizing good data over problem areas. It would write that data as a log to stdout which I then piped over ssh to another machine with sufficient storage.

The basic idea is to start at the beginning of the drive, and read data until it hit a problem (either an IOError or a short read), and then split the remaining section of the drive in half, store a note to take care of the first half later, then repeat the process on the second half of the drive. Once it completed a section, it would grab one of the remaining sections and repeat the process on that section.

This quickly gets the bulk of the good data off the drive while trying to stay away from the bad sections. The queuing strategy this code uses isn't perfect though; it will head back to the beginning of the drive where the damage is to figure out it needs to split the next section. A better approach would be to recursively split the first half of the section pre-emptively so that it would work backwards through the drive. It also does not limit the section size to the harddrive's blocksize or boundaries; so as it approaches the end, it's tracking individual unknown bytes on the drive. But I had already reached the wee hours of the morning, and decided the additional complexity was more than I was willing to attempt at that time.

The format of the data log that salvage_data.py generates takes some cues from the Subversion dump format. It starts a line with 'D' for data or 'E' for error, along with a decimal offset and length. (In the 'E' case, length is assumed to be 1 if not specified.) The data starts on the next line, and has a newline appended. This yields a log file format that is self-describing enough for a human to reverse-engineer it easily; something that I think is important for file formats. The log files can be replayed with a second tool (recover_log.py) I wrote to create an image of the drive. That tool can also write a new log file with ordered and coalesced sections; which can save a lot of space when you have a large number of bad bytes each recorded individually.

The challenge of getting useful data off of a corrupt image is left as an exercise to the reader. In my case, the bad 20kB of 80GB appears to have left a corrupt catalog file, which is preventing any tool I tried from understanding the filesystem. Hmmmm.... I seem to hear the siren song of Technical Note TN1150: HFS Plus Volume Format calling to me over the terrified cries of my "free time".