The Joys of Harddrive Failures

Jamie's Mac, an old G4 tower, decided it was time to byte the dust, loudly, and refuse to boot past a grey screen. Backups of that machine are 2 months old, so recovering the data is important.

So first things first: get Linux running from CD on the Mac.

I downloaded the first CD of Fedora 11 for PPC. The eject button on the keyboard didn't do any good, but I managed to find the hardware eject button behind the face of the case, push it with a screwdriver and load the CD into the machine. I held down 'C' while powering it back on, was greeted with the bootloader from the CD, and typed 'linux rescue'. Followed the prompts to that comfortable root shell with networking up and running and nothing mounted.

Now to get the data.

Trusty dd grabbed the first 19M of the 80G drive before erroring out. A friend pointed me to ddrescue. So I pulled the drive, slapped it into my USB enclosure, and tried it from my Fedora 11 laptop. The problem is that Fedora seems to want to access parts of the drive before I explicitly tell it to, and ddrescue would just hang in D+ state. Unplugging the USB and plugging it back in while ddrescue was running would allocate a different device for the USB drive, so ddrescue would not see the device when it was reinserted.

I could have tried to get ddrescue built for PPC and connected my large USB drive to it directly, but I didn't want to connect that drive to a failing machine; I wanted it insulated from any problems on the dying Mac by a network. I suppose I could have setup sshfs or something, but since I was running from a CD in rescue mode I figured that was going to get painful quickly.

Besides, it's so much more fun to reinvent the wheel.

The Fedora 11 CD rescue mode has Python installed. So, I wrote a tool (salvage_data.py) in Python that would read data from a provided device, prioritizing good data over problem areas. It would write that data as a log to stdout which I then piped over ssh to another machine with sufficient storage.

The basic idea is to start at the beginning of the drive, and read data until it hit a problem (either an IOError or a short read), and then split the remaining section of the drive in half, store a note to take care of the first half later, then repeat the process on the second half of the drive. Once it completed a section, it would grab one of the remaining sections and repeat the process on that section.

This quickly gets the bulk of the good data off the drive while trying to stay away from the bad sections. The queuing strategy this code uses isn't perfect though; it will head back to the beginning of the drive where the damage is to figure out it needs to split the next section. A better approach would be to recursively split the first half of the section pre-emptively so that it would work backwards through the drive. It also does not limit the section size to the harddrive's blocksize or boundaries; so as it approaches the end, it's tracking individual unknown bytes on the drive. But I had already reached the wee hours of the morning, and decided the additional complexity was more than I was willing to attempt at that time.

The format of the data log that salvage_data.py generates takes some cues from the Subversion dump format. It starts a line with 'D' for data or 'E' for error, along with a decimal offset and length. (In the 'E' case, length is assumed to be 1 if not specified.) The data starts on the next line, and has a newline appended. This yields a log file format that is self-describing enough for a human to reverse-engineer it easily; something that I think is important for file formats. The log files can be replayed with a second tool (recover_log.py) I wrote to create an image of the drive. That tool can also write a new log file with ordered and coalesced sections; which can save a lot of space when you have a large number of bad bytes each recorded individually.

The challenge of getting useful data off of a corrupt image is left as an exercise to the reader. In my case, the bad 20kB of 80GB appears to have left a corrupt catalog file, which is preventing any tool I tried from understanding the filesystem. Hmmmm.... I seem to hear the siren song of Technical Note TN1150: HFS Plus Volume Format calling to me over the terrified cries of my "free time".

Comments

No comments.