Posts for the month of December 2024

Generality in solutions; an example in HttpFile

A few (ok, ok, over a dozen) years ago, I came across a question by someone on stackoverflow who wanted to be able to unzip part of a ZIP file that was hosted on the web without having to download the entire file.

He had not found a Python library for doing this, so he modified the ZipFile library to create an HTTPZipFile class which knew about both ZIP files and also about HTTP requests. I posted a different approach. Over time, stackoverflow changed its goals and now that question and answer have been closed and marked off-topic for stackoverflow. I believe there's value in the question and answer, and I think a fuller treatment of the answer would be fruitful for others to learn from.

Seams and Layers

The idea is to think about the interfaces: the seams or layers in the code.

The ZipFile class expects a file object. The file in this case lives on a website, but we could create a file-like object that knows how to retrieve parts of a file over HTTP using Range GET requests, and behaves like a file in terms of being seekable.

Let's walk through this pedagogically:

We want to create a file-like object that takes a URL as its constructor.

So let's start with our demo script:

#!/usr/bin/env python3

from httpfile import HttpFile
# Try it
from zipfile import ZipFile
URL = "https://www.python.org/ftp/python/3.12.0/python-3.12.0-embed-amd64.zip"
my_zip = ZipFile(HttpFile(URL))
print("\n".join(my_zip.namelist()))

And create httpfile.py with just the constructor as a starting point:

#!/usr/bin/env python3
class HttpFile:
    def __init__(self, url):
        self.url = url

Trying that, we get:

AttributeError: 'HttpFile' object has no attribute 'seek'

So let's implement seek:

#!/usr/bin/env python3
import requests

class HttpFile:
    def __init__(self, url):
        self.url = url
        self.offset = 0
        self._size = -1

    def size(self):
        if self._size < 0:
            response = requests.get(self.url, stream=True)
            response.raise_for_status()
            if response.status_code != 200:
                raise OSError(f"Bad response of {response.status_code}")
            self._size = int(response.headers["Content-length"], 10)
        return self._size

    def seek(self, offset, whence=0):
        if whence == 0:
            self.offset = offset
        elif whence == 1:
            self.offset += offset
        elif whence == 2:
            self.offset = self.size() + offset
        else:
            raise ValueError(f"whence value {whence} unsupported")
        return self.offset

That gets us to the next error:

AttributeError: 'HttpFile' object has no attribute 'tell'

So we implement tell():

    def tell(self):
        return self.offset

Making progress, we reach the next error:

AttributeError: 'HttpFile' object has no attribute 'read'

So we implement read:

    def read(self, count=-1):
        if count < 0:
            end = self.size() - 1
        else:
            end = self.offset + count - 1
        headers = {
            'Range': "bytes=%s-%s" % (self.offset, end),
        }
        response = requests.get(self.url, headers=headers, stream=True)
        if response.status_code != 206:
            raise OSError(f"Bad response of {response.status_code}")
        # The headers contain the information we need to check that; in particular,
        # When the server accepts the range request, we get
        # Accept-Ranges: bytes
        # Content-Length: 22
        # Content-Range: bytes 27382-27403/27404
        # vs when it does not accept the range:
        # Content-Length: 27404
        content_range = response.headers.get('Content-Range')
        if not content_range:
            raise OSError("Server does not support Range")
        if content_range != f"bytes {self.offset}-{end}/{self.size()}":
            raise OSError(f"Server returned unexpected range {content_range}")
        # End of paranoia checks
        chunk = len(response.content)
        if count >= 0 and chunk != count:
            raise OSError(f"Asked for {count} bytes but got {chunk}")
        self.offset += chunk
        return response.content

We have a lot going on here; particularly around handling error checking and ensuring the responses match what we expect. We want to fail loudly if we get something unexpected rather than attempt to forge ahead and fail in an obscure way later on.

And now we finally reach some success, giving a listing of the filenames within the ZIP:

python.exe
pythonw.exe
python312.dll
python3.dll
vcruntime140.dll
vcruntime140_1.dll
LICENSE.txt
pyexpat.pyd
select.pyd
unicodedata.pyd
winsound.pyd
_asyncio.pyd
_bz2.pyd
_ctypes.pyd
_decimal.pyd
_elementtree.pyd
_hashlib.pyd
_lzma.pyd
_msi.pyd
_multiprocessing.pyd
_overlapped.pyd
_queue.pyd
_socket.pyd
_sqlite3.pyd
_ssl.pyd
_uuid.pyd
_wmi.pyd
_zoneinfo.pyd
libcrypto-3.dll
libffi-8.dll
libssl-3.dll
sqlite3.dll
python312.zip
python312._pth
python.cat

So let's see if we can extract part of the LICENSE.txt file from within the zip:

data = my_zip.open('LICENSE.txt')
data.seek(99)
print(data.read(239).decode('utf-8'))

That triggers a new error (which a comment 8 years after the initial code was posted pointed out was needed as of Python 3.7):

AttributeError: 'HttpFile' object has no attribute 'seekable'

So a trivial implementation of that:

    def seekable(self):
        return True

and we now get the content:

Guido van Rossum at Stichting
Mathematisch Centrum (CWI, see https://www.cwi.nl) in the Netherlands
as a successor of a language called ABC.  Guido remains Python's
principal author, although it includes many contributions from others.

There are a number of ways to further improve this code for production use, but for our pedagogical purposes here, I think we can call that "good enough". (Areas of improvement from an engineering perspective include: actual unit tests, integration tests that do not rely on a remote server, filling out the file object's full interface, addressing the read-only nature of the file access, using a session to support authentication mechanisms and connection reuse, among others.)

This gets us an object that acts like a local file even though it's reaching over the network. The implementation requires less code than a modified HttpZipFile would.

This same interface of a file-like object can be used for other purposes as well.

A Second Application Of The Pattern

Let's continue with our motivating use case of accessing parts of remote zip files where we don't want to download the entire file. If we don't want to download the entire file, then surely we would not want to download part of the file multiple times, right? So we would like HttpFile to cache data. But then we wind up mixing caching into the HTTP logic. Instead, we can again use the file-like-object interface to add a caching layer for a file-like object.

So we will need a class that takes a file object and a location to save the cached data. To keep this simple, let's say we point to a directory where the cache for this one file object will be stored. We will want to be able to store the file's total size, every chunk of data, and where each chunk of data maps into the file. So let's say the directory can contain a file named size with the file's size as a base 10 string with a newline, and any number of data.<offset> files with a chunk of data. This makes it easy for a human to understand how the data on disk works. I would not call it exactly "self describing", but it does lean in that general direction. (There are many, many ways we could store the data in the cache directory. Each one has its own set of trade-offs. Here I'm aiming for ease of implementation and obviousness.)

Since the file's data will be stored in segments, we will want to be able to think in terms of segments which can be ordered, check if two segments overlap, or if one segment contains another. So let's create a class to provide that abstraction:

import functools


@functools.total_ordering
class Segment(object):
    def __init__(self, offset, length):
        self.offset = offset
        self.length = length

    def overlaps(self, other):
        return (
            self.offset < other.offset+other.length and
            other.offset < self.offset + self.length 
        )

    def contains(self, offset):
        return self.offset <= offset < (self.offset + self.length)

    def __lt__(self, other):
        return (self.offset, self.length) < (other.offset, other.length)

    def __eq__(self, other):
        return self.offset == other.offset and self.length == other.length

Using that class, we can create a constructor that loads the metadata from the cache directory:

class CachingFile:
    def __init__(self, fileobj, backingstore):
        """fileobj is a file-like object to cache.  backingstore is a directory name.
        """
        self.fileobj = fileobj
        self.backingstore = backingstore
        self.offset = 0
        os.makedirs(backingstore, exist_ok=True)
        try:
            with open(os.path.join(backingstore, 'size'), 'r', encoding='utf-8') as size_file:
                self._size = int(size_file.read().strip(), 10)
        except Exception:
            self._size = -1

        # Get files and sizes for any pre-existing data, so
        # self.available_segments is a sorted list of Segments.
        self.available_segments = [
            Segment(int(filename[len("data."):], 10), os.stat(os.path.join(self.backingstore, filename)).st_size)
            for filename in os.listdir(self.backingstore) if filename.startswith("data.")]

and the simple seek/tell/seekable parts of the interface we learned above:

    def size(self):
        if self._size < 0:
            self._size = self.fileobj.seek(0, 2)
            with open(os.path.join(self.backingstore, 'size'), 'w', encoding='utf-8') as size_file:
                size_file.write(f"{self._size}\n")
        return self._size

    def seek(self, offset, whence=0):
        if whence == 0:
            self.offset = offset
        elif whence == 1:
            self.offset += offset
        elif whence == 2:
            self.offset = self.size() + offset
        else:
            raise ValueError("Invalid whence")
        return self.offset

    def tell(self):
        return self.offset

    def seekable(self):
        return True

Implementation of read() is a bit more complex. It needs to handle reads with nothing in the cache, reads with everything in the cache, but also reads with multiple cached and uncached segments.

    def _read(self, offset, count):
        """Does not update self.offset"""
        if offset >= self.size() or count == 0:
            return b""
        desired_segment = Segment(offset, count)
        # Is there a cached segment for the start of this segment?
        matches = sorted(segment for segment in self.available_segments if segment.contains(offset))
        if matches: # Read data from cache
            match = matches[0]
            with open(os.path.join(self.backingstore, f"data.{match.offset}"), 'rb') as data_file:
                data_file.seek( offset - match.offset )
                data = data_file.read(min(offset+count, match.offset+match.length) - offset)
        else: # Read data from underlying file
            # The beginning of the requested data is not cached, but if a later
            # portion of the data is cached, we don't want to re-read it, so
            # request only up to the next cached segment..
            matches = sorted(segment for segment in self.available_segments if segment.overlaps(desired_segment))
            if matches:
                match = matches[0]
                chunk_size = match.offset - offset
            else:
                chunk_size = count
            # Read from the underlying file object
            if not self.fileobj:
                raise RuntimeError(f"No underlying file to satisfy read of {count} bytes at offset {offset}")
            self.fileobj.seek(offset)
            data = self.fileobj.read(chunk_size)
            # Save to the backing store
            with open(os.path.join(self.backingstore, f"data.{offset}"), 'wb') as data_file:
                data_file.write(data)
            # Add it to the list of available segments
            self.available_segments.append(Segment(offset, chunk_size))
        # Read the rest of the data if needed
        if len(data) < count:
            data += self._read(offset+len(data), count-len(data))
        return data

    def read(self, count=-1):
        if count < 0:
            count = self.size() - self.offset
        data = self._read(self.offset, count)
        self.offset += len(data)
        return data

Notice the RuntimeError raised if we created the CachingFile object with fileobj=None. Why would we ever do that? Well, if we have fully cached the file, then we can run entirely from cache. If the original file (or URL, in our HttpFile case) is no longer available, the cache may be all we have. Or perhaps we want to isolate some operation, so we run once in "non-isolated" mode with the file object passed in, and then run in "isolated" mode with no file object. If the second run works, we know we have locally cached everything needed for the operation in question.

Our motivation is to use this with HttpFile, but it could be used in other situations. Perhaps you have mounted an sshfs file system over a slow or expensive link; CachingFile would improve performance or reduce cost. Or maybe you have the original files on a harddrive, but put the cache on an SSD so repeated reads are faster. (Though in the latter case, Linux offers functionality that would likely be superior to anything implemented in Python.)

Generalized Lesson

So those are a couple of handy utilities, but they demonstrate a more profound point.

When you design your code around standard interfaces, your solutions can be applied in a broader range of situations, and reduce the amount of code you must write to achieve your goal.

When faced with a problem of the form "I want to perform an operation on <something>, but I only know how to operate on <something else>", consider if you can create code that takes the "something" you have, and provides an interface that looks like the "something else" that you can use. If you can write that code to adapt one kind of thing to another kind of thing, you can solve your problem without having to reimplement the operation you already have code to do. And you might find there are more uses for the result than you anticipated.