> Introducing the ‘packed’ data format, a binary format that allows using data as it is, without the need for a deserialisation step. A notable perk of this format is that traversals on packed trees is proven to be faster than on ‘unpacked’ trees: as the fields of data structures are inlines, there are no pointer jumps, thus making the most of the L1 cache.
That is, a "memory dump -> zero-copy memory read" of a subgraph of Haskell objects, allowing to pass such trees / subgraphs directly over a network. Slightly reminiscent of Cap'n Proto.
> the serialised version of the data is usually bigger than its in-memory representation
I don’t think this is common. Perhaps for arrays of floats serialized as JSON or something. But I can’t think of a case where binary serialization is bigger. Data types like maps are necessarily larger in memory to support fast lookup and mutability.
If you use a lot of sharing in immutable data it can grow a lot when serializing. A simple pathological example would be a tree that has all left subtrees same as the right ones. It takes O(height) space in memory, but O(2^height) when serialized.
I suppose all self-describing formats, like protobuf, or thrift or, well, JSON are bigger than the efficient machine representation, because they carry the schema in every message, one way or another.
> Introducing the ‘packed’ data format, a binary format that allows using data as it is, without the need for a deserialisation step. A notable perk of this format is that traversals on packed trees is proven to be faster than on ‘unpacked’ trees: as the fields of data structures are inlines, there are no pointer jumps, thus making the most of the L1 cache.
That is, a "memory dump -> zero-copy memory read" of a subgraph of Haskell objects, allowing to pass such trees / subgraphs directly over a network. Slightly reminiscent of Cap'n Proto.
One thing that sometimes gets tricky in these things is handling Sub term sharing. I wonder how they implemented it.
They mention this in the article.
It reminds me more of flat buffers though. Does protobuf also have zero allocation (beyond initial ingestion) and no pointer jumps?
We are always reinventing wheels. If we didn't, they'd all still be made of wood.
> the serialised version of the data is usually bigger than its in-memory representation
I don’t think this is common. Perhaps for arrays of floats serialized as JSON or something. But I can’t think of a case where binary serialization is bigger. Data types like maps are necessarily larger in memory to support fast lookup and mutability.
If you use a lot of sharing in immutable data it can grow a lot when serializing. A simple pathological example would be a tree that has all left subtrees same as the right ones. It takes O(height) space in memory, but O(2^height) when serialized.
I suppose all self-describing formats, like protobuf, or thrift or, well, JSON are bigger than the efficient machine representation, because they carry the schema in every message, one way or another.
Is this like MessagePack for Haskell?
This was very well written. Excellent article!
honestly i wish more stuff worked this way - fewer hops in memory always makes me happy