2051 points by Uptrenda 1319 days ago | 753 comments on HN
| Mild positive Community · v3.7· 2026-03-01 04:16:03 0
Summary Science & Culture Neutral
This is a user post on Hacker News soliciting discussion on interesting data structures, with the author providing an example of Bloom filters and Golomb Coded Sets. The content engages lightly with rights related to sharing information and participating in scientific culture. The evaluation is primarily neutral, with mildly positive scores stemming from the act of knowledge sharing on a platform designed for that purpose.
We use them a lot at Pyroscope for compressing strings that have common prefixes. They are also used in databases (e.g indexes in Mongo) or file formats (e.g debug symbols in macOS/iOS Mach-O format are compressed using tries).
We have an article with some animations that illustrate the concept in case anyone's interested [0].
Conflict-free replicated data type (CRDT) is super interesting data type (and algorithm), especially when it is implemented for complex data structures like JSON: A JSON CRDT is "[...] an algorithm and formal semantics for a JSON data structure that automatically resolves concurrent modifications such that no updates are lost, and such that all replicas converge towards the same state (a conflict-free replicated datatype or CRDT)." [1].
Finger trees allow you to do amazing things. In essence they let you build an index, or multiple indices, for your dataset and then store them in a single structure.
When I go from Haskell back to imperative land I find myself greatly missing this ability. Sure I can make multiple hashmaps or trees or whatever but being able to stuff it all in one data structure is amazing.
One structure I built with them that is much more difficult with typical structures is a "block list" structure.
In this structure each block has a particular width and they're all stacked side by side.
I want to perform a query, "which box is at position X". So if my boxes are of widths 7,20,10, then the lookup for 2 yields the first box, the lookup for 12 yields the second, etc.
More interestingly, if add a new box between the second and third, the indices covered by the last box is increased.
With finger trees you use a sum monoid. This is easy. In other languages you have to roll your own structure either using a list (with o(n) lookup) or a tree with o(log n) lookup, but o(n) inserts (to translate the indices of future blocks)
The problem space of ANN is one of those really deep holes you can go down. It’s a game of balancing time and space, and it’s got plenty of fascinating algorithms and datastructures.
Check out http://ann-benchmarks.com/ for a comparison. HNSW is not “the best” but it’s easy to understand and is quite effective.
A vaguely related notion is that naive analysis of big-O complexity in typical CS texts ignores over the increasing latency/cost of data access as the data size grows. This can't be ignored, no matter how much we would like to hand-wave it away, because physics gets in the way.
A way to think about it is that a CPU core is like a central point with "rings" of data arranged around it in a more-or-less a flat plane. The L1 cache is a tiny ring, then L2 is a bit further out physically and has a larger area, then L3 is even bigger and further away, etc... all the way out to permanent storage that's potentially across the building somewhere in a disk array.
In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.
Hence, a lot of algorithms that on paper have identical performance don't in reality, because one of the two may have an extra sqrt(n) factor in there.
This is why streaming and array-based data structures and algorithms tend to be faster than random-access, even if the latter is theoretically more efficient. So for example merge join is faster than nested loop join, even though they have the same performance in theory.
It's like a Red-Black tree in use case, but much faster to implement, which is good for competitive programming. The average case complexity is the same for all operations, but there's an unlikely degeneration to worst-case linked-list behaviour.
Say that you have data that is identified with points in 2D or 3D space. The standard way that CS students learn in school to represent this is via a quadtree or octree. However, these tree data structures tend to have a lot of "fluff" (needless allocations and pointer chasing) and need a lot of work to be made efficient.
Spatial hashing is the stupidly simple solution of just rounding coordinates to a grid and storing nearby objects together in a hash table. As long as your data is reasonably well behaved then you get many of the advantages of a quadtree, and it's completely trivial to implement.
(Fantastic post idea OP. One of the best I've ever seen :D)
Related to bloom filters, xor filters are faster and more memory efficient, but immutable.
HyperLogLog is an efficient way to estimate cardinality.
Coolest thing I've learned recently was Y-fast trie. If your dataset M is bounded integers (say, the set of all 128 bit numbers), you get membership, predecessor, or successor queries in log log time, not log, like in a normal tree.
Would love to learn more "stupidly fast at the cost of conceptual complexity" things.
edit:
(adding)
I don't know a name for it, because it's not one thing but a combination, but once can:
Use the first N bits from a very quick hash of a key from an unknown distribution (say a file path, or a variable name, or an address, or a binary blob,..) as a way to "shard" this key across M other fast data structures (like a tree) for search/add/remove. By changing M, you can tune the size of the terms in the O(1)+O(log) equation for running time.
Trees getting too deep for fast search? Every increase of N moves the computation from the log search of the tree to the space tradeoff of having more trees.
Added benefit is you can scale to multiple threads easily. Instead of locking the whole tree, you lock a tiny sub-tree.
Very clever. (I first saw in the Aerospike key value store)
The "golden section search" to find a the minimum (or maximum) of a unimodal function. An actual real-world use case for the golden ratio.
Exponentially Weighted Moving Average filters. Or how to have a moving average without saving any data points..
Some of my classic favorites:
Skiplists: they are sorted trees, but the algorithms are low complexity which is nice.
Boyer-Moore string search is awesome..
Bit twiddling algorithms from Hackers Delight: for example (x &-x) isolates the least significant set bit: basically use the ALU's carry chain to determine priority.
Another is compress from Hacker's Delight. It very quickly compresses selected bits (from a bitmask), all to the right of the word. It's useful to make certain hash functions.
The humble circular queue. If your circular queue is a power of 2 in size, then the number of bits needed for indexes is at least log2(size) + 1. The + 1 is key- it allows you to distinguish between completely full and completely empty with no effort. So:
The union-find data structure / algorithm is useful and a lot of fun.
The goal is a data structure where you can perform operations like "a and b are in the same set", "b and c are in the same set" and then get answers to questions like "are a and c in the same set?" (yes, in this example.)
The implementation starts out pretty obvious - a tree where every element either points at itself or some thing it was merged with. To check if two elements are in the same set, check if they have the same parent. Without analyzing it, it sounds like you'll average findRoot() performance of O(log(n)), worst-case O(n).
There are a couple of simple optimizations you can do to this structure, the type of things that seem like they shouldn't end up affecting asymptotic runtime all that much. The first is that, whenever you find a root, you can re-parent all the nodes you visited on the way to that root, so they'll all be quicker to look up next time. The other is that you keep track of the size of sets, and always make the larger set be the parent of the smaller.
And neither of those actually do anything impressive alone, but if you use both, the algorithm suddenly becomes incredibly fast, with the slowest-growing (non-constant) complexity I've ever heard of: O(the inverse of the Ackermann function(n)). Or, for any reasonable N, O(4 or less).
The Zipper acts like a linked-list with a cursor, or "focused element"; it's implemented as a pair of lists in opposite orders; or, equivalently but more symmetric, as triple of (List[A], A, List[A])
Say we have a zipper containing [0, 1, 2, 3, 4, 5], and we're focusing on the 3. In code this will look like:
([2, 1, 0], 3, [4, 5])
Where [a, b, c] denotes a singly-linked list, with O(1) head (returning a) and tail (returning [b, c]). Notice that the first list is in reverse order.
To focus on the next element, we put the currently focused element on the first list, and pull the head off the second list:
([3, 2, 1, 0], 4, [5])
And vice versa to focus on the previous element:
([1, 0], 2, [3, 4, 5])
I like to imagine this as a string of beads, with the focused element held in our fingers and the rest hanging down:
3
/ \
2 4
| |
1 5
|
0
To move the focus forwards and backwards, we move our grip to the next/previous bead.
This works nicely as an immutable datastructure, since the tails of both lists can be shared (i.e. we don't need to copy the whole list, just the wrap/unwrap an element on to the head)
Zippers were first applied to lists, but have since been generalised:
- To trees: focusing on one node, and being able to move the focus up, to the left-child, or to the right child
- To having more than one focus
- To datastructures more generally, by treating them as 'derivatives' (as in calculus)
As an example, the XMonad window manager uses a zipper to keep track of the currently-focused window.
I've also used Zippers for window management, e.g. having a list of Displays, with one of them focused (accepting keyboard input); each Display containing a list of Workspaces, with one of them focused (currently shown); and each Workspace containing a list of Windows, with one of them focused (the active window). Keyboard shortcuts could shift between the previous/next Window, Workspace, or Display.
"This is the story of a clever trick that's been around for at least 35 years, in which array values can be left uninitialized and then read during normal operations, yet the code behaves correctly no matter what garbage is sitting in the array. Like the best programming tricks, this one is the right tool for the job in certain situations. The sleaziness of uninitialized data access is offset by performance improvements: some important operations change from linear to constant time."
HAMT: Hash Array Mapped Trie. This data structure makes efficient immutable data possible. You can update a list of a million items, and keep a reference to the original list, by changing 3 or 4 references and some bytes.
This should replace copy-on-write for scripting languages. I really want to see it in a JS spec soon. There are libraries that can do it, but they add translation penalties and extra steps. I’d complain less if I could use Clojurescript professionally, because it and Clojure are built around them.
These are binary search trees, but when you 'splay' the tree (such as by search) it rebalances the tree so that the search result is on top. This means while it is O(log n) for operations like other self-balancing trees, it optimizes tree depth in the face of non-random access so that recently accessed items are fewer steps into the tree.
Piece tables:
A bit more common for text editors, where you need to represent a sequence of characters in a form that allows efficient memory use and fast insertions/removals (so editing the first line isn't moving the 10MB of data that follows it). You create a series of references to spans (or pieces) of the document, possibly starting with a single span pointing to a mmap() version. Edits are done by appending/prepending pieces, which are potentially just references to subsequences of items created in fresh memory, appended into a buffer. Saving can create a single sequence (and a single span).
This has interesting variations:
- Put attributes on the pieces for formatting, such as indicating text should be rendered a different color or bolded.
- Create a hierarchy of pieces-of-pieces. With formatting attributes you are then dangerously close to a DOM.
- Retain old copies of a piece table - since your original mmap() file hasn't changed and your changes are in an append-only buffer, those piece table copies provide undo/redo state.
Not a very deep CS-y one, but still one of my favourite data structures: Promise Maps.
It only works in languages where promises/futures/tasks are a first-class citizen. Eg JavaScript.
When caching the result of an expensive computation or a network call, don't actually cache the result, but cache the promise that awaits the result. Ie don't make a
Map<Key, Result>
but a
Map<Key, Promise<Result>>
This way, if a new, uncached key gets requested twice in rapid succession, ie faster than the computation takes, you avoid computing/fetching the same value twice. This trick works because:
- promises hold on to their result value indefinitely (until they're GC'ed)
- you can await (or .then()) an existing promise as many times as you want
- awaiting an already-resolved promise is a very low-overhead operation.
In other words, the promise acts as a mutex around the computation, and the resulting code is understandable even by people unfamiliar with mutexes, locks and so on.
Equality graphs (e-graphs) for theorem proving and equality saturation and other equality-related things.
They're awesome data structures that efficiently maintain a congruence relation over many expressions
> At a high level, e-graphs extend union-find to compactly represent equivalence classes of expressions while maintaining a key invariant: the equivalence relation is closed under congruence.
e.g. If I were to represent "f(x)" and "f(y)" in the e-graph, and then said "x == y" (merged "x" and "y" in the e-graph), then the e-graph, by congruence, would be able to tell me that "f(x) == f(y)"
e.g. If I were to represent "a*(2/2)", in the e-graph, then say "2/2 == 1", and "x*1 == x", by congruence the e-graph would know "a*(2/2) == a" !
The most recent description of e-graphs with an added insight on implementation is https://arxiv.org/pdf/2004.03082.pdf to the best of my knowledge.
They’re like tries, but with identical subtrees glued together. They’re capable of encoding real-world dictionaries of millions of words into ~1 byte per word, when stored cleverly (see [0]). They let you do things like efficiently finding a list of words in a dictionary whose Levenshtein’s distance to a given word is less than x. Their cousins, GADDAGs, are _the_ data structure of choice in Scrabble-playing engines.
Purely Functional Data Structures[1] by Chris Okasaki is worth reading. There's a book version if you prefer vs reading a thesis.
Even though the domain application is functional programming, these datastructures can come in handy when you want to enable state sharing / keeping old versions around without having to copy data.
Think of it as a JWT that you can narrow down the authorizations for without needing to communicate with a server, so if you have read permissions to all your photos you can add a caveat saying `photoid=123456` and share it, and the recipient can only read the photo 123456. The caveats can be anything, including requiring third party authentication via third-party caveats. I've implemented systems with it being validated by a lua module in nginx which worked well, but the whole concept never took off.
I still think it seems like one of the most interesting authZ strategies around.
A minor quibble with your use-case explanation: The advantage of a bloom filter isn't strictly time complexity. For example, a hash table would also have constant lookup time (best case), and would give a definitive answer on set membership. However, to store 1 million IPv6 addresses would take 16 MB. You can see very quickly that this would not scale very well to, say, a billion addresses stored in-memory on a laptop. With a bloom filter, we can shrink the amount of storage space required* while maintaining an acceptable, calculable false positive rate.
* IP addresses actually aren't a great use case for basic bloom filters, as they're fairly storage efficient to begin with, as opposed to a url for example. Taking your example, say we need to store 1 million IP addresses in our bloom filter and we're okay with a ~1% false positive rate. Well then, if we use a bloom filter with 2^23 bits (1 MB), the optimal number of hash functions is (2^23)/(10^6)*ln(2) = 6, yielding a false positive rate of (1 - exp(-6* 10^6 /2^23))^6 = ~1.8%. So we're using 6% of the storage space, but with a nearly 2% false positive rate.
I wouldn't consider tries to be obscure tbh. They are the recommended solution for many leetcode-style interview problems involving string searching. I think anyone who has done serious interview prep has encountered them.
Speaking of ease of implementation, I just discovered AA trees. They are probably not "obscure" but I think they are worthy of more fame because they perform jsut as well as red-black trees and are easier to implement. Finding
clear comprehensive documentation about them was not easy though, so here is for you, the best I could find : https://www.cs.umd.edu/class/fall2019/cmsc420-0201/Lects/lec...
Does anyone actually use cache-oblivious data structure in practice? Not, like, "yes I know there's a cache I will write a datastructure for that", that's common, but specifically cache-oblivious data structures?
People mention them a lot but I've never heard anyone say they actually used them.
You can do this with any tree, finger or otherwise, and the big-O stuff applies as long as the tree is within a constant factor of balanced. For some bizarre reason, though, the concept of caching a monoid in each node doesn’t seem to have spread to the standard libraries of most languages. It’s a pity, since that would be incredibly useful.
Regarding the bit “compress” operation, this is supported in hardware (PEXT instruction) by Haswell and newer processors, though unfortunately most AMD processors have poor implementations. I’ve found it to be quite handy when implementing subsetting operations.
You might enjoy a paper I’m a coauthor on which combines finger trees with B-trees to build a data structure for optimal updates to sliding windows: Optimal and General Out-of-Order Sliding-Window Aggregation, https://www.scott-a-s.com/files/vldb2019_fiba.pdf
> In essence, as data size 'n' grows, the random access time grows as sqrt(n), because that's the radius of the growing circle with area 'n'.
I was about to write a comment suggesting that if we made better use of three dimensional space in constructing our computers and data storage devices, we could get this extra latency factor down to the cube root of n.
But then, I decided to imagine an absurdly large computer. For that, one has to take into account a curious fact in black hole physics: the radius of the event horizon is directly proportional to the mass, rather than, as one might expect, the cube root of the mass [1]. In other words, as your storage medium gets really _really_ big, you also have to spread it out thinner and thinner to keep it from collapsing. No fixed density is safe. So, in fact, I think the extra factor for latency in galactic data sets is neither the square root nor cube root, but n itself!
A fantastic thing about HyperLogLog is that it can be merged, so you can split your data between multiple server, precompute HLL for all IPs every minute, and then ask "how many unique IPs was there yesterday".
Discovered HLL because it's used in ClickHouse, which employ a ton of cool but obscure data structure.
I don’t understand why wouldn’t you just use a list and an index. You can always access list[index+1] or list[index-1] or list[0] or list[list.length-1].
I encountered this when I tried implementing a Brainfuck interpreter in Haskell. Representing the memory array using a List + index would be way too slow; but a Zipper is perfect!
I spent a long time looking at an algorithm for long range force calculations called the Fast Multipole Method which is O(N) and found that practically it couldn't compete against an O(N log N) method we used that involved FFTs because the coefficient was so large that you'd need to simulate systems way bigger than is feasible in order for it to pay off because of cache locality, etc.
Another interesting approach to copy-on-write for immutable collections (for example arrays) is where you actively mutate the array in place, but leave the original as a lazy description/view of how to get back to the before state.
From the outside the effect is the same, but the performance is optimized for accessing the updated collection and only possibly using the old value.
Great for cases where you want the immutability guarantee, but where it might be unlikely the old value is actually going to be used in the general case.
Y-fast tries are some of my favorites. I think they are heavily under utilized in modern terms. They sat by the the wayside for a long term because datasets where relatively small, at the time they where created ram didn't exist, and bitwise operations where inefficient along with many other constant factors.
Today; however, a lot of people have datasets on the order of 2^16 or 2^32 keys they need to maintain. And efficient bitwise operations (on upto 512 bits) are the norm. Y-fast tries are faster than b-trees or other datastructures. Also because they divide _space_ not the _data_ they are very amenable to multi-threaded and distributed algorithms. For example the hash-tables in a y-fast trie can actually be a rendezvous hash pointing to a database node. Once in that node you can hash on cores again to get to a local process for example.
Oh, huh, this is serendipitously handy because I need to be doing something like that this weekend (translating from Go to Rust - I already have a reparenting solution using hashes but hopefully this is easier / better.)
Apache foundation has a fantastic DataSketches library that includes HLL and many other powerful data analytics algorithms: https://datasketches.apache.org/
Lee Rhodes has done an excellent introduction to this library - explaining some of the use cases, advantages, and things to be aware of when using these techniques: https://www.youtube.com/watch?v=nO9pauS-mGQ
Even though a trie is pretty standard and expected (to be known) these days, I will vote for this. It was my first deep dive into more exotic data structures after an interview question about implementing T9 that stumped me many years ago.
Indeed, but it took a while for me to appreciate Tries.
Originally, i thought "nice idea", but I rarely ever encountered a problem where I really needed a prefix tree.
But while reading into HAMT and Clojure's datastructure implementations, it dawned on me that prefix trees aren't only useful for strings (or arbitrary byte sequences): Hashes in a dictionary, for example, are sequences of bits, too. And the design of a HAMT is exactly such that it doesn't concern itself with bytes but any blocks of N bits (determined by the branching factor; i.e. a factor of 32 implies 6-bit-blocks). And if you know the length of each sequence (hash), you know the maximum Trie depth, which also works for you, not against you.
I didn't know this was a formal design pattern; I do recall having implemented this myself once, as a means to avoid double requests from different components.
Later on I switched to using react-query which has something like this built-in, it's been really good.
build 1ad9551+j7zs · deployed 2026-03-02 09:09 UTC · evaluated 2026-03-02 11:31:12 UTC
Support HN HRCB
Each evaluation uses real API credits. HN HRCB runs on donations — no ads, no paywalls.
If you find it useful, please consider helping keep it running.