How does open addressing handle deletions? I never figured this out.
Update: So I just read the article (I didn't have the chance earlier) and I see it explains tombstones, which seem like a pretty clever solution I wasn't thinking about. What I had been confused about, though was the other more-obvious attempt at a solution, which is backshifting. I think I had gotten stuck is what you do with the spot that opens up after the backshift... it might have been part of another probe chain (or multiple, for that matter) that had jumped over it before, so how do you find one of these chains to move back an item into this slot (which you would have to do)?
The article mentions a couple of options, one of which is tombstones. In a chaining implementation, the load factor is straightforward: num_items / table_size. Adding items increases the load factor and deleting items reduces it.
With open addressing, deletions do not decrease the load factor (at least not immediately, in general), because a deleted item becomes a tombstone. So for open addressing, load factor is (num_items + num_tombstones) / table_size.
The upshot is that in a scenario where items are being repeatedly added and removed, over time the load factor will gradually increase until eventually the table has be recreated. This is true even if the average number of items remains relatively constant over time, and is unlike a chaining implementation.
In this scenario, the average insertion time might be lower with open addressing, but the worst case will be much worse than for chaining.
# any time a tombstone immediately preceeds a empty, it can be marked empty
[ $ _ ] -> [ _ _ ]
# any time you lookup a key, it can be swapped with a tombstone immediately preceeding it
[ $ A ] -> [ A $ ]
# (moving the tombstone closer to a empty that will destroy it)
# if you don't have iterators, you can also jump over other keys
[ $ B A ] -> [ A B $ ]
[ $ B C A ] -> [ A B C $ ] # etc
# (this will cause a iterator at A to yield B again, or at B to skip A)
How well this keeps the load factor down depends on how aggressively you look for tombstone disposal opportunities, but it does keep it down.
I would argue that "the table" is not mutated, only the internal state of its implementation. Every time you access any information, a cache at some layer below you is updated. Is that also gross?
Yes. Normally you can have one thread writing to a data structure OR many threads reading the data structure at any given time and not need to worry about them causing problems. (This situation is common enough that we have things called "reader-writer mutexes" or "shared-exclusive" mutexes.)
As soon as your reads can modify the internal state of the data structure, it might modify the state in a way which trips up another read; so you can no longer have many threads reading the data structure at once.
But you don't need to write every time, only on occasion, so you can actually use a read write lock and in the nominal case many threads can read just fine.
That said, it's probably still better to avoid this unless it's absolutely necessary to modify the underlying structure sometimes, I recently had to do this for an LRU cache.
Right. And in Rust implementing the hash table that way will suddenly make the table no longer flagged as "Sync" by the compiler, so you will be unable to share it between threads.
Eh, it’s the classic amortized approach. Whoever you ca “touch” the data and you’re right there already due to a lookup, it makes sense to amortize your data structure housekeeping IMO.
TBH, the right answer is always due to the users use case (Amortization and housekeeping really help with purely functional data structures), and benchmark data.
Well, it still works (just slower) if you only do fixups during {table[key] = val} operations. But honestly, if you're using a probabilistic data structure like a hash table, the ship has already sailed on gross implementation details.
Reading the article reveals that you can either use tombstones or move the elements back into the slot they "would have been in" if the deleted element was never added
The easiest one is tombstones (i.e. "this item is deleted") to keep the chain alive, or backshifting (i.e. moving all items in the chain forward one slot).
For back-shifting, I'd prefer to actually seek for the tail item in the list (just before you prove the key doesn't exist) and move that BACK to the evicted slot, rather than update all of the keys.
However the tombstone idea fits better with minimizing mutations and improving the potential for concurrency if the design only cares that inserts/deletes are atomic (not that they're propagated in parallel instantly).
For the 'move back' idea to be that safe I'd still want to use a tombstone value, but it would need to also include a pointer to the moved location. The list of tombstones would need to be submitted (as in, message queue) to a kind of garbage collection thread that sat on the messages for a pre-determined validity time and then did a scrub of the key/value to make sure the compaction still made sense. A shorter interval might also allow for that deferred compaction to be replaced by a new entry instead.
I don't like any of that as a generic solution though, as the trade-offs with each bit of extra effort and code complexity seem likely to be premature optimization and sources of potential bugs when not considered in a specific problem domain.
I'm not sure that backshifting is as easy as "moving all items in the chain forward by one slot". Consider the hashtable [A, B1, B2, _, _], where one element is subsequently added after B2, giving [A, B1, B2, X, _]. Now when we remove B1 and shift B2 forward one slot ([A, B2, _, X, _]), we have to shift X forward if it hashes to the second or the third slot in the table, but not the fourth. So there might be multiple chains involved; if it was one contiguous chain only, we could simply arrange for the "hole" in it to be filled with no need for shifting all the items in it, and it would be quite efficient. However, it seems that it's not so simple.
Yup, the chains can be interleaved. Often it's best to save the hash of each key as well as the key itself. Comparing the full hash (rather than the modulus of the hash) will eliminate many keys faster than doing a full key comparison, and having the full hash available means recreation on expansion is cheap, as all the keys don't need rehashing. The full hashes can then be used to distinguish between different chains if you decide to do backfilling, again cheaper than recalculating key hashes.
Of course, having the hashes available also speeds up recreation, should the tombstone approach be used.
The key is usually indirected, which means a pointer, which is usually bigger than the hash code.
(And of course you could use parallel arrays if you're super concerned about cache lines, though the tradeoffs would very much depend on whether hash misses or hits are the expected mode.)
And if it’s not a pointer but a word-size integer, just use an invertible hash function (aka permutation aka block cipher) so you can store the hash code in place of the key :)
Update: So I just read the article (I didn't have the chance earlier) and I see it explains tombstones, which seem like a pretty clever solution I wasn't thinking about. What I had been confused about, though was the other more-obvious attempt at a solution, which is backshifting. I think I had gotten stuck is what you do with the spot that opens up after the backshift... it might have been part of another probe chain (or multiple, for that matter) that had jumped over it before, so how do you find one of these chains to move back an item into this slot (which you would have to do)?