How does open addressing handle deletions? I never figured this out. Update: So ...

usefulcat · on March 20, 2019

The article mentions a couple of options, one of which is tombstones. In a chaining implementation, the load factor is straightforward: num_items / table_size. Adding items increases the load factor and deleting items reduces it.

With open addressing, deletions do not decrease the load factor (at least not immediately, in general), because a deleted item becomes a tombstone. So for open addressing, load factor is (num_items + num_tombstones) / table_size.

The upshot is that in a scenario where items are being repeatedly added and removed, over time the load factor will gradually increase until eventually the table has be recreated. This is true even if the average number of items remains relatively constant over time, and is unlike a chaining implementation.

In this scenario, the average insertion time might be lower with open addressing, but the worst case will be much worse than for chaining.

a1369209993 · on March 20, 2019

You can eliminate tombstones over time by:

  # any time a tombstone immediately preceeds a empty, it can be marked empty
  [ $ _ ] -> [ _ _ ]
  # any time you lookup a key, it can be swapped with a tombstone immediately preceeding it
  [ $ A ] -> [ A $ ]
  # (moving the tombstone closer to a empty that will destroy it)
  # if you don't have iterators, you can also jump over other keys
  [ $ B A ] -> [ A B $ ]
  [ $ B C A ] -> [ A B C $ ]  # etc
  # (this will cause a iterator at A to yield B again, or at B to skip A)

How well this keeps the load factor down depends on how aggressively you look for tombstone disposal opportunities, but it does keep it down.

tptacek · on March 21, 2019

Mutating the table on lookup seems pretty gross, though.

btym · on March 21, 2019

I would argue that "the table" is not mutated, only the internal state of its implementation. Every time you access any information, a cache at some layer below you is updated. Is that also gross?

cperciva · on March 21, 2019

Yes. Normally you can have one thread writing to a data structure OR many threads reading the data structure at any given time and not need to worry about them causing problems. (This situation is common enough that we have things called "reader-writer mutexes" or "shared-exclusive" mutexes.)

As soon as your reads can modify the internal state of the data structure, it might modify the state in a way which trips up another read; so you can no longer have many threads reading the data structure at once.

shittyadmin · on March 21, 2019

But you don't need to write every time, only on occasion, so you can actually use a read write lock and in the nominal case many threads can read just fine.

That said, it's probably still better to avoid this unless it's absolutely necessary to modify the underlying structure sometimes, I recently had to do this for an LRU cache.

pcwalton · on March 21, 2019

Right. And in Rust implementing the hash table that way will suddenly make the table no longer flagged as "Sync" by the compiler, so you will be unable to share it between threads.

btym · on March 21, 2019

That's a great point. It's still a possible optimization with a compare-and-swap or if you can determine that you're in a single threaded context.

dhash · on March 21, 2019

Eh, it’s the classic amortized approach. Whoever you ca “touch” the data and you’re right there already due to a lookup, it makes sense to amortize your data structure housekeeping IMO.

TBH, the right answer is always due to the users use case (Amortization and housekeeping really help with purely functional data structures), and benchmark data.

a1369209993 · on March 21, 2019

Well, it still works (just slower) if you only do fixups during {table[key] = val} operations. But honestly, if you're using a probabilistic data structure like a hash table, the ship has already sailed on gross implementation details.

rurban · on March 21, 2019

No, that's the common approach with chained lists: Move found list item to first.

munificent · on March 20, 2019

If you want more material, the "Hash Tables" chapter in my book walks through it:

http://www.craftinginterpreters.com/hash-tables.html

Jernik · on March 20, 2019

Reading the article reveals that you can either use tombstones or move the elements back into the slot they "would have been in" if the deleted element was never added

shittyadmin · on March 20, 2019

This is actually described quite well in the OP.

dataflow · on March 20, 2019

Oh I see, thanks. I'm on my phone about to go to a meeting and haven't had a chance to read the article yet.

blattimwind · on March 20, 2019

The easiest one is tombstones (i.e. "this item is deleted") to keep the chain alive, or backshifting (i.e. moving all items in the chain forward one slot).

mjevans · on March 20, 2019

For back-shifting, I'd prefer to actually seek for the tail item in the list (just before you prove the key doesn't exist) and move that BACK to the evicted slot, rather than update all of the keys.

However the tombstone idea fits better with minimizing mutations and improving the potential for concurrency if the design only cares that inserts/deletes are atomic (not that they're propagated in parallel instantly).

For the 'move back' idea to be that safe I'd still want to use a tombstone value, but it would need to also include a pointer to the moved location. The list of tombstones would need to be submitted (as in, message queue) to a kind of garbage collection thread that sat on the messages for a pre-determined validity time and then did a scrub of the key/value to make sure the compaction still made sense. A shorter interval might also allow for that deferred compaction to be replaced by a new entry instead.

I don't like any of that as a generic solution though, as the trade-offs with each bit of extra effort and code complexity seem likely to be premature optimization and sources of potential bugs when not considered in a specific problem domain.

0815test · on March 20, 2019

I'm not sure that backshifting is as easy as "moving all items in the chain forward by one slot". Consider the hashtable [A, B1, B2, _, _], where one element is subsequently added after B2, giving [A, B1, B2, X, _]. Now when we remove B1 and shift B2 forward one slot ([A, B2, _, X, _]), we have to shift X forward if it hashes to the second or the third slot in the table, but not the fourth. So there might be multiple chains involved; if it was one contiguous chain only, we could simply arrange for the "hole" in it to be filled with no need for shifting all the items in it, and it would be quite efficient. However, it seems that it's not so simple.

barrkel · on March 20, 2019

Yup, the chains can be interleaved. Often it's best to save the hash of each key as well as the key itself. Comparing the full hash (rather than the modulus of the hash) will eliminate many keys faster than doing a full key comparison, and having the full hash available means recreation on expansion is cheap, as all the keys don't need rehashing. The full hashes can then be used to distinguish between different chains if you decide to do backfilling, again cheaper than recalculating key hashes.

Of course, having the hashes available also speeds up recreation, should the tombstone approach be used.

Basically, keep the full hashes around :)

rurban · on March 20, 2019

Certainly not with open addressing as it will destroy all the cache-line advantages. with seperate chaining it's very common, esp. for resizing.

barrkel · on March 20, 2019

The key is usually indirected, which means a pointer, which is usually bigger than the hash code.

(And of course you could use parallel arrays if you're super concerned about cache lines, though the tradeoffs would very much depend on whether hash misses or hits are the expected mode.)

senderista · on March 20, 2019

And if it’s not a pointer but a word-size integer, just use an invertible hash function (aka permutation aka block cipher) so you can store the hash code in place of the key :)