Decompression is generally bound by CPU speed, not memory bandwidth or latency.

blihp · on Nov 25, 2020

CPU speed is often bound by memory bandwidth and latency... it's all related. If you can't keep the CPU fed, it doesn't matter how fast it is theoretically.

jml7c5 · on Nov 25, 2020

What I mean is that (to my understanding) memory bandwidth in modern devices is already high enough to keep a CPU fed during decompression. Bandwidth isn't a bottleneck in this scenario, so raising it doesn't make decompression any faster.

accountofme · on Nov 25, 2020

RAM bandwidth limitations (latency and throughput) are generally hidden by the multiple layers of cache in between the ram and CPU prefetching more data than is generally needed. Having memory on chip could make the latency less, but as ATI has shown with HBM memory on a previous generation of its GPUs its not a silver bullet solution.

I am going to speculate now, but maybe, just maybe, if some of the silicon that apple has used on the M1 is used for compression/decompression they could be transparently compressing all ram in hardware. Since this offloaded from the CPUs and allows a compressed stream of data from memory, they achieve greater ram bandwidth, less latency and less usage for a given amount of memory. If this is the case I hope that the memory has ECC and/or the compression has parity checking....

spideymans · on Nov 25, 2020

> I am going to speculate now, but maybe, just maybe, if some of the silicon that apple has used on the M1 is used for compression/decompression they could be transparently compressing all ram in hardware. Since this offloaded from the CPUs and allows a compressed stream of data from memory, they achieve greater ram bandwidth, less latency and less usage for a given amount of memory.

Are you aware of any x86 chips that utilize this method?

accountofme · on Nov 29, 2020

Not that I am aware. I remember seeing apple doing something it in software with the intel macs. Which is why I speculated about it being hardware for M1.

Cheers

eesmith · on Nov 25, 2020

> Blosc [...] has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call. Blosc is the first compressor (that I'm aware of) that is meant not only to reduce the size of large datasets on-disk or in-memory, but also to accelerate memory-bound computations (which is typical in vector-vector operations).

https://blosc.org/pages/blosc-in-depth/