I've been advocating for SQLite+NVMe for a while now. For me it is a new kind of...

crazygringo · on March 13, 2025

> I've been advocating for SQLite+NVMe for a while now.

Why SQLite instead of a traditional client-server database like Postgres? Maybe it's a smidge faster on a single host, but you're just making it harder for yourself the moment you have 2 webservers instead of 1, and both need to write to the database.

> Latency is king in all performance matters.

This seems misleading. First of all, your performance doesn't matter if you don't have consistency, which is what you now have to figure out the moment you have multiple webservers. And secondly, database latency is generally miniscule compared to internet round-trip latency, which itself is miniscule compared to the "latency" of waiting for all page assets to load like images and code libraries.

> Especially in those where items must be processed serially.

You should be avoiding serial database queries as much as possible in the first place. You should be using joins whenever possible instead of separate queries, and whenever not possible you should be issuing queries asynchronously at once as much as possible, so they execute in parallel.

bob1029 · on March 13, 2025

The entire point is to avoid the network hop.

Application <-> SQLite <-> NVMe

has orders of magnitude less latency than

Application <-> Postgres Client <-> Network <-> Postgres Server <-> NVMe

> You should be avoiding serial database queries as much as possible in the first place.

I don't get to decide this. The business does.

_1tem · on March 14, 2025

Postgres supports Unix sockets when running on the same machine. That’s what I use, for a significant latency improvement over the TCP stack even at 127.0.0.1.

sedatk · on March 13, 2025

"...has orders of magnitude less latency than..."

[citation needed]. Local network access shouldn't be much different than local IPC.

bob1029 · on March 13, 2025

> Local network access

In what production scenarios do MySQL, Postgres, DB2, Oracle, et. al., live on the same machine as the application that uses them?

I am pretty sure most of these vendors would offer strict guidance to not do that.

TylerE · on March 14, 2025

Like 95% of websites that aren’t Amazon or google? Ton of sites that run in a single small vm. Postgres scales down quite nicely and will happily run in say, 512MB.

chatmasta · on March 14, 2025

It’s not a stretch to imagine that a scenario where you’re willing to run SQLite locally is also one where it’s acceptable to run Postgres locally. You’ve presumably already got the sharding problem solved, so why not? It’s less esoteric of an architecture than multiwriter SQLite.

crazygringo · on March 14, 2025

> I am pretty sure most of these vendors would offer strict guidance to not do that.

Then you'd be wrong. Running Postgres or MySQL on the same host where Apache is running is an extremely common scenario for sites starting out. They run together on 512 MB instances just fine. And on an SSD, that can often handle a surprising amount of traffic.

As popularity grows, the next step is to separate out the database on its own server, but mostly as a side effect of the fact that you now need multiple web servers, but still a single source of truth for data. Databases are lighter-weight than you seem to think.

immibis · on March 14, 2025

In the scenario where you were choosing between it and SQLite...

bobmcnamara · on March 14, 2025

Context switches plus mmap accesses are often slower than mmap accesses.

badmintonbaseba · on March 14, 2025

You don't have IPC for sqlite, do you?

nolist_policy · on March 14, 2025

You do of you access the same database from miltiple processes.

badmintonbaseba · on March 14, 2025

What IPC mechanisms exist between sqlite processes accessing the same database, other than file locking and some atomic IO operations ensured by the OS.

_1tem · on March 14, 2025

I’ve tested this before and Postgres is measurably faster over Unix socket than over local network.

crazygringo · on March 13, 2025

Perhaps I wasn't clear enough in my comment. When I said "database latency is generally miniscule compared to internet round-trip latency", I meant between the user and the website. Because they're often thousands of miles away, there are network buffers, etc.

But no, a local network hop doesn't introduce "orders of magnitude" more latency. The article itself describes how it is only 5x slower within a datacenter for the roundtrip part -- not 100x or 1,000x as you are claiming. But even that is generally significantly less than the time it takes the database to actually execute the query -- so maybe you see a 1% or 5% speedup of your query. It's just not a major factor, since queries are generally so fast anyways.

The kind of database latency that you seem to be trying to optimize for is a classic example of premature optimization. In the context of a web application, you're shaving microseconds for a page load time that is probably measured in hundreds of milliseconds for the user.

> I don't get to decide this. The business does.

You have enough power to design the entire database architecture, but you can't write and execute queries more efficiently, following best practices?

392 · on March 14, 2025

Sqlite can be run in process. Latency and bandwidth can be made 10x worst by process context switching alone. Plus being able to get away with n+1s could save a lot of dev time depending on the crew, before Claude (tho the dev still needs to learn that the speed problem is due to this and refactor the query, or write it fast the first time)

crazygringo · on March 14, 2025

> Latency and bandwidth can be made 10x worst by process context switching alone.

No they can't. That doesn't even make sense as a claim regarding bandwidth since SQLite doesn't use any, but please re-read what I said about being a 1% or 5% difference in speed. Not 10x.

bob1029 · on March 14, 2025

Yes they absolutely can.

Same-core context switching costs a few microseconds.

Going across core complexes can cost tens to hundreds of microseconds.

These figures are several orders of magnitude (5-6) slower than L1 access on the same thread.

crazygringo · on March 15, 2025

Hundreds of microseconds? L1 access? I don't have the faintest idea of what you're talking about.

Communication between processes is negligible compared to all of the sequential disk/SSD accesses and processing required for executing queries.

The database isn't stored in L1 and communication isn't taking hundreds of microseconds. I don't know where you're getting your information.

The fact that SQLite is in-process is primarily about simplicity and convenience, not performance. Performance can even be worse, e.g. due to the lack of a query cache.

immibis · on March 14, 2025

If you're concerned about the overhead of IPC when using postgres on the same server, weigh your intuition of it against your intuition of the savings from having a persistent process. SQLite can't cache a lot of things because some other process might have completely changed the database between transactions. Postgres knows everything that happens to the database.

conradev · on March 13, 2025

Until you hit the single-writer limitation in SQLite, you do not need to spend more CPU cycles on Postgres

chatmasta · on March 14, 2025

That’s a limitation you’ll hit pretty quickly unless you’ve specifically planned your architecture to be mostly read-only SQLite or one SQLite per session.

spratzt · on March 14, 2025

You certainly won’t hit it with most corporate OLAP processing, which is nearly all read-only SQlite. Writes are generally batched and processed outside ‘normal’ business hours, where the limitations of SQlite writing are irrelevant.

andai · on March 14, 2025

Where are they batched?

chatmasta · on March 14, 2025

In a separate system maintained by the DuckDB cargo cultists.

conradev · on March 14, 2025

I'm having a hard time imagining this. Aren't most CRUD apps using OLTP mostly read-only in the first place?

I just feel like you'd need thousands of concurrent users on a typical CRUD app to even get close to straining SQLite.

theamk · on March 14, 2025

I'd recommend going with postgres if there is a good chance you'll need it, instead of starting with SQLite and switching later - as their capabilities and data models are quite different.

For small traffic, it's pretty simple to run it on the same host as web app, and unix auth means there are no passwords to manage. And once you need to have multiple writers, there is no need to rewrite all the database queries.

dangoodmanUT · on March 13, 2025

The SQLite filesystem is laid out to hedge against HDD defragging. It wouldn't benefit as much as changing it to a more modern layout that's SSD-native, then using NVMe

cynicalsecurity · on March 13, 2025

Sqlite doesn't work super well with parallelism in writing. It supports it, yes, but in a bit clunky way and it still can fail. To avoid problems with parallel writing besides setting a specific clunky mode of operations a trick of using a single thread for writing in an app can be used. Which usually makes the already complicated parallel code slightly more complicated.

If only one thread of writing is required, then SQLite works absolutely great.

bob1029 · on March 13, 2025

> If only one thread of writing is required, then SQLite works absolutely great.

The whole point of getting your commands down to microsecond execution time is so that you can get away with just one thread of writing.

Entire financial exchanges operate on this premise.

chatmasta · on March 14, 2025

Entire financial exchanges are not running single threaded writes to their persistent data store. If they are, and you have a link, I’d love to be proven wrong.

_1tem · on March 14, 2025

https://use.expensify.com/blog/scaling-sqlite-to-4m-qps-on-a...

bob1029 · on March 14, 2025

https://www.infoq.com/presentations/LMAX/

diziet_sma · on March 14, 2025

These aren't financial exchanges, they're a sports betting and an expense management system.

I share OPs skepticism. Market makers invest in microwave towers, FPGAs, etc. I would be surprised if sqlite backed by NVME is on the other end of all that specialized hardware.

Order matching is a single threaded thing though. I would be curious if anyone knows how electronic trading systems are actually implemented.

bob1029 · on March 14, 2025

> I share OPs skepticism

> I would be surprised if sqlite backed by NVME is on the other end of all that specialized hardware.

I was not making this assertion. I am surprised anything like it got inferred (i.e., my use of the word "premise" regarding single thread/writer policy).

I agree that what you describe would be ridiculous in practice.

> I would be curious if anyone knows how electronic trading systems are actually implemented.

> Order matching is a single threaded thing though.

I think you answered your own question.

jstimpfle · on March 13, 2025

I still measure 1-2ms of latency with an NVMe disk on my Desktop computer, doing fsync() on a file on a ext4 filesystem.

Update: about 800us on a more modern system.

rbranson · on March 13, 2025

Not so sure that's true. This is single-threaded direct I/O doing a fio randwrite workload on a WD 850X Gen4 SSD:

    write: IOPS=18.8k, BW=73.5MiB/s (77.1MB/s)(4412MiB/60001msec); 0 zone resets
    slat (usec): min=2, max=335, avg= 3.42, stdev= 1.65
    clat (nsec): min=932, max=24868k, avg=49188.32, stdev=65291.21
     lat (usec): min=29, max=24880, avg=52.67, stdev=65.73
    clat percentiles (usec):
     |  1.00th=[   33],  5.00th=[   34], 10.00th=[   34], 20.00th=[   35],
     | 30.00th=[   37], 40.00th=[   38], 50.00th=[   40], 60.00th=[   43],
     | 70.00th=[   53], 80.00th=[   60], 90.00th=[   70], 95.00th=[   84],
     | 99.00th=[  137], 99.50th=[  174], 99.90th=[  404], 99.95th=[  652],
     | 99.99th=[ 2311]

jstimpfle · on March 13, 2025

I checked again with O_DIRECT and now I stand corrected. I didn't know that O_DIRECT could make such a huge difference. Thanks!

jstimpfle · on March 13, 2025

Oops, O_DIRECT does not actually make that big of a difference. I had updated my ad-hoc test to use O_DIRECT, but didn't check that write() now returned errors because of wrong alignment ;-)

As mentioned in the sibling comment, syncs are still slow. My initial 1-2ms number came from a desktop I bought in 2018, to which I added an NVME drive connected to an M.1 slot in 2022. On my current test system I'm seeing avg latencies of around 250us, sometimes a lot more (there a fluctuations).

   # put the following in a file "fio.job" and run "fio fio.job"
   # enable either direct=1 (O_DIRECT) or fsync=1 (fsync() after each write())
   [Job1]
   #direct=1
   fsync=1
   readwrite=randwrite
   bs=64k  # size of each write()
   size=256m  # total size written

tanelpoder · on March 14, 2025

Add sync=1 to your fio O_DIRECT write tests (not fsync, but sync=1) and you’ll see a big difference on consumer SSDs without power loss protection for their controller buffers. It adds the FUA flag (force unit access) to the write requests to ensure persistence of your writes, O_DIRECT alone won’t do that

wmf · on March 13, 2025

Random writes and fsync aren't the same thing. A single unflushed random write on a consumer SSD is extremely fast because it's not durable.

rbranson · on March 13, 2025

You're right. Sync writes are ten times as slow. 331µs.

  write: IOPS=3007, BW=11.7MiB/s (12.3MB/s)(118MiB/10001msec); 0 zone resets
    clat (usec): min=196, max=23274, avg=331.13, stdev=220.25
     lat (usec): min=196, max=23275, avg=331.25, stdev=220.27
    clat percentiles (usec):
     |  1.00th=[  210],  5.00th=[  223], 10.00th=[  235], 20.00th=[  262],
     | 30.00th=[  297], 40.00th=[  318], 50.00th=[  330], 60.00th=[  343],
     | 70.00th=[  355], 80.00th=[  371], 90.00th=[  400], 95.00th=[  429],
     | 99.00th=[  523], 99.50th=[  603], 99.90th=[ 1631], 99.95th=[ 2966],
     | 99.99th=[ 8225]

madisp · on March 13, 2025

I'm not an expert, but I think an enterprise NVMe will have some sort of power loss protection so it can afford to fsync to ram/caches as they will be written down in a power loss. Consumer NVMe drives afaik lack this so fsync will force the file to be written.

dogben · on March 13, 2025

I believe that's power saving in action. A single operation at idle is slow, the drive needs time to wake from idle.

dzr0001 · on March 13, 2025

What drive is this and does it need a trim? Not all NVMe devices are created equal, especially in consumer drives. In a previous role I was responsible for qualifying drives. Any datacenter or enterprise class drive that had that sort of latency in direct IO write benchmarks after proper pre-conditioning would have failed our validation.

jstimpfle · on March 13, 2025

My current one reads SAMSUNG MZVL21T0HCLR-00BH1 and is built into a quite new work laptop. I can't get below around 250us avg.

On my older system I had a WD_BLACK SN850X but had it connected to an M.1 slot which may be limiting. This is where I measured 1-2ms latency.

Is there any good place to get numbers of what is possible with enterprise hardware today? I've struggled for some time to find a good source.

dzr0001 · on March 14, 2025

Unfortunately, this data is harder to find than it should be. For instance, just looking at Kioxia, which I've found to be very performant, their datasheets for the CD series drives don't mention write latency at all. Blocks and Files[1] mentions that they claim <255us average, so they must have published that somewhere. This is why we would extensively test multiple units ourselves, following proper preconditioning as defined by SNIA. Averaging 250us for direct writes is pretty good.

[1] https://blocksandfiles.com/2023/08/07/kioxias-rocketship-dat...

the8472 · on March 13, 2025

I assume fsyncing a whole file does more work than just ensuring that specific blocks made it to the WAL which it can achieve with direct IO or maybe sync_file_range.

delamon · on March 14, 2025

Enterprise NVMe can do fsync much faster than consumer hardware. This is because they can cheat and report successful fsync() before data actually had been flushed to flash. They have backup capacitors which allow them to flush caches in case of power loss, so no data loss.

Here PM983 doing `fio --name=fsync_test --ioengine=sync --rw=randwrite --bs=4k --size=1G --numjobs=1 --runtime=10s --time_based --fsync=1`

  Jobs: 1 (f=1): [w(1)][100.0%][w=183MiB/s][w=46.7k IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=11905: Fri Mar 14 13:34:34 2025
    write: IOPS=39.1k, BW=153MiB/s (160MB/s)(1527MiB/10001msec); 0 zone resets
      clat (nsec): min=1052, max=223288, avg=1606.69, stdev=2345.64
       lat (nsec): min=1082, max=223458, avg=1653.08, stdev=2346.58
      clat percentiles (nsec):
       |  1.00th=[  1128],  5.00th=[  1176], 10.00th=[  1240], 20.00th=[  1320],
       | 30.00th=[  1448], 40.00th=[  1496], 50.00th=[  1528], 60.00th=[  1576],
       | 70.00th=[  1640], 80.00th=[  1720], 90.00th=[  1816], 95.00th=[  1960],
       | 99.00th=[  2576], 99.50th=[  3376], 99.90th=[ 10816], 99.95th=[ 32640],
       | 99.99th=[124416]
     bw (  KiB/s): min=123168, max=190568, per=99.00%, avg=154788.63, stdev=19610.50, samples=19
     iops        : min=30792, max=47642, avg=38697.16, stdev=4902.62, samples=19
    lat (usec)   : 2=95.61%, 4=4.10%, 10=0.19%, 20=0.04%, 50=0.03%
    lat (usec)   : 100=0.02%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=13, max=1238, avg=23.08, stdev= 9.27
      sync percentiles (usec):
       |  1.00th=[   15],  5.00th=[   16], 10.00th=[   16], 20.00th=[   17],
       | 30.00th=[   18], 40.00th=[   25], 50.00th=[   26], 60.00th=[   26],
       | 70.00th=[   26], 80.00th=[   26], 90.00th=[   26], 95.00th=[   27],
       | 99.00th=[   34], 99.50th=[   79], 99.90th=[  101], 99.95th=[  126],
       | 99.99th=[  347]

The same test on SN850X

  Jobs: 1 (f=1): [w(1)][100.0%][w=22.9MiB/s][w=5859 IOPS][eta 00m:00s]
  fsync_test: (groupid=0, jobs=1): err= 0: pid=23328: Fri Mar 14 13:35:04 2025
    write: IOPS=5742, BW=22.4MiB/s (23.5MB/s)(224MiB/10001msec); 0 zone resets
      clat (nsec): min=400, max=110253, avg=797.80, stdev=1244.19
       lat (nsec): min=430, max=110273, avg=826.49, stdev=1248.86
      clat percentiles (nsec):
       |  1.00th=[  502],  5.00th=[  540], 10.00th=[  572], 20.00th=[  612],
       | 30.00th=[  644], 40.00th=[  668], 50.00th=[  708], 60.00th=[  748],
       | 70.00th=[  804], 80.00th=[  868], 90.00th=[ 1032], 95.00th=[ 1176],
       | 99.00th=[ 1560], 99.50th=[ 2224], 99.90th=[ 8384], 99.95th=[23424],
       | 99.99th=[66048]
     bw (  KiB/s): min=19800, max=24080, per=100.00%, avg=23004.21, stdev=1039.13, s  amples=19
     iops        : min= 4950, max= 6020, avg=5751.05, stdev=259.78, samples=19
    lat (nsec)   : 500=0.80%, 750=58.72%, 1000=29.04%
    lat (usec)   : 2=10.89%, 4=0.28%, 10=0.18%, 20=0.04%, 50=0.04%
    lat (usec)   : 100=0.01%, 250=0.01%
    fsync/fdatasync/sync_file_range:
      sync (usec): min=136, max=28040, avg=172.88, stdev=195.00
      sync percentiles (usec):
       |  1.00th=[  145],  5.00th=[  149], 10.00th=[  151], 20.00th=[  151],
       | 30.00th=[  159], 40.00th=[  159], 50.00th=[  159], 60.00th=[  159],
       | 70.00th=[  159], 80.00th=[  161], 90.00th=[  198], 95.00th=[  202],
       | 99.00th=[  396], 99.50th=[  416], 99.90th=[  594], 99.95th=[ 1467],
     | 99.99th=[ 5145]

kev009 · on March 14, 2025

NVMe is just a protocol. There are drives that are absolute shit and others that cost as much as luxury automobiles. In either case not quite DRAM latency because it is expansion bus attached.

bobmcnamara · on March 15, 2025

RIP Optane DIMMs.

sergiotapia · on March 13, 2025

I had a lot of fun with Coolify running my app and my database on the same machine. It was pretty cool to see zero latency in my SQL queries, just the cost of the engine.