Does anyone have any experience with this DB? Or context about where it came fro...

StevenBtw · 2026-03-22T19:24:57 1774207497

Hi, I'm the one building grafeo, I have no idea why it is being posted everywhere. But I can probably answer your questions.

The first version was largely a (slightly rearchitected) port of a local graph database I had been building called graphos. Most of the engine and core are handwritten, so are the python bindings and conformance tests. The rest is indeed largely AI generated, so is the documentation (Mkdocs). The AI generated parts are curated and validated, although it's not up to par for a production release yet.

This is not a resume portfolio project and in no way related to my day job. I started writing grafeo(then graphos) out of frustration with Neo4j and being inspired by some discussions about database internals with Hännes from duckdb at a conference. I tried ladybug, but found memory usage insanely high and was sure I could do better. Anyone looking for an embedded battle tested graph database should probably still look at ladybug though. Grafeo is not that mature yet.

And to be honest I also have no real plans with grafeo, I am using it myself for now and am very happy with it, but that's n=1. It's fully free and open source and contributors are very welcome, but its also not yet fully where I would want it to be, hence the beta status. I have no commercial interest, but had a lot of fun pouring multiple hundreds of hours in and creating something that I enjoy using myself.

Hope that clarifies some things!

adsharma · 2026-03-25T04:24:11 1774412651

Thank you for the shout out! I looked into your benchmark setup a bit. Two things going on:

- Ladybug by default allocates 80% of the physical memory to the buffer pool. You can limit it. This wasn't the main reason.

- Much of the RSS is in ladybug native memory connected to the python connection object. I noticed that you keep the connection open between benchmark runs. For whatever reason, python is not able to garbage collect the memory.

We ran into similar lifetime issues with golang and nodejs bindings as well. Many race conditions where the garbage collector releases memory while another thread still has a reference to native memory. We now require that the connection be closed for the memory to be released.

  https://github.com/LadybugDB/ladybug/issues/320
  https://github.com/LadybugDB/go-ladybug/issues/7
  https://github.com/LadybugDB/ladybug-nodejs/pull/1

jandrewrogers · 2026-03-21T16:36:07 1774110967

That is a lot of code for what appears to be a vanilla graph database with a conventional architecture. The thing I would be cautious about is that graph database engines in particular are known for hiding many sharp edges without a lot of subtle and sophisticated design. It isn't obvious that the necessary level of attention to detail has been paid here.

adsharma · 2026-03-21T16:45:27 1774111527

Are you talking about Andy Pavlo bet here?

https://news.ycombinator.com/item?id=29737326

Kuzu folks took some of these discussions and implemented them. SIP, ASP joins, factorized joins and WCOJ.

Internally it's structured very similar to DuckDB, except for the differences noted above.

DuckDB 1.5 implemented sideways information passing (SIP). And LadybugDB is bringing in support for DuckDB node tables.

So the idea that graph databases have shaky internals stems primarily from pre 2021 incumbents.

4 more years to go to 2030!

jandrewrogers · 2026-03-21T18:02:11 1774116131

I wasn't referring to the Pavlo bet but I would make the same one! Poor algorithm and architecture scalability is a serious bottleneck. I was part of a research program working on the fundamental computer science of high-scale graph databases ~15 years ago. Even back then we could show that the architectures you mention couldn't scale even in theory. Just about everyone has been re-hashing the same basic design for decades.

As I like to point out, for two decades DARPA has offered to pay many millions of dollars to anyone who can demonstrate a graph database that can handle a sparse trillion-edge graph. That data model easily fits on a single machine. No one has been able to claim the money.

Inexplicably, major advances in this area 15-20 years ago under the auspices of government programs never bled into the academic literature even though it materially improved the situation. (This case is the best example I've seen of obviously valuable advanced research that became lost for mundane reasons, which is pretty wild if you think about it.)

zozbot234 · 2026-03-22T00:26:50 1774139210

What do you need one trillion edges for? Wikidata is a huge, general purpose knowledge graph and it gets away with ~1B triples, give or take.

jandrewrogers · 2026-03-22T03:32:10 1774150330

Almost all analytic graphs of general scope surpass 1T edges, see below. DARPA also has an unfilled objective for 1B edge real-time continuously updated operational graphs. These are smaller and the write throughput requirements are in line with non-graph analytical databases but graph databases struggle to meet that standard.

There are countless smaller graphs for narrow domains that may be <1B edges but many people have the ambition to stitch together these narrow graphs into a larger graph. When stitching graphs together, the number of edges is usually super-linear. A billion edges is kind of considered “Hello World” for system testing.

The Semantic Web companies in the 2000s had graphs that were 100B+ edges. They wanted to go much larger but hit hard scaling walls around that point. That scaling wall killed them.

Classic mapping data models are typically 10-100B edges. These could be much, much larger if they could process all the data available to them.

Of course, intelligence agencies had all kinds of graphs far beyond trillions of edges 20 years ago. People, places, things, events.

Any type of spatiotemporal entity graphs with large geographic scope are quadrillions of edges. It isn’t just a lot of inferred relationships between entities, the relationships evolve over time which also must be captured. These are probably the most commercially valuable type of graph. You could build hundreds of different graphs of this type with 1T+ edges in most regions, never mind doing it at scale. These are so large that we usually don’t store them. Subgraphs are generated on demand, which is computationally expensive.

These spatiotemporal entity graphs also have the largest write loads. Single sources generate tens of PB/day of new edges. There is a ton of industrial data that looks like this; it isn’t just people slinging structured data.

Graphs are everywhere but we furiously avoid them because the scalability of operations over anything but severely constrained graphs is so poor. Selection bias.

NSA in particular heavily funded foundational theoretical and applied computer science research into scaling graph computing for decades. They had all kinds of boring graphs where trillions of edges was their Tuesday. The US military also uses large graph databases in fairly boring applications that probably didn’t require a graph database.

adsharma · 2026-03-21T20:05:42 1774123542

> many millions of dollars to anyone who can demonstrate a graph database that can handle a sparse trillion-edge graph.

I wonder why no one has claimed it. It's possible to compress large graphs to 1 byte per edge via Graph reordering techniques. So a trillion scale graph becomes 1TB, which can fit into high end machines.

Obviously it won't handle high write rates and mutations well. But with Apache Arrow based compression, it's certainly possible to handle read-only and read-mostly graphs.

Also the single machine constraint feels artificial. For any columnar database written in the last 5 years, implementing object store support is tablestakes.

jandrewrogers · 2026-03-21T22:15:48 1774131348

Achieving adequate performance at 1T edges in one aspect requires severe tradeoffs in other aspects, making every implementation impractical at that scale. You touched on a couple of the key issues when I was working in this domain.

There is no single machine constraint, just the observation that we routinely run non-graph databases at similar scale on single machines without issue. It doesn't scale on in-memory supercomputers either, so the hardware details are unrelated to the problem:

- Graph database with good query performance typically has terrible write performance. It doesn't matter how fast queries are if it takes too long to get data into the system. At this scale there can be no secondary indexing structures into the graph; you need a graph cutting algorithm efficient for both scalable writes and join recursion. This was solved.

- Graph workloads break cache replacement algorithms for well-understood theory reasons. Avoiding disk just removes one layer of broken caching among many but doesn't address the abstract purpose for which a cache exists. This is why in-memory systems still scale poorly. We've known how to solve this in theory since at least the 1980s. The caveat is it is surprisingly difficult to fully reduce to practice in software, especially at scale, so no one really has. This is a work in progress.

- Most implementations use global synchronization barriers when parallelizing algorithms such as BFS, which greatly increases resource consumption while throttling hardware scalability and performance. My contribution to research was actually in this area: I discovered a way to efficiently use error correction algorithms to elide the barriers. I think there is room to make this even better but I don't think anyone has worked on it since.

The pathological cache replacement behavior is the real killer here. It is what is left even if you don't care about write performance or parallelization.

I haven't worked in this area for many years but I do keep tabs on new graph databases to see if someone is exploiting that prior R&D, even if developed independently.

mleonhard · 2026-03-22T09:11:42 1774170702

> Inexplicably, major advances in this area 15-20 years ago under the auspices of government programs never bled into the academic literature even though it materially improved the situation.

Would you please share some more info about this? Were the advances implemented in software and never written up and published? What are the names of the government programs?

rossjudson · 2026-03-21T21:43:36 1774129416

I guess it all depends on the meaning of the word "handle", and what the use cases are.

darkteflon · 2026-03-21T23:28:44 1774135724

KuzuDB, now in [maintenance mode](https://github.com/kuzudb/kuzu). Quite annoyed about that one, was using it extensively.

gdotv · 2026-03-23T12:52:32 1774270352

LadybugDB (https://github.com/LadybugDB/ladybug) at this point seems to be the only sustainable fork. When deciding what to do about the Kuzu archival on https://gdotv.com, we've gone with maintaining support for the last available version of Kuzu (it's still heavily used from what I'm seeing) whilst introducing support for LadybugDB. I've looked into a few other forks and at this point in time none seem to be actively maintained for more than a few weeks before getting dropped.

darkteflon · 2026-03-24T20:24:47 1774383887

Thanks, that’s interesting - I didn’t know that Ladybug was a KuzuDB fork.

adsharma · 2026-03-21T17:33:49 1774114429

Source: https://www.theregister.com/2023/03/08/great_graph_debate_we...

> There are some additional optimizations that are specific to graphs that a relational DBMS needs to incorporate: [...]

This is essentially what Kuzu implemented and DuckDB tried to implement (DuckPGQ), without touching relational storage.

The jury is out on which one is a better approach.

justonceokay · 2026-03-21T16:41:10 1774111270

Yes a graph database will happily lead you down a n^3 (or worse!) path when trying to query for a single relation if you are not wise about your indexes, etc.

cluckindan · 2026-03-21T18:15:51 1774116951

That sounds like a ”graph” DB which implements edges as separate tables, like building a graph in a standard SQL RDB.

If you wish to avoid that particular caveat, look for a graph DB which materializes edges within vertices/nodes. The obvious caveat there is that the edges are not normalized, which may or may not be an issue for your particulat application.

adsharma · 2026-03-21T16:48:30 1774111710

Are you talking about the query plan for scanning the rel table? Kuzu used a hash index and a join.

Trying to make it optional.

Try

explain match (a)-[b]->(c) return a.rowid, b.rowid, c.rowid;

stult · 2026-03-21T20:43:23 1774125803

It certainly does seem problematic to have a graph database hiding edges, sharp or not

gdotv · 2026-03-21T16:20:35 1774110035

Agreed, there's been a literal explosion in the last 3 months of new graph databases coded from scratch, clearly largely LLM assisted. I'm having to keep track of the industry quite a bit to decide what to add support for on https://gdotv.com and frankly these days it's getting tedious.

piyh · 2026-03-21T18:10:59 1774116659

I'm turning off my brain and using neo4j

gdotv · 2026-03-21T22:26:55 1774132015

proof that Neo4j won the popularity contest!

UltraSane · 2026-03-21T19:35:00 1774121700

Neo4j is pretty nice.

aorth · 2026-03-21T18:41:47 1774118507

Figurative!

ozgrakkurt · 2026-03-21T18:17:17 1774117037

Using a LLM coded database sounds like hell considering even major databases can have some rough edges and be painful to use.

arthurjean · 2026-03-21T17:17:11 1774113431

Sounds about right for someone who ships fast and iterates. 54 days for a v0 that probably needs refactoring isn't that crazy if the dev has a real DB background. We've all seen open source projects drag on for 3 years without shipping anything, that's not necessarily better

Aurornis · 2026-03-21T19:43:04 1774122184

200,000 lines of code on week 1 is not a sign of a quality codebase with careful thought put into it.

> We've all seen open source projects drag on for 3 years without shipping anything, that's not necessarily better

There are more options than “never ship anything” and “use AI to slip 200,000 lines of code into a codebase”

TheJord · 2026-03-21T21:23:48 1774128228

shipping fast matters a lot less than shipping something you actually understand. 200k lines in a week means nobody knows what's in there, including the author. that's not a codebase, it's a liability