This is a classic appeal to authority. Let's play the argument, not the man. (My...

hyc_symas · on July 2, 2023

Correct, and thank you. I wrote LMDB, wrote a lot of OpenLDAP, and worked on BerkeleyDB for many years. And actually Andy Pavlo invited me to CMU to give a lecture on LMDB a few years back. https://www.youtube.com/watch?v=tEa5sAh-kVk

Andy and I have had this debate going for a long time already.

gavinray · on July 2, 2023

Well, I eat my shorts.

Isn't LMBD closer to an embedded key-value store than an RDBMS, though? Also there's a section in the paper that mentions it's single-writer.

hyc_symas · on July 2, 2023

Yes, LMDB is an embedded key/value store but it can be used as the backing store of any other DB model you care for. E.g. as a backend to MySQL, or SQLite, or OpenLDAP, or whatever.

jandrewrogers · on July 2, 2023

I think the real argument is more nuanced. Where you see mmap() fail badly on Linux, even for read-only workloads, is under a few specific conditions: very large storage volumes, highly concurrent access, non-trivial access patterns (e.g. high-dimensionality access methods). Most people do not operate data models under these conditions, but if you do then you can achieve large integer factor gains in throughput by not using mmap().

Interestingly, most of the reason for these problems has to do with theoretical limitations of cache replacement algorithms as drivers of I/O scheduling. There are alternative approaches to scheduling I/O that work much better in these cases but mmap() can’t express them, so in those cases bypassing mmap() offers large gains.

pclmulqdq · on July 3, 2023

GP wrote a key-value store called LMDB that is constrained to a single writer, and often used for small databases that fit entirely in memory but need to persist to disk. There's a whole different world for more scalable databases.

hyc_symas · on July 3, 2023

"fit entirely in memory" is not a requirement. LMDB is not a main-memory database, it is an on-disk database that uses memory mapping.

LAC-Tech · on July 3, 2023

Can you explain "high-dimensionality access methods" to me? (Or if it's too big for an HN comment, maybe recommend a paper).

heisjustsosmart · on July 5, 2023

This guy talks a lot of crap. See his website for examples, and don't waste your time with him

<<<There is one significant drawback that should not be understated. Algorithm design using topology manipulation can be enormously challenging to reason about. You are often taking a conceptually simple algorithm, like a nested loop or hash join, and replacing it with a much more efficient algorithm involving the non-trivial manipulation of complex high-dimensionality constraint spaces that effect the same result. Routinely reasoning about complex object relationships in greater than three dimensions, and constructing correct parallel algorithms that exploit them, becomes easier but never easy.>>>

http://www.jandrewrogers.com/2015/10/08/spacecurve/

ilyt · on July 3, 2023

I'd imagine same kind of worst case access would also be a problem doing IO the "classical" way

gavinray · on July 2, 2023

The argument is that:

- Queries can trigger blocking page faults when accessing (transparently) evicted pages, causing unexpected I/O stalls

- mmap() complicates transactionality and error-handling

- Page table contention, single-threaded page eviction, and TLB shootdowns become bottlenecks

hyc_symas · on July 2, 2023

1 - for reading any uncached data, the I/O stalls are unavoidable. Whatever client requested that data is going to have to wait regardless.

2 - complexity? this is simply false. LMDB's ACID txns using MVCC are much simpler than any "traditional" approach.

3 - contention is a red herring since this approach is already single-writer, as is common for most embedded k/v stores these days. You lose more perf by trying to make the write path multi-threaded, in lock contention and cache thrashing.

RaisingSpear · on July 3, 2023

> for reading any uncached data, the I/O stalls are unavoidable.

Excuse me for a silly question, but whilst an I/O stall may be unavoidable, wouldn't a thread stall be avoidable if you're not using mmap?

Assuming that you're not swapping, you'll generally know if you've loaded something into memory or not, whilst mmap doesn't help you know if the relevant page is cached. If the data isn't in memory, you can send the I/O request to a thread to retrieve it, and the initiating thread can then move onto the next connection. I suspect this isn't doable under mmap based access?

pclmulqdq · on July 3, 2023

It's kind of disingenuous to talk about how great your concurrency system is when you only allow a single writer. RCU (which I imagine your system is isomorphic to) is pretty simple compared to what many DB engines use to do ACID transactions that involve both reads and writes.

hyc_symas · on July 3, 2023

You don't need more than single-writer concurrency if your write txns are fast enough.

Our experience with OpenLDAP was that multi-writer concurrency cost too much overhead. Even though you may be writing primary records to independent regions of the DB, if you're indexing any of that data (which all real DBs do, for query perf) you wind up getting a lot of contention in the indices. That leads to row locking conflicts, txn rollbacks, and retries. With a single writer txn model, you never get conflicts, never need rollbacks.

jandrewrogers · on July 3, 2023

> You don't need more than single-writer concurrency if your write txns are fast enough.

This only works on systems with sufficiently slow storage. If your server has a bunch of NVMe, which is a pretty normal database config these days, you will be hard-pressed to get anywhere close to the theoretical throughput of the storage with a single writer. That requires 10+ GB/s sustained. It is a piece of cake with multiple writers and a good architecture.

Writes through indexing can be sustained at this rate (assuming appropriate data structures), most of the technical challenge is driving the network at the necessary rate in my experience.

hyc_symas · on July 3, 2023

That's all just false. Just because you're single-writer at the application level doesn't mean the OS isn't queueing enough writes to saturate storage at the device level. We've benchmarked plenty of high speed NVMe devices, like Intel Optane SSDs, etc. showing this. http://www.lmdb.tech/bench/optanessd/

heisjustsosmart · on July 5, 2023

this guy is a fool. Ignore him. Or see his website.

pclmulqdq · on July 3, 2023

That's probably because your OpenLDAP benchmarks used a tiny database. If you have multi-terabyte databases, you will start to see huge gains from a multi-writer setup because you will be regularly be loading pages from disk, rather than keeping almost all of your btree/LSM tree in RAM.

hyc_symas · on July 3, 2023

Yeah, no. Not with a DB 50x larger than RAM, anyway.

http://www.lmdb.tech/bench/hyperdex/

RAM is relatively cheap too, there's no real reason to be running multi-TB databases at greater than a 50x ratio.

pclmulqdq · on July 3, 2023

Sorry, but "50x larger than RAM" is a pretty small DB - that's an 800 GB database on a machine with 16 GB of RAM. I usually have seen machines with 500-1000x ratios of flash to RAM. "RAM is relatively cheap" is also false when you're storing truly huge amounts of data, which is how the systems you compare yourself to (LevelDB, etc) are usually deployed. Note that RAM is now the single greatest cost when buying servers.

> Now that the total database is 50 times larger than RAM, around half of the key lookups will require a disk I/O.

That is an insanely high cache hit rate, which should have probably set off your "unrepresentative benchmark" detector. I am also a little surprised at the lack of a random writes benchmark. I get that this is marketing material, though.

hyc_symas · on July 3, 2023

> I am also a little surprised at the lack of a random writes benchmark.

Eh? This was 20% random writes, 80% random reads. LMDB is for read-heavy workloads.

> That is an insanely high cache hit rate, which should have probably set off your "unrepresentative benchmark" detector.

No, that is normal for a B+tree; the root page and most of the branch pages will always be in cache. This is why you can get excellent efficiency and performance from a DB without tuning to a specific workload.

pclmulqdq · on July 3, 2023

> Eh? This was 20% random writes, 80% random reads. LMDB is for read-heavy workloads.

The page says "updates," not "writes." Updates are a constrained form of write where you are writing to an existing key. Updates, importantly, do not affect your index structure, while writes do.

> No, that is normal for a B+tree; the root page and most of the branch pages will always be in cache. This is why you can get excellent efficiency and performance from a DB without tuning to a specific workload.

It is normal for a small B+tree relative to the memory size available on the machine. The "small" was the unrepresentative part of the benchmark, not the "B+tree."

hyc_symas · on July 3, 2023

> The page says "updates," not "writes." Updates are a constrained form of write where you are writing to an existing key. Updates, importantly, do not affect your index structure, while writes do.

OK, I see your point. It would only have made things even worse for LevelDB here to do an Add/Delete workload because its garbage compaction passes would have had to do a lot more work.

> It is normal for a small B+tree relative to the memory size available on the machine. The "small" was the unrepresentative part of the benchmark, not the "B+tree."

This was 100 million records, and a 5-level deep tree. To get to 6 levels deep it would be about 10 billion records. Most of the branch pages would still fit in RAM; most queries would require at most 1 more I/O than the 5-level case. The cost is still better than any other approach.

jeffffff · on July 3, 2023

take a look at http://nms.csail.mit.edu/~stavros/pubs/OLTP_sigmod08.pdf - the overhead of coordinating multiple writers often makes multi-writer databases slower than single-writer databases. remember, everything has to be serialized when it goes to the write ahead log, so as long as you can do the database updates as fast as you can write to the log then concurrent writers are of no benefit.

pclmulqdq · on July 3, 2023

This is another cool example of a toy database that is again very small:

> The database size for one warehouse is approximately 100 MB (we experiment with five warehouses for a total size of 500MB).

It is not surprising that when your database basically fits in RAM, serializing on one writer is worth doing, because it just plainly reduces contention. You basically gain nothing in a DB engine from multi-writer transactions when this is the case. A large part of a write (the vast majority of write latency) in many systems with a large database comes from reading the index up to the point where you plan to write. If that tree is in RAM, there is no work here, and you instead incur overhead on consistency of that tree by having multiple writers.

I'm not suggesting that these results are useless. They are useful for people whose databases are small because they are meaningfully better than RocksDB/LevelDB which implicitly assume that your database is a *lot* bigger than RAM.

hyc_symas · on July 3, 2023

> RocksDB/LevelDB which implicitly assume that your database is a lot bigger than RAM.

Where are you getting that assumption from? LevelDB was built to be used in Google Chrome, not for multi-TB DBs. RocksDB was optimized specifically for in-memory workloads.

pclmulqdq · on July 3, 2023

I worked with the Bigtable folks at Google. LevelDB's design is ripped straight from BigTable, which was designed with that assumption in mind. I'm also pretty sure it was not designed specifically for Google Chrome's use case - it was written to be a general key-value storage engine based on BigTable, and Google Chrome was the first customer.

RocksDB is Facebook's offshoot of LevelDB, basically keeping the core architecture of the storage engine (but multithreading it), and is used internally at Facebook as the backing store for many of their database systems. I have never heard from anyone that RocksDB was optimized for in-memory workloads at all, and I think most benchmarks can conclusively say the opposite: both of those DB engines are pretty bad for workloads that fit in memory.

hyc_symas · on July 3, 2023

I think we've gone off on a tangent. At any rate, both LevelDB and RocksDB are still single-writer so whatever point seems to have been lost along the way.

earthnail · on July 3, 2023

I've used RocksDB for an in-memory K/V store of ~600GB in size and it worked really well. Not saying it's the best choice out there but it did the job very well for us. And in particular because our dataset was always growing and we needed the option to fallback to disk if needed, RocksDB worked very well.

Was a PITA to optimise though; tons of options and little insight into which ones work.

ayende · on July 3, 2023

I am using the same rough model, and I'm using that on a 1.5 TB db running on Raspberry PI very successfully.

Pretty much all storage libraries written in the past couple of decades are using single writer. Note that single writer doesn't mean single transaction. Merging transactions is easy and highly profitable, after all.

AdamProut · on July 3, 2023

Yeah for workloads with any long running write transactions a single writer design is a pretty big limitation. Say some long running data load (or a big bulk deletion) running along with some faster high throughput key value writes - the big data load would block all the faster key-value writes when it runs.

No "mainstream" database I'm aware of has a global single writer design.

jemfinch · on July 2, 2023

"Taking full control of your I/O and buffer management is great if (a) your developers are all smart and experienced enough to be kernel programmers" is already an appeal to authority in itself.

We shouldn't apply a higher bar to the counterargument than we applied to the argument in the first place.