> your DBMS is the only process running on a machine. In practice, (a) is never ...

hyc_symas · on July 2, 2023

> There's nothing special about kernel programmers.

Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.

> no DBA worth their salt would put database in the environment where it has to share resources with applications.

Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.

> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.

Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.

xorcist · on July 3, 2023

> "people who've studied computer architecture" - which most application developers never have

If you are developing an DBMS and haven't studied computer architecture, the best idea is probably to ask more experienced people to help out with your ideas.

From my limited knowledge, I don't think the article is old enough to be obsolete, just that there's a lot more to it.

Not to be gatekeeping or anything, but it is a pretty well studied field with lots of very knowledgeable people around, who are probably more than keep to help. There aren't too many qualified jobs around and you probably have a budget if you are developing a database commercially.

Johnny555 · on July 3, 2023

mmap doesn't allow their users to precisely control the persistence aspect

It's been a while since I've dealt with mmap(), but isn't this what msync() does? You can synchronously or asynchronously force dirty pages to be flushed to disk without waiting until munmap().

hyc_symas · on July 3, 2023

msync lets you force a flush so you can control the latest possible moment for a writeout. But the OS can flush before that, and you have no way to detect or control that. So you can only control the late side of the timing, not the early side. And in databases, you usually need writes to be persisted in a specific order; early writes are just as harmful as late writes.

colanderman · on July 3, 2023

I'd even take a memory ordering guarantee, something like, within each page, data is read out sequentially as atomic aligned 64-bit reads with acquire ordering. (Though this probably is what you get on AMD64.) As-is, there's not even a guarantee against an atomic aligned write being torn when written out.

ayende · on July 3, 2023

That is absolutely not what you actually get from the hardware.

For fun, there is no guarantee in terms of writing a page in what order it is written. SQLite documents that they assume (but cannot verify) that _sector_ writes are linear, but not atomic. https://www.sqlite.org/atomiccommit.html

> If a power failure occurs in the middle of a sector write it might be that part of the sector was modified and another part was left unchanged. The key assumption by SQLite is that if any part of the sector gets changed, then either the first or the last bytes will be changed. So the hardware will never start writing a sector in the middle and work towards the ends. We do not know if this assumption is always true but it seems reasonable.

You are talking several levels higher than that, at the page level (composed of multiple sectors).

Assume that they reside in _different_ physical locations, and are written at different times. That's fun.

hyc_symas · on July 3, 2023

Every HDD since the 1980s has guaranteed atomic sector writes:

> Currently all hard drive/SSD manufacturers guarantee that 512 byte sector writes are atomic. As such, failure to write the 106 byte header is not something we account for in current LMDB releases. Also, failures of this type should result in ECC errors in the disk sector - it should be impossible to successfully read a sector that was written incorrectly in the ways you describe.

Even in extreme cases, the probability of failure to write the leading 128 out of 512 bytes of a sector is nearly nil - even on very old hard drives, before 512-byte sector write guarantees. We would have to go back nearly 30 years to find such a device, e.g.

https://archive.org/details/bitsavers_quantumQuaroductManual...

Page 23, Section 2.1 "No damage or loss of data will occur if power is applied or removed during drive operation, except that data may be lost in the sector being written at the time of power loss."

  From the specs on page 15, the data transfer rate to/from the platters is
 1.25MB/sec, so the time to write one full sector is 0.4096ms; the time to
 write the leading 128 bytes of the sector is thus 1/4 of that: 0.10ms. You
 would have to be very very unlucky to have a power failure hit the drive
 within this .1ms window of time. Fast-forward to present day and it's simply
 not an issue.

^ above quoted from https://lists.openldap.org/hyperkitty/list/openldap-devel@op...

ayende · on July 6, 2023

Doesn't help when you work with pages :-)

Assume 512 sectors ( I know those are rare ), but I don't think that there is any guarantees that 4KB page would be:

* Written atomically * Written in a particular order

colanderman · on July 6, 2023

Even memory ordering guarantees within sector boundaries are sufficient, and something the kernel could provide on its own.

ayende · on July 6, 2023

Also doesn't help when you are running on virtual / networked hardware. Nothing ensure that what you think is a sector write would actually align properly with the hardware.

colanderman · on July 6, 2023

The design and guarantees of the virtualized hardware provide that guarantee. I've worked on several such products. They all guarantee atomic sector writes (typically via copy-on-write).

crabbone · on July 2, 2023

> Most applications today are running on smartphones/mobile devices.

That's patently false. There are about 8 bn. people. Even if everyone has a smartphone or two, it's nothing compared to the total of all devices that can be called "computer". I think that "smart TV" alone will beat the number of smartphones. But even that is a drop in a bucket when it comes to the total of running programs on Earth / its orbit.

But, that's beside the point. Smartphones aren't designed to run database servers. Even if they indeed were the majority, they'd still be irrelevant for this conversation because they are a wrong platform for deploying databases. In other words, it doesn't matter how people deploy databases to smartphones -- they have no hopes of achieving good performance, and whether they use mmap or not is of no consequences -- they've lost the race before they even qualified for it.

> LMDB

Are we talking about this? https://en.wikipedia.org/wiki/Lightning_Memory-Mapped_Databa... If so, this is irrelevant for databases in general.

> LMDB databases may have only one writer at a time

(Taken from the page above) -- this isn't a serious contender for database server space. It's a toy database. You shouldn't give general advice based on whatever this system does or doesn't.

ricardo81 · on July 3, 2023

>irrelevant for databases in general

It's one of the databases compared in the paper

OP is one of the authors of LMDB

kalleboo · on July 3, 2023

Smart TVs are also all running SQLite