> your DBMS is the only process running on a machine. In practice, (a) is never true, and (b) is no longer true because everyone is running apps inside containers inside shared VMs.
There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).
As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.
This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.
> There's nothing special about kernel programmers.
Yes, that was a shorthand generalization for "people who've studied computer architecture" - which most application developers never have.
> no DBA worth their salt would put database in the environment where it has to share resources with applications.
Most applications today are running on smartphones/mobile devices. That means they're running with local embedded databases - it's all about "edge computing". There's far more DBs in use in the world than there are DBAs managing them.
> mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.
Well, you're half right. That's why by default LMDB uses a read-only mmap and uses regular (p)write syscalls for writes. But the central point of databases is to be able to persist data such that it can be retrieved again in the future, efficiently. And that's where the read characteristics of using mmap are superior.
> "people who've studied computer architecture" - which most application developers never have
If you are developing an DBMS and haven't studied computer architecture, the best idea is probably to ask more experienced people to help out with your ideas.
From my limited knowledge, I don't think the article is old enough to be obsolete, just that there's a lot more to it.
Not to be gatekeeping or anything, but it is a pretty well studied field with lots of very knowledgeable people around, who are probably more than keep to help. There aren't too many qualified jobs around and you probably have a budget if you are developing a database commercially.
mmap doesn't allow their users to precisely control the persistence aspect
It's been a while since I've dealt with mmap(), but isn't this what msync() does? You can synchronously or asynchronously force dirty pages to be flushed to disk without waiting until munmap().
msync lets you force a flush so you can control the latest possible moment for a writeout. But the OS can flush before that, and you have no way to detect or control that. So you can only control the late side of the timing, not the early side. And in databases, you usually need writes to be persisted in a specific order; early writes are just as harmful as late writes.
I'd even take a memory ordering guarantee, something like, within each page, data is read out sequentially as atomic aligned 64-bit reads with acquire ordering. (Though this probably is what you get on AMD64.) As-is, there's not even a guarantee against an atomic aligned write being torn when written out.
That is absolutely not what you actually get from the hardware.
For fun, there is no guarantee in terms of writing a page in what order it is written. SQLite documents that they assume (but cannot verify) that _sector_ writes are linear, but not atomic.
https://www.sqlite.org/atomiccommit.html
> If a power failure occurs in the middle of a sector write it might be that part of the sector was modified and another part was left unchanged. The key assumption by SQLite is that if any part of the sector gets changed, then either the first or the last bytes will be changed. So the hardware will never start writing a sector in the middle and work towards the ends. We do not know if this assumption is always true but it seems reasonable.
You are talking several levels higher than that, at the page level (composed of multiple sectors).
Assume that they reside in _different_ physical locations, and are written at different times. That's fun.
Every HDD since the 1980s has guaranteed atomic sector writes:
> Currently all hard drive/SSD manufacturers guarantee that 512 byte sector
writes are atomic. As such, failure to write the 106 byte header is not
something we account for in current LMDB releases. Also, failures of this type
should result in ECC errors in the disk sector - it should be impossible to
successfully read a sector that was written incorrectly in the ways you describe.
Even in extreme cases, the probability of failure to write the leading 128 out
of 512 bytes of a sector is nearly nil - even on very old hard drives, before
512-byte sector write guarantees. We would have to go back nearly 30 years to
find such a device, e.g.
Page 23, Section 2.1
"No damage or loss of data will occur if power is applied or removed during
drive operation, except that data may be lost in the sector being written at
the time of power loss."
From the specs on page 15, the data transfer rate to/from the platters is
1.25MB/sec, so the time to write one full sector is 0.4096ms; the time to
write the leading 128 bytes of the sector is thus 1/4 of that: 0.10ms. You
would have to be very very unlucky to have a power failure hit the drive
within this .1ms window of time. Fast-forward to present day and it's simply
not an issue.
Also doesn't help when you are running on virtual / networked hardware. Nothing ensure that what you think is a sector write would actually align properly with the hardware.
The design and guarantees of the virtualized hardware provide that guarantee. I've worked on several such products. They all guarantee atomic sector writes (typically via copy-on-write).
> Most applications today are running on smartphones/mobile devices.
That's patently false. There are about 8 bn. people. Even if everyone has a smartphone or two, it's nothing compared to the total of all devices that can be called "computer". I think that "smart TV" alone will beat the number of smartphones. But even that is a drop in a bucket when it comes to the total of running programs on Earth / its orbit.
But, that's beside the point. Smartphones aren't designed to run database servers. Even if they indeed were the majority, they'd still be irrelevant for this conversation because they are a wrong platform for deploying databases. In other words, it doesn't matter how people deploy databases to smartphones -- they have no hopes of achieving good performance, and whether they use mmap or not is of no consequences -- they've lost the race before they even qualified for it.
> LMDB databases may have only one writer at a time
(Taken from the page above) -- this isn't a serious contender for database server space. It's a toy database. You shouldn't give general advice based on whatever this system does or doesn't.
There's nothing special about kernel programmers. In fact, if I had to compare, I'd go with storage people being the more experienced / knowledgeable ones. They have a highly competitive environment, which requires a lot more understanding and inventiveness to succeed, whereas kernel programmers proper don't compete -- Linux won many years ago. Kernel programmers who deal with stuff like drivers or various "extensions" are, largely, in the same group as storage (often time literally the same people).
As for "single process" argument... well, if you run a database inside an OS, then, obviously, that will never happen as OS has its own processes to run. But, if you ignore that -- no DBA worth their salt would put database in the environment where it has to share resources with applications. People who do that are, probably, Web developers who don't have high expectations from their database anyways and would have no idea how to configure / tune it for high performance, so, it doesn't matter how they run it, they aren't the target audience -- they are light years behind on what's possible to achieve with their resources.
This has nothing to do with mmap though. mmap shouldn't be used for storage applications for other reasons. mmap doesn't allow their users to precisely control the persistence aspect... which is kind of the central point of databases. So, it's a mostly worthless tool in that context. Maybe fine for some throw-away work, but definitely not for storing users' data or database's own data.