I need to implement the fastest possible way to store a key/value pair in a distributed system on Linux. Records of the database are tiny, 256 bytes on average.
I am thinking to use open(), write() and read() system calls and write the key-value pairs directly at some offset in the file. I can omit fdatasync() system call since I will be using SSD disk with battery, so I don't have to worry about ACID compliance if an unexpected shutdown of the system happens. Linux already provides disk cache implementation, so no reads/writes will happen on sectors that were already loaded in memory. This (i think) would be the fastest way to store data, much faster than any other cache capable database engine like for example GT.M or Intersystem's Globals.
However the data is not replicated, and to achieve replication, I can mount a filesystem of another Linux server with NFS and copy the data there, so for example, if I have 2 data servers (1 local and 1 remote), I would issue 2 open(), 2 write() and 2 close() calls. If a transaction fails on remote server, I would mark it as "out of sync" and simply copy the good file again when the remote server comes back.
What do you think of this approach? Will it be fast? I can use NFS over UDP so I will avoid the TCP Stack overhead.
Advantage list so far goes like this:
- Linux disk cache reused
- Few lines of code
- High performance
I will be coding this in C. To locate the record in the file I will keep a btree in memory with a pointer to physical location.
A few suggestions come to mind.
- is it necessary to open()/write()/close() for every transaction? the system call overhead of open() in particular is probably non-trivial
- could you use mmap() instead of explicit write()s?
- if you're doing 2 write() calls (1 local, 1 NFS) for each transaction, it seems like any kind of network problem (latency, dropped packets, etc.) has the potential to bring your application to a screeching halt if you're waiting for the NFS write() call to succeed. And if you're not waiting, for example by doing the NFS writes from a separate thread, your complexity will rapidly grow (I don't think "Few lines of code" will remain true.)
In general, I would suggest that you really prove to yourself that the available tools don't meet your performance requirements before choosing to re-invent this particular wheel.
You might look into a real distributed filesystem rather than using NFS, which as you point out, still provides a single point of failure and no replication.
The Andrew File System (AFS) originally developed by CMU may be a solution for you. It's a commercial product, but you might check out OpenAFS which works on linux (and other systems).
Warning though: AFS has a learning curve.