Disk layout

General Layout
Here is the overall hierarchy of data storage within Pleiades:

Monolithic KVStore -> KVRangeStore -> Raft Shard -> Raft Replica -> RocksDB -> Host

Ultimately, RocksDB is the core storage engine on which everything is built. To optimize data storage, and lower the amount of files we have to track, data for multiple Raft Replicas is colocated into a single RocksDB instance. However, each replica's data is split into it's own column family so we can support atomic writes across multiple replicas on the same host without contention.

There are a few optimizations we can make through namespacing and delineation that are specific to RocksDB. The first is through the column families, which provides isolation between the replicas. The second is through namespacing the Raft metadata into the  namespace and application data into. The third is through a Raft-specific optimization: linearization - all operations, while asynchronous, are completely linear. These 3 optimizations allow a host to have almost complete atomicity for pretty much every key - we still wrap each write in a transaction, however.

To make it easier on the brain, here's an easy way of understanding how things are laid out on disk. Keep in mind this representation is human-friendly, whereas the actual implementation uses column families & byte-specific layouts.

/shardId/raft/ -> /shardId/data/ ->

Over time, this might change, but for the most part these locations are static.

Raft Layout
While the namespaces for shardId are column families, the core namespace delimiters are actually for faster sorting of the keys.

/shardId/raft/vote -> vote /shardId/raft/last_purged_log_id /shardId/raft/logs/ -> logId /shardId/raft/snapshots/metadata -> snapshotMetadata /shardId/raft/snapshots/ -> 

Data Key Encodings
The key encodings for application data is fairly straightforward but also incredibly powerful for addressing. Generally, Pleiades supports tagging, where keys are tagged with specific values that allow for fast retrievals, descriptor storage, and other general aspects. For each binary key, there are 2 bytes of appended to the key (re: keys created by applications) are reserved for metadata that allow us to support complex key usages. The overall byte alignment looks like so:

[binary-key][delimiter, tag]

There are a few different types of delimiters that make decoding a bit easier and more consistent:

Generally, this allows us to support a couple different use cases: general tags, latest version, and specific versions. Regarding versioning, the latest version tag is a quick way for us to fetch the latest version of a tag, as well as potentially signalling a cascade of data updates to older key versions.

To understand how the last two bytes would logically look on disk, it would look something like this:

/shardId/data/ .d ->  /shardId/data/ @l ->  /shardId/data/ :3 ->  /shardId/data/ :2 ->  /shardId/data/ :1 -> 

As an important bit of information,  is always version   as that is the maximum supported number of versions. Using the above example, there are 4 versions of the key,, with the   tag being version 4. If they key is going to be updated, creating a 5th version, the order of operations looks like this:


 * 1)   is read so we can get key metadata, which contains the latest version information.
 * 2)   is read so we can get the current value.
 * 3)   is created from   (which is really just a rename)
 * 4)   is updated with the new binary data.

There is a specific scenario where an update will cause a cascading update. Updating the above example, let's say there are 255 existing versions of the key:

/shardId/data/ .d ->  /shardId/data/ @l ->  /shardId/data/ :254 ->  /shardId/data/ :253 ->  /shardId/data/ :252 -> 

If you issue a key update, a rollover occurs, where the oldest version drops off, and every existing key is "incremented" to ensure the relative key versions remain consistent. For example, version 254 becomes version 253, version 253 becomes version 252, etc. Overall there will be a total of 255 writes with 257 reads if a key is already at it's maximum version of 255. It's up to the user to determine how they want to handle this.

Vacuuming
The hooks in the key value pair structs exist, but vacuum logic hasn't been implemented yet. The general idea of vacuuming the key store would garbage collect all older key versions that are not marked as vacuumable. Since we don't store versioned metadata, users would have to issue a read-then-write to update the flag. As none of this logic exists yet, suggestions are recommended!

One thing we might want to capture is keys which are always vacuumable (re: vacuumable flag set on initial put). This might allow us some optimizations, but to be determined!