An Introduction to Facebook MyRocks Database

Published in

The Startup

5 min readNov 5, 2020

How Facebook designed a database with 10X write throughput and 2X compression rate compared to InnoDB

Databases are at the heart of virtually all internet applications. Database throughput and latencies are normally the bottleneck of many internet services and have attracted attentions from many researchers. As the amount of data grows exponentially, it is more and more important to store the data efficiently. In this blog, we will introduce Facebook’s MyRocks database and examine how it improves the write throughput by 10X and compression rate by 2X compared to InnoDB.

The architecture of InnoDB

InnoDB, like other classic relational database, uses B-Tree to organize its data. A B-Tree is similar to a binary search tree with a few differences,

Each node has more than 1 value

As shown in the figure below, each node can have more than 1 value. For instance, there are 3 values in the root node. All the values inside the subtree rooted at the leftmost child of the root node are smaller than 100. All the values inside the subtree rooted at the second leftmost child of the root are between 100 and 155.

Each node takes a constant amount of space on disk

InnoDB allocates a constant amount of space for each node of the B-Tree. Suppose the database uses 28 bytes for each node. Root node has used all the space allocated. Assuming both the pointer and the value take 4 bytes each, we need 28 bytes to store values and pointers in the root node. However for the node storing 128 and 140, 8 bytes are wasted.

How Database B-Tree Indexing Works - DZone Database — B-Tree

The advantages of the InnoDB

Read is fast

The read operation is very fast with a B-Tree. To find the value we are interested in, we can use binary search. Suppose we want to find 145. At the root, we can follow the second leftmost pointer to reach the node holding 128 and 140. From that node, we can follow the rightmost edge to find the target value.

B-Tree in the database is balanced, which means it takes O(logₓN) to find the data. x is the number of values inside each node and N is total number of values stored in the database. In reality, each node can have 100 or even more values. Assume each entry in the database holds 1KB of user data. The database has 1,000,000,000 entries if it stores 1TB of data. The height of B-Tree is 5 if the average number of values inside each node is 100. Therefore we usually won’t need to exam more than 5 nodes to find the data and runtime for the read operation can be considered constant.

The issues with the InnoDB

Wasted storage

Since the database allocates a constant amount of storage to each node, spaces are wasted if the node doesn’t hold enough values. One of the reason that the database uses constant memory is actually to save storage! With constant memory, we can easily find out the location of the node with the following formula,

location = address_of_first_node + node_number * constant_memory_size

Should we keep a map from node number to its physical address, we will use more storage than those wasted due to B-Tree fragmentation.

From the experiment in [1], fragmentation in InnoDB wastes 25% — 30% space.

Write is slow

Assume the database allocates 32KB for each node. If we change content of one entry inside one node, we have to write the whole node back. Changing 1KB of data would actually cause 32 KB write to the disk!

How MyRocks DB solved it

MyRocks DB used a different format to store the data, as shown below.

Data are stored into an in-memory buffer called Mem-Table. (The data is written to a Write-Ahead Log as well otherwise there will be data loss if the machine crashes before the data inside Mem-Table is flushed to the disk.) The Mem-Table could use a classic balanced binary search tree to hold its data.

When the size of the Mem-Table reaches a certain threshold, those data are flushed to a sorted-string table (SST) data file. Each of the SSTs stores data in sorted order. Following figure is an example of SSTable, with integer as the key and string as the value.

SSTable

SSTables are immutable. As time goes by, we will have multiple SSTables, each holding a sorted list of data. We will merge those SSTable files into a bigger SSTable file, which we can level 2 SSTable file. When we have a few level 2 files, we will merge them to level 3 file and so on.

Write becomes faster

Writes are substantially faster. When we want to modify 1 KB of data, we only need to flush 1KB of data to the disk. (Well, this is not exactly true, there is another 1KB copy in the WAL.) We could even employ group-commit of the underlying file system to reduce the number of I/O’s.

Space usage is reduced

As can be seen in the figure, there is no space fragmentation.

Read becomes slower

There is no free lunch. Read becomes slower. Since the data could be inside the SSTable at any level. We need to do a binary search on all SSTable files at all levels! If we have 4 levels, the read could be 4 times slower. Though we can parallelize the read operations, it would introduce much more read traffic to the disk and limiting the database throughput.

Performance evaluation

Facebook launched MyRocks DB to its user database, which holds the data of all Facebook users. The space usage dropped from 2187 GB to 824 GB. The amount of data written dropped from 13MB/s to 3.4 MB/s. CPU seconds indicates the efficiency of the system and MyRocks DB outperforms InnDB as well.

Facebook has open sourced MyRocks DB, you can find it here if you are interested.

Reference

[1] MyRocks: LSM-Tree Database Storage Engine Serving Facebook’s Social Graph, Matsunobu et al., VLDB’ 20