Golem-Base data storage

Ethereum Storage Fundamentals¶

To understand GolemBase's data storage approach, it's helpful to first review the fundamentals of how Ethereum storage works.

Ethereum storage is based on an internal key-value structure. The underlying storage engine is typically LevelDB or RocksDB (for the Geth client, it's RocksDB). The fundamental unit in Ethereum is a 256-bit value (32 bytes), which is used for both keys and values. The Solidity compiler attempts to pack multiple typed values into a single 32-byte slot.

The core data structure Ethereum uses for integrity verification and efficient compression is the Patricia Merkle Trie. Two aspects of this structure are particularly important:

Data resides in the leaves, while parent nodes reference their children via keys computed from the hashes of the children. This ensures integrity—any change affects the root hash, making tampering detectable (this is the Merkle part).
The trie uses a special structure (Branch Nodes, Extension Nodes, and Leaf Nodes) with specialized encodings (Hex-Prefix [HP] and Recursive Length Prefix [RLP]) to minimize storage. The goal is to extract shared prefixes of keys and store them in a common node (this is the Patricia part).

Image title — Figure source: https://arxiv.org/pdf/2108.05513

Ethereum maintains several such tries, as it has multiple types of storage:

World State Trie – stores basic account data such as balance, nonce, storage root, and code hash (for contract accounts).
Transaction Trie – stores transaction details including nonce, gas price, gas limit, etc.
Receipt Trie – stores the results of processed transactions, such as gas used and logs.
Account Storage Trie – stores contract-specific data for each account.

The diagram below provides more detail on the general storage structure:

For more in-depth information on how Ethereum storage is organized internally, refer to this paper: https://arxiv.org/pdf/2108.05513

The key aspect of GolemBase storage is that it stores all its data in the Account Storage Trie under the hardcoded address:
0x0000000000000000000000000000000060138453

Storage schema for various GolemBase types¶

GolemBase consists of two types of stored data: Set and Payload (also referred to as Blob).
Sets are used to keep track of:

all entities in the GolemBase store,
all entities per address (data owner),
expiration times for each entity,
and annotations.

Payloads are used to store the actual entity data (as a byte array) and a Metadata structure (also called the metadata payload).
The metadata is encoded using RLP (Recursive Length Prefix) encoding. Below is the EntityMetaData structure:

type EntityMetaData struct {
    ExpiresAtBlock     uint64              `json:"expiresAtBlock"`
    StringAnnotations  []StringAnnotation  `json:"stringAnnotations"`
    NumericAnnotations []NumericAnnotation `json:"numericAnnotations"`
    Owner              common.Address      `json:"owner"`
}

Schema of the Set store¶

Set store consists of three elements:

A special entry that stores the number of elements in the Set. The key for this entry is the name of the Set, referred to as the SetKey.
A special entry whose key is computed by hashing the name of the Set (referred to as the KeySet), prefixed with golemBase.keyset.map. The value is the item to be stored, typically a hash referencing an Entity (i.e., the Entity’s key). This entry enables efficient, non-linear-time checks to determine whether a given value already exists in the Set.

Conceptually, this can be treated as adding the value to a Map, where the key is the Entity’s key. This structure allows for fast membership checks within the Set.
An entry containing the value we want to store in the Set. The key is computed as SetKey + index of the entry.

This part can be seen as an array of elements in the Set, where the offset is the SetKey and each element is stored at SetKey + index (typically hashed).

To add an item to the Set, three steps are required:

Update the length of the Set.
Add the special entry required for the contains operation (i.e., insert the item's value into the Map).
Add the entry to the Set (i.e., append the item's value to the array).

The process is illustrated in the following diagram:

AddToSetFlow

Schema of the Payload/Blob store¶

Payload is stored just as a sequence of bytes under a given hashed key - where the hashed key can be treated as the unique ID of the entity. There are two options of storing Payload depending on the size of Payload

Payload <= 31 bytes - in this case it is saved simply as entry as Payload fits into 32 bytes slot (1 byte is reserved for length)
Payload > 31 bytes - in this case payload has to be split into sequence of entries and information about size (number of sequences) is kept in last byte of first entry

NOTE: To know if payload is packed in the same slot or is prolonged to the following slots length is multiplied by 2 and if it is a sequence a is added (so the least significant bit indicates the continuation of payload in the following slots).

Add Entity Flow¶

Now that we understand the two types GolemBase operates on, and how they are stored in detail, the process of adding a new Entity to GolemDB is illustrated in the following diagram:

LargeBlob

As shown, adding an Entity involves:

Inserting entries into three Set structures, where each insertion includes two Map entries (one for presence checking and one for value storage),
Inserting into N annotation Sets (for N annotations),
Creating two new entries: one for the Entity payload and one for the Entity metadata.

Each entry consumes 64 bytes: 32 bytes for the key and 32 bytes for the value.

Thus, the total storage size required to add an Entity in GolemBase is calculated as:

storage-size = 3 x 2 x 64 bytes + N x 2 x 64 bytes + 2 x 32 bytes + X bytes + Y bytes = 448 bytes + N x 128 bytes + (X + Y) bytes,

Where:

N is the number of annotations,
X is the size of the metadata payload,
Y is the size of the entity payload.