# Merkle Patricia Trie

Uhariri wa mwisho: @wackerow(opens in a new tab), 3 Julai 2024

The state of Ethereum (the totality of all accounts, balances, and smart contracts), is encoded into a special version of the data structure known generally in computer science as a Merkle Tree. This structure is useful for many applications in cryptography because it creates a verifiable relationship between all the individual pieces of data entangled in the tree, resulting in a single **root** value that can be used to prove things about the data.

Ethereum's data structure is a 'modified Merkle-Patricia Trie', named so because it borrows some features of PATRICIA (the Practical Algorithm To Retrieve Information Coded in Alphanumeric), and because it is designed for efficient data re**trie**val of items that comprise the Ethereum state.

A Merkle-Patricia trie is deterministic and cryptographically verifiable: The only way to generate a state root is by computing it from each individual piece of the state, and two states that are identical can be easily proven so by comparing the root hash and the hashes that led to it (*a Merkle proof*). Conversely, there is no way to create two different states with the same root hash, and any attempt to modify state with different values will result in a different state root hash. Theoretically, this structure provides the 'holy grail' of `O(log(n))`

efficiency for inserts, lookups and deletes.

In the near future, Ethereum plans to migrate to a Verkle Tree(opens in a new tab) structure, which will open up many new possibilities for future protocol improvements.

## Prerequisites

To better understand this page, it would be helpful to have basic knowledge of hashes(opens in a new tab), Merkle trees(opens in a new tab), tries(opens in a new tab) and serialization(opens in a new tab). This article begins with a description of a basic radix tree(opens in a new tab), then gradually introduces the modifications necessary for Ethereum's more optimized data structure.

## Basic radix tries

In a basic radix trie, every node looks as follows:

1 [i_0, i_1 ... i_n, value]

Where `i_0 ... i_n`

represent the symbols of the alphabet (often binary or hex), `value`

is the terminal value at the node, and the values in the `i_0, i_1 ... i_n`

slots are either `NULL`

or pointers to (in our case, hashes of) other nodes. This forms a basic `(key, value)`

store.

Say you wanted to use a radix tree data structure for persisting an order over a set of key value pairs. To find the value currently mapped to the key `dog`

in the trie, you would first convert `dog`

into letters of the alphabet (giving `64 6f 67`

), and then descend the trie following that path until you find the value. That is, you start by looking up the root hash in a flat key/value DB to find the root node of the trie. It is represented as an array of keys pointing to other nodes. You would use the value at index `6`

as a key and look it up in the flat key/value DB to get the node one level down. Then pick index `4`

to look up the next value, then pick index `6`

, and so on, until, once you followed the path: `root -> 6 -> 4 -> 6 -> 15 -> 6 -> 7`

, you would look up the value of the node and return the result.

There is a difference between looking something up in the 'trie' and the underlying flat key/value 'DB'. They both define key/value arrangements, but the underlying DB can do a traditional 1 step lookup of a key. Looking up a key in the trie requires multiple underlying DB lookups to get to the final value described above. Let's refer to the latter as a `path`

to eliminate ambiguity.

The update and delete operations for radix tries can be defined as follows:

1 def update(node,path,value):2 curnode = db.get(node) if node else [ NULL ] * 173 newnode = curnode.copy()4 if path == '':5 newnode[-1] = value6 else:7 newindex = update(curnode[path[0]],path[1:],value)8 newnode[path[0]] = newindex9 db.put(hash(newnode),newnode)10 return hash(newnode)1112 def delete(node,path):13 if node is NULL:14 return NULL15 else:16 curnode = db.get(node)17 newnode = curnode.copy()18 if path == '':19 newnode[-1] = NULL20 else:21 newindex = delete(curnode[path[0]],path[1:])22 newnode[path[0]] = newindex2324 if all(x is NULL for x in newnode):25 return NULL26 else:27 db.put(hash(newnode),newnode)28 return hash(newnode)Onyesha yote

A "Merkle" Radix tree is built by linking nodes using deterministically-generated cryptographic hash digests. This content-addressing (in the key/value DB `key == keccak256(rlp(value))`

) provides a cryptographic integrity guarantee of the stored data. If the root hash of a given trie is publicly known, then anyone with access to the underlying leaf data can construct a proof that the trie includes a given value at a specific path by providing the hashes of each node joining a specific value to the tree root.

It is impossible for an attacker to provide a proof of a `(path, value)`

pair that does not exist since the root hash is ultimately based on all hashes below it. Any underlying modification would change the root hash. You can think of the hash as a compressed representation of structural information about the data, secured by the pre-image protection of the hashing function.

We'll refer to an atomic unit of a radix tree (e.g. a single hex character, or 4 bit binary number) as a "nibble". While traversing a path one nibble at a time, as described above, nodes can maximally refer to 16 children but include a `value`

element. We, hence, represent them as an array of length 17. We call these 17-element arrays "branch nodes".

## Merkle Patricia Trie

Radix tries have one major limitation: they are inefficient. If you want to store one `(path, value)`

binding where the path, like in Ethereum, is 64 characters long (the number of nibbles in `bytes32`

), we will need over a kilobyte of extra space to store one level per character, and each lookup or delete will take the full 64 steps. The Patricia trie introduced in the following solves this issue.

### Optimization

A node in a Merkle Patricia trie is one of the following:

`NULL`

(represented as the empty string)`branch`

A 17-item node`[ v0 ... v15, vt ]`

`leaf`

A 2-item node`[ encodedPath, value ]`

`extension`

A 2-item node`[ encodedPath, key ]`

With 64 character paths it is inevitable that after traversing the first few layers of the trie, you will reach a node where no divergent path exists for at least part of the way down. To avoid having to create up to 15 sparse `NULL`

nodes along the path, we shortcut the descent by setting up an `extension`

node of the form `[ encodedPath, key ]`

, where `encodedPath`

contains the "partial path" to skip ahead (using a compact encoding described below), and the `key`

is for the next DB lookup.

For a `leaf`

node, which can be marked by a flag in the first nibble of the `encodedPath`

, the path encodes all prior node's path fragments and we can look up the `value`

directly.

This above optimization, however, introduces ambiguity.

When traversing paths in nibbles, we may end up with an odd number of nibbles to traverse, but because all data is stored in `bytes`

format. It is not possible to differentiate between, for instance, the nibble `1`

, and the nibbles `01`

(both must be stored as `<01>`

). To specify odd length, the partial path is prefixed with a flag.

### Specification: Compact encoding of hex sequence with optional terminator

The flagging of both *odd vs. even remaining partial path length* and *leaf vs. extension node* as described above reside in the first nibble of the partial path of any 2-item node. They result in the following:

1hex char bits | node type partial path length2----------------------------------------------------------3 0 0000 | extension even4 1 0001 | extension odd5 2 0010 | terminating (leaf) even6 3 0011 | terminating (leaf) odd

For even remaining path length (`0`

or `2`

), another `0`

"padding" nibble will always follow.

1 def compact_encode(hexarray):2 term = 1 if hexarray[-1] == 16 else 03 if term: hexarray = hexarray[:-1]4 oddlen = len(hexarray) % 25 flags = 2 * term + oddlen6 if oddlen:7 hexarray = [flags] + hexarray8 else:9 hexarray = [flags] + [0] + hexarray10 // hexarray now has an even length whose first nibble is the flags.11 o = ''12 for i in range(0,len(hexarray),2):13 o += chr(16 * hexarray[i] + hexarray[i+1])14 return oOnyesha yote

Examples:

1 > [ 1, 2, 3, 4, 5, ...]2 '11 23 45'3 > [ 0, 1, 2, 3, 4, 5, ...]4 '00 01 23 45'5 > [ 0, f, 1, c, b, 8, 10]6 '20 0f 1c b8'7 > [ f, 1, c, b, 8, 10]8 '3f 1c b8'

Here is the extended code for getting a node in the Merkle Patricia trie:

1 def get_helper(node,path):2 if path == []: return node3 if node = '': return ''4 curnode = rlp.decode(node if len(node) < 32 else db.get(node))5 if len(curnode) == 2:6 (k2, v2) = curnode7 k2 = compact_decode(k2)8 if k2 == path[:len(k2)]:9 return get(v2, path[len(k2):])10 else:11 return ''12 elif len(curnode) == 17:13 return get_helper(curnode[path[0]],path[1:])1415 def get(node,path):16 path2 = []17 for i in range(len(path)):18 path2.push(int(ord(path[i]) / 16))19 path2.push(ord(path[i]) % 16)20 path2.push(16)21 return get_helper(node,path2)Onyesha yote

### Example Trie

Suppose we want a trie containing four path/value pairs `('do', 'verb')`

, `('dog', 'puppy')`

, `('doge', 'coins')`

, `('horse', 'stallion')`

.

First, we convert both paths and values to `bytes`

. Below, actual byte representations for *paths* are denoted by `<>`

, although *values* are still shown as strings, denoted by `''`

, for easier comprehension (they, too, would actually be `bytes`

):

1 <64 6f> : 'verb'2 <64 6f 67> : 'puppy'3 <64 6f 67 65> : 'coins'4 <68 6f 72 73 65> : 'stallion'

Now, we build such a trie with the following key/value pairs in the underlying DB:

1 rootHash: [ <16>, hashA ]2 hashA: [ <>, <>, <>, <>, hashB, <>, <>, <>, [ <20 6f 72 73 65>, 'stallion' ], <>, <>, <>, <>, <>, <>, <>, <> ]3 hashB: [ <00 6f>, hashC ]4 hashC: [ <>, <>, <>, <>, <>, <>, hashD, <>, <>, <>, <>, <>, <>, <>, <>, <>, 'verb' ]5 hashD: [ <17>, [ <>, <>, <>, <>, <>, <>, [ <35>, 'coins' ], <>, <>, <>, <>, <>, <>, <>, <>, <>, 'puppy' ] ]

When one node is referenced inside another node, what is included is `H(rlp.encode(node))`

, where `H(x) = keccak256(x) if len(x) >= 32 else x`

and `rlp.encode`

is the RLP encoding function.

Note that when updating a trie, one needs to store the key/value pair `(keccak256(x), x)`

in a persistent lookup table *if* the newly-created node has length >= 32. However, if the node is shorter than that, one does not need to store anything, since the function f(x) = x is reversible.

## Tries in Ethereum

All of the merkle tries in Ethereum's execution layer use a Merkle Patricia Trie.

From a block header there are 3 roots from 3 of these tries.

- stateRoot
- transactionsRoot
- receiptsRoot

### State Trie

There is one global state trie, and it is updated every time a client processes a block. In it, a `path`

is always: `keccak256(ethereumAddress)`

and a `value`

is always: `rlp(ethereumAccount)`

. More specifically an ethereum `account`

is a 4 item array of `[nonce,balance,storageRoot,codeHash]`

. At this point, it's worth noting that this `storageRoot`

is the root of another patricia trie:

### Storage Trie

Storage trie is where *all* contract data lives. There is a separate storage trie for each account. To retrieve values at specific storage positions at a given address the storage address, integer position of the stored data in the storage, and the block ID are required. These can then be passed as arguments to the `eth_getStorageAt`

defined in the JSON-RPC API, e.g. to retrieve the data in storage slot 0 for address `0x295a70b2de5e3953354a6a8344e616ed314d7251`

:

1curl -X POST --data '{"jsonrpc":"2.0", "method": "eth_getStorageAt", "params": ["0x295a70b2de5e3953354a6a8344e616ed314d7251", "0x0", "latest"], "id": 1}' localhost:854523{"jsonrpc":"2.0","id":1,"result":"0x00000000000000000000000000000000000000000000000000000000000004d2"}4

Retrieving other elements in storage is slightly more involved because the position in the storage trie must first be calculated. The position is calculated as the `keccak256`

hash of the address and the storage position, both left-padded with zeros to a length of 32 bytes. For example, the position for the data in storage slot 1 for address `0x391694e7e0b0cce554cb130d723a9d27458f9298`

is:

1keccak256(decodeHex("000000000000000000000000391694e7e0b0cce554cb130d723a9d27458f9298" + "0000000000000000000000000000000000000000000000000000000000000001"))

In a Geth console, this can be calculated as follows:

1> var key = "000000000000000000000000391694e7e0b0cce554cb130d723a9d27458f9298" + "0000000000000000000000000000000000000000000000000000000000000001"2undefined3> web3.sha3(key, {"encoding": "hex"})4"0x6661e9d6d8b923d5bbaab1b96e1dd51ff6ea2a93520fdc9eb75d059238b8c5e9"

The `path`

is therefore `keccak256(<6661e9d6d8b923d5bbaab1b96e1dd51ff6ea2a93520fdc9eb75d059238b8c5e9>)`

. This can now be used to retrieve the data from the storage trie as before:

1curl -X POST --data '{"jsonrpc":"2.0", "method": "eth_getStorageAt", "params": ["0x295a70b2de5e3953354a6a8344e616ed314d7251", "0x6661e9d6d8b923d5bbaab1b96e1dd51ff6ea2a93520fdc9eb75d059238b8c5e9", "latest"], "id": 1}' localhost:854523{"jsonrpc":"2.0","id":1,"result":"0x000000000000000000000000000000000000000000000000000000000000162e"}

Note: The `storageRoot`

for an Ethereum account is empty by default if it's not a contract account.

### Transactions Trie

There is a separate transactions trie for every block, again storing `(key, value)`

pairs. A path here is: `rlp(transactionIndex)`

which represents the key that corresponds to a value determined by:

1if legacyTx:2 value = rlp(tx)3else:4 value = TxType | encode(tx)

More information on this can be found in the EIP 2718(opens in a new tab) documentation.

### Receipts Trie

Every block has its own Receipts trie. A `path`

here is: `rlp(transactionIndex)`

. `transactionIndex`

is its index within the block it was included in. The receipts trie is never updated. Similar to the Transactions trie, there are current and legacy receipts. To query a specific receipt in the Receipts trie, the index of the transaction in its block, the receipt payload and the transaction type are required. The Returned receipt can be of type `Receipt`

which is defined as the concatenation of `TransactionType`

and `ReceiptPayload`

or it can be of type `LegacyReceipt`

which is defined as `rlp([status, cumulativeGasUsed, logsBloom, logs])`

.

More information on this can be found in the EIP 2718(opens in a new tab) documentation.