Merkle proofs for offline data integrity
Introduction
Ideally we'd like to store everything in Ethereum storage, which is stored across thousands of computers and has extremely high availability (the data cannot be censored) and integrity (the data cannot be modified in an unauthorized manner), but storing a 32-byte word typically costs 20,000 gas. As I'm writing this, that cost is equivalent to $6.60. At 21 cents per byte this is too expensive for many uses.
To solve this problem the Ethereum ecosystem developed many alternative ways to store data in a decentralized fashion. Usually they involve a tradeoff between availability and price. However, integrity is usually assured.
In this article you learn how to ensure data integrity without storing the data on the blockchain, using Merkle proofs(opens in a new tab).
How does it work?
In theory we could just store the hash of the data on chain, and send all the data in transactions that require it. However, this is still too expensive. A byte of data to a transaction costs about 16 gas, currently about half a cent, or about $5 per kilobyte. At $5000 per megabyte, this is still too expensive for many uses, even without the added cost of hashing the data.
The solution is to repeatedly hash different subsets of the data, so for the data that you don't need to send you can just send a hash. You do this using a Merkle tree, a tree data structure where each node is a hash of the nodes below it:
The root hash is the only part that needs to be stored on chain. To prove a certain value, you provide all the hashes that need to be combined with it to obtain the root. For example, to prove C
you provide D
, H(A-B)
, and H(E-H)
.
Implementation
The sample code is provided here(opens in a new tab).
Off-chain code
In this article we use JavaScript for the off-chain computations. Most decentralized applications have their off-chain component in JavaScript.
Creating the Merkle root
First we need to provide the Merkle root to the chain.
1const ethers = require("ethers")2
We use the hash function from the ethers package(opens in a new tab).
1// The raw data whose integrity we have to verify. The first two bytes a2// are a user identifier, and the last two bytes the amount of tokens the3// user owns at present.4const dataArray = [5 0x0bad0010, 0x60a70020, 0xbeef0030, 0xdead0040, 0xca110050, 0x0e660060,6 0xface0070, 0xbad00080, 0x060d0091,7]8
Encoding each entry into a single 256-bit integer results in less readable code than using JSON, for example. However, this means significantly less processing to retrieve the data in the contract, so much lower gas costs. You can read JSON on chain(opens in a new tab), it's just a bad idea if avoidable.
1// The array of hash values, as BigInts2const hashArray = dataArray3
In this case our data is 256-bit values to begin with, so no processing is needed. If we use a more complicated data structure, such as strings, we need to make sure we hash the data first to get an array of hashes. Note that this is also because we don't care if users know other users' information. Otherwise we would have had to hash so user 1 won't know the value for user 0, user 2 won't know the value for user 3, etc.
1// Convert between the string the hash function expects and the2// BigInt we use everywhere else.3const hash = (x) =>4 BigInt(ethers.utils.keccak256("0x" + x.toString(16).padStart(64, 0)))5
The ethers hash function expects to get a JavaScript string with a hexadecimal number, such as 0x60A7
, and responds with another string with the same structure. However, for the rest of the code it's easier to use BigInt
, so we convert to a hexadecimal string and back again.
1// Symmetrical hash of a pair so we won't care if the order is reversed.2const pairHash = (a, b) => hash(hash(a) ^ hash(b))3
This function is symmetrical (hash of a xor(opens in a new tab) b). This means that when we check the Merkle proof we don't need to worry about whether to put the value from the proof before or after the calculated value. Merkle proof checking is done on chain, so the less we need to do there the better.
Warning:
Cryptography is harder than it looks.
The initial version of this article had the hash function hash(a^b)
.
That was a bad idea because it meant that if you knew the legitimate values of a
and b
you could use b' = a^b^a'
to prove any desired a'
value.
With this function you'd have to calculate b'
such that hash(a') ^ hash(b')
is equal to a known value (the next branch on the way to root), which is a lot harder.
1// The value to denote that a certain branch is empty, doesn't2// have a value3const empty = 0n4
When the number of values is not an integer power of two we need to handle empty branches. The way this program does it is to put zero as a place holder.
1// Calculate one level up the tree of a hash array by taking the hash of2// each pair in sequence3const oneLevelUp = (inputArray) => {4 var result = []5 var inp = [...inputArray] // To avoid over writing the input // Add an empty value if necessary (we need all the leaves to be // paired)67 if (inp.length % 2 === 1) inp.push(empty)89 for (var i = 0; i < inp.length; i += 2)10 result.push(pairHash(inp[i], inp[i + 1]))1112 return result13} // oneLevelUp14அனைத்தையும் காட்டு
This function "climbs" one level in the Merkle tree by hashing the pairs of values at the current layer. Note that this is not the most efficient implementation, we could have avoided copying the input and just added hashEmpty
when appropriate in the loop, but this code is optimized for readability.
1const getMerkleRoot = (inputArray) => {2 var result34 result = [...inputArray] // Climb up the tree until there is only one value, that is the // root. // // If a layer has an odd number of entries the // code in oneLevelUp adds an empty value, so if we have, for example, // 10 leaves we'll have 5 branches in the second layer, 3 // branches in the third, 2 in the fourth and the root is the fifth56 while (result.length > 1) result = oneLevelUp(result)78 return result[0]9}10அனைத்தையும் காட்டு
To get the root, climb until there is only one value left.
Creating a Merkle proof
A Merkle proof is the values to hash together with the value being proved to get back the Merkle root. The value to prove is often available from other data, so I prefer to provide it separately rather than as part of the code.
1// A merkle proof consists of the value of the list of entries to2// hash with. Because we use a symmetrical hash function, we don't3// need the item's location to verify the proof, only to create it4const getMerkleProof = (inputArray, n) => {5 var result = [], currentLayer = [...inputArray], currentN = n67 // Until we reach the top8 while (currentLayer.length > 1) {9 // No odd length layers10 if (currentLayer.length % 2)11 currentLayer.push(empty)1213 result.push(currentN % 214 // If currentN is odd, add with the value before it to the proof15