Introduction#
In practice, it is easier to construct hashing algorithms which operate on relatively small, fixed input lengths, whilst still keeping the output length even smaller (\(l_{\text{out}}\) is still less than \(l_{\text{in}}\)). But hash functions are usually used on much larger inputs - for example, creating checksums for integrity verification of files. The Merkle-Damgård transform allows us to turn such a hash function, which operates on small fixed input lengths, into a hash function which operates on inputs of arbitrary lengths.
The Merkle-Damgård Construction#
In particular, given a compression function \(H'(\textit{input}: \textbf{str}[\textbf{const } l_{\text{in}}]) \to \textbf{str}[\textbf{const }l]\) which works with inputs of a "small", fixed input length \(l_{\text{in}}\) and has outputs with length \(l\), the Merkle-Damgård transform allows us to use \(H'\) to a construct a hash function \(H(\textit{input}: \textbf{str}[l_{\text{m}}]) \to \textbf{str}[\textbf{const }l]\) which takes messages of arbitrary length, denoted by \(l_{\text{m}}\), and produces digests of the same output length \(l\) as \(H'\).
The construction is similar to a [[index|block cipher]] in the sense that the message \(m\) is chopped up into blocks. In contrast to block ciphers, however, this is done rather differently. Each block has length \(l_{\text{in}}\) (since each block will be input into \(H'\)), but it is not comprised entirely of message bits. Instead, each block contains \(l_{\text{mb}}\) message bits and the other \(l_t\) bits (\(l_t = l_{\text{in}} - l_{\text{mb}}\)) represent the so-called chaining variable for the current block.
This means that the message \(m\) needs to be chopped up into \(q\) message fragments \(\mu_1, \mu_2, ..., \mu_q\), all of length \(l_{\text{mb}}\). If the message length \(l_m\) is not a multiple of \(l_{\text{mb}}\), then the message is padded by appending a 1 to it and then appending 0s until the message length is short of a multiple of the fragment length \(l_{\text{mb}}\) by exactly the number of bits \(l_e\) needed to encode the message length \(l_m\). The total length of the padding (including the 1, the 0s and the encoding of the message length) is denoted by \(l_{\text{pad}} = 1 + \textit{count}(0) + l_e\).
![[res/Merkle-Damgard Padding.svg]]
When the message length \(l_m\) is a multiple of the fragment length \(l_{\text{mb}}\), padding still needs to be added. In a particular, an additional padding block is appended to the message, following the exact same procedure as before. The padding block begins with a \(1\) and is followed by \(0\)s - the last \(l_e\) bits of the padding block again encode the message length \(l_m\).
Note
The number of bits reserved for encoding the message length \(l_{mb}\) is fixed for a given Merkle-Damgård construction. Usually \(l_e\) is 64 bits, resulting in a maximum message length of \(2^{64} - 1\), which is quite a reasonable limit.
After padding, the actual hash algorithm begins by appending an initialisation vector (IV) of length \(l\) to the first message fragment \(\mu_1\). The IV is always a constant which is pre-defined in the specification of the Merkle-Damgård construction.
Example
The SHA256 hash function uses the following 256-bit IV (the value is in hex):
This initialisation vector serves as the initial chaining variable \(t_1\). The concatenation \(\mu_1 || IV\) of the first message block \(\mu_1\) with the IV is passed to the compression function \(H'\), whose output becomes the next chaining variable. In general, the \(i\)-th iteration takes the \(i\)-th message block \(\mu_i\) and appends to it the chaining variable \(t_i\). The chaining variable \(t_i\) for the current stage is simply the output of \(H'\) from the previous iteration. The final output, i.e. the hash generated by the Merkle-Damgård function \(H\), is the final chaining variable.
![[res/Merkle-Damgård Construction.svg]]
Security of Merkle-Damgård Constructions#
The reason why the Merkle-Damgård transform is used ubiquitously is the fact that it preserves collision resistance.
Theorem: Merkle-Damgård Collision Resistance
If the compression function \(H'\) is collision resistant, then so is the Merkle-Damgård function \(H\).
Proof
Suppose, towards contradiction that there is an efficient collision finder \(\mathcal{A}\) which can find a collision in \(H\) with non-negligible probability. Let \(x\) and \(x'\) be two inputs of lengths \(L\) and \(L'\), respectively, such that \(H(x) = H(x')\). Let \(x_1, x_2, ..., x_q\) be the \(q\) blocks which \(x\) is divided into, and, let \(x_1', x_2', ..., x_{q'}'\) be the \(q'\) blocks which \(x'\) is divided into. Similarly, let \(t_1, t_2, ..., t_q, t_{q+1}\) and \(t_1', t_2', ..., t_{q'}', t_{q'+1}'\) be the chaining variables used at each iteration of the hashing of \(x\) and \(x'\), respectively (remember that the chaining variables \(t_{q+1}\) and \(t_{q'+1}'\) are also the output of \(H\)).
Case 1: If the two inputs have different lengths, i.e. \(L \ne L'\), then the hash \(H(x)\) is \(t_{q+1} \coloneqq H'(x_q||t_{q})\) and the hash \(H(x')\) is \(t_{q'+1}' \coloneqq H'(x_q'||t_{q'}')\). However, \(H(x) = H(x')\) means that \(H'(x_q||t_{q}) = H'(x_q'||t_{q'}')\) which is a contradiction because \(L\ne L'\) and so \(x_q \ne x_{q'}'\) (remember that the length is appended to the message when padding) - we have found two different inputs which cause a collision in the collision resistant \(H'\).
Case 2: If the two inputs have the same length, i.e. \(L = L'\), then they are also divided into the same number of blocks \(q = q'\). Let \(I_i \coloneqq x_i||t_i\) denote the \(i\)-th input passed to \(H'\) when computing \(H(x)\), and let \(I_i' \coloneqq x_i'||t_i'\) denote the \(i\)-th input passed to \(H'\) when computing \(H(x')\). Additionally, we will denote the output of \(H(x)\) as \(I_{q+1} \coloneqq t_{q+1}\), and we will denote the output of \(H(x')\) as \(I_{q'+1}' \coloneqq t_{q'+1}'\).
Now, \(H(x) = H(x')\) and so \(I_{q+1} = t_{q+1} = I_{q'+1}' = t_{q'+1}'\). This can only happen if \(I_q = I_{q'}'\) or if \((I_q, I_{q'}')\) is a collision pair for \(H'\) and the same logic propagates backwards - in general, \(H'(I_i) = H'(I_i')\) can be true only if \(I_i = I_i'\) or if \((I_i, I_{i'}')\) is a collision pair for \(H'\). The inputs \(x, x'\) are a collision pair for \(H\) which means that \(x \ne x'\) and so there must be some index \(j\) for which \(x_j \ne x_j'\) which means for sure that \(I_j \ne I_j'\) and so \((I_j, I_j')\) turn out to be a collision in \(H'\), which is a contradiction.