Canonical chain extraction tool

I recently got to work on a really fun problem while building our indexer and wanted to share a command line tool I made as a result.

The problem
Given a directory with a bunch of precomputed blocks (presumably, all current mainnet blocks), extract the canonical chain.

The solution
The idea is simple, but the rust code linked above isn’t so simple, the devil lies in the details. It goes like this:
Find the most recent canonical block (here I take canonical to mean that there are at least 10 descendants). Then, get that block’s parent hash, then get that block’s parent hash, etc until you end at the block with height 2 (the genesis block is not included as a precomputed block).

The command line tool linked above allows you to pass in a block directory and will perform the algorithm I just described and print the file names of all the blocks in the canonical chain in order, followed by a height sorted list of all successor blocks. It’s easy to use, just clone the repo

git clone git@github.com:Isaac-DeFrain/fn.git
cd fn/blockchain
cargo run --release -- backward -b path/to/your/precomputed/blocks/dir

and let the magic happen! The output file defaults to $HOME/.output but that can be changed via the -o flag. Also, you can prefix that command with RUST_LOG=info to have the logs keep you informed.

For more options, access the help menu via

cargo run -- --help

There are actually two canonical chain discovery algorithms included (forward and backward), but the forward one is super slow. That’s why I recommend using backward!

In more detail now
To find the most recent canonical block, we start by sorting all block file names in the directory under this lexicographic ordering:

  • blockchain_length (the current precomputed block file naming convention puts this u32 value in the file name; mainnet-{blockchain_length}-{state_hash}.json)
  • state_hash lexicographically as it’s base58 encoded String

Once, we have the length-sorted file names, all we need to do to find the highest canonical block is

  • start at any block with the highest length (sorry, highest length sounds weird)
  • go back 10 blocks

The parent hash is located in the JSON object obj["protoco_state"]["previous_state_hash"]. We definitely want avoid parsing this JSON into a full-blown PrecomputedBlock struct. Currently, the mechanism for getting this value is quite primitive; we simply open the file, seek to a specific byte index (75), and read exactly 52 bytes (the length of the base58 encoded state hash). Since the format is standardized, this works, but it’s less than ideal. Ultimately, I’d like to implement a custom serde Deserializer which only parses this field. But alas, art is long and life is short!

Once this is in place, we simply walk the canonical chain back to block 2 (or whatever the lowest block length is in your directory) and reverse the Vec of canonical block file names we’ve built. The successive blocks are just the length-sorted blocks that are higher than the canonical tip.

I’d love to hear your thoughts!

Enjoy!

2 Likes

Very cool, thanks @Quantifier!

2 Likes