What is the data format layout for txindex LevelDB values?

[ad_1]

The keys I understand, t + 32-byte hash.

But my problem are the values. I understand from sources such as What are the keys used in the blockchain levelDB (ie what are the key:value pairs)? that the values should encode three values: dat file number, block offset, and tx offset within block.

But I’ve noticed that each value has a different sizes between 5 and 10 on the first thousand entries, so I’m not sure how to decode the values into those three fields. Are those fields simply 3 varint values?

Here’s my Plyvel code that prints out the lengths using plyvel==1.5.1, Bitcoin Core v26.0.0 on Ubuntu 23.10:

#!/usr/bin/env python3

import struct

import plyvel

def decode_varint(data):
    """
    https://github.com/alecalve/python-bitcoin-blockchain-parser/blob/c06f420995b345c9a193c8be6e0916eb70335863/blockchain_parser/utils.py#L41
    """
    assert(len(data) > 0)
    size = int(data[0])
    assert(size <= 255)

    if size < 253:
        return size, 1

    if size == 253:
        format_ = '<H'
    elif size == 254:
        format_ = '<I'
    elif size == 255:
        format_ = '<Q'
    else:
        # Should never be reached
        assert 0, "unknown format_ for size : %s" % size

    size = struct.calcsize(format_)
    return struct.unpack(format_, data[1:size+1])[0], size + 1

ldb = plyvel.DB('/home/ciro/snap/bitcoin-core/common/.bitcoin/indexes/txindex/', compression=None)
i = 0
for key, value in ldb:
    if key[0:1] == b't':
        txid = bytes(reversed(key[1:])).hex()
        print(i)
        print(txid)
        print(len(value))
        print(value.hex(' '))
        value = bytes(reversed(value))
        file, off = decode_varint(value)
        blk_off, off = decode_varint(value[off:])
        tx_off, off = decode_varint(value[off:])
        print((txid, file, blk_off, tx_off))
        print()
        i += 1

but it eventually blows up at:

131344
ec4de461b0dd1350b7596f95c0d7576aa825214d9af0e8c54de567ab0ce70800
7
42 ff c0 43 8b 94 35
Traceback (most recent call last):
  File "/home/ciro/bak/git/bitcoin-strings-with-txids/./tmp.py", line 39, in <module>
    blk_off, off = decode_varint(value[off:])
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ciro/bak/git/bitcoin-strings-with-txids/./tmp.py", line 29, in decode_varint
    return struct.unpack(format_, data[1:size+1])[0], size + 1
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
struct.error: unpack requires a buffer of 8 bytes

So I wonder if I guessed the format wrong, or if it’s just a bug in my code.

Comparing to: https://en.bitcoin.it/wiki/Protocol_documentation#Variable_length_integer I would decode:

42 ff c0 43 8b 94 35

manually as:

  • 42
  • ff: expect 8 bytes next
    • c0 43 8b 94 35: only 5 bytes left, blowup

I also tried to inverse value:

value = bytes(reversed(value))

but then it blows up very early, definitely wrong.

I also tried to ignore the error to see if there are others, but there were hundreds of thousands of that error, so something is definitely wrong.

Related:

[ad_2]

Source link

Leave a Comment