Understanding the BitTorrent Protocol: A Technical Analysis and Experimentation Guide

0. Introduction

Times are changing, and many things have slowly disappeared into the ocean of data. But we won’t forget the internet environment that existed for the purpose of mutual communication and information sharing. In today’s prevalent BS and CS models, we occasionally need to recall the significance and necessity of Peer-to-Peer. Regardless of right or wrong, good or bad, let’s reminisce about it from a purely technical perspective, especially considering its diminishing presence.

This article analyzes and interprets the BitTorrent protocol to foster a better understanding and provide a theoretical basis for further experimentation. **This article does not discuss DHT, PEX, and other content not officially confirmed by BDFL. This content may be discussed in other articles.**

1. Overview

BitTorrent is a protocol for distributing files. Compared to FTP and HTTP, its characteristic is that when multiple downloads of the same file are working simultaneously, the downloaders will upload to each other. Therefore, it can support a large number of download requests without significantly increasing the load on the file source.

2. Components

The official documentation describes the components of a BitTorrent file distribution as follows:

An ordinary web serverA static ‘metainfo’ fileA BitTorrent trackerAn ‘original’ downloaderThe end user web browsersThe end user downloaders

Web server, metainfo file (torrent file), Tracker, “original” downloader (the initial seeder), user browsers, and user downloaders.

Web servers and user browsers are used to distribute and transmit metainfo. This process can also be done in other ways (e.g., sending it to a group chat?).

The Tracker is used to exchange user information.

Download and distribution usually use the same program, both of which can be considered downloaders.

3. Encoding

The specification defines the encoding for strings, integers, lists, and dictionaries. Integers, lists, and dictionaries require type start and end markers. Specifically:

Strings

A string consists of a decimal number representing its length, followed by a colon and the string itself.

Examples:

  • 4:spam corresponds to the string ‘spam’, where “4” represents the length of the string;
  • 5:cloud corresponds to ‘cloud’;
  • 7:tencent corresponds to ‘tencent’;
  • 3:com corresponds to ‘com’;

Integers

An integer starts with a lowercase letter “i” and ends with a lowercase letter “e”, with the number using decimal.

Examples:

  • i3e corresponds to the number 3;
  • i-3e corresponds to the integer -3;
  • i0e corresponds to the number 0;

Note: Except for i0e corresponding to the number 0, other numbers cannot start with 0. i-0e is also not allowed.

Mnemonic: int number end

Lists

A list starts with a lowercase letter “l” and ends with the letter “e”.

Examples:

  • l4:spam4:eggse corresponds to [‘spam’, ‘eggs’]

Mnemonic: list item:item:
 end

Dictionaries

Dictionaries are similar to lists, starting with a lowercase letter “d” and ending with the letter “e”.

Examples:

  • d3:cow3:moo4:spam4:eggse corresponds to the dictionary {‘cow’: ‘moo’, ‘spam’: ‘eggs’}

Key types must be strings and are sorted according to string rules.

Mnemonic: dictionary key:value:key:value:
 end

Metadata File (Torrent File)

In BitTorrent, the metadata (metainfo) file is the familiar torrent file (.torrent file). For convenience of reading and understanding, the subsequent mentions of torrent and torrent file refer to the metadata file.

Torrent Structure

First, the entire torrent file is a dictionary, with a structure roughly as follows:

{    "announce": "TackerAddressString",    "info": {        "name": "NameString",        "piece length": 262144,         "pieces": "hashString...",        "length": 3276800    }}

Or like this:

{    "announce": "TackerAddressString",    "info": {        "name": "NameString",        "piece length": 262144,         "pieces": "hashString...",        "files": [            {                "length": 4096,                 "path": ["Folder", "FileName1"]            },            {                "length": 8192,                 "path": ["Folder", "FileName2"]            },            {                "length": 16384,                 "path": ["Folder", "Subfolder", "FileName3"]            }        ]    }}

In detail:

  • Announce: The URL of the Tracker server, string
  • Info: Dictionary
    • Name: UTF-8 encoded string of the suggested name for the saved file (or directory), string
    • Piece length: The number of bytes in each file piece, integer
    • Pieces: Concatenation of SHA-1 hash values of file pieces. Each piece has a SHA-1 hash of length 20, so this part is a string of 20 times the number of characters.
    • Length or Files: This part distinguishes between single-file and multi-file cases. Therefore, length and files **must and can only** exist one, string or dictionary:
      • Length: The number of bytes in the file, integer
      • Files: As the name suggests, a list. The list contains dictionaries. Dictionary format:
        • Length: The number of bytes in the file, integer
        • Path: A list of strings indicating the file path, corresponding to subdirectories and filenames in UTF-8 encoding.

Examples

Decoding

According to the specification, a simple parsing program can be completed. Here, Python is used as an example, other languages are omitted, the implementation is as follows:

def decode_bencode(data):    # Decode string    def decode_string(data):        colon_idx = data.index(b':')        length_str = data[:colon_idx].decode('utf-8')        length = int(length_str)        start_idx = colon_idx + 1        end_idx = start_idx + length        string = data[start_idx:end_idx]        return string, data[end_idx:]        # Decode integer    def decode_integer(data):        end_idx = data.index(b'e')        integer_str = data[1:end_idx].decode('utf-8')        return int(integer_str), data[end_idx + 1:]    # Decode list    def decode_list(data):        decoded_list = []        data = data[1:]        while not data.startswith(b'e'):            item, data = decode_bencode(data)            decoded_list.append(item)        return decoded_list, data[1:]    # Decode dictionary    def decode_dict(data):        decoded_dict = {}        data = data[1:]        while not data.startswith(b'e'):            key, data = decode_string(data)            value, data = decode_bencode(data)            decoded_dict[key] = value        return decoded_dict, data[1:]    # Call the corresponding decoding function according to the data type    if data.startswith(b'd'):        return decode_dict(data)    elif data.startswith(b'l'):        return decode_list(data)    elif data.startswith(b'i'):        return decode_integer(data)    else:        return decode_string(data)

Instance

First, let’s look at a torrent created by qBittorrent and its decoded information.

BitTorrent protocol />

{    "announce": "https: //www.example.com/announce.php",    "createdby": "qBittorrentv4.4.5",    "creationdate": 1695555982,    "info": {        "length": 1373744,        "name": "ChromeSetup.exe",        "piece length": 524288,        "pieces": b"L\xb2k\xd9\x83\xa4\x84\x84\x00g\xeb\xf7\x1d\xfe3\xa2\xd9\x95\x0f\\\xa6\xb2E\xcd!^\xe3\xed\x8a\x85\xe7>(\x99\x9dU\x06g%b\x08@\xc9\x9fG\xb8S\x8f\x067K#3\xa7\xbf\xb8`N\xac3"    }}

According to the previous content, this metadata is as follows:

  • Tracker address: https: //www.example.com/announce.php
  • Piece length: 524288
  • File size: 1373744
  • Suggested name: ChromeSetup.exe
  • Piece hash value: (omitted)

Hash Value Verification

Use the following program to calculate the file hash value:

import hashlibdef calculate_sha1(data):    h = hashlib.sha1()    h.update(data)    return h.hexdigest()def slice_and_hash_file(f, s):    h = []    with open(f, 'rb') as file:        while True:            data = file.read(s)            if not data:                break            h.append(calculate_sha1(data))    return hfile_path = 'ChromeSetup.exe'slice_size = 524288  # For example, here the file is cut into 1024 bytes per sliceslice_hashes = slice_and_hash_file(file_path, slice_size)for i, sha1_hash in enumerate(slice_hashes):    print(sha1_hash)

Use the program to calculate the hash value of the target file, resulting in the following:

4cb26bd983a484840067ebf71dfe33a2d9950f5ca6b245cd215ee3ed8a85e73e28999d55066725620840c99f47b8538f06374b2333a7bfb8604eac33

Consistent with the metadata.

BitTorrent protocol />

In addition, it’s noted that qBittorrent adds the createdby and creationdate key-value pairs during seeding. This content is mainly proposed by other BEPs and supported by corresponding clients. For the current version and content under discussion, this content is not mandatory, and its presence does not pose a significant risk.

Manually Creating a Metadata File

Manually construct a BitTorrent torrent file. Here, a single-file metadata is used as an example, still using the Chrome installer ChromeSetup.exe as an example.

Target File Properties

Assuming the tracker address is https: //www.example.com/announce, setting the slice size to 2097152 (2M), confirming the file size is 1373744 bytes, and calculating the file SHA1 as 3b4964c5f6aead8d2aee4818976c041cb485b81b, complete the data structure according to the content:

{    "announce": "https: //www.example.com/announce",    "info": {        "length": 1373744,        "name": "ChromeSetup.exe",        "piece length": 2097152,        "pieces": b';Id\xc5\xf6\xae\xad\x8d*\xeeH\x18\x97l\x04\x1c\xb4\x85\xb8\x1b'    }}

Use the following program for encoding:

def encode_bencode(data):    if isinstance(data, dict):        # Encode dictionary        encoded_data = b'd'        for key in sorted(data.keys()):            encoded_data += encode_bencode(key) + encode_bencode(data[key])        encoded_data += b'e'    elif isinstance(data, list):        # Encode list        encoded_data = b'l'        for item in data:            encoded_data += encode_bencode(item)        encoded_data += b'e'    elif isinstance(data, int):        # Encode integer        encoded_data = f"i{data}e".encode('utf-8')    elif isinstance(data, str):        # Encode string        encoded_data = f"{len(data)}:{data}".encode('utf-8')    else:        raise ValueError("Unsupported data type")        return encoded_data

Save the encoded result to a file. The encoded result is as follows:

b’d8:announce32:https://www.example.com/announce4:infod6:lengthi1373744e4:name15:ChromeSetup.exe12:piece lengthi2097152e6:pieces20:;Id\xc5\xf6\xae\xad\x8d*\xeeH\x18\x97l\x04\x1c\xb4\x85\xb8\x1bee’

Open and verify using qBittorrent. Verification passed.

qBittorrent Recognition Successful

Metadata File – Complete

The first part, Metadata File, is temporarily researched here. The content of Tracker communication and peer communication will be supplemented in subsequent articles. The links will be placed here: