BitNet Inference Tutorial

20 minutes to your first LLM inference with ternary weights

Goal

Run a BitNet b1.58 model and perform inference.

What you will learn:

How to download a BitNet model
How to run the Firebird engine
How to perform chat inference
How to measure performance

What is BitNet b1.58?

BitNet b1.58 is a neural network architecture where weights are quantized to ternary values 1.

Advantage	Description
Memory	20x smaller than float32
Compute	Addition/subtraction only
Energy	Lower power consumption

Step 1: Download Model

# Create models directory
mkdir -p models

# Download BitNet b1.58-2B-4T GGUF
pip install huggingface_hub

python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
    'microsoft/bitnet-b1.58-2B-4T-gguf',
    'ggml-model-i2_s.gguf',
    local_dir='./models'
)
"

Size: ~1.1 GB

Step 2: Build Firebird

# Build Firebird LLM CLI
zig build firebird

# Or build TRI with Firebird included
zig build tri

Step 3: Run Inference

CPU Inference

# Interactive chat
./zig-out/bin/tri chat --model ./models/ggml-model-i2_s.gguf

# Single prompt
./zig-out/bin/tri chat --model ./models/ggml-model-i2_s.gguf \
  --prompt "Explain ternary computing"

Server Mode

# Start HTTP server
./zig-out/bin/tri serve --model ./models/ggml-model-i2_s.gguf --port 8080

# Query via API
curl -X POST http://localhost:8080/api/generate \
  -H "Content-Type: application/json" \
  -d '{"prompt":"Hello Trinity"}'

Step 4: Performance

Expected Results

Hardware	Speed	Notes
Apple M1/M2	5-15 tok/s	CPU only
x86_64 (AVX2)	10-20 tok/s	CPU only
RTX 3090 (GPU)	100K+ tok/s	via bitnet.cpp
H100 (GPU)	298K tok/s	via bitnet.cpp

Benchmark

# Run benchmark
./zig-out/bin/tri bench --model ./models/ggml-model-i2_s.gguf

Code Example

const std = @import("std");
const firebird = @import("firebird");

pub fn main() !void {
    var gpa = std.heap.GeneralPurposeAllocator(.{}) {};
    defer _ = gpa.deinit();
    const allocator = &gpa.allocator;

    // Load model
    var model = try firebird.Model.load(allocator, "./models/ggml-model-i2_s.gguf");
    defer model.unload();

    // Generate
    const prompt = "The golden ratio is";
    const output = try model.generate(prompt, .{
        .max_tokens = 50,
        .temperature = 0.7,
    });

    std.debug.print("{s}\n", .{output});
    // → "The golden ratio is approximately 1.618, known as phi..."
}

Troubleshooting

Problem	Solution
Model file not found	Check path to GGUF file
Out of memory	Reduce context size or use smaller model
Slow inference	Use GPU or quantized model

What's Next?

Tutorial	Description
VSA Operations	Vector operations
DePIN Node	Run inference node

φ² + 1/φ² = 3 = TRINITY

Goal​

What is BitNet b1.58?​

Step 1: Download Model​

Step 2: Build Firebird​

Step 3: Run Inference​

CPU Inference​

Server Mode​

Step 4: Performance​

Expected Results​

Benchmark​

Code Example​

Troubleshooting​

What's Next?​