Skip to main content

BitNet Inference Tutorial

20 minutes to your first LLM inference with ternary weights


Goal

Run a BitNet b1.58 model and perform inference.

What you will learn:

  • How to download a BitNet model
  • How to run the Firebird engine
  • How to perform chat inference
  • How to measure performance

What is BitNet b1.58?

BitNet b1.58 is a neural network architecture where weights are quantized to ternary values 1.

AdvantageDescription
Memory20x smaller than float32
ComputeAddition/subtraction only
EnergyLower power consumption

Step 1: Download Model

# Create models directory
mkdir -p models

# Download BitNet b1.58-2B-4T GGUF
pip install huggingface_hub

python3 -c "
from huggingface_hub import hf_hub_download
hf_hub_download(
'microsoft/bitnet-b1.58-2B-4T-gguf',
'ggml-model-i2_s.gguf',
local_dir='./models'
)
"

Size: ~1.1 GB


Step 2: Build Firebird

# Build Firebird LLM CLI
zig build firebird

# Or build TRI with Firebird included
zig build tri

Step 3: Run Inference

CPU Inference

# Interactive chat
./zig-out/bin/tri chat --model ./models/ggml-model-i2_s.gguf

# Single prompt
./zig-out/bin/tri chat --model ./models/ggml-model-i2_s.gguf \
--prompt "Explain ternary computing"

Server Mode

# Start HTTP server
./zig-out/bin/tri serve --model ./models/ggml-model-i2_s.gguf --port 8080

# Query via API
curl -X POST http://localhost:8080/api/generate \
-H "Content-Type: application/json" \
-d '{"prompt":"Hello Trinity"}'

Step 4: Performance

Expected Results

HardwareSpeedNotes
Apple M1/M25-15 tok/sCPU only
x86_64 (AVX2)10-20 tok/sCPU only
RTX 3090 (GPU)100K+ tok/svia bitnet.cpp
H100 (GPU)298K tok/svia bitnet.cpp

Benchmark

# Run benchmark
./zig-out/bin/tri bench --model ./models/ggml-model-i2_s.gguf

Code Example

const std = @import("std");
const firebird = @import("firebird");

pub fn main() !void {
var gpa = std.heap.GeneralPurposeAllocator(.{}) {};
defer _ = gpa.deinit();
const allocator = &gpa.allocator;

// Load model
var model = try firebird.Model.load(allocator, "./models/ggml-model-i2_s.gguf");
defer model.unload();

// Generate
const prompt = "The golden ratio is";
const output = try model.generate(prompt, .{
.max_tokens = 50,
.temperature = 0.7,
});

std.debug.print("{s}\n", .{output});
// → "The golden ratio is approximately 1.618, known as phi..."
}

Troubleshooting

ProblemSolution
Model file not foundCheck path to GGUF file
Out of memoryReduce context size or use smaller model
Slow inferenceUse GPU or quantized model

What's Next?

TutorialDescription
VSA OperationsVector operations
DePIN NodeRun inference node

φ² + 1/φ² = 3 = TRINITY