Skip to main content

RunPod GPU Deployment

RunPod provides on-demand GPU instances suitable for high-throughput BitNet b1.58 inference. This guide walks through deploying Trinity on a RunPod instance for benchmarking and production inference.

GPUVRAMCost TierExpected Throughput (2B model)Best For
NVIDIA H100 SXM80 GBHigh~300K+ tok/sMaximum performance, AVX-512 VNNI on CPU
NVIDIA A100 80GB80 GBMedium~274K tok/sProduction workloads
NVIDIA RTX 309024 GBLow~298K tok/sCost-effective benchmarking
NVIDIA RTX 409024 GBMedium~310K tok/sConsumer-grade best performance

For the BitNet b1.58-2B-4T model (1.1 GB GGUF), even a 24 GB GPU has more than sufficient VRAM. The CPU capabilities of the host (particularly AVX-512 VNNI support) also significantly affect throughput, as bitnet.cpp uses CPU kernels for ternary weight unpacking.

Setup Steps

1. Create a RunPod Instance

  1. Sign up at runpod.io
  2. Create a new GPU pod with your chosen GPU type
  3. Select a PyTorch or Ubuntu template (provides CUDA and Python)
  4. Ensure at least 20 GB of disk space for the model and build tools

2. Connect and Run the Benchmark Script

Trinity includes a pre-built benchmark script for H100 instances:

# SSH into your RunPod instance, then:
git clone https://github.com/gHashTag/trinity.git
cd trinity

# Run the automated benchmark script
bash scripts/runpod_h100_bitnet.sh

The script (scripts/runpod_h100_bitnet.sh) automates the entire process:

  • Verifies hardware (CPU features, GPU type, AVX-512 support)
  • Installs dependencies (clang, cmake, Python packages)
  • Clones and builds Microsoft's bitnet.cpp with optimized kernels
  • Downloads the BitNet b1.58-2B-4T GGUF model from HuggingFace
  • Runs thread scaling tests (1, 2, 4, 8, 16, max threads)
  • Executes 12 diverse prompts with 500-token generation each
  • Produces a results report and JSON metrics file

3. Manual Setup (Alternative)

If you prefer manual control:

# Install Zig
curl -LO https://ziglang.org/download/0.13.0/zig-linux-x86_64-0.13.0.tar.xz
tar -xf zig-linux-x86_64-0.13.0.tar.xz
export PATH="$PWD/zig-linux-x86_64-0.13.0:$PATH"

# Clone and build Trinity
git clone https://github.com/gHashTag/trinity.git
cd trinity
zig build firebird

# For bitnet.cpp inference, follow the benchmark script steps

4. Server Mode for API Access

To expose inference as an HTTP API on your RunPod instance:

./bin/vibee serve --port 8080

Configure the RunPod instance to expose port 8080 via the RunPod proxy URL, which provides HTTPS access to your running service.

Cost Considerations

  • RunPod charges by the hour for GPU instances. Stop your pod when not in use.
  • The H100 SXM is the most expensive but provides the best throughput.
  • For cost-effective testing, the RTX 3090 delivers comparable per-token throughput at a fraction of the cost.
  • A single benchmark run (12 prompts, 500 tokens each) typically completes in under 5 minutes on H100 hardware.
  • The benchmark script reminds you to stop the pod when finished.

Output Files

After running the benchmark script, results are saved to:

FileContents
/root/bitnet_h100_results.txtHuman-readable results with all prompts and outputs
/root/bitnet_h100_metrics.jsonMachine-readable JSON with per-test timing and throughput

Troubleshooting

  • Missing AVX-512: Some RunPod instances use older CPUs. The script detects this and falls back to a manual cmake build without TL2 optimizations.
  • Build failures: Ensure clang and cmake are installed. The script handles this automatically.
  • Tokenizer warnings: The GGUF model may show a "missing pre-tokenizer type" warning. The benchmark script overrides this with --override-kv "tokenizer.ggml.pre=str:llama-bpe".