Skip to main content

Native FP16/BF16 Support Across Languages — Deep Analysis

Comparative analysis of half-precision floating-point support in Zig, Rust, Go, C/C++, and LLVM for positioning zig-half, GF16, TF3, and the Sensation System.


Executive Summary

Languagef16 TypeNative SIMDML FeaturesVerdict
Zig✅ Built-in✅ Adaptive @Vector❌ Manual (zig-half)Best foundation
Rust⚠️ RFC 3453⚠️ Nightly intrinsics❌ External cratesProgress
Go❌ No std type❌ Manual ASM❌ External packagesWeakest
C/C++⚠️ __fp16 storage⚠️ Intrinsics❌ ManualFragmented
LLVM⚠️ half IR type✅ Backend❌ Format-agnosticFoundation

Key insight: No language provides "ML-grade half-precision stack" natively. Trinity's zig-half + GF16/TF3 + Sensation System is a unique vertical integration from language level (Level 0) to FPGA RTL (Level 6).


Phase 1: Language-by-Language Analysis

Zig — Built-in f16 with Adaptive SIMD

FeatureStatusDetails
f16 type✅ Built-inIEEE 754 binary16, native type since 0.11.0
Native arithmetic⚠️ Target-dependentARM NEON has full f16, x86 requires AVX-512 FP16
std.simd✅ Adaptive@Vector(N, f16) compiles to platform SIMD
Math routines⚠️ LimitedMost ops: f16→f32→compute→f16 (promotion)
ML-specific❌ NoneNo ternary/sparse natively (zig-half provides)

Strengths:

  • Comptime SIMD: @Vector(8, f16) → AVX2/NEON automatically
  • Zero-cost abstraction: inline functions compile to optimal machine code
  • Explicit memory: no hidden allocations, perfect for ML workloads
  • Target features: std.Target.features for runtime CPU detection

Weaknesses:

  • f16 arithmetic promotes to f32 on x86 (pre-AVX-512)
  • No standard library ML ops (matmul, softmax, attention)
  • No ternary/sparse support (zig-half fills this gap)

Sources:

Verdict: Zig gives excellent SIMD foundation + basic f16. zig-half adds ML-grade ops (ternary, sparse, shadow storage) on top.


Rust — RFC 3453, Crate Ecosystem

FeatureStatusDetails
f16 type⚠️ RFC 3453In progress, software float fallback
half crate✅ Popularf16/bf16 wrapper types with ops
SIMD⚠️ Nightly intrinsicsstd::simd or packed_simd crate
bf16⚠️ Via cratesbf16→f32→compute→bf16 promotion
ML-specific❌ No nativeExternal crates (ndarray, candle)

Strengths:

  • Type safety: f16/bf16 as distinct types
  • Crate ecosystem: half, half-bf16 well-maintained
  • Nightly SIMD: std::simd improving

Weaknesses:

  • f16 not stabilized (RFC 3453 ongoing)
  • ML features scattered across crates
  • No unified "ML stack" in std

Sources:

Verdict: Rust moving toward native f16, but ML-specific features require external crates. zig-half comparable to half crate + additional ML ops.


Go — Weakest FP16 Stack

FeatureStatusDetails
float16 type❌ No std supportOnly external packages
float16 package⚠️ Externalgithub.com/shogo82148/float16
SIMD❌ No built-inPure Go (limited auto-vectorization)
SIMD via ASM⚠️ ManualCPU-feature branches handwritten
ML-specific❌ NoneNo ecosystem

Strengths:

  • Simple GC model (good for some ML workloads)
  • External float16 package works

Weaknesses:

  • No native f16 type in standard library
  • No SIMD support without assembly
  • ML ecosystem fragmented compared to Python/C++

Sources:

Verdict: Go-stack weakest for fp16 ML workloads. zig-half significantly surpasses typical Go solutions.


C/C++ / LLVM — Storage-Only Pattern

FeatureStatusDetails
__fp16⚠️ Storage-onlyPromotes to float32 for arithmetic
_Float16⚠️ In developmentNative arithmetic (ARMv8.2-A)
x86 fp16❌ AVX-512 FP16 onlySapphire Rapids+ (2024+)
bf16⚠️ Dot-product onlyVDPBF16PS-style instructions

Strengths:

  • LLVM half type well-defined in IR
  • Extensive intrinsics for all platforms

Weaknesses:

  • fp16/bf16 primarily "storage formats" on most CPUs
  • General fp16 arithmetic requires new hardware (AVX-512 FP16, ARM SVE2)
  • C++ std::float16_t not yet standardized

Sources:

Verdict: Standard C/C++ treats fp16 as "storage + matmul only", not full compute. zig-half provides higher-level abstractions.


Phase 2: The 8-Level Compilation Stack

Understanding where Trinity operates vs. native language stacks:

Level 0  Language (Zig/Rust/C)          ← YOU ARE HERE (GF16, TF3, Sensation)
Level 1 Frontend (AST → IR) ← Zig compiler frontend
Level 2 LLVM IR (Middle-end) ← Optimizations, SSA, vectorize
Level 3 SelectionDAG → MachineIR ← WHERE fp16 "BREAKS"
Level 4 ISA (x86/ARM/RISC-V) ← CPU instructions
Level 5 Microarchitecture (μarch) ← Pipeline, execution units
Level 6 RTL / Gate-level (HDL) ← YOU ARE ALSO HERE (FPGA!)
Level 7 Transistors / Physical ← Silicon, lithography

Level 0 → Level 2: Language → LLVM IR

  • Zig/Rust/C compile to LLVM IR (SSA form)
  • f16half type in IR
  • Vector of 8 f16 → <8 x half>
  • Auto-vectorization happens here

Source: LLVM LangRef

Level 3: SelectionDAG → Machine IR (CRITICAL!)

This is where fp16/GF16 fate is decided:

  • CPU has fp16 hardware → direct instructions
  • CPU lacks fp16 → LLVM promotes half → float

GF16 CANNOT pass natively — LLVM doesn't know "6-bit exp + 9-bit mant" type. Must use manual encode/decode at Level 0.

Source: LLVM SelectionDAG

Level 4: ISA — CPU Instructions

ISAfp16 Instructionsbf16 Instructions
x86 AVX2VCVTPH2PS/VCVTPS2PH (conv only)None
x86 AVX-512 FP16Full arithmetic (Sapphire Rapids+)VDPBF16PS (dot only)
ARM NEONFCVT half↔singleNone
ARM SVE2Full fp16 arithmeticBFDOT, BFMMLA
RISC-V ZfhFull fp16Zfbfmin (conversion)

Key fact: bf16 on most ISAs is dot-product/matmul only, not general-purpose. GF16/TF3 don't exist on any ISA.

Source: FP16 on x86-64

Level 5: Microarchitecture

  • Pipeline width: Apple M1 = 8-wide, Zen4 = 6-wide
  • Execution units: parallel FPU count
  • Cache hierarchy: data throughput

You control which instructions generate via adaptive SIMD.

Source: Pipelined Processor

Level 6: RTL / Gate-Level — YOU ARE ALSO HERE! 🔥

Verilog/VHDL level, where logic elements are described. This is where FPGA (XC7A100T) lives.

At this level you CAN create native GF16/TF3 arithmetic:

module gf16_adder (
input [15:0] a, // 1 sign + 6 exp + 9 mant
input [15:0] b,
output [15:0] sum
);
// ... native GF16 arithmetic in silicon!
endmodule

This is what NO CPU can provide — native GF16/TF3 compute in hardware.

Level 7: Transistors / Physical

  • CMOS, FinFET, lithography
  • FPGA Artix-7 uses 28nm TSMC
  • Each LUT = 6-input lookup table = ~dozens of transistors

Phase 3: Where Trinity Operates

LevelWhat Trinity DoesFile/Tool
0 — LanguageGF16, TF3, Sensation Systemintraparietal_sulcus.zig, angular_gyrus.zig
1 — FrontendZig compiler → ZIRzig build
2 — LLVM IRAuto-vectorization f16std.simd<N x half>
3 — SelectionDAGfp16 legalizationAutomatic by LLVM
4 — ISAAdaptive: AVX2 / NEONadaptive_simd.zig
5 — μarchM1 Pro / Xeon throughputBenchmarks: 1.09× / 2.06×
6 — RTLGF16/TF3 native arithmeticFPGA XC7A100T (Vivado)
7 — Physical28nm Artix-7 fabricHardware (fixed)

Key insight: Trinity operates simultaneously on Level 0 (language/formats) AND Level 6 (FPGA RTL). All others (PyTorch, JAX, TensorRT) stop at Level 0–4.


Phase 4: zig-half vs Native Stacks — Feature Comparison

FeatureZig NativeRust (half)GoC/LLVMzig-half
f16 type⚠️ Wrapper⚠️ __fp16✅ Built-in
f32→f16 conv⚠️ Promotion❌ f32 promotion⚠️ External⚠️ Native conv✅ Optimized
SIMD f16 vectors✅ @Vector(8, f16)⚠️ Intrinsics (nightly)❌ ASM⚠️ Intrinsics✅ Adaptive
Adaptive width✅ Comptime❌ Runtime❌ Native❌ Native✅ AVX/AVX-512/NEON
Ternary quant❌ None❌ None❌ None❌ None1
Sparse matmul✅ Via code❌ None⚠️ ASM❌ None✅ Zero-chunk skip
Ternary pack✅ Manual❌ None✅ Manual❌ None✅ 16 trits → 32 bit
Shadow weights❌ None❌ External❌ None❌ None✅ F16 sync

Conclusion

Trinity's unique position:

  1. zig-half provides ML-grade f16 operations missing from all native language stacks
  2. GF16/TF3 are Layer 0 formats that don't exist in any ISA
  3. Sensation System (IPS + Angular + Fusiform + OFC) adds semantic layer over raw numbers
  4. FPGA (Level 6) enables native GF16/TF3 hardware acceleration — impossible on CPUs

No other project operates across these levels simultaneously. PyTorch, JAX, and TensorRT are confined to Levels 0–4, relying on GPU vendors for Levels 5–7. Trinity owns the full stack from language to silicon.


Sources:


φ² + 1/φ² = 3 | TRINITY