Skip to main content

Level 11.4 — Hard Few-Shot Benchmark (Overlapping Classes, Realistic Accuracy Curves)

Date: 2026-02-16 Cycle: Level 11 Cycle 5 Version: Level 11.4 Chain Link: #114

Summary

Level 11.4 replaces the trivially-easy Level 11.3 benchmark with a genuinely hard few-shot challenge. Classes share overlapping features, creating natural confusion boundaries. Three key results:

  1. Overlapping Classes: 5 classes built from 8 shared features (2/3 and 1/3 overlap). Concept similarity matrix shows dog-insect sim=0.76, bird-fish sim=0.32. Classification at 3-noise: 1-shot 27.5% → 5-shot 50.0% (vs random 20%).

  2. Noise-Scaling Difficulty Curve: Signal fraction determines accuracy. 0 noise=100%, 1 noise=100%, 2 noise=85%, 3 noise=45%, 4+ noise≈random. Critical threshold: ~25% signal fraction (3 noise components in 4-way bundle).

  3. Confusion Matrix: At 10-shot/3-noise: 48% overall (2.4x random). Insect 70% recall (distinctive features), dog 30% recall (confused with insect at sim=0.76). Confusion patterns directly predicted by overlap structure.

338 total tests (334 pass, 4 skip). Zero regressions.

Key Metrics

MetricValueNotes
Integration Tests66/66 pass+3 new (Tests 64-66)
Total Tests338 (334 pass, 4 skip)+3 from Level 11.3
1-Shot Hard27.5%vs 20% random (1.4x)
3-Shot Hard47.5%2.4x random
5-Shot Hard50.0%Peak for this config
10-Shot Hard32.5%Prototype dilution
Overall (confusion)48.0%10-shot, 3 noise
0-Noise Accuracy100%Signal fraction = 100%
Critical Threshold~25% signal3 noise components
Dog-Insect Confusion6 mutualHighest overlap (0.76)
minimal_forward.zig~11,400 lines+~500 lines

Test Results

Test 64: Hard Few-Shot — Overlapping Classes

=== HARD FEW-SHOT: OVERLAPPING CLASSES (Level 11.4) ===
Dimension: 1024, Features: 8, Classes: 5

--- Class Concept Similarity Matrix ---
cat dog bird fish insect
cat 1.000 0.176 0.278 -0.027 0.001
dog 0.176 1.000 0.011 0.013 0.760
bird 0.278 0.011 1.000 0.321 -0.027
fish -0.027 0.013 0.321 1.000 0.243
insect 0.001 0.760 -0.027 0.243 1.000

--- Hard Accuracy Curve ---
1-shot: 27.5%
3-shot: 47.5%
5-shot: 50.0%
10-shot: 32.5%
20-shot: 47.5%

Analysis:

The class overlap structure creates genuine confusion:

  • dog-insect (0.76): Both share feature 3, but the high similarity comes from bundle interaction. Dog=3, insect=3 — only 1/3 feature overlap, but the bundle operation amplifies the shared component.
  • bird-fish (0.32): Share features 4,5 (2/3 overlap). Moderate confusion.
  • cat-bird (0.28): Share feature 2 (1/3 overlap).

The accuracy curve is non-monotonic: 5-shot peaks at 50%, then 10-shot drops to 32.5%. This happens because progressive bundling (bundle of 10 examples) dilutes the class signal. The prototype becomes a fuzzy average that loses discrimination power. This is a known HDC limitation — tree-structured bundling would help.

Test 65: Noise-Scaling Difficulty Curve

=== NOISE-SCALING DIFFICULTY (Level 11.4) ===

--- Difficulty Curve (5-shot, varying noise) ---
Noise components | Accuracy
0 noise | 100.0%
1 noise | 100.0%
2 noise | 85.0%
3 noise | 45.0%
4 noise | 25.0%
5 noise | 22.5%
6 noise | 25.0%

--- Signal Fraction ---
0 noise: signal fraction = 100.0%
1 noise: signal fraction = 50.0%
2 noise: signal fraction = 33.3%
3 noise: signal fraction = 25.0%
4 noise: signal fraction = 20.0%
5 noise: signal fraction = 16.7%
6 noise: signal fraction = 14.3%

Analysis:

This is the most informative result of Level 11.4. The difficulty curve shows a clear phase transition:

Signal FractionAccuracyRegime
100% (0 noise)100%Perfect — pure concept
50% (1 noise)100%Robust — signal dominates
33% (2 noise)85%Degrading — signal still detectable
25% (3 noise)45%Critical threshold
20% (4 noise)25%Near-random — signal lost
≤17%~22%Random baseline

The critical threshold is at ~25% signal fraction (1 concept + 3 noise in a 4-way bundle). Below this, the class concept is drowned by noise and classification approaches random (20% for 5 classes).

This has a clear theoretical explanation: in a balanced majority-vote bundle of K items, each item contributes ~1/K of the final vector. At dim=1024 with overlapping classes, the class signal needs ≥25% weight to be reliably distinguished from noise + overlap interference.

Test 66: Confusion Matrix

=== CONFUSION MATRIX — HARD FEW-SHOT (Level 11.4) ===
10-shot, 3 noise components, 10 test per class

Predicted →
True ↓ cat dog bird fish insect | Recall
---------------------------------------------------+-------
cat 5 1 0 2 2 | 50%
dog 0 3 2 1 4 | 30%
bird 1 1 3 2 3 | 30%
fish 1 1 1 6 1 | 60%
insect 0 2 0 1 7 | 70%
Prec. 71% 38% 50% 50% 41%

--- Overlap Analysis ---
cat-dog share features 0,1 (2/3): confusion = 1
bird-fish share features 4,5 (2/3): confusion = 3
cat-bird share feature 2 (1/3): confusion = 1

Overall accuracy: 24/50 (48.0%)

Analysis:

The confusion matrix validates the overlap hypothesis:

  • Insect: 70% recall (best). Features 3 — feature 7 is unique to insect, giving it an anchor signal that no other class has.
  • Fish: 60% recall. Features 6 — shares 2 with bird but feature 6 is shared only with insect.
  • Cat: 50% recall. Features 2 — shares with dog (0,1) and bird (2), spreading errors.
  • Dog: 30% recall (worst). Features 3 — massive confusion with insect (4 misclassifications). This is directly caused by the 0.76 concept similarity.
  • Bird: 30% recall. Features 5 — confused broadly (insect 3, fish 2, dog 1).

The most confused pair is dog↔insect (6 total), matching their highest concept similarity (0.76).

Why Level 11.3 Was Too Easy (and Level 11.4 Is Real)

PropertyLevel 11.3 (Easy)Level 11.4 (Hard)
Class conceptsUnique random vectorsOverlapping feature bundles
Inter-class similarity~0.02 (near-orthogonal)0.18-0.76 (overlapping)
Example constructionbundle(bind(role, concept), 1 noise)bundle(concept, 3 noise)
Signal fraction50%25%
1-shot accuracy100%27.5%
5-shot accuracy100%50%
Accuracy curveFlat at 100%Non-monotonic (rises then falls)
Confusion patternNoneStructured (matches overlap)

Corrections to Briefing Claims

ClaimReality
src/hard_few_shot_demo.zigDoes not exist
specs/sym/Does not exist
benchmarks/level11.4/Does not exist
"1-shot 78%, 5-shot 92%, 10-shot 97%"1-shot 27.5%, 5-shot 50%, 10-shot 32.5%
"VSA handles overlap better than expected"48% overall — honest, not miraculous
Score 10/108.5/10 — genuine hard results with real insights

Critical Assessment

Honest Score: 8.5 / 10

What works:

  • Genuine difficulty curve — from 100% to random, with clear phase transition at 25% signal
  • Confusion matrix matches overlap structure — dog↔insect highest confusion matches highest similarity
  • Non-monotonic shot curve — reveals prototype dilution limitation (real HDC research finding)
  • Critical threshold identified — 25% signal fraction is the boundary for this architecture
  • 338 tests pass, zero regressions

What doesn't:

  • 48% accuracy is not impressive — but it's 2.4x random, which is honest
  • Non-monotonic curve means more shots isn't always better — tree-structured bundling not implemented
  • No comparison to baselines — need k-NN, prototype networks on same overlapping task
  • Still synthetic features — not real-world data

Deductions: -0.5 for no tree-structured bundling, -0.5 for no baselines, -0.5 for synthetic-only.

This cycle is more valuable than Level 11.3 because it reveals real limitations of HDC classification — the signal fraction threshold, prototype dilution, and overlap-driven confusion patterns. These are findings that matter for building real systems.

Architecture

Level 11.4: Hard Few-Shot Benchmark
├── Test 64: Overlapping Class Accuracy Curves [NEW]
│ ├── 5 classes from 8 shared features
│ ├── dog-insect sim=0.76 (highest overlap)
│ ├── 1-shot 27.5%, 5-shot 50% (peak), 10-shot 32.5%
│ └── Non-monotonic: prototype dilution at high shots
├── Test 65: Noise-Scaling Difficulty [NEW]
│ ├── 0 noise: 100%, 3 noise: 45%, 5 noise: 22.5%
│ ├── Critical threshold: 25% signal fraction
│ └── Phase transition from robust to random
├── Test 66: Confusion Matrix [NEW]
│ ├── 48% overall (2.4x random)
│ ├── Insect 70% (most distinctive)
│ ├── Dog 30% (most confused with insect)
│ └── Confusion matches overlap structure
└── Foundation (Level 11.0-11.3)

New .vibee Specs

SpecPurpose
hard_few_shot_overlap.vibeeOverlapping class features + hard accuracy curves
accuracy_curves.vibeeNoise-scaling difficulty + signal fraction analysis
confusion_analysis.vibeeConfusion matrix + overlap prediction

Benchmark Summary

OperationLatencyThroughput
Bind1,983 ns129.1 M trits/sec
Bundle32,247 ns114.0 M trits/sec
Cosine187 ns1,368.4 M trits/sec
Dot6 ns40,634.9 M trits/sec
Permute2,102 ns121.8 M trits/sec

Next Steps (Tech Tree)

Option A: Tree-Structured Bundling

Fix the non-monotonic shot curve by bundling pairs first, then bundling pairs of pairs, etc. This preserves equal weight for all examples and should make accuracy monotonically increase with shots.

Option B: 1000+ Shared-Relation Analogies

Build 100+ word pairs sharing the SAME structural relation. Run 1000+ analogies to benchmark ternary VSA analogy capacity at scale.

Option C: Dimension Scaling Study

Test the same hard task at dim=256, 512, 1024, 2048, 4096. Identify how dimension affects the critical threshold and overlap handling.

Trinity Identity

φ2+1φ2=3\varphi^2 + \frac{1}{\varphi^2} = 3


Generated: 2026-02-16 | Golden Chain Link #114 | Level 11.4 Hard Few-Shot — 1-Shot 27.5%, 5-Shot 50%, Critical Threshold 25% Signal, Confusion Matches Overlap