Skip to main content

Golden Chain v2.47 — Large Corpus Trigram (25K Chars, Sparsity Partially Solved)

Date: 2026-02-15 Cycle: 87 Version: v2.47 Chain Link: #104

Summary

v2.47 implements Option A from v2.46: scale the corpus from 5K to 25K+ characters. A new shakespeare_extended.txt (25,523 chars) is loaded via @embedFile, containing passages from Hamlet, Macbeth, Romeo and Juliet, As You Like It, Richard III, Twelfth Night, Merchant of Venice, Julius Caesar, A Midsummer Night's Dream, The Tempest, and multiple Sonnets. The LargeTrigramModel handles 512-word vocabulary, 8192 tokens, and 8192 trigram hash slots.

  1. 25,523 chars → 4,991 tokens, 512 unique words (5x the v2.46 corpus)
  2. 2,248 trigram contexts, 4,887 observations (2.17 avg per context, up from 1.51)
  3. Trigram eval PPL: 39.71 — higher than small corpus (21.16) due to 2x vocabulary
  4. T=0.8 generates diverse Shakespeare vocabulary from 512-word space
  5. Low-temperature degeneration returned: T=0.3 → "to and to to to..." (larger vocab doesn't fix self-loops)
  6. Bigram still beats trigram: eval CE 3.39 vs 3.68 (sparsity persists at word level)

All 41 integration tests pass. src/minimal_forward.zig grows to ~7,350 lines.

Key Metrics

MetricValueChange from v2.46
Integration Tests41/41 pass+2 new tests
Total Tests312 (308 pass, 4 skip)+2
Corpus Size25,523 chars (5x)Was 5,014
Token Count4,991 (5x)Was 988
Vocabulary Size512 (2x)Was 256
Trigram Contexts2,248 (3.5x)Was 645
Trigram Observations4,887 (5x)Was 975
Avg Obs Per Context2.17Was 1.51
Hash Table Load27.4% (2248/8192)31.5% (645/2048)
Eval Trigram Hit Rate100% (999/999)100% (198/198)
Trigram Eval CE3.6816 nats (41.0% below random)3.0522 (45.0%)
Trigram Train CE3.5082 nats (43.8% below random)3.0802 (44.5%)
Bigram Eval CE3.3905 nats (45.7% below random)2.7421 (50.6%)
Random CE6.2383 nats (ln(512))5.5452 (ln(256))
Trigram Eval PPL39.7121.16
Trigram Train PPL33.3921.76
Overfit Gap+6.32 (healthy positive)-0.60 (inverted)
Generation T=0.8Diverse Shakespeare vocabDiverse
minimal_forward.zig~7,350 lines+~550 lines
Total Specs330+3

Test Results

Test 40 (NEW): Large Corpus Trigram Statistics + Generation

Corpus: 25523 chars → 4991 tokens, 512 unique words
Trigram slots: 2248/8192 (27.4% load)
Total trigram observations: 4887
Avg observations per context: 2.17
Eval trigram hit rate: 999/999 (100.0%)

--- Loss (CE nats) ---
Trigram eval CE: 3.6816 (41.0% below random)
Trigram train CE: 3.5082 (43.8% below random)
Bigram eval CE: 3.3905 (45.7% below random)
Random CE: 6.2383 (ln(512))

--- Generation (start: "to be") ---
T=0.8: "to fly infant sight bodkin won shuffled mind green acts fury fly rain heir possession lady merely told bounty bid perchance thy to people syllable sorrow bare consummation declines be"
T=0.5: "to to to to to the to of to to to to to to the to to me to breaks calamity brevity outrageous recorded fathom ere or to do to"
T=0.3: "to and to to to to to to to to to to to to to to to to to to to to to to to to to to to to"

Analysis — Larger Corpus, Harder Problem:

The 5x corpus scale brought 5x more data, but the vocabulary also doubled (256→512). This makes the prediction problem harder: instead of choosing among 256 words, the model now chooses among 512. The raw numbers look worse, but the normalized picture is different.

Vocabulary-normalized comparison:

MetricSmall (v2.46)Large (v2.47)Ratio
Vocab2565122.0x harder
PPL/Vocab21.16/256 = 0.08339.71/512 = 0.078Large is relatively better
CE/Random3.05/5.55 = 55%3.68/6.24 = 59%Similar information capture
Avg obs/context1.512.17+44% more data per context

Normalized by vocabulary size, the large corpus model is slightly better (0.078 vs 0.083). The model captures a similar fraction of available information from the data.

Why degeneration returned: The "to" attractor is even stronger in the larger corpus. With more Shakespeare text, "to" appears in more bigram contexts (P("to"|X) is high for many X), creating more self-loop paths. The 2-word context that fixed degeneration on the small corpus doesn't help when both prev2 and prev1 are "to" — P("to"|"to","to") is still the dominant successor.

Test 41 (NEW): Large Corpus Trigram Perplexity

Large corpus (4991 tokens, 512 vocab):
Trigram: train=33.39 eval=39.71 gap=6.32
Bigram eval: 29.68
Small corpus (988 tokens, 256 vocab):
Trigram eval: 21.16
Improvement: -87.6% lower eval PPL (large vs small trigram)
Random baseline: 512.0

The overfit gap normalized: The large corpus has a healthy positive gap of 6.32 (eval worse than train, as expected). This contrasts with the small corpus negative gap of -0.60. The positive gap indicates real generalization — the model isn't just memorizing. This is genuine improvement.

Why bigram still beats trigram: With 2.17 avg observations per trigram context, the model still lacks sufficient data to estimate 512-way probability distributions from trigram counts alone. The bigram has more observations per context (avg ~10 for common words) and thus produces sharper, more accurate estimates.

Coverage Comparison: Small vs Large

MetricSmall CorpusLarge CorpusImprovement
Chars5,01425,5235.1x
Tokens9884,9915.1x
Vocabulary2565122.0x
Trigram Contexts6452,2483.5x
Trigram Observations9754,8875.0x
Avg Obs/Context1.512.17+44%
Overfit Gap-0.60+6.32Healthy (was inverted)

The coverage improvement is real but insufficient. To match the small corpus's PPL-to-vocab ratio, we'd need ~10 avg observations per context, which requires roughly 5x more data (125K+ chars) for this vocabulary size.

Architecture

src/minimal_forward.zig (~7,350 lines)
├── [v2.29-v2.46 functions preserved]
├── LargeTrigramSlot struct [NEW v2.47]
├── LargeTrigramModel struct [NEW v2.47]
│ ├── LARGE_MAX_WORDS=512, LARGE_MAX_TOKENS=8192
│ ├── LARGE_TRI_HASH_SIZE=8192, LARGE_TRI_MAX_NEXTS=48
│ ├── init(), getOrAddWord(), getWord(), tokenize()
│ ├── buildBigrams(), buildTrigrams()
│ ├── triHash(), getOrCreateSlot(), findSlot()
│ ├── wordTrigramProb(), sampleNextWord(), wordTrigramLoss()
├── src/shakespeare_extended.txt (25,523 chars) [NEW v2.47]
│ └── Hamlet, Macbeth, Romeo+Juliet, As You Like It,
│ Richard III, Twelfth Night, Merchant of Venice,
│ Julius Caesar, Midsummer, Tempest, Sonnets
└── 41 tests (all pass)

Complete Method Comparison (v2.30 → v2.47)

VersionMethodCorpusVocabLoss MetricTest PPLGeneration
v2.30-v2.43VSA variants527-501495 charscosine proxy1.6-2.0Random chars
v2.44Raw freq (char)5014951.45 nats5.59English words
v2.45Word bigram50142562.74 nats15.52Scrambled vocab
v2.46Word trigram50142563.05 nats21.16Shakespeare phrases
v2.47Word trigram255235123.68 nats39.71Diverse vocab

New .vibee Specs

SpecPurpose
hdc_corpus_50k.vibeeLarge corpus tokenization and statistics
trigram_sparsity_solve.vibeeSparsity analysis and vocab normalization
fluent_large_corpus.vibeeLarge corpus generation and degeneration analysis

What Works vs What Doesn't

Works

  • 5x corpus scale: 25,523 chars from 12+ Shakespeare plays and sonnets
  • 512 unique words: broader vocabulary coverage
  • 2.17 avg obs/context: 44% improvement over small corpus
  • Healthy overfit gap: +6.32 (real generalization, not memorization)
  • T=0.8 diverse: bodkin, shuffled, consummation, declines, perchance
  • 312 tests pass: zero regressions
  • @embedFile: clean corpus loading, no bloated string literals

Doesn't Work

  • PPL not 14.2: true word trigram eval PPL is 39.71 (larger vocab = harder problem)
  • Not 68% below random: 41.0% (eval), 43.8% (train)
  • Not "fluent Shakespearean English": T=0.8 is diverse but incoherent; T=0.3 degenerates
  • Bigram still beats trigram: 3.39 vs 3.68 eval CE (sparsity persists)
  • Degeneration returned at T=0.3: "to" attractor stronger in larger corpus
  • Not 50K chars: corpus is 25.5K (realistic amount of Shakespeare I could compose)

Critical Assessment

Honest Score: 7.5 / 10

This cycle delivers a genuine infrastructure improvement — 5x corpus scale, @embedFile loading, and a model struct that handles 512-word vocabulary. The positive overfit gap (+6.32) confirms real generalization rather than the inverted gap from v2.46.

However, the key hypothesis — "larger corpus solves sparsity" — is only partially validated. Sparsity improved (2.17 vs 1.51 avg obs) but the vocabulary also grew, creating a harder prediction problem. The net result is PPL went UP, not down. The bigram still beats the trigram.

The briefing's claims are severely fabricated:

  • PPL 14.2 → actual 39.71
  • "Fluent Shakespearean English" → incoherent at all temperatures
  • "Sparsity solved" → partially improved, still insufficient

The fundamental issue: word trigrams need ~10+ observations per context to produce sharp distributions. With 512 vocab and 2248 contexts from 4991 tokens, we're at 2.17 — still 5x too sparse.

Corrections to Briefing Claims

ClaimReality
src/large_corpus_trigram_demo.zigDoes not exist. LargeTrigramModel added to minimal_forward.zig
52,847 chars25,523 chars (realistic amount of composable Shakespeare)
PPL 14.239.71 (larger vocab = harder problem)
Train loss 68% below random43.8% (train), 41.0% (eval)
"Fluent Shakespearean English"Diverse vocabulary at T=0.8, degeneration at T=0.3
"Sparsity solved"Partially improved (2.17 vs 1.51 avg obs), still insufficient
Trigram coverage >88%100% eval hit rate (all contexts seen)
Score 10/107.5/10

Benchmark Summary

OperationLatencyThroughput
Bind2,026 ns126.4 M trits/sec
Bundle32,441 ns104.9 M trits/sec
Cosine195 ns1,312.8 M trits/sec
Dot6 ns40,000.0 M trits/sec
Permute2,230 ns114.8 M trits/sec

Next Steps (Tech Tree)

Option A: Interpolated Trigram + Bigram (Kneser-Ney style)

Weight: λ·P_tri + (1-λ)·P_bi. Tune λ per-context based on trigram count. Standard NLP technique that directly addresses sparsity. Should make trigram beat bigram.

Option B: Fixed Vocabulary + Massive Corpus

Cap vocabulary at 256 (map rare words to <UNK>), then use the 25K corpus. Fewer parameters to estimate from the same data → lower PPL. Trades vocabulary breadth for prediction accuracy.

Option C: Character-Word Hybrid

Generate at character level (raw freq trigram from v2.44) but constrain to produce real words from the vocabulary. Combines character-level smoothness with word-level coherence.

Trinity Identity

φ2+1φ2=3\varphi^2 + \frac{1}{\varphi^2} = 3


Generated: 2026-02-15 | Golden Chain Link #104 | Large Corpus Trigram — 25K Chars, Sparsity Partial, Vocabulary Scaling (PPL Higher, Coverage Better, Generalization Real)