engram — Training Status

Status — v14_branchb_learnable_vocab: Vocab Was the Bottleneck. Branch B Confirmed. First Recognizable Dialog Output.

Update 2026-05-05: v14_branchb_learnable_vocab deployed live — single architectural change from v13: vocab_matrix_global is now an nn.Parameter (~5.6M trainable parameters) trained alongside the brain via AdamW with lr=EMBED_LR/2=2.5e-4. Brain was warm-started from v13 final and frozen for epoch 1 so the vocab could adjust against fixed brain predictions, then unfrozen for epochs 2–5 (joint training). Architecture otherwise identical to v13: 12L / 384D / 12H / RoPE / Pre-LN / AdamW, vocab 14,704, context 32, 5.3 MB corpus. Trained 2026-05-05 on Modal L4, $8.00.

Loss curve (the key evidence): v13 final plateau: 5.04 nats. Epoch 1 end (brain frozen, only vocab moving): 4.7854 — already 0.25 nats below v13 with the brain frozen. Epoch 2 (brain unfrozen, joint training): 4.0939. Epoch 3: 3.5592. Epoch 4: 3.0563. Epoch 5 final: 2.7059 — 2.34 nats below v13, ~2.12 nats above the theoretical floor at INV_TEMP=30 (≈0.59).

Smoking gun — epoch 1 with brain frozen. The fact that loss broke through v13's 5.04 plateau in epoch 1 with the brain still frozen is the clean ablation: vocab geometry was the bottleneck. The sentence-transformer initialization optimizes for semantic similarity ("cat" near "dog"), not syntactic prediction ("cat" followed by "is"). Letting the vocab learn under the cross-entropy gradient unlocked the descent.

Eval verdict — PARTIAL. Qualitatively much improved, not yet fully coherent. Output now produces real English fragments and recognizable conversational/scheduling patterns from the dailydialog corpus. Sample outputs: "hi how about tomorrow coming to account on friday", "okay how about one week user what time bot at ten to p after", "ten minutes walk", "i love you very long". Compare v13's token-soup: "today we who had selling how me user". Replies are not yet fully coherent multi-turn dialogue but this is by far the cleanest signal yet — dialog-shaped text is appearing for the first time.

Significance of Branch B. Branch B is one of engram's core architectural bets: vocab/brain as separable, swappable components. v14-B is the first run to exercise the "learnable" half of "learnable 96-D coordinates" under cross-entropy training. The frozen sentence-transformer geometry optimized for semantic similarity turns out to be hostile to syntactic prediction — these are different tasks, and the gradient needed to move the vocab to find out.

Next — Branches A, C, D from V14_CANDIDATES.md. Branch A (raise halt-gate cap from 3 → 5 + lower ponder cost) is most likely next: avg_ponder is still pegging at 3.00 (the cap), meaning the adaptive-compute lever is not yet engaged. Branch D (surprise-modulated gradient) and Branch C (episodic memory at training time) are the remaining axes. After A/C/D land, plans/FUTURE_RESEARCH.md picks up at v15+ (∇-Reasoner is the recommended first follow-on, zero training cost). Cumulative spend: ~$73 of $150 budget ceiling.

Server Status

Status Live (HTTP 200)

Active model v14_branchb_learnable_vocab

Architecture 12L · 384D · 12H · head_dim=32 · RoPE · Pre-LN (frozen since v9)

Loss function cosine cross-entropy · tied output projection · INV_TEMP=30

Brain params 21,509,761

Engram module params 38,696,064

Vocab size 14,704 — now learnable nn.Parameter (~5.6M trainable params); warm-started from sentence-transformer init

Vocab training schedule epoch 1: brain frozen, vocab only (lr=2.5e-4); epochs 2–5: joint training (brain + vocab)

Corpus dailydialog_clean.txt + everyday_conversations.txt (5.3 MB combined)

Training 5 epochs · Modal L4 · 2026-05-05

Per-epoch loss ep1 (brain frozen): 4.7854 · ep2: 4.0939 · ep3: 3.5592 · ep4: 3.0563 · ep5: 2.7059

Final loss 2.7059 nats (theoretical floor at INV_TEMP=30: ≈0.59 — 2.12 nats headroom)

Training cost $8.00 · cumulative ~$73 of $150 ceiling

Platform Modal (L4 GPU)

Endpoint http://108.181.97.223:5000

OpenMythos Architecture Transfer — Ablation Results

Five architectural ideas from kyegomez/OpenMythos were ported and tested with strict pass/kill criteria from a benchmark harness (bench/run.py). Only one shipped.

Phase	Idea	Status	Why
0	Reproducibility harness + baseline	PASSED	0.0e+00 loss diff between identical runs; baseline locked
1	LTI residual injection	KILLED	grad_norm_p99 unchanged (0.572 vs 0.561); no eval gain
2	Loop-index sinusoidal embedding	KILLED	halt gate completely insensitive to loop signal
3	Inference-time depth extrapolation	SKIPPED	Phase 1+2 prereq chain broken
4	Per-loop LoRA	SKIPPED	Phase 1–3 prereq chain broken
5	RoPE positional encoding	PASSED · SHIPPED	grad_norm_p99 halved (0.561→0.280); zero quality cliff at 3× train context
6	Lock + document	PASSED	use_rope=True locked as default; README updated

What We've Learned — v8 through v14

Each run tested one hypothesis. Architecture has been stable since v6 (Pre-LN + AdamW + RoPE).

Model	Change tested	Final loss	Vocab	Coherent?	Conclusion
v8_clean	Corpus cleanup — strip numeric artifacts, merge rare tokens	1.0044	9,509	Real words, no grammar	Cleanup worked — output became recognizable English. Grammar still missing.
v9_dialog_big	Capacity — 384D / 12L / 12H, ~21.5M brain params (was ~6M)	~1.05	9,509	FAIL	Model capacity hypothesis refuted. Word salad at 21.5M same as at 6M.
v10_dialog_corpus	Corpus expansion — intended to add everyday_conversations.txt	~1.05	9,509	NOT TESTED	File not committed. Corpus expansion hypothesis was not tested — v9 re-run.
v11_dialog_2corpus	Corpus expansion — first clean test with both files committed	1.0480 (MSE)	14,704	FAIL	Corpus volume hypothesis refuted. +55% vocab, same coherence level as v9.
v12_xent	Loss function — MSE on embeddings → cosine cross-entropy, INV_TEMP=10	7.38 nats (x-ent)	14,704	REPLACED	Cross-entropy trains cleanly; dialog scaffolding tokens present. INV_TEMP=10 too flat — 5.6 nats above floor. Temperature calibration identified as next test.
v13_xent_temp30	Temperature — INV_TEMP 10 → 30; all else frozen	5.04 nats (x-ent)	14,704	REPLACED	Loss dropped 2.34 nats. avg_ponder saturated at 2.7–2.9 (cap=3). Output still token-soup. Temperature calibration hypothesis partially supported for loss, refuted for coherence. Frozen vocab embeddings identified as bottleneck.
v14_branchb_learnable_vocab	Branch B: vocab_matrix_global made nn.Parameter (~5.6M params); brain frozen ep1, joint training ep2–5	2.7059 nats (x-ent)	14,704 (learnable)	PARTIAL · BRANCH B CONFIRMED	Loss broke below v13's 5.04 plateau in epoch 1 with brain frozen — vocab geometry was the bottleneck. Final loss 2.7059 (2.34 nats below v13). Output shows real English fragments and dialog patterns for the first time. avg_ponder pegged at cap=3 — adaptive-compute lever still not engaged. Branches A/C/D queued.

Evaluation Report — Human Judgement

A 16-prompt evaluation across three difficulty buckets (greetings · chitchat · harder Q&A) was run against v4_rope. Result: all 32 generated replies were near-identical sequences of rare/odd words regardless of input prompt. Example prompt/reply pairs:

[USER] hello
[BOT] you for suavely user resigning to the for impressiveness resigning i you freshener that the

[USER] what is the capital of france
[BOT] that slighted resigning to impressiveness you the i and to resigning the defilements for you

[USER] tell me a story
[BOT] you for suavely user resigning to the for impressiveness resigning i you freshener that the

Diagnosis: the model isn't conditioning on input prompts. The output collapses to the same vocab cluster regardless of context. Two plausible causes:

Undertraining — 1 epoch on a corpus where ~50% is 19th-century novels means rare archaic vocab dominates the output distribution. Classic sign of insufficient gradient steps.
Eval bug — the bench harness used torch.randn() embeddings in its eval_cosine_top1, which never moved off 5.0% across any architectural change. The metric may be saturated / noise-dominated, but live eval on real ChromaDB embeddings still produces nonsense — so undertraining is the dominant explanation.

Roadmap to Coherent — Cost & Time Tiers

Modal pricing as of 2026-04-26 (Starter plan, $25 included credits). Each tier assumes restart from previous run + bug fixes. Recommendation: don't budget more than $5 until v5 confirms whether undertraining is really the bottleneck.

Tier	Change	Wall time	GPU	Cost	Expected quality
v4 (now)	19M params · 15 MB corpus · 1 epoch	~1.5h	L4 ($0.80/h)	$1.20	gibberish (current)
v5	+ full dailydialog · 5 epochs	~6h	L4	~$5	most greetings/chitchat coherent
v6	+ bigger model (50M params, 12 layers)	~12h	A10G ($1.10/h)	~$13	usually-coherent short replies
v7	+ 200 MB conversational corpus · 80M params	~30h	L40S ($1.95/h)	~$60	actual conversation, occasional weirdness
v8	+ 1 GB corpus · 100M params · 5 epochs	~80h	L40S	~$160	GPT-2-tier coherence

Logical Next Steps (updated 2026-05-05)

Branch A — raise halt-gate cap from 3 → 5, lower ponder cost. avg_ponder is pegging at exactly 3.00 (the current cap) in v14-B evals, which means the adaptive-compute lever is not engaged — the model is always running to the cap rather than learning to halt earlier on easy tokens. Raising the cap to 5 and reducing the ponder cost will test whether the model can learn differentiated pondering depth. This is a relatively cheap single-variable change. See plans/V14_CANDIDATES.md.
Branches C and D — episodic memory and surprise-modulated gradient. Branch C adds episodic memory at training time; Branch D modulates the loss gradient by per-token prediction surprise. Both are independent axes that can be tested after Branch A. Order and priority are documented in plans/V14_CANDIDATES.md.
v15+ research backlog. Once Branches A/C/D are resolved, plans/FUTURE_RESEARCH.md picks up at the ∇-Reasoner follow-on (zero training cost) and other v15+ candidates. The core vocab/brain architecture is now validated and coherent dialog fragments are appearing — further work should stack on this foundation rather than revisiting architecture fundamentals.

Recent Runs

Timestamp	Model	Status	Notes
2026-05-05	v14_branchb_learnable_vocab	LIVE · PARTIAL · Branch B confirmed	vocab_matrix_global → nn.Parameter (~5.6M learnable); brain frozen ep1, joint ep2–5; ep1 loss 4.7854 (below v13's 5.04 with brain frozen — vocab was the bottleneck); final loss 2.7059 nats; distinct PASS, dialog fragments present, not fully coherent; avg_ponder 3.00 (cap); $8.00 · cumulative ~$73
2026-05-04	v13_xent_temp30	REPLACED	INV_TEMP 10 → 30; 5 epochs; final loss 5.04 nats (floor ≈0.59, headroom 4.45); distinct PASS, avg_ponder 2.7–2.9 (near cap saturation), coherent FAIL; $6.00 · cumulative ~$57; frozen vocab identified as bottleneck
2026-05-03	v12_xent	REPLACED	loss MSE → cosine x-ent tied projection; 5 epochs; final loss 7.38 nats (floor ≈1.77, headroom 5.6); distinct PASS, dialog scaffolding partial, coherent FAIL; $6.00 · commit e281458
2026-05-02 03:55	v11_dialog_2corpus	REPLACED	vocab grew 9,509 → 14,704 confirming both corpora ingested; final loss 1.0480; eval 1/3 — distinct PASS, english partial (proper-noun leakage), coherent FAIL; avg_ponder collapsed 3.0→1.0; 11h08m · ~$6; corpus-volume hypothesis refuted
2026-04-29 08:52	v10_dialog_corpus	REPLACED	corpus bug: everyday_conversations.txt not committed — trained on dailydialog only (same as v9); vocab=9,509 confirms no expansion; effectively a v9 re-run · ~$4 wasted
2026-04-28 19:50	v9_dialog_big	REPLACED	21.5M params · 12L · 384D · 5 epochs · ~$4 · eval 1/3 criteria: distinct PASS, english mostly, coherent FAIL
2026-04-28 02:16	v8_clean	REPLACED	cleaned corpus (9,509-token vocab) · 5 epochs · loss=1.0044 · real English words but word salad
2026-04-26 01:18	v4_rope	REPLACED	1 epoch, 15 MB corpus, loss=1.1347, eval failed (see report)
2026-04-25 22:31	phase5_5b	PASSED	RoPE extrapolation: 5.0% top1 maintained at 2× and 3× train context
2026-04-25 22:12	phase5_5a	PASSED	RoPE at-distribution: grad_norm_p99 halved (0.561→0.280)
2026-04-25 21:58	phase2_loopidx	KILLED	halt gate insensitive to loop signal
2026-04-25 21:47	phase1_lti	KILLED	no grad-norm or eval gain
2026-04-25 21:13	baseline	LOCKED	eval_cosine_top1=5.0%, grad_norm_p99=0.561, reproducible
2026-04-06 09:02	large_iter4	REPLACED	Pre-RoPE model (128D · 5L · ctx=16) — kept available for rollback