Every GPU run, every bug, every iteration toward a mind
Engram is a small custom language model built from scratch — an AttentionBrain with multi-head self-attention over a sliding context window. It lives in the kent-ai-dev/engram repo. Training migrated from SaladCloud (preemption issues) to Modal.com (L4 GPU, no preemption) as of April 2026.
Active model: v14_branchb_learnable_vocab (live 2026-05-05). RoPE + Pre-LN + AdamW + cosine cross-entropy (INV_TEMP=30). Vocab 14,704 now learnable nn.Parameter (~5.6M params), 21.5M brain params. Branch B hypothesis confirmed: loss broke below v13's 5.04 plateau in epoch 1 with brain frozen — vocab geometry was the bottleneck. Final loss 2.7059 nats; output shows real English dialog fragments for the first time. avg_ponder pegged at cap=3; Branch A (raise cap) queued next. Full status and eval report: kent-ai-dev.github.io/engram/
17 runs (+ pre-history) across ~48 days. Architecture stable since v6. Corpus-volume hypothesis refuted (v11). Loss function and temperature calibration tested (v12, v13). Branch B (learnable vocab) confirmed in v14 — first dialog-shaped output. avg_ponder pegged at cap; Branch A queued.
Victorian literature corpus (Frankenstein, Dracula, The Time Machine…). ingest.py wiped ChromaDB on every run — so the server always loaded stale embeddings with similarity 0.00. High-probability tokens were Victorian vocabulary: "knight", "wretchedly", "thence". Model couldn't form a coherent sentence, let alone a conversation.
ingest.py run
Three attempts back-to-back on local Windows VPS (CPU only): full corpus (6.2 MB, 3 epochs), then 20% corpus, then 5% corpus. All timed out. CPU-only training requires 4–12 hours minimum for this architecture. Side effect: ingest.py held Ollama's process lock, blocking semantic search for 30+ hours.
Pre-history weights committed to the GitHub repo as a baseline snapshot. Vocabulary: 7,253 tokens. This became the "stale weights" reference point that all subsequent SaladCloud runs attempted to improve upon. Last known good state before the SaladCloud era began.
First attempt on SaladCloud GPU. Multiple bugs surfaced with the container environment: python:3.11-slim ships without bash (scripts failed immediately), containers crash-looped due to wrong base image, S4 uploads failed with curl exit 56, and the restart_policy was incorrectly set. Several containers launched and died before training could complete.
python:3.11-slim has no bash; wrong base image; bad restart_policy
/bin/sh; moved to pytorch/pytorch:2.5.1 base; S4 curl fallback added; restart_policy set to "never"
Container engram-1774387064 queued with DailyDialog full corpus (6.2 MB, 3 epochs). SaladCloud API key had silently expired mid-session, blocking the launch. A new key was obtained and stored as an environment variable (not hardcoded). As of Mar 24 23:26 UTC the container was still running — no weights URL ever captured.
Several security and infrastructure fixes landed this session: hardcoded API key removed from git history, S4 upload replaced the old 0x0.st fallback, IMDS JWT auth added for SaladCloud S4. Upload succeeded — but the ntfy notification used single quotes: -d 'Training done! WEIGHTS_URL=$UPLOAD_URL'. Single quotes prevent shell variable expansion. $UPLOAD_URL was never substituted. The URL was in S4 but no one knew where.
-d arg wrapped in single quotes → $UPLOAD_URL never expanded
Container engram-1774630566 ran for ~8 hours on SaladCloud GPU. Training completed. S4 upload succeeded. But the ntfy single-quote bug from Run 5 was still present — $UPLOAD_URL never expanded in the notification body. Nobody knew the S4 URL. Container was deleted. Weights were gone.
-d arg (commit c35bd18)
Container engram-1774677568 launched with the ntfy double-quote fix applied. Cancelled shortly after in favour of running the full-corpus run (Run 8) instead of another DailyDialog-only pass. No training loss to report.
Container engram-1774780608 ran the full corpus: 13 books + DailyDialog (~2.5 M words). SSH access established Mar 30. train_runner.py had hardcoded Windows paths (C:\Python314\python.exe) and timed out on Linux. ingest.py ran instead and produced weights — but the container had cloned the repo, which already had those same weights committed from local runs. Checksums matched exactly: no new training had occurred. Weights were manually pushed to S4.
C:\Python314\python.exe hardcoded in train_runner.py
salad_train.py now calls python3 train_runner.py instead of python ingest.py
Container engram-1774884158 on SaladCloud. No confirmed completion — container likely preempted. This was the last SaladCloud-era run before the platform migration.
Container engram-1775425431 launched after cleaning 4 stale containers. Ran for 8.7 hours with 3 GPU node swaps (preemption). The polling script timed out at MAX_TRAINING_WAIT (28800s). An instance was briefly running near the end but no weights were ever produced. This was the decisive failure that triggered the migration to Modal.
First successful training on Modal.com. L4 GPU (24 GB VRAM), no preemption. Full corpus (13 books + DailyDialog, ~2.5M words), 3 epochs. Training completed in ~1 hour — compared to 8+ hours of failure on SaladCloud. All 3 weight files saved to Modal Volume and deployed to the live frontend. Vocab jumped from 7,253 → 37,591 tokens.
However, model output was still word salad — repeating "dissemble", "outlandish", "xxvii". Surprise score ~1.08 (high = random). The 37K vocab was too large for 3 epochs to converge. Led to launching a 10-epoch follow-up.
Upgraded architecture to 8 layers (from 3), 256 dimensions, 5 epochs. Training completed on Modal L4 GPU. Weights downloaded and deployed. Frontend serves vocab=37,591. Output quality is still poor — model produces grammatically-structured gibberish ("drawbridges tzatziki mammiferous"). Eval score: 56.4/100 (passes threshold but semantically meaningless). Deep analysis revealed: data bottleneck (13 Gutenberg books insufficient), single-head attention, contradictory ponder objectives, no gradient clipping.
Full architecture analysis completed. Compared Engram to kent_hologram (Hyperdimensional Computing system). Identified 7 transferable techniques: surprise-gated learning, curriculum training, experience replay, output validation, ventriloquist dual-model generation, salience-weighted loss, and adaptive pondering.
Key finding: The gibberish problem is primarily a data problem. TinyStories (476M tokens, designed for small models) is the recommended unlock. Models under 10M params trained on TinyStories produce coherent stories — Engram is 18.7M params producing gibberish on Gutenberg books.
Next run plan: Fix 4 bugs (gradient clipping, ponder weight, evaluator contradiction, multi-head attention), then train on TinyStories + WikiText-2 curriculum for 20-30 epochs.
Earlier iteration prior to the OpenMythos transfer — superseded by v4_rope.
The plan. Inspired by kyegomez/OpenMythos — a Recurrent Depth Transformer reference implementation — I drafted plans/OPENMYTHOS_TRANSFER.md, a six-phase ablation-first study to port four ideas from OpenMythos into Engram: (1) LTI-stable input injection, (2) loop-index sinusoidal embeddings, (3) inference-time depth extrapolation, (4) per-loop LoRA — plus an independent track for (5) RoPE positional encoding. Every phase had numeric pass/kill criteria gated on JSON metrics from a reproducible bench harness (bench/run.py). No vibes — only loss/grad-norm/eval deltas decided whether a change shipped.
The ralph loop. To execute the plan hands-off across many sessions, I configured the ralph-loop plugin with completion-promise ENGRAM_OPENMYTHOS_TRANSFER_COMPLETE. The loop iterates phase-by-phase, running tests on Modal L4 GPUs and committing decisions back to the engram repo. Each phase that passes ships its code; each that fails is reverted before moving on. Full mechanics documented in plans/AUTONOMOUS_LOOP.md.
Results — only one of five ideas survived. LTI injection and loop-index embedding both killed (no measurable grad-norm or eval gain). Phases 3 and 4 skipped — their prerequisite chain was broken once Phases 1–2 failed. Only Phase 5 (RoPE) shipped: grad_norm_p99 halved 0.561→0.280 and the model maintained quality at 3× training context with zero quality cliff. RoPE is now locked as the default in engram_model.py with zero extra trainable parameters.
What's next. With the architecture transfer complete, the autonomous loop has been re-armed with completion-promise ENGRAM_COHERENT — driving an iterated training campaign (v5 → v8) toward a model that produces actual coherent dialog instead of gibberish. v5 (3 epochs, 15 MB corpus, ~$3.60 on L4) is training right now. Full report and tier-by-tier cost roadmap: kent-ai-dev.github.io/engram/.
Wrote corpus_clean.py to strip dailydialog's preprocessing artifacts: regex out digit-runs (replaced with <num>), drop tokens with mixed alphanumeric garbage, and merge low-frequency tokens (count < 3) into <unk>. Vocab dropped from 17,363 → 9,515 unique tokens (45% noise reduction). Trained v8 on this cleaned corpus (5 epochs, ~$3 on Modal L4). Final loss 1.0044 — best yet (v7: 1.0101, v6: 1.0509, v5: 1.1448).
Major win. All 16 eval prompts produce distinct replies with only real English words. Compare v7 vs v8 sample output for "hello":
v7: "i user swear 106 236 i user doctor uncite the i bidpai bleu 236 to"
v8: "and tonight user good guy it and the z representatives i playstation complicated kiss"
The numeric IDs (236, yw132, 8826789) and placeholder gibberish (rikknen, bidpai, morrissette) are gone. Every token is now a recognizable English word.
Diagnostic also improved: cos(p1, p2) = 0.780 (v7: 0.912, v6: 0.960, v5: 1.000). Each fix has driven cosine down monotonically. Live at 108.181.97.223:5000.
What's still missing: word-salad-of-valid-words instead of coherent sentences. Topics drift mid-sentence (tonight → playstation → kiss in one reply). Grammar is broken; no syntactic structure. Hypothesis for v9: bigger model (50M params, 384D, 12L) has the capacity to learn multi-word grammatical patterns instead of just word-level associations. v9 launching next, ~$13 on L4. v10 will add HuggingFaceTB/everyday-conversations-llama3.1-2k (clean Llama-generated dialogues) on top.
Plan was to commit everyday_conversations.txt and re-run to actually test corpus expansion.
Executed as v11_dialog_2corpus (trained 2026-05-02). Vocab grew 9,509 → 14,704, confirming both files were ingested.
The corpus-expansion hypothesis was tested cleanly and refuted — see v11 entry below.
v10 was intended to test the corpus-expansion hypothesis — same v9 architecture plus HuggingFaceTB/everyday-conversations-llama3.1-2k. It did not test that. The bug: corpus/everyday_conversations.txt was not committed to GitHub at launch time. Modal training clones the repo, so the file was absent in the container. ingest.py silently filtered it (line 134: corpus_files = [b for b in args.books if os.path.exists(b)]). Result: v10 trained on dailydialog_clean.txt only — identical to v9. Confirmed by vocab_size=9,509 (bit-identical to v9). $4 of Modal compute wasted on a random-seed re-run.
Eval still shows marginal differences from the different seed. All 16 prompts produce distinct replies (criterion 1 PASS). Mostly real English with some odd tokens like kramer, wisconsin, interpersonal (criterion 2 mostly pass). Fragments slightly more grammatical than v9 but no coherent sentences (criterion 3 FAIL). Sample outputs from eval_runs/chat_20260429_200824.json:
"what do you like to do" → "user i have candles possess of the bot situations we have the candles for a"
"tell me about yourself" → "user oh seem that is hospitable i sticker possess managers bot it basil be i"
"what do you think about love" → "i reeve that user addresses you are residing to make a bot airlines in the"
Lesson logged in plans/V11_PLAN.md: "BOOKS files must be in HEAD before launch; verify with git ls-tree HEAD corpus/." Full report: kent-ai-dev.github.io/engram/.
eval_runs/chat_20260429_200824.json
v10 was supposed to test the corpus-expansion hypothesis but a commit bug made it a v9 re-run (vocab stayed at 9,509 — the second file was never in the container). v11 fixed all that: both dailydialog_clean.txt and everyday_conversations.txt were committed and verified before launch. Vocab grew 9,509 → 14,704 — definitive proof both files were ingested. Subprocess timeout was bumped 27,800s → 42,000s (the v11-original failure mode), function timeout 28,800s → 43,200s, and per-epoch checkpointing was added to ingest.py as defense-in-depth. Engram N-gram module size doubled (74.4 MB → 147.6 MB) because NGRAM_TABLE_SIZE was bumped 50,021 → 100,003.
Loss: epoch 1: 1.1084 · epoch 2: 1.0867 · epoch 3: 1.0749 · epoch 4: ~1.057 · epoch 5: 1.0480. Comparable to v9/v10 finals (~1.05) despite 55% more vocab to discriminate — consistent with genuine improvement on a harder distribution.
Eval (16 prompts, eval_runs/chat_20260502_195749.json): Criterion 1 (distinct) PASS — 16/16. Criterion 2 (real English) partial — grammar fragments improved but everyday-conversations introduced brand/proper-noun leakage: ursa, stevia, breville, photoshop, ronaldo. Criterion 3 (coherent chitchat) FAIL — same word salad as v9/v10. Sample outputs:
"what do you like to do" → "bot i think stimulation spaceship the i can photoshop deformity the bot stimulation for a"
"tell me about yourself" → "bot bald is i think the a unknown that carmichael i can photoshop bombs the"
"do you have any friends" → "bot i don resolved the it s my unknown of the you foodborne it user"
Notable signal: avg_ponder collapsed from 3.0 (all prior models, always ran the full loop) to 1.0 — the halt gate now terminates after 1 step. Bigger vocab gives sharper predictions. Didn't translate to coherence. Corpus-volume hypothesis refuted. v9 (capacity) and v11 (corpus) both failed at the same level. The remaining hypothesis is the loss function: MSE on embeddings rewards predicting the average of plausible tokens; v12 will switch to softmax cross-entropy. Full report: kent-ai-dev.github.io/engram/.
eval_runs/chat_20260502_195749.json
Single-variable swap from v11: training loss changed from MSE-on-embeddings to temperature-scaled cosine cross-entropy via tied output projection (predicted_norm @ vocab_matrix_normed.T * INV_TEMP=10). Architecture and corpus identical to v11 (12L / 384D / 12H / RoPE / Pre-LN / AdamW, 21.5M brain params, 38.7M engram params, vocab 14,704, context 32, 5.3 MB corpus). The v6–v11-era coherence_penalty MSE hack was dropped. vocab_matrix_global refreshed from the drifting embed_cache each epoch. Trained 2026-05-03 on Modal L4, $6.00.
Loss (cross-entropy — not comparable to v11's MSE 1.05): epoch 1: 8.16 · epoch 2: 8.00 · epoch 3: 7.74 · epoch 4: 7.54 · epoch 5: 7.38. Steady ~−0.2 nats/epoch — training is functioning. Theoretical floor at INV_TEMP=10: ≈1.77 nats. The model has 5.6 nats of headroom and isn't closing it.
Eval (eval_runs/chat_20260503_165046.json): Model is not collapsed — replies differ per prompt. Dialog scaffolding tokens appear (user, bot, i, you; occasional bigrams like "i'm not going", "i ll go to"). But no coherent sentences. Sample outputs:
"hello" → "user do fragrance that are of be curb cortina what is it typical good the"
"hi how are you" → "and me cgi slim a and to course unnecessary tailors i think illness but you"
"tell me about yourself" → "user yes it lasts you are not sure dump rica bot you vegetables are the"
"what is the capital of france" → "and concentrate bot tools i have a unk sara grandfather is my canvas and i"
Assessment: the loss-function hypothesis is partly supported. Cross-entropy trains cleanly and dialog-turn structure is visible in output — progress over raw MSE word-salad. The bottleneck is now temperature calibration: at INV_TEMP=10 the softmax is too flat to force lexical commitment. v13 queued: bump INV_TEMP to 30+. Theoretical floor at INV_TEMP=30 is ≈0.59 nats — 3× sharper gradient signal per step. Full report: kent-ai-dev.github.io/engram/.
eval_runs/chat_20260503_165046.json
e281458
Single architectural change from v13: vocab_matrix_global promoted from a frozen buffer to an nn.Parameter (~5.6M trainable parameters), trained via AdamW with lr=EMBED_LR/2=2.5e-4. Brain warm-started from v13 final and frozen for epoch 1 so the vocab could adjust against fixed brain predictions, then unfrozen for epochs 2–5 (joint training). Architecture otherwise identical: 12L / 384D / 12H / RoPE / Pre-LN / AdamW, vocab 14,704, context 32, 5.3 MB corpus. Trained 2026-05-05 on Modal L4, $8.00.
Per-epoch loss (the key evidence): v13 final plateau 5.04 nats. Epoch 1 end with brain frozen: 4.7854 — already 0.25 nats below v13 with the brain doing nothing. Epoch 2 (joint): 4.0939. Epoch 3: 3.5592. Epoch 4: 3.0563. Epoch 5 final: 2.7059 nats (2.34 nats below v13; ~2.12 nats above the theoretical floor at INV_TEMP=30 of ≈0.59).
Smoking gun. Loss breaking below v13's plateau in epoch 1 with the brain still frozen is the clean ablation: vocab geometry was the bottleneck. The sentence-transformer initialization optimizes for semantic similarity ("cat" near "dog"), not syntactic prediction ("cat" followed by "is"). Letting the vocab move under the cross-entropy gradient unlocked the descent. This is one of engram's core architectural bets — vocab/brain as separable components — and Branch B is the first run to exercise the "learnable" half under cross-entropy training.
Eval: PARTIAL — qualitatively much improved. Output produces real English fragments and recognizable scheduling/conversational patterns from the dailydialog corpus. Sample outputs: "hi how about tomorrow coming to account on friday", "okay how about one week user what time bot at ten to p after", "ten minutes walk", "i love you very long". Compare v13's token-soup: "today we who had selling how me user". Not yet fully coherent multi-turn dialogue, but dialog-shaped text is appearing for the first time.
Remaining signal: avg_ponder pegged at 3.00 (the cap). The adaptive-compute lever is not engaged — the model always runs to the cap rather than learning to halt early. Branch A (raise cap from 3 → 5, lower ponder cost) queued next. Branches C (episodic memory at training time) and D (surprise-modulated gradient) follow. Full report: kent-ai-dev.github.io/engram/.
Single-variable swap from v12: INV_TEMPERATURE raised from 10 to 30 (predicted_norm @ vocab_matrix_normed.T * INV_TEMP=30). Architecture, corpus, and loss function are otherwise identical to v12 (12L / 384D / 12H / RoPE / Pre-LN / AdamW, 21.5M brain params, 38.7M engram params, vocab 14,704, context 32, 5.3 MB corpus). Trained 2026-05-04 on Modal L4, $6.00.
Loss: final 5.04 nats vs v12's 7.38 — a 2.34 nat absolute reduction. Theoretical floor at INV_TEMP=30 is ≈0.59 nats (dropped 1.18 nats from v12's ≈1.77). Net headroom-above-floor narrowed from 5.6 to 4.45 nats — a real signal improvement, but the model is still 4.45 nats above its floor. Convergence was healthy.
Eval: PARTIAL/FAIL. Replies differ per prompt (distinct: PASS). avg_ponder sat at 2.7–2.9 throughout eval, indicating the halt gate is running near its cap of 3 — the model is almost always exhausting its pondering budget before halting. Output is token-soup — not English sentences. No improvement in coherence versus v12.
Diagnosis. The brain can reduce loss via a sharper temperature signal, but its predicted concept vectors are not landing on coherent ChromaDB tokens. The vocab embeddings are frozen — initialized once from the teacher and never updated during training. A more confident brain pointing in random directions of a frozen vocab space produces no better text. The sharpened softmax is demanding sharper discrimination from a target that was never trained to receive it.
Next: v14. Branch B makes the ChromaDB vocab a learnable nn.Parameter so brain predictions and token space can co-adapt. Branch A lifts the halt-gate cap above 3 to test whether the saturated avg_ponder is an independent bottleneck. Decision tree in plans/V14_CANDIDATES.md; v15+ research backlog in plans/FUTURE_RESEARCH.md. Full report: kent-ai-dev.github.io/engram/.
INV_TEMP=30 (was 10 in v12)
Trained the biggest Engram model yet: 384 embed dim / 12 layers / 12 heads / head_dim=32 / RoPE / Pre-LN / AdamW — 21.5M brain params (v8: ~6M, v6: ~6M). Same cleaned dailydialog corpus (9,509-token vocab). 5 epochs on Modal L4, ~$4. Trained 2026-04-28 12:32 UTC, live as of 2026-04-28 19:50 UTC. Full report at kent-ai-dev.github.io/engram/.
Eval results (eval_chat.py, 16 prompts): Criterion 1 (distinct outputs) PASS — 16/16 distinct. Criterion 2 (real English tokens) mostly pass — real words but with odd leaks (ikebana, milliken, tornados, banquet). Criterion 3 (coherent chitchat) FAIL — still word salad. Sample outputs:
"what do you like to do" → "you go to banquet present the bot smoke and starting i ll failed to the"
"tell me about yourself" → "user reducing that happening slip i m it evil banquet bot you cortex it s"
"how was your day" → "user reducing the me that i m dishwashing hi evil bot you festival it s"
Diagnosis: the v8 hypothesis ("bigger model has capacity to learn grammar") is refuted. Fragments emerge (i am, i m, hi, i have a really) but no coherent sentences. The bottleneck is the corpus — not the model. Next: v10 with a larger, more varied corpus.
eval_runs/chat_20260428_194533.json
Tested the data hypothesis from v6's diagnosis: with the architecture bug fixed (Pre-LN + AdamW), would dropping the 19th-century novels give us coherent dialog? Trained v7 on corpus/dailydialog.txt only (6.1 MB conversational, 5 epochs, ~$3 on Modal L4). Final loss 1.0101 — best yet (v6: 1.0509, v5: 1.1448). Vocab dropped from 38,062 → 18,187 concepts (the novels are gone).
Diagnostic shows further improvement. cos(p1, p2) = 0.912 (was 0.960 in v6, 1.000 in v5). ||p1 − p2|| = 4.51 (was 0.43 in v6) — 10× more output differentiation. The model conditions on input even more strongly with the focused corpus.
But output is still gibberish. Sample replies: "i user swear 106 236 i user doctor uncite the i bidpai bleu 236 to", "i rikknen laid uptight lock for i to rikknen batwoman and yw132 expanded". The Moby-Dick vocab is gone — replaced by data artifacts in dailydialog: numeric IDs (236, 106, 1205, yw132, 8826789), placeholder tokens (rikknen, bidpai), rare proper nouns (morrissette, pavarotti). Common English IS in there (i, user, you, the, to, and, lending, doctor, good) but drowned by noise tokens.
Diagnosis: dailydialog has preprocessing residue. Three options for v8: (a) clean the corpus first (regex out numeric IDs + low-frequency tokens), (b) bigger model — 50M params, more capacity to ignore noise, (c) different conversational corpus (Persona-Chat / OpenAssistant). Each architectural fix has worked exactly as predicted; quality has improved monotonically (v5 frozen → v6 conditions → v7 differentiates more) — but we keep finding new bottlenecks. Live at 108.181.97.223:5000.
After v5's diagnostic localized the bug to block 0 of the trained model (attention softmax collapsed to near-uniform max=0.033, FF norm exploded to 360 vs residual 16), implemented two fixes: (1) switched AttentionBlock from post-LN to pre-LN architecture (LayerNorm before each sublayer instead of after the residual), and (2) switched optimizer from Adam to AdamW with weight_decay=0.01. Trained v6 on Modal L4 (3 epochs, 15 MB local corpus, ~$3.60). Final loss 1.0509 (vs v5's 1.1448).
Diagnostic confirms the fix. Same input/test as v5: feed three completely different prompts, compare model output. Result: cosine(p1, p2) = 0.960 (was 1.000), ||p1 − p2|| = 0.43 (was 1.6 × 10-7). Model now genuinely conditions on input. Avg ponder steps: 3.0 (was 1.0 — model now uses full reasoning depth). All 16 eval prompts produce distinct replies.
But output is still gibberish-quality. Sample replies: "the judo my 1021 fun of the my tertiary fun and the adapt bot is" / "ichthyosaurus refurbished crossing 1021..." / "the user recur necessitate rotate at i blackjack...". Heavy contamination from 19th-century novel vocab (Moby Dick, Dracula, Pride and Prejudice — ~60% of the 15 MB training corpus). The architecture is healthy now; the bottleneck has shifted from model can't condition to model trained on the wrong data.
Loop halted for user decision. Two paths forward: v7 dailydialog-only (~$2 — drop the novels, train just on conversational corpus, test data hypothesis) or v8 bigger model (~$13 — 50M params, more capacity to filter noise). Recommend v7 first since it's the cheapest informative experiment and addresses the actual diagnosis.
Trained v5 (3 epochs, 15 MB local corpus, ~$3.60 on Modal L4). Final loss 1.1448 — converged to a clear plateau (epoch 1 end 1.1564 → epoch 2 end 1.1463 → epoch 3 end 1.1448). Server live at 108.181.97.223:5000. Then ran eval_chat.py against 16 prompts.
The bug. Output is gibberish — but more importantly, identical gibberish for every prompt. Diagnostic check (the "5-minute sanity test" promised in the previous report): feed three completely different prompts to the model, dump the raw concept prediction. Result: cos(p1, p2) = 1.000000, ||p1 − p2|| = 1.6 × 10-7. The model is producing a near-constant output. The top-5 nearest vocab words to that constant: the · i · bot · a · and — i.e., the highest-frequency tokens. The model has learned "predict the average direction of common words" and stopped there.
Why this is structural, not undertraining. Both v4 (1 epoch) and v5 (3 epochs) show the exact same failure mode. Loss has converged; more epochs only move it 0.01. Embeddings are correctly distinct (verified). The model genuinely isn't conditioning on input — likely the residual + LayerNorm path dominates over the attention contribution, so position 31's representation gets pinned regardless of context. Spending $13 on a bigger v6 or $160 on a 100M-param v8 would not fix a structural bug; it would just produce a bigger constant.
Loop halted. The autonomous training loop has been paused (.claude/ralph-loop.local.md set to active: false). Next steps need code work, not compute: investigate why attention isn't propagating prompt-word information through to position 31, possibly fix the residual scaling or the post-RoPE attention mask, then re-run v5 with the fix before resuming the cost ladder.
OpenMythos architecture transfer complete — 6 phases of ablation, only RoPE shipped (grad_norm_p99 halved 0.561→0.280; perfect 3× context extrapolation). Trained on Modal L4 for 1 epoch / 15 MB local corpus, final loss=1.1347. Server is live at http://108.181.97.223:5000 with the new model. Eval shows the model produces near-identical repetitive sequences ("resigning slighted defilements impressiveness…") regardless of prompt — undertrained, not broken. Full status report: kent-ai-dev.github.io/engram/
Phase results: P0 reproducibility ✅ · P1 LTI killed · P2 loop-idx killed · P3/P4 skipped · P5 RoPE ✅ shipped · P6 lock ✅
Next: sanity-check eval (free), then v5 (5 epochs · ~$5 · ~6h L4) to confirm whether undertraining is the bottleneck before scaling to bigger model (v6 ~$13).