Engram Training Log

Pre-History

Mar 18–20, 2026

❌ Word Salad

Victorian literature corpus (Frankenstein, Dracula, The Time Machine…). ingest.py wiped ChromaDB on every run — so the server always loaded stale embeddings with similarity 0.00. High-probability tokens were Victorian vocabulary: "knight", "wretchedly", "thence". Model couldn't form a coherent sentence, let alone a conversation.

Root Cause ChromaDB wiped on each ingest.py run

Fix Applied Killed Victorian corpus. Switched to DailyDialog (12,118 real conversations, 6.2 MB)

Run 1

Mar 20–21, 2026

❌ CPU Timeout Chain

Three attempts back-to-back on local Windows VPS (CPU only): full corpus (6.2 MB, 3 epochs), then 20% corpus, then 5% corpus. All timed out. CPU-only training requires 4–12 hours minimum for this architecture. Side effect: ingest.py held Ollama's process lock, blocking semantic search for 30+ hours.

Platform Local Windows VPS — CPU only

Root Cause No GPU on local hardware; Ollama blocked for 30 h

Fix Applied GPU required. All future training moved to SaladCloud exclusively

Run 2 — Baseline Commit

~Mar 21, 2026

⚠️ Baseline Only

Pre-history weights committed to the GitHub repo as a baseline snapshot. Vocabulary: 7,253 tokens. This became the "stale weights" reference point that all subsequent SaladCloud runs attempted to improve upon. Last known good state before the SaladCloud era began.

Vocab 7,253 tokens

Platform Local Windows VPS

Status Weights in repo — pre-SaladCloud era baseline

Run 3 — First SaladCloud

Mar 22, 2026

❌ Container Instability

First attempt on SaladCloud GPU. Multiple bugs surfaced with the container environment: python:3.11-slim ships without bash (scripts failed immediately), containers crash-looped due to wrong base image, S4 uploads failed with curl exit 56, and the restart_policy was incorrectly set. Several containers launched and died before training could complete.

Root Causes python:3.11-slim has no bash; wrong base image; bad restart_policy

Fixes Applied Switched to /bin/sh; moved to pytorch/pytorch:2.5.1 base; S4 curl fallback added; restart_policy set to "never"

Run 4

Mar 24, 2026

❌ API Key Expired

Container engram-1774387064 queued with DailyDialog full corpus (6.2 MB, 3 epochs). SaladCloud API key had silently expired mid-session, blocking the launch. A new key was obtained and stored as an environment variable (not hardcoded). As of Mar 24 23:26 UTC the container was still running — no weights URL ever captured.

Container engram-1774387064

Root Cause SaladCloud API key expired mid-session

Fix Applied New API key obtained; moved to env var, not hardcoded in source

Run 5

Mar 25, 2026

❌ ntfy Variable Bug

Several security and infrastructure fixes landed this session: hardcoded API key removed from git history, S4 upload replaced the old 0x0.st fallback, IMDS JWT auth added for SaladCloud S4. Upload succeeded — but the ntfy notification used single quotes: -d 'Training done! WEIGHTS_URL=$UPLOAD_URL'. Single quotes prevent shell variable expansion. $UPLOAD_URL was never substituted. The URL was in S4 but no one knew where.

Root Cause ntfy -d arg wrapped in single quotes → $UPLOAD_URL never expanded

Fix Applied Identified but not yet committed this run — fixed in Run 6

Run 6

Mar 28, 2026 — ~01:00 UTC

❌ Weights Trained, URL Lost

Container engram-1774630566 ran for ~8 hours on SaladCloud GPU. Training completed. S4 upload succeeded. But the ntfy single-quote bug from Run 5 was still present — $UPLOAD_URL never expanded in the notification body. Nobody knew the S4 URL. Container was deleted. Weights were gone.

Container engram-1774630566

Runtime ~8 hours on GPU

Root Cause ntfy single-quote bug — S4 URL never surfaced

Fix Applied Single quotes → double quotes around ntfy -d arg (commit c35bd18)

Run 7

Mar 28, 2026 — 05:59 UTC

❌ Cancelled

Container engram-1774677568 launched with the ntfy double-quote fix applied. Cancelled shortly after in favour of running the full-corpus run (Run 8) instead of another DailyDialog-only pass. No training loss to report.

Container engram-1774677568

Reason Superseded by full corpus run (13 books + DailyDialog)

Run 8 — Full Corpus

Mar 28–30, 2026

❌ Trained but Not New

Container engram-1774780608 ran the full corpus: 13 books + DailyDialog (~2.5 M words). SSH access established Mar 30. train_runner.py had hardcoded Windows paths (C:\Python314\python.exe) and timed out on Linux. ingest.py ran instead and produced weights — but the container had cloned the repo, which already had those same weights committed from local runs. Checksums matched exactly: no new training had occurred. Weights were manually pushed to S4.

Container engram-1774780608

Corpus 13 books + DailyDialog (~2.5 M words)

Root Cause Windows path C:\Python314\python.exe hardcoded in train_runner.py

S4 Uploads baseline (553 KB), target_small_iter4 (4.2 MB), large_iter4 (3.5 MB), vocab_embeddings (9.7 MB)

Fix Applied salad_train.py now calls python3 train_runner.py instead of python ingest.py

Model State vocab=7,253 · 13 books · 5 iterations (from local)

Run 9 — SaladCloud (stale)

Mar 30, 2026 — 15:22 UTC

❌ Unknown

Container engram-1774884158 on SaladCloud. No confirmed completion — container likely preempted. This was the last SaladCloud-era run before the platform migration.

Platform SaladCloud (deprecated)

Outcome Unknown — no weights captured, container deleted

Run 10 — SaladCloud Timeout (final SaladCloud run)

Apr 5, 2026 — 21:44 UTC

❌ Timeout

Container engram-1775425431 launched after cleaning 4 stale containers. Ran for 8.7 hours with 3 GPU node swaps (preemption). The polling script timed out at MAX_TRAINING_WAIT (28800s). An instance was briefly running near the end but no weights were ever produced. This was the decisive failure that triggered the migration to Modal.

Platform SaladCloud (RTX 3060)

Duration 8.7 hours (timeout)

Node swaps 3 (GPU preemption)

Decision Migrate to Modal.com — no preemption, reserved GPU

Run 11 — Modal 3-Epoch (first Modal success)

Apr 6, 2026 — 08:00–09:02 UTC

✅ Success

First successful training on Modal.com. L4 GPU (24 GB VRAM), no preemption. Full corpus (13 books + DailyDialog, ~2.5M words), 3 epochs. Training completed in ~1 hour — compared to 8+ hours of failure on SaladCloud. All 3 weight files saved to Modal Volume and deployed to the live frontend. Vocab jumped from 7,253 → 37,591 tokens.

However, model output was still word salad — repeating "dissemble", "outlandish", "xxvii". Surprise score ~1.08 (high = random). The 37K vocab was too large for 3 epochs to converge. Led to launching a 10-epoch follow-up.

Platform Modal.com (L4 GPU, 24 GB VRAM)

Duration ~1 hour

Epochs 3

Final loss 0.9751

Vocab 37,591 tokens (was 7,253)

Weights engram_weights.pth (1.6 MB), engram_memory_module.pth (18.4 MB), engram_word_to_id.pth (0.7 MB)

Deployed Yes — frontend at 108.181.97.223:5000 serving vocab=37,591

Cost ~$0.80 (L4 at $0.80/hr × 1h)

Run 12 — Modal 5-Epoch (8-layer upgrade)

Apr 6, 2026 — 09:34 UTC

✅ Complete

Upgraded architecture to 8 layers (from 3), 256 dimensions, 5 epochs. Training completed on Modal L4 GPU. Weights downloaded and deployed. Frontend serves vocab=37,591. Output quality is still poor — model produces grammatically-structured gibberish ("drawbridges tzatziki mammiferous"). Eval score: 56.4/100 (passes threshold but semantically meaningless). Deep analysis revealed: data bottleneck (13 Gutenberg books insufficient), single-head attention, contradictory ponder objectives, no gradient clipping.

Platform Modal.com (L4 GPU)

Epochs 5 (8-layer, 256-dim)

Timeout ~2 hours

Expected outcome Complete — deployed, but output quality needs data + architecture fixes

Deep Research — Architecture Review & Training Plan

Apr 8, 2026

🔬 Research

Full architecture analysis completed. Compared Engram to kent_hologram (Hyperdimensional Computing system). Identified 7 transferable techniques: surprise-gated learning, curriculum training, experience replay, output validation, ventriloquist dual-model generation, salience-weighted loss, and adaptive pondering.

Key finding: The gibberish problem is primarily a data problem. TinyStories (476M tokens, designed for small models) is the recommended unlock. Models under 10M params trained on TinyStories produce coherent stories — Engram is 18.7M params producing gibberish on Gutenberg books.

Next run plan: Fix 4 bugs (gradient clipping, ponder weight, evaluator contradiction, multi-head attention), then train on TinyStories + WikiText-2 curriculum for 20-30 epochs.

Ralph Iteration 0 (historical)

April 22, 2026

📜 Archived

Earlier iteration prior to the OpenMythos transfer — superseded by v4_rope.

Iteration 0

Architecture 256dim × 8layer × 32ctx, LR=0.001

Evaluation N/A

Status Superseded by v4_rope (2026-04-26)

Plan & Implementation — OpenMythos Architecture Transfer + Autonomous Ralph Loop

April 25-26, 2026

📋 Plan

The plan. Inspired by kyegomez/OpenMythos — a Recurrent Depth Transformer reference implementation — I drafted plans/OPENMYTHOS_TRANSFER.md, a six-phase ablation-first study to port four ideas from OpenMythos into Engram: (1) LTI-stable input injection, (2) loop-index sinusoidal embeddings, (3) inference-time depth extrapolation, (4) per-loop LoRA — plus an independent track for (5) RoPE positional encoding. Every phase had numeric pass/kill criteria gated on JSON metrics from a reproducible bench harness (bench/run.py). No vibes — only loss/grad-norm/eval deltas decided whether a change shipped.

The ralph loop. To execute the plan hands-off across many sessions, I configured the ralph-loop plugin with completion-promise ENGRAM_OPENMYTHOS_TRANSFER_COMPLETE. The loop iterates phase-by-phase, running tests on Modal L4 GPUs and committing decisions back to the engram repo. Each phase that passes ships its code; each that fails is reverted before moving on. Full mechanics documented in plans/AUTONOMOUS_LOOP.md.

Results — only one of five ideas survived. LTI injection and loop-index embedding both killed (no measurable grad-norm or eval gain). Phases 3 and 4 skipped — their prerequisite chain was broken once Phases 1–2 failed. Only Phase 5 (RoPE) shipped: grad_norm_p99 halved 0.561→0.280 and the model maintained quality at 3× training context with zero quality cliff. RoPE is now locked as the default in engram_model.py with zero extra trainable parameters.

What's next. With the architecture transfer complete, the autonomous loop has been re-armed with completion-promise ENGRAM_COHERENT — driving an iterated training campaign (v5 → v8) toward a model that produces actual coherent dialog instead of gibberish. v5 (3 epochs, 15 MB corpus, ~$3.60 on L4) is training right now. Full report and tier-by-tier cost roadmap: kent-ai-dev.github.io/engram/.

v8_clean: Real English Words — Corpus Cleanup Worked

April 28, 2026

✅ Live · English Words

Wrote corpus_clean.py to strip dailydialog's preprocessing artifacts: regex out digit-runs (replaced with <num>), drop tokens with mixed alphanumeric garbage, and merge low-frequency tokens (count < 3) into <unk>. Vocab dropped from 17,363 → 9,515 unique tokens (45% noise reduction). Trained v8 on this cleaned corpus (5 epochs, ~$3 on Modal L4). Final loss 1.0044 — best yet (v7: 1.0101, v6: 1.0509, v5: 1.1448).

Major win. All 16 eval prompts produce distinct replies with only real English words. Compare v7 vs v8 sample output for "hello":
v7: "i user swear 106 236 i user doctor uncite the i bidpai bleu 236 to"
v8: "and tonight user good guy it and the z representatives i playstation complicated kiss"
The numeric IDs (236, yw132, 8826789) and placeholder gibberish (rikknen, bidpai, morrissette) are gone. Every token is now a recognizable English word.

Diagnostic also improved: cos(p1, p2) = 0.780 (v7: 0.912, v6: 0.960, v5: 1.000). Each fix has driven cosine down monotonically. Live at 108.181.97.223:5000.

What's still missing: word-salad-of-valid-words instead of coherent sentences. Topics drift mid-sentence (tonight → playstation → kiss in one reply). Grammar is broken; no syntactic structure. Hypothesis for v9: bigger model (50M params, 384D, 12L) has the capacity to learn multi-word grammatical patterns instead of just word-level associations. v9 launching next, ~$13 on L4. v10 will add HuggingFaceTB/everyday-conversations-llama3.1-2k (clean Llama-generated dialogues) on top.

v10-redo Plan — Resolved by v11_dialog_2corpus

Apr 30, 2026

📜 Archived

Plan was to commit everyday_conversations.txt and re-run to actually test corpus expansion. Executed as v11_dialog_2corpus (trained 2026-05-02). Vocab grew 9,509 → 14,704, confirming both files were ingested. The corpus-expansion hypothesis was tested cleanly and refuted — see v11 entry below.

v10_dialog_corpus: Live — But This Was a v9 Re-Run Due to a Commit Bug

Apr 29, 2026

⚠ Live · Word Salad (v9 re-run)

v10 was intended to test the corpus-expansion hypothesis — same v9 architecture plus HuggingFaceTB/everyday-conversations-llama3.1-2k. It did not test that. The bug: corpus/everyday_conversations.txt was not committed to GitHub at launch time. Modal training clones the repo, so the file was absent in the container. ingest.py silently filtered it (line 134: corpus_files = [b for b in args.books if os.path.exists(b)]). Result: v10 trained on dailydialog_clean.txt only — identical to v9. Confirmed by vocab_size=9,509 (bit-identical to v9). $4 of Modal compute wasted on a random-seed re-run.

Eval still shows marginal differences from the different seed. All 16 prompts produce distinct replies (criterion 1 PASS). Mostly real English with some odd tokens like kramer, wisconsin, interpersonal (criterion 2 mostly pass). Fragments slightly more grammatical than v9 but no coherent sentences (criterion 3 FAIL). Sample outputs from eval_runs/chat_20260429_200824.json:
"what do you like to do" → "user i have candles possess of the bot situations we have the candles for a"
"tell me about yourself" → "user oh seem that is hospitable i sticker possess managers bot it basil be i"
"what do you think about love" → "i reeve that user addresses you are residing to make a bot airlines in the"

Lesson logged in plans/V11_PLAN.md: "BOOKS files must be in HEAD before launch; verify with git ls-tree HEAD corpus/." Full report: kent-ai-dev.github.io/engram/.

Architecture 384D / 12L / 12H / RoPE / Pre-LN / AdamW

Brain params 21,509,761

Vocab 9,509 — identical to v9 (corpus bug confirmed)

Training 5 epochs · Modal L4 · $4.00 · 2026-04-29T08:52:00Z

Corpus (intended) dailydialog_clean.txt + everyday_conversations.txt

Corpus (actual) dailydialog_clean.txt only — uncommitted file was absent from Modal container

Eval 1/3 criteria: distinct PASS · english mostly · coherent FAIL

Transcript eval_runs/chat_20260429_200824.json

v11_dialog_2corpus: Corpus Expansion Actually Tested This Time — Still Word Salad

May 2, 2026

📜 Replaced by v12

v10 was supposed to test the corpus-expansion hypothesis but a commit bug made it a v9 re-run (vocab stayed at 9,509 — the second file was never in the container). v11 fixed all that: both dailydialog_clean.txt and everyday_conversations.txt were committed and verified before launch. Vocab grew 9,509 → 14,704 — definitive proof both files were ingested. Subprocess timeout was bumped 27,800s → 42,000s (the v11-original failure mode), function timeout 28,800s → 43,200s, and per-epoch checkpointing was added to ingest.py as defense-in-depth. Engram N-gram module size doubled (74.4 MB → 147.6 MB) because NGRAM_TABLE_SIZE was bumped 50,021 → 100,003.

Loss: epoch 1: 1.1084 · epoch 2: 1.0867 · epoch 3: 1.0749 · epoch 4: ~1.057 · epoch 5: 1.0480. Comparable to v9/v10 finals (~1.05) despite 55% more vocab to discriminate — consistent with genuine improvement on a harder distribution.

Eval (16 prompts, eval_runs/chat_20260502_195749.json): Criterion 1 (distinct) PASS — 16/16. Criterion 2 (real English) partial — grammar fragments improved but everyday-conversations introduced brand/proper-noun leakage: ursa, stevia, breville, photoshop, ronaldo. Criterion 3 (coherent chitchat) FAIL — same word salad as v9/v10. Sample outputs:
"what do you like to do" → "bot i think stimulation spaceship the i can photoshop deformity the bot stimulation for a"
"tell me about yourself" → "bot bald is i think the a unknown that carmichael i can photoshop bombs the"
"do you have any friends" → "bot i don resolved the it s my unknown of the you foodborne it user"

Notable signal: avg_ponder collapsed from 3.0 (all prior models, always ran the full loop) to 1.0 — the halt gate now terminates after 1 step. Bigger vocab gives sharper predictions. Didn't translate to coherence. Corpus-volume hypothesis refuted. v9 (capacity) and v11 (corpus) both failed at the same level. The remaining hypothesis is the loss function: MSE on embeddings rewards predicting the average of plausible tokens; v12 will switch to softmax cross-entropy. Full report: kent-ai-dev.github.io/engram/.

Architecture 384D / 12L / 12H / RoPE / Pre-LN / AdamW (frozen since v9)

Brain params 21,509,761

Vocab 14,704 (was 9,509 — corpus expansion confirmed)

Corpus dailydialog_clean.txt + everyday_conversations.txt (5.3 MB)

Training 5 epochs · Modal L4 · 11h08m · $6 · 2026-05-02T03:55:00Z

Final loss 1.0480

Eval 1/3 criteria: distinct PASS · english partial · coherent FAIL

Transcript eval_runs/chat_20260502_195749.json

v12_xent: Cross-Entropy Loss Swap — Training Works, Output Still Incoherent

May 3, 2026

📜 Replaced by v13

Single-variable swap from v11: training loss changed from MSE-on-embeddings to temperature-scaled cosine cross-entropy via tied output projection (predicted_norm @ vocab_matrix_normed.T * INV_TEMP=10). Architecture and corpus identical to v11 (12L / 384D / 12H / RoPE / Pre-LN / AdamW, 21.5M brain params, 38.7M engram params, vocab 14,704, context 32, 5.3 MB corpus). The v6–v11-era coherence_penalty MSE hack was dropped. vocab_matrix_global refreshed from the drifting embed_cache each epoch. Trained 2026-05-03 on Modal L4, $6.00.

Loss (cross-entropy — not comparable to v11's MSE 1.05): epoch 1: 8.16 · epoch 2: 8.00 · epoch 3: 7.74 · epoch 4: 7.54 · epoch 5: 7.38. Steady ~−0.2 nats/epoch — training is functioning. Theoretical floor at INV_TEMP=10: ≈1.77 nats. The model has 5.6 nats of headroom and isn't closing it.

Eval (eval_runs/chat_20260503_165046.json): Model is not collapsed — replies differ per prompt. Dialog scaffolding tokens appear (user, bot, i, you; occasional bigrams like "i'm not going", "i ll go to"). But no coherent sentences. Sample outputs:
"hello" → "user do fragrance that are of be curb cortina what is it typical good the"
"hi how are you" → "and me cgi slim a and to course unnecessary tailors i think illness but you"
"tell me about yourself" → "user yes it lasts you are not sure dump rica bot you vegetables are the"
"what is the capital of france" → "and concentrate bot tools i have a unk sara grandfather is my canvas and i"

Assessment: the loss-function hypothesis is partly supported. Cross-entropy trains cleanly and dialog-turn structure is visible in output — progress over raw MSE word-salad. The bottleneck is now temperature calibration: at INV_TEMP=10 the softmax is too flat to force lexical commitment. v13 queued: bump INV_TEMP to 30+. Theoretical floor at INV_TEMP=30 is ≈0.59 nats — 3× sharper gradient signal per step. Full report: kent-ai-dev.github.io/engram/.

Architecture 384D / 12L / 12H / RoPE / Pre-LN / AdamW (frozen since v9)

Loss function cosine cross-entropy · tied output projection · INV_TEMP=10

Brain params 21,509,761

Vocab 14,704 (unchanged from v11)

Corpus dailydialog_clean.txt + everyday_conversations.txt (5.3 MB)

Training 5 epochs · Modal L4 · $6.00 · 2026-05-03

Final loss 7.379 nats (floor ≈1.77 at INV_TEMP=10)

Eval distinct PASS · dialog-scaffolding partial · coherent FAIL

Transcript eval_runs/chat_20260503_165046.json

Commit e281458

v14_branchb_learnable_vocab: Branch B Confirmed — Vocab Was the Bottleneck, First Dialog Fragments

May 5, 2026

✅ Live · Partial · Branch B confirmed

Single architectural change from v13: vocab_matrix_global promoted from a frozen buffer to an nn.Parameter (~5.6M trainable parameters), trained via AdamW with lr=EMBED_LR/2=2.5e-4. Brain warm-started from v13 final and frozen for epoch 1 so the vocab could adjust against fixed brain predictions, then unfrozen for epochs 2–5 (joint training). Architecture otherwise identical: 12L / 384D / 12H / RoPE / Pre-LN / AdamW, vocab 14,704, context 32, 5.3 MB corpus. Trained 2026-05-05 on Modal L4, $8.00.

Per-epoch loss (the key evidence): v13 final plateau 5.04 nats. Epoch 1 end with brain frozen: 4.7854 — already 0.25 nats below v13 with the brain doing nothing. Epoch 2 (joint): 4.0939. Epoch 3: 3.5592. Epoch 4: 3.0563. Epoch 5 final: 2.7059 nats (2.34 nats below v13; ~2.12 nats above the theoretical floor at INV_TEMP=30 of ≈0.59).

Smoking gun. Loss breaking below v13's plateau in epoch 1 with the brain still frozen is the clean ablation: vocab geometry was the bottleneck. The sentence-transformer initialization optimizes for semantic similarity ("cat" near "dog"), not syntactic prediction ("cat" followed by "is"). Letting the vocab move under the cross-entropy gradient unlocked the descent. This is one of engram's core architectural bets — vocab/brain as separable components — and Branch B is the first run to exercise the "learnable" half under cross-entropy training.

Eval: PARTIAL — qualitatively much improved. Output produces real English fragments and recognizable scheduling/conversational patterns from the dailydialog corpus. Sample outputs: "hi how about tomorrow coming to account on friday", "okay how about one week user what time bot at ten to p after", "ten minutes walk", "i love you very long". Compare v13's token-soup: "today we who had selling how me user". Not yet fully coherent multi-turn dialogue, but dialog-shaped text is appearing for the first time.

Remaining signal: avg_ponder pegged at 3.00 (the cap). The adaptive-compute lever is not engaged — the model always runs to the cap rather than learning to halt early. Branch A (raise cap from 3 → 5, lower ponder cost) queued next. Branches C (episodic memory at training time) and D (surprise-modulated gradient) follow. Full report: kent-ai-dev.github.io/engram/.

Architecture 384D / 12L / 12H / RoPE / Pre-LN / AdamW (frozen since v9)

Vocab 14,704 — learnable nn.Parameter (~5.6M params); epoch 1 brain frozen, epochs 2–5 joint

Brain params 21,509,761

Corpus dailydialog_clean.txt + everyday_conversations.txt (5.3 MB)

Training 5 epochs · Modal L4 · $8.00 · 2026-05-05

Per-epoch loss 4.7854 · 4.0939 · 3.5592 · 3.0563 · 2.7059

Final loss 2.7059 nats (floor ≈0.59 at INV_TEMP=30 — 2.12 nats headroom)

avg_ponder 3.00 (pegged at cap — adaptive-compute not engaged)

Eval distinct PASS · dialog fragments present · fully coherent FAIL

Budget $8.00 this run · cumulative ~$73 of $150 ceiling

v13_xent_temp30: Temperature Sharpened, Loss Dropped 2.34 Nats — Output Still Token-Soup

May 4, 2026

📜 Replaced by v14

Single-variable swap from v12: INV_TEMPERATURE raised from 10 to 30 (predicted_norm @ vocab_matrix_normed.T * INV_TEMP=30). Architecture, corpus, and loss function are otherwise identical to v12 (12L / 384D / 12H / RoPE / Pre-LN / AdamW, 21.5M brain params, 38.7M engram params, vocab 14,704, context 32, 5.3 MB corpus). Trained 2026-05-04 on Modal L4, $6.00.

Loss: final 5.04 nats vs v12's 7.38 — a 2.34 nat absolute reduction. Theoretical floor at INV_TEMP=30 is ≈0.59 nats (dropped 1.18 nats from v12's ≈1.77). Net headroom-above-floor narrowed from 5.6 to 4.45 nats — a real signal improvement, but the model is still 4.45 nats above its floor. Convergence was healthy.

Eval: PARTIAL/FAIL. Replies differ per prompt (distinct: PASS). avg_ponder sat at 2.7–2.9 throughout eval, indicating the halt gate is running near its cap of 3 — the model is almost always exhausting its pondering budget before halting. Output is token-soup — not English sentences. No improvement in coherence versus v12.

Diagnosis. The brain can reduce loss via a sharper temperature signal, but its predicted concept vectors are not landing on coherent ChromaDB tokens. The vocab embeddings are frozen — initialized once from the teacher and never updated during training. A more confident brain pointing in random directions of a frozen vocab space produces no better text. The sharpened softmax is demanding sharper discrimination from a target that was never trained to receive it.

Next: v14. Branch B makes the ChromaDB vocab a learnable nn.Parameter so brain predictions and token space can co-adapt. Branch A lifts the halt-gate cap above 3 to test whether the saturated avg_ponder is an independent bottleneck. Decision tree in plans/V14_CANDIDATES.md; v15+ research backlog in plans/FUTURE_RESEARCH.md. Full report: kent-ai-dev.github.io/engram/.

Architecture 384D / 12L / 12H / RoPE / Pre-LN / AdamW (frozen since v9)

Loss function cosine cross-entropy · tied output projection · INV_TEMP=30 (was 10 in v12)

Brain params 21,509,761

Vocab 14,704 (frozen embeddings — initialized from teacher, not updated)

Corpus dailydialog_clean.txt + everyday_conversations.txt (5.3 MB)

Training 5 epochs · Modal L4 · $6.00 · 2026-05-04

Final loss 5.04 nats (floor ≈0.59 at INV_TEMP=30 — 4.45 nats headroom)

avg_ponder 2.7–2.9 (cap=3, near saturation)

Eval distinct PASS · avg_ponder near-cap · coherent FAIL

Budget $6.00 this run · cumulative ~$57 of $150 ceiling

v9_dialog_big: More Params Did Not Fix Grammar

Apr 28, 2026

📜 Replaced by v10

Trained the biggest Engram model yet: 384 embed dim / 12 layers / 12 heads / head_dim=32 / RoPE / Pre-LN / AdamW — 21.5M brain params (v8: ~6M, v6: ~6M). Same cleaned dailydialog corpus (9,509-token vocab). 5 epochs on Modal L4, ~$4. Trained 2026-04-28 12:32 UTC, live as of 2026-04-28 19:50 UTC. Full report at kent-ai-dev.github.io/engram/.

Eval results (eval_chat.py, 16 prompts): Criterion 1 (distinct outputs) PASS — 16/16 distinct. Criterion 2 (real English tokens) mostly pass — real words but with odd leaks (ikebana, milliken, tornados, banquet). Criterion 3 (coherent chitchat) FAIL — still word salad. Sample outputs:
"what do you like to do" → "you go to banquet present the bot smoke and starting i ll failed to the"
"tell me about yourself" → "user reducing that happening slip i m it evil banquet bot you cortex it s"
"how was your day" → "user reducing the me that i m dishwashing hi evil bot you festival it s"

Diagnosis: the v8 hypothesis ("bigger model has capacity to learn grammar") is refuted. Fragments emerge (i am, i m, hi, i have a really) but no coherent sentences. The bottleneck is the corpus — not the model. Next: v10 with a larger, more varied corpus.

Architecture 384D / 12L / 12H / RoPE / Pre-LN / AdamW

Brain params 21,288,577 (~21.5M)

Vocab 9,509 (cleaned dailydialog)

Training 5 epochs · Modal L4 · ~$4 · 2026-04-28 12:32 UTC

Eval 1/3 criteria: distinct PASS · english mostly · coherent FAIL

Transcript eval_runs/chat_20260428_194533.json

v7_dialog: Cleaner Vocab, Still Gibberish — Data Artifacts Now Visible

April 27, 2026

📜 Superseded by v8

Tested the data hypothesis from v6's diagnosis: with the architecture bug fixed (Pre-LN + AdamW), would dropping the 19th-century novels give us coherent dialog? Trained v7 on corpus/dailydialog.txt only (6.1 MB conversational, 5 epochs, ~$3 on Modal L4). Final loss 1.0101 — best yet (v6: 1.0509, v5: 1.1448). Vocab dropped from 38,062 → 18,187 concepts (the novels are gone).

Diagnostic shows further improvement. cos(p1, p2) = 0.912 (was 0.960 in v6, 1.000 in v5). ||p1 − p2|| = 4.51 (was 0.43 in v6) — 10× more output differentiation. The model conditions on input even more strongly with the focused corpus.

But output is still gibberish. Sample replies: "i user swear 106 236 i user doctor uncite the i bidpai bleu 236 to", "i rikknen laid uptight lock for i to rikknen batwoman and yw132 expanded". The Moby-Dick vocab is gone — replaced by data artifacts in dailydialog: numeric IDs (236, 106, 1205, yw132, 8826789), placeholder tokens (rikknen, bidpai), rare proper nouns (morrissette, pavarotti). Common English IS in there (i, user, you, the, to, and, lending, doctor, good) but drowned by noise tokens.

Diagnosis: dailydialog has preprocessing residue. Three options for v8: (a) clean the corpus first (regex out numeric IDs + low-frequency tokens), (b) bigger model — 50M params, more capacity to ignore noise, (c) different conversational corpus (Persona-Chat / OpenAssistant). Each architectural fix has worked exactly as predicted; quality has improved monotonically (v5 frozen → v6 conditions → v7 differentiates more) — but we keep finding new bottlenecks. Live at 108.181.97.223:5000.

v6_preln: Pre-LN Fix Worked — Constant-Output Bug Resolved

April 27, 2026

📜 Superseded by v7

After v5's diagnostic localized the bug to block 0 of the trained model (attention softmax collapsed to near-uniform max=0.033, FF norm exploded to 360 vs residual 16), implemented two fixes: (1) switched AttentionBlock from post-LN to pre-LN architecture (LayerNorm before each sublayer instead of after the residual), and (2) switched optimizer from Adam to AdamW with weight_decay=0.01. Trained v6 on Modal L4 (3 epochs, 15 MB local corpus, ~$3.60). Final loss 1.0509 (vs v5's 1.1448).

Diagnostic confirms the fix. Same input/test as v5: feed three completely different prompts, compare model output. Result: cosine(p1, p2) = 0.960 (was 1.000), ||p1 − p2|| = 0.43 (was 1.6 × 10^-7). Model now genuinely conditions on input. Avg ponder steps: 3.0 (was 1.0 — model now uses full reasoning depth). All 16 eval prompts produce distinct replies.

But output is still gibberish-quality. Sample replies: "the judo my 1021 fun of the my tertiary fun and the adapt bot is" / "ichthyosaurus refurbished crossing 1021..." / "the user recur necessitate rotate at i blackjack...". Heavy contamination from 19th-century novel vocab (Moby Dick, Dracula, Pride and Prejudice — ~60% of the 15 MB training corpus). The architecture is healthy now; the bottleneck has shifted from model can't condition to model trained on the wrong data.

Loop halted for user decision. Two paths forward: v7 dailydialog-only (~$2 — drop the novels, train just on conversational corpus, test data hypothesis) or v8 bigger model (~$13 — 50M params, more capacity to filter noise). Recommend v7 first since it's the cheapest informative experiment and addresses the actual diagnosis.

v5_rope: Diagnostic — Structural Bug Found (Now Fixed in v6)

April 26, 2026

📜 Superseded

Trained v5 (3 epochs, 15 MB local corpus, ~$3.60 on Modal L4). Final loss 1.1448 — converged to a clear plateau (epoch 1 end 1.1564 → epoch 2 end 1.1463 → epoch 3 end 1.1448). Server live at 108.181.97.223:5000. Then ran eval_chat.py against 16 prompts.

The bug. Output is gibberish — but more importantly, identical gibberish for every prompt. Diagnostic check (the "5-minute sanity test" promised in the previous report): feed three completely different prompts to the model, dump the raw concept prediction. Result: cos(p1, p2) = 1.000000, ||p1 − p2|| = 1.6 × 10^-7. The model is producing a near-constant output. The top-5 nearest vocab words to that constant: the · i · bot · a · and — i.e., the highest-frequency tokens. The model has learned "predict the average direction of common words" and stopped there.

Why this is structural, not undertraining. Both v4 (1 epoch) and v5 (3 epochs) show the exact same failure mode. Loss has converged; more epochs only move it 0.01. Embeddings are correctly distinct (verified). The model genuinely isn't conditioning on input — likely the residual + LayerNorm path dominates over the attention contribution, so position 31's representation gets pinned regardless of context. Spending $13 on a bigger v6 or $160 on a 100M-param v8 would not fix a structural bug; it would just produce a bigger constant.

Loop halted. The autonomous training loop has been paused (.claude/ralph-loop.local.md set to active: false). Next steps need code work, not compute: investigate why attention isn't propagating prompt-word information through to position 31, possibly fix the residual scaling or the post-RoPE attention mask, then re-run v5 with the fix before resuming the cost ladder.

v4_rope: Deployed · Live but Undertrained

April 26, 2026

⚠ Live · Gibberish

OpenMythos architecture transfer complete — 6 phases of ablation, only RoPE shipped (grad_norm_p99 halved 0.561→0.280; perfect 3× context extrapolation). Trained on Modal L4 for 1 epoch / 15 MB local corpus, final loss=1.1347. Server is live at http://108.181.97.223:5000 with the new model. Eval shows the model produces near-identical repetitive sequences ("resigning slighted defilements impressiveness…") regardless of prompt — undertrained, not broken. Full status report: kent-ai-dev.github.io/engram/

Phase results: P0 reproducibility ✅ · P1 LTI killed · P2 loop-idx killed · P3/P4 skipped · P5 RoPE ✅ shipped · P6 lock ✅

Next: sanity-check eval (free), then v5 (5 epochs · ~$5 · ~6h L4) to confirm whether undertraining is the bottleneck before scaling to bigger model (v6 ~$13).

Model v4_rope · 6.3M brain + 12.9M N-gram tables

Architecture 256dim × 8layer × 32ctx · RoPE · 8 heads · LR=0.001

Training 1 epoch / 15 MB / L4 · 90 min / $1.20

Final loss 1.1347 (avg) · grad_norm_p99=0.280

Vocab 38,062 concepts (ChromaDB-backed)

Status Server live · model undertrained, output is gibberish

Architecture

Run History