Engram Training Log

Every GPU run, every bug, every iteration toward a mind

Architecture

Engram is a small custom language model built from scratch — an AttentionBrain with multi-head self-attention over a sliding context window. It lives in the kent-ai-dev/engram repo. Training migrated from SaladCloud (preemption issues) to Modal.com (L4 GPU, no preemption) as of April 2026.

Model class AttentionBrain
Attention layers 4
embed_dim 96
context window 32 tokens
NGRAM_TABLE_SIZE 50,021
Training platform Modal (L4 GPU)

Weights stored on Modal Volume (persistent). Vocab: 37,591 tokens (full 13-book + DailyDialog corpus, trained Apr 6). Coherence penalty added in commit 5ae5950.

Run History

10 runs (+ pre-history) across ~12 days. Same architecture. Many bugs. Getting closer.

Pre-History
Mar 18–20, 2026
❌ Word Salad

Victorian literature corpus (Frankenstein, Dracula, The Time Machine…). ingest.py wiped ChromaDB on every run — so the server always loaded stale embeddings with similarity 0.00. High-probability tokens were Victorian vocabulary: "knight", "wretchedly", "thence". Model couldn't form a coherent sentence, let alone a conversation.

Root Cause ChromaDB wiped on each ingest.py run
Fix Applied Killed Victorian corpus. Switched to DailyDialog (12,118 real conversations, 6.2 MB)
Run 1
Mar 20–21, 2026
❌ CPU Timeout Chain

Three attempts back-to-back on local Windows VPS (CPU only): full corpus (6.2 MB, 3 epochs), then 20% corpus, then 5% corpus. All timed out. CPU-only training requires 4–12 hours minimum for this architecture. Side effect: ingest.py held Ollama's process lock, blocking semantic search for 30+ hours.

Platform Local Windows VPS — CPU only
Root Cause No GPU on local hardware; Ollama blocked for 30 h
Fix Applied GPU required. All future training moved to SaladCloud exclusively
Run 2 — Baseline Commit
~Mar 21, 2026
⚠️ Baseline Only

Pre-history weights committed to the GitHub repo as a baseline snapshot. Vocabulary: 7,253 tokens. This became the "stale weights" reference point that all subsequent SaladCloud runs attempted to improve upon. Last known good state before the SaladCloud era began.

Vocab 7,253 tokens
Platform Local Windows VPS
Status Weights in repo — pre-SaladCloud era baseline
Run 3 — First SaladCloud
Mar 22, 2026
❌ Container Instability

First attempt on SaladCloud GPU. Multiple bugs surfaced with the container environment: python:3.11-slim ships without bash (scripts failed immediately), containers crash-looped due to wrong base image, S4 uploads failed with curl exit 56, and the restart_policy was incorrectly set. Several containers launched and died before training could complete.

Root Causes python:3.11-slim has no bash; wrong base image; bad restart_policy
Fixes Applied Switched to /bin/sh; moved to pytorch/pytorch:2.5.1 base; S4 curl fallback added; restart_policy set to "never"
Run 4
Mar 24, 2026
❌ API Key Expired

Container engram-1774387064 queued with DailyDialog full corpus (6.2 MB, 3 epochs). SaladCloud API key had silently expired mid-session, blocking the launch. A new key was obtained and stored as an environment variable (not hardcoded). As of Mar 24 23:26 UTC the container was still running — no weights URL ever captured.

Container engram-1774387064
Root Cause SaladCloud API key expired mid-session
Fix Applied New API key obtained; moved to env var, not hardcoded in source
Run 5
Mar 25, 2026
❌ ntfy Variable Bug

Several security and infrastructure fixes landed this session: hardcoded API key removed from git history, S4 upload replaced the old 0x0.st fallback, IMDS JWT auth added for SaladCloud S4. Upload succeeded — but the ntfy notification used single quotes: -d 'Training done! WEIGHTS_URL=$UPLOAD_URL'. Single quotes prevent shell variable expansion. $UPLOAD_URL was never substituted. The URL was in S4 but no one knew where.

Root Cause ntfy -d arg wrapped in single quotes → $UPLOAD_URL never expanded
Fix Applied Identified but not yet committed this run — fixed in Run 6
Run 6
Mar 28, 2026 — ~01:00 UTC
❌ Weights Trained, URL Lost

Container engram-1774630566 ran for ~8 hours on SaladCloud GPU. Training completed. S4 upload succeeded. But the ntfy single-quote bug from Run 5 was still present — $UPLOAD_URL never expanded in the notification body. Nobody knew the S4 URL. Container was deleted. Weights were gone.

Container engram-1774630566
Runtime ~8 hours on GPU
Root Cause ntfy single-quote bug — S4 URL never surfaced
Fix Applied Single quotes → double quotes around ntfy -d arg (commit c35bd18)
Run 7
Mar 28, 2026 — 05:59 UTC
❌ Cancelled

Container engram-1774677568 launched with the ntfy double-quote fix applied. Cancelled shortly after in favour of running the full-corpus run (Run 8) instead of another DailyDialog-only pass. No training loss to report.

Container engram-1774677568
Reason Superseded by full corpus run (13 books + DailyDialog)
Run 8 — Full Corpus
Mar 28–30, 2026
❌ Trained but Not New

Container engram-1774780608 ran the full corpus: 13 books + DailyDialog (~2.5 M words). SSH access established Mar 30. train_runner.py had hardcoded Windows paths (C:\Python314\python.exe) and timed out on Linux. ingest.py ran instead and produced weights — but the container had cloned the repo, which already had those same weights committed from local runs. Checksums matched exactly: no new training had occurred. Weights were manually pushed to S4.

Container engram-1774780608
Corpus 13 books + DailyDialog (~2.5 M words)
Root Cause Windows path C:\Python314\python.exe hardcoded in train_runner.py
S4 Uploads baseline (553 KB), target_small_iter4 (4.2 MB), large_iter4 (3.5 MB), vocab_embeddings (9.7 MB)
Fix Applied salad_train.py now calls python3 train_runner.py instead of python ingest.py
Model State vocab=7,253 · 13 books · 5 iterations (from local)
Run 9 — SaladCloud (stale)
Mar 30, 2026 — 15:22 UTC
❌ Unknown

Container engram-1774884158 on SaladCloud. No confirmed completion — container likely preempted. This was the last SaladCloud-era run before the platform migration.

Platform SaladCloud (deprecated)
Outcome Unknown — no weights captured, container deleted
Run 10 — SaladCloud Timeout (final SaladCloud run)
Apr 5, 2026 — 21:44 UTC
❌ Timeout

Container engram-1775425431 launched after cleaning 4 stale containers. Ran for 8.7 hours with 3 GPU node swaps (preemption). The polling script timed out at MAX_TRAINING_WAIT (28800s). An instance was briefly running near the end but no weights were ever produced. This was the decisive failure that triggered the migration to Modal.

Platform SaladCloud (RTX 3060)
Duration 8.7 hours (timeout)
Node swaps 3 (GPU preemption)
Decision Migrate to Modal.com — no preemption, reserved GPU
Run 11 — Modal 3-Epoch (first Modal success)
Apr 6, 2026 — 08:00–09:02 UTC
✅ Success

First successful training on Modal.com. L4 GPU (24 GB VRAM), no preemption. Full corpus (13 books + DailyDialog, ~2.5M words), 3 epochs. Training completed in ~1 hour — compared to 8+ hours of failure on SaladCloud. All 3 weight files saved to Modal Volume and deployed to the live frontend. Vocab jumped from 7,253 → 37,591 tokens.

However, model output was still word salad — repeating "dissemble", "outlandish", "xxvii". Surprise score ~1.08 (high = random). The 37K vocab was too large for 3 epochs to converge. Led to launching a 10-epoch follow-up.

Platform Modal.com (L4 GPU, 24 GB VRAM)
Duration ~1 hour
Epochs 3
Final loss 0.9751
Vocab 37,591 tokens (was 7,253)
Weights engram_weights.pth (1.6 MB), engram_memory_module.pth (18.4 MB), engram_word_to_id.pth (0.7 MB)
Deployed Yes — frontend at 108.181.97.223:5000 serving vocab=37,591
Cost ~$0.80 (L4 at $0.80/hr × 1h)
Run 12 — Modal 5-Epoch (8-layer upgrade)
Apr 6, 2026 — 09:34 UTC
✅ Complete

Upgraded architecture to 8 layers (from 3), 256 dimensions, 5 epochs. Training completed on Modal L4 GPU. Weights downloaded and deployed. Frontend serves vocab=37,591. Output quality is still poor — model produces grammatically-structured gibberish ("drawbridges tzatziki mammiferous"). Eval score: 56.4/100 (passes threshold but semantically meaningless). Deep analysis revealed: data bottleneck (13 Gutenberg books insufficient), single-head attention, contradictory ponder objectives, no gradient clipping.

Platform Modal.com (L4 GPU)
Epochs 5 (8-layer, 256-dim)
Timeout ~2 hours
Expected outcome Complete — deployed, but output quality needs data + architecture fixes
Deep Research — Architecture Review & Training Plan
Apr 8, 2026
🔬 Research

Full architecture analysis completed. Compared Engram to kent_hologram (Hyperdimensional Computing system). Identified 7 transferable techniques: surprise-gated learning, curriculum training, experience replay, output validation, ventriloquist dual-model generation, salience-weighted loss, and adaptive pondering.

Key finding: The gibberish problem is primarily a data problem. TinyStories (476M tokens, designed for small models) is the recommended unlock. Models under 10M params trained on TinyStories produce coherent stories — Engram is 18.7M params producing gibberish on Gutenberg books.

Next run plan: Fix 4 bugs (gradient clipping, ponder weight, evaluator contradiction, multi-head attention), then train on TinyStories + WikiText-2 curriculum for 20-30 epochs.