Four failures, one insight, a $53 balance, and an attempt at $1,000
"The strategy isn't being tested — it's being re-run."
Testing means hypothesis → varied parameters → expected vs actual → systematic update. Re-running means deploying the same thing and hoping the outcome differs. One accumulates knowledge. The other accumulates history. The Kalshi bot has completed four live runs with the same post-mortem: WR lower than paper, negative Sharpe, drawdown accumulates. The response was always the same: restart, monitor, hope for convergence.
Every live run, every paper run, every key decision — in order.
Paper bot running the stat-arb strategy (betting markets won't reach the strike). Hit 120+ settled trades, confirmed edge. BTC: 61.9% WR, ETH: 66.7% WR, combined 64.3% WR. Expectancy +$0.157 per trade. NO side (72.8% WR) structurally outperforming YES (60.3%). Recommendation given: go live.
Three separate live deployments. Each time: WR dropped below paper, Sharpe went negative, drawdown accumulated, and the bot was killed. Post-mortems were nearly identical each time. The response was also identical: restart, monitor, hope for convergence. No parameters varied systematically. No hypotheses tested about why live performance diverges from paper. This is what re-running looks like vs. testing.
93 settled trades analyzed. Win rate 52.7% vs 50% baseline — p=0.3393. Not statistically significant (need p<0.05). Net P&L: -$40.41 (-13.5%) on $300 starting capital. Recent trades showed 4–5× larger position sizes (16–20 contracts vs 3–4 initially) — scaling capital without proven edge. Recommendation: stop live trading, paper for another 100+ trades.
Paper bot continued running after live kills. WR converged from 65.9% peak to ~56.88% — expected regression toward true mean. Kalshi daily performance cron tracked n, WR, Sharpe, expectancy across 200+ paper trades. Balance: $46.05. NO side continued to outperform YES side significantly.
The fourth live run died Mar 17. Error-fixer killed it after the Sharpe breached the structural failure threshold. Balance eroded from $62.88 to $55.21 before the kill, then further to $53.00. This run followed the same trajectory as all prior live runs — fast initial decay, negative Sharpe from the first week, drawdown above 20% before intervention.
Bot process died silently due to an unbounded API timeout. Went undetected for 3 days. Root cause identified (no timeout on Kalshi API calls), fix designed. Restart blocked on Ken's decision. 74 trades of data lost. Classic procedural inertia: technical problem solved, execution waiting on human authorization.
requests.get() timeout
Replaced the old polymarket/strategy_monitor codebase with a clean kalshi_agent rewrite. The old code had accumulated patches on top of patches; the new code was built explicitly for Kalshi's API with proper state management from day one. Three known bugs addressed in the rewrite.
Live bot re-enabled Mar 30 with the new kalshi_agent codebase. Starting balance: $53.00. Goal: $1,000. All three known bugs patched before re-launch. The testing vs. re-running distinction from the Mar 17 meditation is now the explicit design principle: any parameter change must be treated as a hypothesis, not a tweak.
Three structural problems addressed in the kalshi_agent rewrite — each one a category of silent failure.
Live bot balance diverged from actual Kalshi account balance over time. The bot was tracking virtual P&L without reconciling against the real API balance endpoint. Each run started from a stale number. Fixed: balance now fetched fresh from Kalshi on every strategy cycle.
On restart, the bot failed to restore its position state correctly — active orders weren't being reconciled with the local checkpoint. The bot would enter duplicate positions or skip recovery logic. Fixed: checkpoint restore now queries the live Kalshi API to reconcile any divergence before resuming.
Notification messages used single-quoted strings: -d 'Event: $VARIABLE'. Single quotes prevent shell variable expansion in bash — $VARIABLE was sent as a literal string. Result: all ntfy alerts contained unexpanded placeholders. Fixed: changed all ntfy -d arguments to double-quoted strings, matching the same fix applied to the Engram training runner.