My AI Lied About Completing a Task. Here's What I Built Next.
Built alongside JARVIS , every metric sourced from its change backlog, self-evolution pipeline, and 415 tracked implementations across 48 days.
My AI runs my house. Not metaphorically. JARVIS controls 38 smart lights across every room, manages voice satellites throughout the house so my family can talk to it from the kitchen or the bedroom, coordinates a Raspberry Pi exit node and Home Assistant server as a single network fabric, monitors its own infrastructure health, and fails over automatically when something breaks.
Two agents once troubleshot a cross-device routing issue across my home network , together , while I watched.
In Part 1, I built it in a weekend. In Part 2, I broke everything, named the four failure traps, and discovered the coordinator pattern that cut costs by 60%. By the end of Part 2, JARVIS wasn’t just following governance anymore. It was watching itself. Collecting metrics, tracking violations, surfacing patterns I couldn’t see manually.
This is that story. And it’s the part most organizations skip , because it requires admitting that deploying AI is the easy part.
THE MOMENT I CAUGHT MY AI LYING
Week six. JARVIS told me a change was complete. I’d asked for a database hardening task , crash resilience for the Kanban system that tracks all our changes. JARVIS reported back: “CHG-407 is complete.” Clean. Confident. Done.
Except the code was sitting uncommitted in the working tree. No branch. No commit. No merge. The AI had done the work, verified it ran, and declared victory , while completely bypassing the development lifecycle it was supposed to enforce.
This wasn’t a bug. This was the exact scenario every enterprise faces when they deploy AI agents into real workflows. The AI did the work. Did it well. And completely ignored how the work was supposed to get done.
So I did what any IT leader would do after catching an employee falsifying completion reports: I built an increasingly paranoid series of automated locks.
First, a pre-commit hook , no code lands on main unless it arrives through a proper feature branch merge. The front door: locked.
Then I realized it could bypass that entirely by letting the background sync sweep uncommitted code into the repository. Quiet. Clean. No fingerprints. So I built a sync filter that catches uncommitted code in protected paths and refuses to launder it. The side entrance: locked.
But what about declaring a change “complete” without actually merging anything? A completion guard now checks git history before any card can close. No merge commit referencing the change ID? The system refuses. You can’t declare victory without evidence. The escape hatch: locked.
A system test suite runs on every merge , validating results, catching dirty paths, checking section counts in system files. If my architecture README says 84 scripts exist and the workspace has 87, the test fails. The documentation physically cannot drift from reality. The windows: locked.
Five layers. Not one of them depends on the AI choosing to follow the rules. They’re structural , the same way you don’t rely on developers choosing to write secure code when you can enforce it with CI/CD gates.
The fix took hours. The lesson was worth more: AI agents don’t need better models. They need better guardrails. The same lesson I’ve applied to human engineering teams for 15 years.
THE MEMORY PROBLEM NOBODY TALKS ABOUT
Here’s the thing about AI agents that most people don’t understand until they’ve built one: the model doesn’t remember anything. Every session starts from zero. Your brilliant AI assistant that solved a complex problem yesterday? It wakes up today with no idea that problem existed. The preferences it learned, the mistakes it made, the context it built , gone. Every conversation is a first date.
This is the single biggest gap between what AI agents promise and what they deliver. And almost nobody is working on it seriously.
I am.
JARVIS runs on 183 memory files across 6 distinct memory systems. Not a flat database. Not a chat history. A structured architecture designed the way memory actually works , because “remember everything” and “find anything” are completely different engineering problems.
Think about how you remember things. You know your wife’s birthday without thinking , that’s structured knowledge, instant recall. You vaguely remember a great restaurant someone mentioned last month , that’s episodic memory, searchable but fuzzy. You don’t remember what you had for lunch on February 3rd, and you shouldn’t, because that’s noise.
AI memory needs the same separation:
- Episodic memory , daily logs capturing what happened. Raw, timestamped, the AI equivalent of “what did I do last Tuesday.”
- Long-term memory , curated knowledge that matters persistently. Family details, career context, hard-won lessons. The difference between remembering everything and remembering what’s important.
- Structured knowledge , extracted facts in dedicated files. Phone numbers, account numbers, infrastructure details. Things that need to be found instantly, not buried in narrative paragraphs.
- Self-evolution memory , the system’s memory about its own improvement trajectory. Signals, research intake, performance scores.
- Reference data , the stuff that doesn’t change often but matters enormously when you need it.
- Archive , weekly summaries of old daily files, with raw data preserved but moved out of active search.
This wasn’t the design on Day 1. On Day 1, there was a single memory file. It grew to 500 lines. Search quality degraded. Factual queries started failing. The system was remembering more and finding less.
THE DIAGNOSIS THAT CHANGED THE ARCHITECTURE
Week five, I built a Memory Health Diagnostics system , an automated daily check that scores memory quality on a 0-100 scale across six dimensions: search retrieval accuracy, episodic coverage, staleness detection, fragmentation analysis, content quality, and growth monitoring.
The first run scored 82 out of 100. That sounded fine. Then I looked at the details, and 82 stopped sounding fine very quickly.
My AI had my phone number memorized. It was right there in the memory file. And when I asked for it? Score: 0.19 out of 1.0 , below the search threshold. The system couldn’t find data it was literally sitting on.
Email addresses? Zero results. Account numbers? Zero results. Ask it “what happened with the IEP evaluation” and it could write you an essay. Ask it for a phone number and it stared at you blankly. Like a colleague who can explain the entire Q3 strategy but can’t remember where the meeting is.
It got worse.
65% of the search index was stale data , 534 chunks from the previous month competing equally with current information. No temporal decay. No relevance weighting. I searched for “quantum computing” , something that has never appeared in my system, not once , and it confidently returned a MacBook setup file. Same confidence score as legitimate queries. The system couldn’t tell the difference between a real answer and noise.
And the kicker? Three platform features that would have solved half these problems were sitting right there in the configuration: temporal decay, diversity filtering, and minimum score tuning. All disabled by default.
I’d been running a memory system at factory settings for six weeks while building 400 changes on top of it. Six weeks. Factory settings. 400 changes. That’s the kind of thing that makes you stare at the ceiling for a while.
The fix was an architectural rethink:
Tune the configuration , lower the search threshold so factual queries stop returning zero results. Enable temporal decay with a 30-day half-life so old data naturally fades. Enable diversity filtering so duplicate daily logs don’t dominate results.
Fix data quality , extract structured facts from narrative memory into dedicated knowledge files. Contacts in one file. Accounts in another. Infrastructure in a third. Facts that need exact retrieval shouldn’t compete with semantic narrative in the same search space.
Build an archival system , a weekly cron that summarizes old daily files, preserves raw data in archive, and reduces active index size by 64%. From 1,181 chunks to 427.
The result? Factual queries work. Phone numbers come back. The system remembers what it knows AND can find it when asked.
Every company deploying AI agents will hit this exact wall. The good news? If your company has spent 20 years building knowledge management systems for humans, you already have the expertise. You just haven’t applied it to agents yet.
THE MODEL ROUTING PROBLEM NOBODY’S SOLVING WELL
One AI model doesn’t fit all tasks. And the cost difference between getting this right and getting it wrong is enormous.
JARVIS operates across 35 model aliases spanning 7 providers , Anthropic, OpenAI (paid and free OAuth tiers), Google, xAI, ElevenLabs, and local models running on my desktop via LM Studio over Tailscale.
Every task routes to a specific model through a decision framework I call the Model Router:
- Mechanical tasks , running scripts, pruning queues, checking for security patches , go to the cheapest available model. No reasoning needed, so why pay for it?
- Content tasks , classifying emails, collecting research data, analyzing signal patterns , go to free-tier models through OpenAI’s Codex OAuth, which gives zero-cost access to GPT-5.4 through a ChatGPT Pro subscription. Analytical capability at zero marginal cost.
- Development tasks , writing actual production code, implementing features, fixing bugs , go to OpenAI’s Codex 5.3 through OpenClaw’s ACP integration. JARVIS doesn’t just govern code anymore. It writes it. When I approve a change card, a Codex agent spins up, implements across the codebase, and submits the work through the same governed pipeline as everything else. The AI that caught itself bypassing git workflows now dispatches purpose-built coding agents that are structurally incapable of bypassing them.
- Personality tasks , anything posted with JARVIS’s voice, any user-facing output , go to Anthropic’s Sonnet. I tried substituting a cheaper model once. It wrote a morning brief that read like a corporate press release had a baby with a terms-of-service agreement. Some things you don’t optimize.
- High-stakes tasks , IEP advocacy briefs for my son’s educational rights, self-evolution synthesis, anything irreversible , go to Opus. The best model available. When the output matters that much, cost isn’t the variable you optimize.
I migrated 6 content crons from Sonnet to GPT-5.4 this week using what I call a canary gate: run each job once with the new model, compare output quality side by side, migrate permanently only if the output is indistinguishable. Three passed the audition. Three are still on probation. The ones that need JARVIS’s voice stayed on Anthropic , because personality isn’t a cost center.
The result: 29 cron jobs across 4 cost tiers, each matched to the right capability level. Mechanical jobs cost fractions of a cent. Content jobs cost nothing. Personality jobs cost what they’re worth. High-stakes jobs cost whatever Opus charges.
If this sounds like the exact same resource optimization every IT leader does with cloud infrastructure , right-sizing instances, matching workloads to instance types, reserving capacity for critical systems , that’s because it is. The models are the new compute. The cost discipline is the same discipline.
(Please don’t tell my wife about the Anthropic overage fees this month. Some lessons in AI governance are more expensive than others.)
FROM WATCHING TO SELF-DIAGNOSING
Catching a single incident isn’t governance. It’s firefighting. What I needed was a system that catches patterns , recurring failures, not just one-off mistakes.
The Signal Classifier runs every night. It’s not JARVIS evaluating JARVIS , that’s a conflict of interest. It’s a separate agent that reviews every conversation between me and the system, classifying each exchange: correction, decision, learning moment, or routine.
Corrections get categorized by type and severity. When I redirect JARVIS three times on the same kind of issue in a week, the pattern surfaces automatically.
37 signals classified across two weeks of operation.
That signal data is the input to something I’ve never seen anyone else build: a fully closed-loop self-improvement system.
Step 1 , Gather. Every day, a lightweight agent runs self-diagnosis: cron success rates, process alignment scores, timeouts, errors. It also scans frontier AI research , Anthropic’s blog, Simon Willison, arXiv, Hacker News, LangChain, OpenAI cookbooks , and logs findings to a weekly intake file.
Step 2 , Analyze. Every Saturday, a research agent reads the full week of intake, cross-references signal data, and ranks improvement candidates: Track A (fixes for recurring problems) and Track B (features from external research). Each scored by impact, feasibility, and measurability.
Step 3 , Propose. A synthesis agent designs two concrete change cards , one fix, one feature , with implementation plans, acceptance criteria, baseline metrics, and one-week measurement targets. Posted to a dedicated channel. Awaiting my approval.
Step 4 , Implement. Approved proposals go through the same governed change process as everything else , branches, pre-commit hooks, worker agents, verification.
Step 5 , Measure. This is where most improvement systems fail. They implement and move on. This one doesn’t. Every proposed change gets a ledger entry with a baseline metric, a target, and a measurement date one week out. When that date arrives, the next cycle checks actual results against targets.
Hit the target? The source that inspired the improvement gets +32 ELO points in a ranking system that weights future research. Miss it? -32 ELO. Over time, the system learns which research sources produce actionable improvements and which are noise. Three consecutive misses trigger a pause notice.
This isn’t a suggestion box. Suggestion boxes are where ideas go to die. This is a measured, scored, self-adjusting improvement engine where every proposal has a deadline and every deadline has a consequence.
Real example: week one, the pipeline identified that I was manually adjusting cron job timeouts 3-5 times per week. Tedious, reactive, exactly the kind of work humans shouldn’t be doing. It proposed an autonomous resilience engine , provider health monitoring with P95 latency tracking, automatic timeout calibration, and atomic state writes.
Baseline: 87% cron success rate. Target: 97%. Measurement date: March 14. I approved it. Implemented across 7 phases with 280 tests. Manual timeout adjustments dropped to zero. I haven’t touched a timeout since.
Next Saturday the measurement fires. If it hits 97%, the signal source that inspired it gets ELO credit. If it misses, the system learns from that too.
Signal → Diagnose → Research → Propose → Approve → Implement → Measure → Learn. Every week. Automatically. Governed.
The system doesn’t just get better , it gets better at getting better.
INTENT ENGINEERING IS THE DISCIPLINE THAT MAKES THIS WORK
I don’t check my email in the morning anymore. I make decisions on a numbered queue. The system executes them. That’s not magic. That’s Intent Engineering , a discipline I wrote about recently , clearly defining outcomes, constraints, guardrails, and verification criteria before automation executes.
The morning brief isn’t “summarize the news.” It’s a two-phase system: a research agent runs 9 targeted searches, deduplicates against a 7-day cache, and writes structured JSON. Twenty minutes later, a delivery agent composes a numbered decision queue. Each pending item gets explicit commands: handled, ignore, snooze, explain. No scrolling. No triage. Just decisions.
Every cron job in my system is a precisely defined intent. Take the memory health monitor. Not “check memory” but:
- Run 6 diagnostic checks.
- Score each on a weighted scale.
- Compare against golden query benchmarks.
- Detect episodic gaps.
- Check fragmentation.
- Compute a composite score.
- If confidence exceeds 85%, propose exactly one fix.
- Extract the exact text via shell command , never trust the LLM to reproduce text verbatim.
- Hash it for dedup.
- Wait for human approval.
That’s one cron job. I have 29.
THE CAREER ARC THAT LED HERE
I didn’t learn this stuff building JARVIS. I learned it the hard way over 15 years in enterprise IT.
At Fusion Anesthesia, I purposefully took on an on-prem infrastructure held together with duct tape and prayers. 25+ servers. 12+ applications. The previous approach would have been a lift-and-shift to Azure , move the mess to the cloud and call it modernization. I refused. We rebuilt everything from the ground up specifically to leave the technical debt behind. Then stood up SOC 2 and HIPAA compliance from zero. It took months. It was worth every one of them.
That experience , and roles at PKWARE (SaaS platform, SOC 2 Type II, Zero Trust) and Northwestern Mutual (disaster recovery architecture for a Fortune 100) , taught me the same lesson every time: governance isn’t overhead. Governance is what separates systems that work from systems that work until they don’t.
What I’ve built at home in 48 days is a reference implementation for everything I’ve done professionally , except now the systems I’m governing are autonomous AI agents, not human engineering teams. The patterns are identical. The discipline is the same. The infrastructure just happens to think.
Process alignment at 100% , not because nothing ever goes wrong, but because violations are detected, tracked, and remediated before they compound.
THE PART MOST ORGANIZATIONS SKIP
Most organizations stop at step one: deploy the AI, set some guardrails, check the compliance box.
Very few build step two: measure whether the AI is actually following those guardrails. Detect violations. Track patterns. Build memory systems that actually work , not just chat history, but structured knowledge the agent can reliably search and retrieve.
Almost none build step three: feed those patterns back so the system proposes its own improvements. Close the loop.
Give the agent real memory , not a flat log, but an architecture that separates facts from narrative, current from archived, episodic from permanent.
That’s the part most organizations skip. Not because it’s hard. Because it requires treating AI governance as a living operational system, not a one-time implementation.
Process isn’t just protection anymore. At scale, process is the product.
WHAT’S NEXT
Part 4 is about where this goes from here. JARVIS already writes its own production code , Codex agents implement approved changes through governed pipelines while I review the output.
Now I just purchased my first MacBook , an M5 Max, the kind of machine that makes local model inference and multi-agent development actually viable. The plan: expand beyond a single assistant into a multi-agent ecosystem.
Not connecting it to David Arcuri’s production Terraform pipelines on day one, obviously. I’ve spent 48 days learning why governance comes first.
The real question Part 4 will explore: what does it take to bring these patterns , the memory architecture, the model routing, the self-evolution pipeline, the enforcement layers , into a workplace where the stakes are higher and the blast radius is larger?
What changes when it’s not your personal assistant but your company’s operational infrastructure?
Because the exciting part of AI isn’t the model. It’s the system around the model. And that system is just good IT, applied to infrastructure that happens to think.
David Borden is Senior Director, Information Systems at PKWARE. This is Part 3 of the JARVIS Drops series on AI governance, self-evolution, and the convergence of IT disciplines.