Norns and Context Management
Norns

We first have to talk about Norns and what it is. It’s a SaaS I’m building to prove out a theory of mine that the workflow I used with Claude (specifically Claude Code) could be agentic in nature, and cheaper in that fashion.
My approach with writing as mentioned in my other articles is very much about velocity. Provide the foundation, have the LLM generate, ask for changes, rinse, wash and repeat. In Claude Code this works well because you have 200,000 tokens into which you can cram a lot of crap. Claude Code masks a significant amount of this by being very efficient about how interacts with things, compaction and the like. But at $100 a month for max, it’s hard to swallow as a cost for a writer who would probably be using it for input / editing.
Formatting
The biggest benefit is actually knowing that Claude is better at the things you yourself are not good at or do not know you’re not good at. Examples of this were PDF formatting. I knew roughly what I had to have in order to get it published, but getting to that point was an unspecified collection of dark art rituals
It took weeks of working with Claude, but eventually we cracked the code for how to get a PDF that I could publish. I realized later that the approach was simpler than I had made it in my head:
- Where was the file going?
- What are that target’s specifications?
- How do I build to those specifications
- Build
- Verify against specifications
- Troubleshoot
- Repeat
This is agentic in nature, and was the fundamental idea behind Norns. The things that are typically onerous or outside of the domain of the author can be automated away via agency.
The problem
While that itself seems hard, it’s actually the least problematic thing I had. That’s a discrete scope with a tight loop. The context any agent will see is small (all things considered). What I really ran into issues with was the actual AI agent itself, which we lovingly call ghost in Norns. The very first time I tried it in earnest, I did the following:
- I copied the contents of my bible for The Chuin Cascade in (All characters, locations, etc)
- I created chapter 1
- I created an outline for it
- Told it to get writing!
And immediate failure. Context limit exceeded.
Beginning to Understand How Out of my League I was
I couldn’t really fathom what happened at first, because it worked in Claude Code. Fundamentally though, when you’re using the API, it’s a different beast. The Most context you can throw at a given request is 64K tokens. So I updated to that (globally at first. Just to see it through. For science!). I got a few seconds more in then it failed gracelessly.
Rate limited. 30K tokens. What? How did I hit 30K tokens? What I had done is what you see most AI systems do which is “Smash all the data into the context and it should work, right?” Nah it definitely does not. I then started going down a path of compaction, targeting and moving to something more akin to how Claude Code operates.
- When the agent needs data about a character, it may not need all of it.
- Provide tools to locate “windows” within data
- Provide tools for identifying what subsets of data the agent may need
- Just because it has it all doesn’t need it needs to set in memory forever
- I created an ephemeral storage space in MinIO for the agent to store plans
- Plans also store state (To Do Lists and whether they’re complete or not)
- The agent offloads things it doesn’t need right now into that layer to free up token space for other endeavors
- Compaction and Eviction of the context over time. Once we had finished a task we could empty the brain and go back to what was on paper in order to make forward progress without burning through unnecessary head space.
What I had stumbled onto was the painful nature of effective and efficient context management. This is where the real engineering happens.
The Challenge of AI Context Management
The realization was that throwing everything into Claude’s context window isn’t just inefficient, it’s fundamentally unsustainable for long-form writing. Just like with humans, filling the headspace with data leaves no room for reasoning and evaluation. That’s why we do things like annotate important concepts, highlight, set those aside, reconstruct, and synthesize.
That introduced its own problems. The agent needed:
- Ways to access memory very surgically rather than all at once.
- Mechanisms for storage
- Mechanisms for re-construction or synthesis
Even in that thought, “memory” was a loaded term. I needed to compartmentalize what “memory” meant and how it could be accessed.
The Four-Layer Memory System
Rather than treating all information equally, Norns organizes memory into four distinct layers, each with different retention and retrieval characteristics:
DOMAIN LAYER (Book-wide, Permanent)
↓ Example: "Magic requires verbal incantations"
↓ Use: Fundamental world rules, themes, plot arcs
CATEGORY LAYER (Entity-specific, Permanent)
↓ Example: "Alice has brown hair, works as detective"
↓ Use: Character profiles, locations, magic systems
TRACE LAYER (Session-specific, Temporary)
↓ Example: "User asked to revise chapter 3"
↓ Use: Current working session, editing history
EPISODE LAYER (Interaction-specific, Ephemeral)
↓ Example: "Last generation was 500 words"
↓ Use: Individual tool calls, immediate context
Why this matters: Instead of searching through 100,000 words of past conversations, we can precisely target the right layer. Need character details? Hit the Category layer. Need to understand the current editing session? Trace layer only.
Three Complementary Systems
The layered memory is just one piece. To make context truly directional, it took gluing three separate systems together:
1. Vector Search (The “Similar To” Engine)
All memories get converted to mathematical embeddings and stored in PostgreSQL with the pgvector extension. This enables semantic search:
- Query: “Scenes where Alice confronts authority”
- Returns: Relevant memories based on meaning, not just keywords
- Efficiency: Search 10,000+ memories in milliseconds
2. Knowledge Graph (The “Connected To” Engine)
Entities (characters, locations, events) are automatically extracted and stored with their relationships:
Alice --[knows]--> Bob
Alice --[works_at]--> Police Station
Alice --[participates_in]--> Murder Investigation
Murder Investigation --[occurs_at]--> Police Station
Instead of searching for facts, we traverse relationships. “What locations are connected to Alice?” becomes a simple graph query.
3. Narrative Structure (The “Supposed To” Engine)
Using Dramatica theory, Norns tracks:
- Character arcs and transformations
- Plot progression through signposts
- Thematic elements
- Story dynamics
This prevents Claude from accidentally writing Alice as “steadfast” when she’s supposed to “change,” or placing a key moment in the wrong signpost.
Context Assembly in Practice
When you ask Norns to write a chapter, here’s what happens behind the scenes:
USER REQUEST: "Write chapter 3 featuring Alice, Bob, and Carol"
CONTEXT BUILDER GATHERS:
├─ Platform prompts (behavior guidelines) ~5,000 tokens
├─ User preferences (style, tone) ~2,000 tokens
├─ Book bible (world rules, characters) ~8,000 tokens
├─ Chapter outline (from segments) ~3,000 tokens
├─ Narrative structure (storybeats, arcs) ~5,000 tokens
├─ Memory search ("chapter writing context") ~10,000 tokens
└─ Graph query (Alice-Bob-Carol relationships) ~5,000 tokens
───────────────
TOTAL ASSEMBLED CONTEXT: ~38,000 tokens
vs. naive approach of dumping entire conversation history: ~150,000+ tokens
The Validation Layer
After Norns generates content, three evaluators verify quality:
| Evaluator | What It Checks |
|---|---|
| SCORE | Semantic consistency, logical coherence, authorial intent preservation, canon alignment, character emotional arcs |
| Hallucination Detector | Character attributes match known facts, timeline ordering is correct, location details are accurate |
| NCP Alignment | Storybeats follow structure, character arcs progress correctly, themes are honored |
Each runs automatically in the background, scoring the output before it’s finalized.
The Compaction Strategy
Over time, conversations grow massive. Rather than deleting old context, Norns uses two strategies:
Reflection Compaction:
- Extracts key decisions, conclusions, and turning points
- Stores them as compact episodic memories
- Generates summaries for long exchanges
- Updates the knowledge graph with new entities
Entropy Pruning:
- Removes semantically redundant memories
- Ages out rarely-accessed information
- Eliminates low-information content
- Preserves high-value memories regardless of age
The Payoff
This architecture enables:
- Precision retrieval - Only relevant context enters the window
- Continuity - Facts stay consistent across 100+ chapters
- Efficiency - 60-70% reduction in context token usage
- Scalability - Handles novels 300,000+ words without degradation
- Quality - Multi-dimensional validation prevents drift and hallucination
For comparison, a typical Claude conversation might maintain 20-30 messages of history. Norns maintains comprehensive memory of an entire novel while using less context per interaction.
The key insight: Context efficiency isn’t about cramming more into the window, it’s about being surgical about what you retrieve and when. Every token should earn its place.
This is ironically to me a call back to old C and C++ development. Back in the 90’s, bytes in memory mattered. The amount of data to be transferred mattered. The idea of being frugal with every resource was engrained into developers.
When computing power exploded, a lot of that fell by the wayside, and we’re seeing that lack of mentality hurt as various people attempting to adopt AI just try to cram everything into context and hope for the best.
But doing the work to investigate how to surgically extract what is needed and injected it when needed, allowed me to generate a full chapter, contextually relevant and verified against the projected arc for $0.38.