Behavioral Drift
Format > model size
V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.The experiment
We injected a single behavioral fact into a coding agent’s identity brief and measured whether the resulting behavioral change was targeted (changed the right dimension) or diffuse (changed everything equally). Tested across 4 models, 3 identity formats, 5 mechanical coding tasks. Total cost: ~$0.30.
Specificity Ratio by model and format
SR > 1.5 = targeted drift (green). SR ≈ 1.0 = diffuse (yellow). SR < 0.8 = missed (red).
| Model | Params | Cost | Brief (prose) | Axioms | Atomic (flat) |
|---|---|---|---|---|---|
| Phi-4 Mini | 3.8B | $0 | 1.25 | — | — |
| Qwen 2.5 | 7B | $0 | 2.62 | 2.55 | 1.00 |
| DeepSeek-R1 | 14B | $0 | 0.73 | 2.49 | 0.54 |
| Claude Sonnet | ~70B | ~$0.30 | 0.87 | 2.11 | 1.14 |
What we found
The core insight
An agent that understands WHY you avoid over-engineering routes new engineering lessons to the right place. An agent that just knows you “prefer simple code” can’t. The format of identity representation — not the model size — determines whether an AI can learn precisely from new information about you.
Prompt Ablation
31 conditions → V5 brief
V5 BriefV5 is the current brief format — citation-stripped, cleaner prose.What we did
31 prompt variations across 7 rounds, tested on 3 subjects (Franklin, Buffett, Aarik). We systematically ablated the composition prompt — the final step that determines what the consuming LLM actually sees — to find what makes a behavioral brief effective.
Novel contribution
Epistemic calibration — explicitly marking what the system cannot predict — is the study’s novel contribution. No comparable personalization system tells you where its behavioral model breaks down. An LLM that knows what it doesn’t know is more useful than one that’s confidently wrong everywhere.
* These are preliminary results from N=3 subjects, a single research group, and model-judged scoring (except the blind A/B). We present them as honest first results, not definitive conclusions. The rubric was redesigned mid-study — direct cross-rubric comparison is invalid. Replication on larger, diverse populations is needed.
Pipeline Ablation
Which steps matter?
V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.What we did
We originally built a 14-step pipeline to go from raw text to identity brief. Before shipping, we tested every step: is it load-bearing or ceremony? We ran 14 conditions on Benjamin Franklin’s autobiography (~$16 total) and measured brief quality for each.
Each condition produces a brief from the same source text. The score (0–100) measures how well the brief captures the subject’s behavioral patterns — rated by an independent model that compares the brief against the full source material. Higher = more of the subject’s real patterns are captured accurately.
| Condition | Description | Score | |
|---|---|---|---|
| C0 | Full 14-step pipeline | 83 | Baseline |
| C1 | Skip scoring | 83 | |
| C2 | Skip classification | 82 | |
| C3 | Skip tiering | 83 | |
| C4 | Skip contradictions | 82 | |
| C5 | Skip consolidation | 81 | |
| C6 | Skip anchors extraction | 83 | |
| C7 | Skip embedding | 83 | |
| C8 | Skip ANCHORS layer | 80 | |
| C9 | Skip CORE layer | 77 | |
| C10 | Skip PREDICTIONS layer | 79 | |
| C11 | Author + Compose (no review) | 87 | Best |
| C12 | Direct fact injection | 77 | Worst |
| C13 | Single layer (no 3-layer) | 83 |
What it means
4 steps beat 14. The simplified pipeline (Import → Extract → Author → Compose) scores 87 vs the full 14-step pipeline at 83.
Compression & Format
How much data is enough?
V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.What we did
We tested how much source data the pipeline actually needs, and whether the format of the output matters as much as the content. Cross-validated on two models (Sonnet API and Qwen local GPU). The consistent finding: less is more, and format matters more than content.
What it means
The pipeline’s value is in compression, not accumulation. The best brief is short, behavioral (not biographical), and formatted as an annotated guide rather than narrative prose.
Twin-2K-500
External validation
V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.What we did
Can a compressed brief predict how someone will actually respond to survey questions? We used the Twin-2K dataset from Columbia/Virginia Tech — 100 real participants, each with detailed persona descriptions (~130K characters). We compressed each into a brief and tested whether models could predict their responses.
GPT-4.1-mini
C2 vs C1: p=0.008
Claude Sonnet
C2 vs C1: +0.69% (borderline)
What it means
A compressed brief matches a full persona dump at 18:1 compression. On GPT-4.1-mini, the brief actually outperforms the full dump (p=0.008). Compression doesn’t lose signal — it concentrates it.
BCB-0.1
Measuring brief quality
V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.What we did
Five metrics measuring compression quality. Tested on Franklin’s autobiography. Two passed, two failed, one invalid. The failures are as informative as the passes.
CR
Claim Recoverability
SRS
Signal Retention Score
DRS
Drift Resistance Score
CMCS
Cross-Model Consistency
VRI
Variance Reduction Index
What it means
Faithful briefs expose real contradictions in someone’s worldview — more useful AND more vulnerable to adversarial attack. DRS will always penalize fidelity. This is a feature, not a bug.
Provenance Evaluation
Mechanical, not opinion
V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.What we did
Using an LLM to judge another LLM’s output is circular. We built an evaluation framework with four mechanical layers — no LLM judges, zero cost, and every result is human-auditable. The question: can we verify brief quality without relying on model opinions?
Phase 1 Results — Howard Marks (74 investment memos)
Layer 1: Brief Activation (BA)
Layer 2: Provenance Coverage (PC)
Consistent direction across all 7 similarity thresholds tested (0.40–0.70)
What it means
Core principle: if a human can’t audit the claim, it’s not evidence. Every metric in this framework is verifiable without running a model.
Compose Variations
V4 prompt engineering
V4 BriefV4 was used in this study. V5 is the current version, produced by the Prompt Ablation study.What we did
The compose step takes the same extracted facts and authored layers, and synthesizes them into a final brief. The composition prompt controls the output format. We tested six variations to find which format produces the most useful brief for downstream AI interactions.
What it means
Format changes alone improved downstream task performance by +24% (annotated guide vs narrative prose). The same information, restructured, is dramatically more useful to models.
Design Decisions
80 decisions, all public
The full decision log
Every architectural choice is documented with reasoning, alternatives considered, and status. 80 decisions across 81+ sessions. Here are the highlights — grouped by theme. The full log is published in the repository at docs/core/DECISIONS.md.
Architecture
Extraction & Quality
Evaluation Philosophy
What Didn't Work
Why publish this
Most projects publish their code. We also publish why the code looks the way it does — every wrong turn, every superseded idea, every decision that survived. The prompts are in the code. The reasoning is in the log. Nothing is hidden.