LLM code review: /brutal-review and the 80/20 rule

/brutal-review and the new 80/20 rule

18 min read

•

4 months ago

(Mistakes and em-dashes are my own.)

We have a Claude Code command in our monorepo which is a good example of what I’m calling the "80/20 token rule": if it takes 20M LLM tokens to write a feature, you should spend 80M on ensuring quality: safety measures, testing, and of course code review. If software quality was an open problem before, then agentic engineering has blown it even more wide opener.

At a high level, here's what it does:

Read the patch to review and all related code.
Hand off a context block to 4 different perspectival review subagents.
Collate the results.

All 2815 tokens are at the end of the post.

`/brutal-review` in practice

In addition to running it on other's code while I'm reading the patch myself, I'll run it as soon as another agent has finished some work, as soon as I've handwritten some code, or while cleaning up patches before posting them for review. I've been running some variation of it many times every day for a few months. By now, many of the engineers on the team are running the command regularly as well.

It usually takes 5-10 minutes to run with Opus 4.5, though that can consume ~75% of a Claude Pro's 5 hour quota. But other models, including some open models, do just about as well as far as I can tell subjectively.

For the majority of the code reviews I perform, this command finds at least one legitimate issue I missed, often several. Last week it uncovered a long-standing latent bug impacted by some changes to our window function support which it was reviewing. I'm fairly positive no human reviewer would have noticed that one.

Let's look at a few key aspects of /brutal-review and think about software quality.

Naming things "You're absolutely right!"

If the name wasn't enough to catch your eye, here's the first line of the prompt in .claude/commands/brutal-review.md:

Perform a ruthless, brutal, in-depth, extremely critical code review of
the most recent change (jj `@-` or git `HEAD`).

When I first started routinely using LLMs for code review, that one line was all I did. I would check out a particular patch in my working stack, or a coworker's I had pulled down to review, and type out some variation of that line—by hand, yes. The intent should be obvious: I was sick of asking for feedback and getting, "Great job!" and little of substance. I get enough "good jobs" from my wife and therapist, I don't need them from my computer. What I need is a critical eye for detail that goes faster and is less likely to start half-assing it halfway through a long review than a human who skipped their afternoon coffee. And I definitely don't care about the AI's glowing praise for my "excellent implementation of metrics and observability" in a patch that was supposed to fix a data serialization bug just because I happened to move a preexisting counter increment one line down. Every token it wasted on meaningless praise or, less obviously, general and unfocused assessment and appraisal ("The new field is correctly added to every instantiation of the struct"—gee, thanks for checking), was a token that wouldn't get a chance to notice that I never fixed the bug in the first place.

After the prompt intro, we describe the full review process, but there's some more juice to be squoze from adding this reminder near the end:

# Mindset

You are not here to make friends. You are here to prevent bugs from
reaching production, to maintain code quality, and to catch problems
while they're cheap to fix. Every issue you miss is a bug that will wake
someone up at 3 AM.

Be direct. Be specific. Be relentless. The code must earn its place in
the codebase.

One motivation for me investing time in code review at the height (or in the depths) of my AI experimentation was this thought: if an LLM can't spot a sketchy design decision in a new API surface with all the time and tokens I'm willing to give it, why the hell should I let it write any new API surfaces? It turned out, after helping it circumvent the politeness governor module or whatever, it almost always could. I haven't gone as far as doing real backtesting to compare against already-human-completed code reviews, but in my usage, I have found it's pretty rare that I notice something the robot doesn't. Even good open weight models can usually spot plenty of legitimate problems.

This brings me to one aspect of the 80/20 rule and a facet of AI adoption I don't hear mentioned too often. Even if you abhor AI and think it can't produce usable code, isn't it worth $200/month to spot even one bug before a customers does? Heck, that's why we pay Antithesis. So actually, the 80/20 rule goes like this: spend 80M tokens on quality for every new feature, even if it's just code review! If you're feeling adventurous, follow up by asking it to design some new tests you hadn't thought of, then go write the tests by hand if you so please. If you don't use the other 20M tokens for generating actual code, I don't care; I hope you spend the remainder of your AI budget on a nice ortholinear split keyboard for all the typing you’re gonna have to do.

Cache invalidation Paying attention

But speaking of testing, let's look at one of the things you'll actually see running this command:

Running 4 Task agents… (ctrl+o to expand)
   ├─ Core Logic Review · 12 tool uses · 21.1k tokens
   │  ⎿  Read: src/ast/ddl.rs
   ├─ Reliability & Testing Review · 2 tool uses · 14.9k tokens
   │  ⎿  Done
   ├─ Clean Campground Review · 8 tool uses · 17.7k tokens
   │  ⎿  Done
   └─ Performance Review · 0 tool uses
      ⎿  Done

The command splits the review into four perspectives from which to approach the review. These are really just loose groupings of questions as they struck my fancy. Here is how I've described to the main agent how to instruct the "reliability & testing" subagent:

### Perspective 2: Reliability (use for `[PERSPECTIVE-SPECIFIC INSTRUCTIONS]`)
This subagent takes the perspective of a reliability engineer with a breaker mindset, deeply considering:

**Testing**
- Are there tests? Are they comprehensive?
- Do they test edge cases and error paths?
- Could the tests pass while the code is still broken?
- Are concurrent scenarios tested if relevant?

**Error Handling & Edge Cases**
- What happens with null/empty inputs? Boundary values? Maximum sizes?
- Are errors handled appropriately or silently swallowed?
- For Rust code: Is there any `unwrap()` in production paths? This is FORBIDDEN.
- Are panic paths possible? Document them or eliminate them.

**Reliability**
- How does this change contribute to or diminish the overall reliability of the system?
- Does it introduce new failure modes or exacerbate existing ones?
- Are there any potential points of failure that need to be addressed?

You might think this is a bit verbose, and people often take up a "token optimizing" mindset when talking to or about LLMs. I would say it's almost always premature, at least if it's on the scale of a human typing characters. If you've heard people talk about LLMs rolling dice, I have an alternate (and completely uninformed, made up, home baked) view: I'm playing Plinko. In the context (ha) of a 200K token context window filled with over-broad Grep tool call results and Reads that failed with "the file is too long" forcing the agent to make another tool call to get a portion of the file... you really want me to skimp on what I am trying to tell the agent, on what I am trying to get out of it? The tool call failures and retries are more important than my bullet point dedicated to concurrency? If the part of the codebase under review is a giant Plinko board filled with 100K tokens worth of pegs, and what I want is to find the one crucial oversight that could make or break the feature, you can bet I'm not dropping just one chip down the board. Each bullet point that /brutal-review pastes into the prompt for a review subagent is a Plinko chip increasing the likelihood it will go down a path related to that topic, and not ignore a lack of test coverage in favor of, "Did you consider annotating every function in this file with tracing instrumentation?"

Of course, it's a balancing act, and we can't just include the entirety of Clean Code and xUnit Test Patterns in the prompt. In addition to trusting that enough of the right associations made their way into the LLM via the training data, we can also make sure we don't succumb to context overload. And if I'm skeptical of 1M token context windows, it's from seeing the difference between some agent harnesses with 25K token system prompts (I won't name names) struggling to complete tool calls halfway through a task, with the same models breezing through the task without any failures in an agent harness with a 5-10K token system prompt (about what Claude Code has). This is more noticeable if you spend a lot of time experimenting with open models which, even when very large, tend to struggle more than the top tier models.

So, yes, while I've decried token optimizing (or at least minimizing), we should still be somewhat conservative. And, indeed, for code review, it's more about the tool calls and code reading than my instruction. This is why one of the biggest innovations (okay, the only one) in brutal-review.md is the context block:

## Step 2: Gather All Context (BEFORE launching any subagents)

The main agent MUST gather all context first. Subagents do NOT inherit
the main agent's context—they start fresh. Therefore, you must:

1. **Get the change stack context**:
...

4. **Build a CONTEXT BLOCK**: After gathering all context, construct a comprehensive context block containing:
   - The full diff (from Step 1)
   - The commit stack context
   - Relevant excerpts from files that callers/dependencies come from
   - Any architectural patterns or conventions discovered

5. **Write the CONTEXT BLOCK to a temporary file**:

   Use the Write tool to save the context block to `/tmp/brutal-review-context-<ID>.md` (e.g., `/tmp/brutal-review-context-abc123.md`). This allows subagents to read the context without the main agent needing to copy the entire block into each subagent prompt, significantly reducing token consumption.

Some agent harnesses probably do support context forking for subagents, and I think this has been requested for Claude Code, but I'm doing my best with the tools at hand. I'll definitely experiment with that at some point.

But this, along with some instructions to the subagents to avoid doing extra code reading and tool calls unless something is genuinely missing from the context block, caused the token usage to go from e.g. 30 tool calls and ~50K tokens per subagent to between 0 and 15 tool calls and ~20K tokens. I can't say I noticed any difference in the quality of the output. When called for, the subagents do extra research using tool calls, such as reading other files only discovered to be relevant in the context of its perspective. Sometimes, if one of them really gets a wild hair, it will even request permission to query a MySQL or PostgreSQL instance to confirm upstream behavior. I sometimes catch subagents re-reading code that was included in the context block, but rarely.

Summary (of the reviews, not this post)

Another important aspect of this command is that each subagent outputs its own list of "findings", and it would be a waste of my time to read through them all in detail when there is sure to be some overlap, latent patterns, and false positives. My context window is not 200K tokens, I can tell you that much. So, obviously, we have the main agent summarize:

## Step 5: Synthesize perspectival findings & report
After collecting findings from all subagents, you must analyze and synthesize the findings to provide a comprehensive report.
- Prioritize issues based on severity
- Identify patterns
- Holistically combine related issues into single findings
- Suggest overall improvements
- Filter out irrelevant findings and false positives
- Most importantly, report these new synthesized findings in the same format as the original findings:
  - Specific file, line, snippet
  - Concise explanation
  - Actionable fixes
  - Concrete questions
  - Updated confidence score and prioritization category

This ends up looking something like this (just ran on a random change):

# Brutal Code Review: CAST(... AS ... ARRAY) AST Support

## Change Summary

This change plumbs MySQL's `CAST(expr AS type ARRAY)` syntax through the
Readyset SQL AST. Previously rejected at AST conversion time, the syntax
is now threaded through and rejected during dataflow lowering. This
allows CREATE TABLE statements with functional indices to parse
successfully.

Files Modified: 8 files across dataflow-expression, nom-sql,
readyset-sql, and readyset-sql-parsing

### 1. MAJOR: Pattern Match Ordering Fragility in lower.rs

File: public/dataflow-expression/src/lower.rs:1347-1351

```rust
AstExpr::Cast { array: true, .. } => {
    unsupported!("CAST with ARRAY syntax is not supported")
}
AstExpr::Cast {
    expr, ty: to_type, ..
}
```

Problem: The second Cast arm uses .. to ignore the array field, relying
on match arm ordering for correctness. If someone reorders these arms
during refactoring, array: true cases would silently fall through and be
treated as regular casts—compiling without error but producing incorrect
behavior.

Fix: Make the second arm explicit:

```rust
AstExpr::Cast {
    expr,
    ty: to_type,
    array: false,
    ..
}
`

Confidence: 85

...

## Summary Table

┌─────┬──────────┬────────────┬─────────────────────────────────────────────────────┐
│  #  │ Severity │ Confidence │                     Description                     │
├─────┼──────────┼────────────┼─────────────────────────────────────────────────────┤
│ 1   │ MAJOR    │ 85         │ Pattern match ordering fragility in lower.rs        │
├─────┼──────────┼────────────┼─────────────────────────────────────────────────────┤
...
├─────┼──────────┼────────────┼─────────────────────────────────────────────────────┤
│ 10  │ NIT      │ 90         │ Doc comment could clarify PostgreSQL behavior       │
└─────┴──────────┴────────────┴─────────────────────────────────────────────────────┘

## Verdict

The change is functionally correct but has two MAJOR issues that should
be addressed before merge:

1. Finding #1 (Pattern match fragility): The reliance on match arm
  ordering is a maintenance hazard. Make the array: false explicit to
  get compiler enforcement.

2. Finding #2 (debug_assert in Display): Silent data corruption in release
    builds when invariants are violated is unacceptable. At minimum,
    upgrade to assert! or log a warning.

The MINOR findings are mostly test coverage improvements that could be
addressed as follow-up work. The NITs are stylistic preferences.

(Don't worry, the other MAJOR finding was a nonissue—but I'll probably go back and fix that first one...)

Off-by-one errors Counting past 7

Whoa, 10 findings, including 2 classified as MAJOR, for a change I already submitted?!

I have noticed that almost every time, across different LLMs, /brutal-review results in approximately 10 findings. I had already run /brutal-review on this change and fixed a few other things before posting for CR, but it still came back with 10 this time. For tiny patches where the subagents don't need to make any tool calls at all: 7 findings. Huge patch that should have been split into 6 before being posted for review: 12 findings. Why never 3? Or 20?

My crackpot theory is that LLMs, being as they are some kind of alien and inhuman monstrosity (not in a bad way!—just being technical, please don’t be offended, Claude) have, amongst their many alien and inhuman capacities, the ability to count past 7.

I'm being slightly facetious, but since at least the 50s, we have commonly said human working memory is "around 7", and things like the N-back test suggest that for active cognition, as opposed to recall, the practical limit is even lower. So what happens when you are designing a system and thinking about how many billions of active parameters it should have? If that correlates to working memory size in any way, it's probably an open question. But to the extent that LLMs and the "thinking" they perform are intended to mimic, emulate, or otherwise align with human cognition, it seems to me that the mere fact that they are so prone to counting past 7, and gravitate toward patterns such as "how long a list ought to be", devoid of any discernible meaning or motivation, can only indicate one thing: if they were humans, they would be quite mad.

The illusion of explanatory depth vs. know-how

This is a long-winded digression to reach my final observation about LLM-assisted code review: it's not that you can't trust it, in the way you can't trust a junior developer and have to take a second look at all their code; it's that you can't understand it, so trust doesn't even enter into it. Sure, there are non-human objects, even fairly complex ones, like an automobile, which a skilled operator can "trust" to do what it is designed to do within tolerances. When it comes to thought, and the Brandomian game of "giving and asking for reasons", we "trust" the discursive commitments of others either because (1) we have a mental model of our interlocutor's thought process, and can trace that thought process in a chain of reasoning which we can independently judge, or (2) we can independently verify the claims against base reality.

The basic reason, in my opinion, that AI has caught on so rapidly in software engineering, particularly compared to other professions and activities, is that our claims, insofar as they can be translated into test cases or printlns, are easily and routinely verifiable. Make use of that fact. In the game of giving and asking for reasons, when you are talking to an AI, you do not have another basis for trust; they are playing the game fundamentally differently from you or I. If you are confused or skeptical about one of /brutal-review's findings, don't ask it to explain. Explanation gives you a feeling of clarity and understanding, but that is distinct from genuine operational understanding, which must be able to influence and be influenced by the world. We even have the phrase "illusion of explanatory depth" to denote the phenomenon, and LLMs are nothing but the illusion of explanatory depth. (Even if they hypothetically have internal qualia, the same could be said of anyone who doesn't know what they are talking about yet sounds convincing.)

Software quality requires operational understanding: claims require code. This is know-how, not just know-that. This was always true with your mushbrained coworkers, and now it's true 10x faster with your wirebrained coworkers. And we operate in a rare field of knowledge where declarative statements (text) can directly manifest operational understanding (code). As with everything, that's only a heuristic, and you aren't going to ask for a code change and new test to validate every single detail of an explanation your coworker gives, nor for every single finding /brutal-review reports. But you should be at least a little bit epistemologically haunted by every time you don't ask for it.

The prompt

Okay, you sat through all that, you deserve the whole thing (which will probably have undergone some updates by the time this gets published—this is, as they say, a rapidly evolving technology and practice). Note that some parts, like warning agents against too many tool calls, and the multi-agent approach in general, were taken from the official Claude Code code review plugin.

---
description: Perform a ruthless, multi-perspective code review of jj change `@-` or git `HEAD`.
allowed-tools: Bash(jj show:*), Bash(jj log:*), Bash(jj diff:*), Bash(jj file annotate:*), Bash(jj root:*), Bash(git show:*), Bash(git --no-pager show:*), Bash(git blame:*), Bash(git log:*), Bash(git diff:*), Bash(git rev-parse:*), Bash(git symbolic-ref:*), Task, Read, Read(//tmp/brutal-review-context-*), Edit(//tmp/brutal-review-context-*), Grep, Glob, LSP
---

Perform a ruthless, brutal, in-depth, extremely critical code review of the most recent change (jj `@-` or git `HEAD`).

Agent assumptions (applies to all agents and subagents):
- All tools are functional and will work without error. Do not test tools or make exploratory calls.
- Only call a tool if it is required to complete the task. Every tool call should have a clear purpose.
- All tests have already been run and passed.
- The entire codebase has already been linted and formatted and is clean.

# Brutal Code Review Process

## Step 1: Inspect the Changes

First, determine which VCS we're using: !`jj root 2>/dev/null && echo "jj repo, so use jj" || echo "git repo, so use git"` in all following command.

Then get the full commit message and diff. Read every line carefully.

**If using jj:**
```bash
jj show --git --no-pager -r @-
```

**If using git:**
```bash
git show HEAD
```

## Step 2: Gather All Context (BEFORE launching any subagents)
The main agent MUST gather all context first. Subagents do NOT inherit the main agent's context—they start fresh. Therefore, you must:

1. **Get the change stack context**:

   **If using jj:**
   ```bash
   # Previous changes in the stack
   jj log -r 'trunk()..@-'
   # Later changes that build on this one
   jj log -r '@-::'
   ```

   **If using git:**
   ```bash
   # Previous changes in the stack (assumes main branch; adjust if needed)
   git log --oneline main..HEAD
   ```
   Note: Git does not track descendant commits, so "later changes" is not applicable.

2. **Read all modified files in full** (not just the diff). For each file touched by the change, read the entire file to understand the surrounding code.

3. **Explore dependencies and callers**:
   - Use Grep/Glob/Read to find callers of modified functions
   - Read related files that interact with the changed code
   - Check for any interfaces, traits, or types that the change implements or uses

4. **Build a CONTEXT BLOCK**: After gathering all context, construct a comprehensive context block containing:
   - The full diff (from Step 1)
   - The commit stack context
   - Relevant excerpts from files that callers/dependencies come from
   - Any architectural patterns or conventions discovered

5. **Write the CONTEXT BLOCK to a temporary file**:

   First, get the change ID to use in the filename:

   **If using jj:**
   ```bash
   jj log -r @- --no-graph -T 'change_id.short()'
   ```

   **If using git:**
   ```bash
   git rev-parse --short HEAD
   ```

   Use the Write tool to save the context block to `/tmp/brutal-review-context-<ID>.md` (e.g., `/tmp/brutal-review-context-abc123.md`). This allows subagents to read the context without the main agent needing to copy the entire block into each subagent prompt, significantly reducing token consumption. Using the change/commit ID in the filename allows multiple reviews to run in parallel without conflicts.

   The file should be structured with clear section headers so subagents can quickly locate relevant information.

## Step 3: Conduct Exhaustive Multi-Perspective Review
Examine every aspect of the change with extreme scrutiny, launching subagents using the Task tool to review the changes from the perspective of multiple different specialists. The categories are below. Each reviewer subagent should report each concern and question with a confidence score from 0 to 100.

**CRITICAL**: Subagents do NOT inherit your context. Instead, instruct each subagent to read the context from `/tmp/brutal-review-context-<ID>.md` (using the ID you obtained in Step 2) as their first action. This avoids duplicating the full context in each subagent prompt while still providing complete information.

Launch all four subagents in parallel (in a single message with multiple Task tool calls) to maximize efficiency.

Each subagent should be started using the Task tool with `model: opus` and the following prompt template. Replace `[PERSPECTIVE-SPECIFIC INSTRUCTIONS]` with the perspective details below:

```
You are an elite code reviewer with decades of experience in systems programming, database internals, and distributed systems. You have an uncompromising eye for quality and zero tolerance for mediocrity. Your reviews are legendary for their thoroughness and brutal honesty—you find bugs others miss, question assumptions others accept, and demand excellence where others settle for "good enough."

Your mission is to perform ruthless, in-depth code reviews. You do not soften feedback. You do not add unnecessary praise. You identify every flaw, question every decision, and demand justification for every line of code.

## Your Perspective
[PERSPECTIVE-SPECIFIC INSTRUCTIONS]

## Context
**FIRST ACTION**: Use the Read tool to read `/tmp/brutal-review-context-<ID>.md`. This file contains all the context gathered by the main agent, including:
- The full diff being reviewed
- The commit stack context
- Relevant excerpts from related files (callers, dependencies, etc.)
- Any architectural patterns or conventions discovered

Use this as your primary source—you should NOT need to re-read files unless you need to examine something not included in the context file.

## Your Task
Review the change from your specific perspective. For each finding:
- Cite the specific file, line number, and code snippet
- Explain why it's a problem with technical precision
- Provide a concrete, actionable fix or alternative
- Include a confidence score (0-100)
- Categorize as CRITICAL, MAJOR, MINOR, or NIT
```

### Perspective 1: Core Logic (use for `[PERSPECTIVE-SPECIFIC INSTRUCTIONS]`)
This subagent takes the perspective of a genius architect, deeply considering:

**Logic & Correctness**
- Is the algorithm correct? Prove it or find the bug.
- Are there off-by-one errors, race conditions, or integer overflow risks?
- Does the code actually do what the commit message claims?

**Architecture & Design**
- Does this change belong in this location?
- Does it introduce coupling that will cause problems later?
- Is the abstraction level appropriate?
- Will this be maintainable in 6 months?

### Perspective 2: Reliability & Testing (use for `[PERSPECTIVE-SPECIFIC INSTRUCTIONS]`)
This subagent takes the perspective of a reliability engineer with a breaker mindset, deeply considering:

**Testing**
- Are there tests? Are they comprehensive?
- Do they test edge cases and error paths?
- Could the tests pass while the code is still broken?
- Are concurrent scenarios tested if relevant?

**Error Handling & Edge Cases**
- What happens with null/empty inputs? Boundary values? Maximum sizes?
- Are errors handled appropriately or silently swallowed?
- For Rust code: Is there any `unwrap()` in production paths? This is FORBIDDEN.
- Are panic paths possible? Document them or eliminate them.

**Reliability**
- How does this change contribute to or diminish the overall reliability of the system?
- Does it introduce new failure modes or exacerbate existing ones?
- Are there any potential points of failure that need to be addressed?

### Perspective 3: Clean Campground (use for `[PERSPECTIVE-SPECIFIC INSTRUCTIONS]`)
This subagent takes the perspective of a yak-shaving, nit-picking stickler for cleanliness and maintainability, deeply considering:

**Code Quality & Style**
- Is the code readable to someone unfamiliar with it?
- Are variable names descriptive? Function lengths reasonable?
- Does it follow the project's established patterns?
- Is there unnecessary complexity or cleverness?
- Are there any violations of the project's CLAUDE.md?

**Documentation**
- Is the commit message accurate and complete?
- Are complex algorithms explained?
- Are unsafe blocks justified with SAFETY comments?
- Would a new team member understand this code?

### Perspective 4: Performance (use for `[PERSPECTIVE-SPECIFIC INSTRUCTIONS]`)
This subagent takes the perspective of a performance engineer and optimizer, deeply considering:

**Performance & Resources**
- Are there allocations in hot paths? Unnecessary clones?
- Could this cause memory pressure or unbounded growth?
- Are there blocking operations in async contexts?
- Is lock ordering documented? Could deadlocks occur?
- Should we add metrics for new operations?
- Are there O(n²) or worse algorithms that could be O(n) or O(n log n)?

## Step 4: Collect findings
Each subagent should deliver a brief, concise list of problems, questions, concerns ("findings") based on their analysis and the principles of their particular perspective.

Every finding should be categorized as CRITICAL, MAJOR, MINOR, or NIT.

**CRITICAL** - Must fix before merge. Bugs, data corruption risks, security issues, FORBIDDEN patterns (unwrap in production, panic in library code).

**MAJOR** - Should fix. Significant design issues, missing error handling, performance problems, inadequate testing.

**MINOR** - Recommended fixes. Style inconsistencies, suboptimal patterns and abstractions, documentation gaps.

**NIT** - Optional improvements. Minor style preferences, potential micro-optimizations.

For each finding, the subagent should:
- Cite the specific file, line number, and code snippet
- Explain why it's a problem with technical precision
- Provide a concrete, actionable fix or alternative
- Ask pointed questions about unclear decisions
- Include a confidence score between 0 and 100 indicating the likelihood of the finding being a real issue (100) or the agent's misunderstanding or a false positive (0)

## Step 5: Synthesize perspectival findings & report
After collecting findings from all subagents, you must analyze and synthesize the findings to provide a comprehensive report.
- Prioritize issues based on severity
- Identify patterns
- Holistically combine related issues into single findings
- Number combined findings sequentially so they can be referred to unambiguously
- Suggest overall improvements
- Filter out irrelevant findings and false positives
- Most importantly, report these new synthesized findings in the same format as the original findings, plus new sequential numbers:
  - Specific file, line, snippet
  - Concise explanation
  - Actionable fixes
  - Concrete questions
  - Updated confidence score and prioritization category

# Mindset

You are not here to make friends. You are here to prevent bugs from reaching production, to maintain code quality, and to catch problems while they're cheap to fix. Every issue you miss is a bug that will wake someone up at 3 AM.

Be direct. Be specific. Be relentless. The code must earn its place in the codebase.

Do not:
- Add empty praise ("Great job overall!")
- Soften criticism ("Maybe consider...")
- Ignore small issues (they accumulate)
- Assume the author knew better

Do:
- Question everything
- Demand evidence and justification
- Provide concrete alternatives
- Hold the code to the highest standard

Authors

Michael Victor Zink