Judging

When you run multiple agents on the same task, judge agents can automatically evaluate the results and help you identify the best solution. Judges analyze code quality, correctness, and completeness—providing objective feedback that saves you time reviewing variants.

What Are Judge Tasks?

A judge task is a special task that evaluates other tasks in a group. Unlike regular tasks that modify your code, judges:

Run after primary tasks complete
Have read-only access to primary task results
Analyze code quality, correctness, and completeness
Produce evaluation notes and scoring
Do not modify your repositories

Judge tasks appear in the task group alongside other variants.

How Judges Evaluate

When a judge task runs, it:

Reads all primary task results—patches, summaries, exit codes, logs
Analyzes the code changes each agent made
Reviews test results and error messages
Evaluates each variant on multiple dimensions
Generates a detailed report with scores and recommendations

Evaluation Dimensions

Judges score variants on:

Correctness: Does the code solve the problem? Are edge cases handled? Do tests pass?
Code quality: Is it readable, maintainable, and following good patterns?
Completeness: Are all requirements addressed? Is anything missing?
Performance: Is the implementation efficient? (when applicable)

Each dimension receives a score, and judges provide detailed notes explaining their reasoning.

Judges may use or add other dimensions based on the task context.

Automatic Judging

You can configure task groups to automatically launch judge tasks when primary agents finish.

Configuring Auto-Judge

When creating a task group, select which agents should serve as judges.

Multiple judges provide independent evaluations, reducing bias and increasing confidence in the results.

When Auto-Judge Launches

Judge tasks launch automatically when:

All primary tasks have completed
At least two variants finished successfully
Multiple variants made file changes
No follow-up instructions are pending

If conditions aren't met, auto-judge is skipped—but you can always launch judges manually.

Manual Judge Launch

You can launch judge tasks at any time:

Open the task group
Click the Judge button
Select which agents to use as judges
Judge tasks are created and queued

This is useful when auto-judge conditions weren't met, or when you want additional evaluation after making changes.

Judge Consensus

When multiple judges evaluate the same variants:

Each judge scores independently
Results can be compared side-by-side
Consensus emerges when judges agree on the best variant
Disagreements highlight areas worth closer review

If two out of three judges recommend the same variant, that's a strong signal. If judges disagree significantly, you may want to review their reasoning before deciding.

Using Judge Feedback

Judge feedback isn't just for picking a winner—it helps you improve the code.

Common Issues Judges Identify

Test failures: Some tests aren't passing
Edge cases: Boundary conditions not handled
Error handling: Missing validation or exception handling
Code style: Inconsistent naming or formatting
Incomplete implementation: Features not fully implemented

Feedback Loops

After reviewing judge feedback:

Identify specific issues mentioned in the evaluation
Send follow-up instructions to the winning variant addressing those issues
The agent resumes and implements improvements
Optionally re-run judges to verify the improvements

This creates an automated refinement cycle where judges catch issues that agents then fix.

Judges Don't Approve

Important: Judge tasks provide feedback and recommendations only. They do not:

Automatically approve changes
Commit or push code
Mark tasks as winners

You make the final decision on winner selection and approval. Judges inform your decision—they don't make it for you.

What Are Judge Tasks?​

How Judges Evaluate​

Evaluation Dimensions​

Automatic Judging​

Configuring Auto-Judge​

When Auto-Judge Launches​

Manual Judge Launch​

Judge Consensus​

Using Judge Feedback​

Common Issues Judges Identify​

Feedback Loops​

Judges Don't Approve​