Skip to main content

Judging

When you run multiple agents on the same task, judge agents can automatically evaluate the results and help you identify the best solution. Judges analyze code quality, correctness, and completeness—providing objective feedback that saves you time reviewing variants.

What Are Judge Tasks?

A judge task is a special task that evaluates other tasks in a group. Unlike regular tasks that modify your code, judges:

  • Run after primary tasks complete
  • Have read-only access to primary task results
  • Analyze code quality, correctness, and completeness
  • Produce evaluation notes and scoring
  • Do not modify your repositories

Judge tasks appear in the task group alongside other variants.

How Judges Evaluate

When a judge task runs, it:

  1. Reads all primary task results—patches, summaries, exit codes, logs
  2. Analyzes the code changes each agent made
  3. Reviews test results and error messages
  4. Evaluates each variant on multiple dimensions
  5. Generates a detailed report with scores and recommendations

Evaluation Dimensions

Judges score variants on:

  • Correctness: Does the code solve the problem? Are edge cases handled? Do tests pass?
  • Code quality: Is it readable, maintainable, and following good patterns?
  • Completeness: Are all requirements addressed? Is anything missing?
  • Performance: Is the implementation efficient? (when applicable)

Each dimension receives a score, and judges provide detailed notes explaining their reasoning.

Judges may use or add other dimensions based on the task context.

Automatic Judging

You can configure task groups to automatically launch judge tasks when primary agents finish.

Configuring Auto-Judge

When creating a task group, select which agents should serve as judges.

Multiple judges provide independent evaluations, reducing bias and increasing confidence in the results.

When Auto-Judge Launches

Judge tasks launch automatically when:

  • All primary tasks have completed
  • At least two variants finished successfully
  • Multiple variants made file changes
  • No follow-up instructions are pending

If conditions aren't met, auto-judge is skipped—but you can always launch judges manually.

Manual Judge Launch

You can launch judge tasks at any time:

  1. Open the task group
  2. Click the Judge ucib
  3. Select which agents to use as judges
  4. Judge tasks are created and queued

This is useful when auto-judge conditions weren't met, or when you want additional evaluation after making changes.

Judge Consensus

When multiple judges evaluate the same variants:

  • Each judge scores independently
  • Results can be compared side-by-side
  • Consensus emerges when judges agree on the best variant
  • Disagreements highlight areas worth closer review

If two out of three judges recommend the same variant, that's a strong signal. If judges disagree significantly, you may want to review their reasoning before deciding.

Using Judge Feedback

Judge feedback isn't just for picking a winner—it helps you improve the code.

Common Issues Judges Identify

  • Test failures: Some tests aren't passing
  • Edge cases: Boundary conditions not handled
  • Error handling: Missing validation or exception handling
  • Code style: Inconsistent naming or formatting
  • Incomplete implementation: Features not fully implemented

Feedback Loops

After reviewing judge feedback:

  1. Identify specific issues mentioned in the evaluation
  2. Send follow-up instructions to the winning variant addressing those issues
  3. The agent resumes and implements improvements
  4. Optionally re-run judges to verify the improvements

This creates an automated refinement cycle where judges catch issues that agents then fix.

Judges Don't Approve

Important: Judge tasks provide feedback and recommendations only. They do not:

  • Automatically approve changes
  • Commit or push code
  • Mark tasks as winners

You make the final decision on winner selection and approval. Judges inform your decision—they don't make it for you.