14The integrated generation environment

Measuring and adjusting

Evaluation is not a one-off. You measure, adjust, and measure again. The human is the quality system.

Morgan KavanaghPublished 2026-03-28

Why measurement matters

Without measurement, you are guessing. You think the model is doing a good job because the outputs look reasonable. But "looks reasonable" is not a quality standard. Measurement tells you how often the model agrees with human judgment, how many outputs need correction, which types of tasks produce the most errors, and whether quality is improving or degrading over time. It turns your intuition into evidence and your workflow into a system you can manage.

What to measure

The metrics depend on the task. For classification: accuracy, precision, and recall against human-labelled ground truth. For extraction: how many fields are correct, how many are missing, how many are fabricated. For generation: how often the first draft is accepted without changes, how many edits are needed per output, and what types of edits are most common. For scoring: how well the model's scores correlate with human scores. Start with the simplest metric that answers "is this good enough?" and add complexity only when needed.

The adjustment cycle

Measurement leads to adjustment. If classification accuracy is 85% and you need 95%, you examine the errors: are they concentrated in one category? Is the prompt ambiguous? Is the model too small for the task? Based on the diagnosis, you adjust: refine the prompt, add examples, switch models, or restructure the categories. Then you measure again. This cycle, measure, diagnose, adjust, measure, is the core discipline of working with AI in production. It is not a one-time setup; it is continuous.

The human as the quality system

In this framework, the human is not the person who does the work that the AI cannot. The human is the quality system. You define the criteria. You evaluate the output. You make the corrections. You decide when the quality is sufficient and when it is not. The AI handles volume and consistency; you handle judgment and standards. This division of labour is not a limitation of AI. It is the design of the system. The human in the loop is not a safety net; they are the quality control function.

Examples

Tracking correction rates

You implement an email classification workflow. In the first week, you correct 40 out of 200 classifications, a 20% error rate. You examine the errors and find that most involve distinguishing complaints from requests. You add two examples of each type to the prompt. In the second week, the error rate drops to 8%. You continue tracking. By week four, it stabilises at 5%, which you accept as sufficient for the task.