How to Set Up an LLM-as-Judge Eval Harness for a Coding Agent
Build a working LLM-as-judge eval harness for a coding agent in Python: golden tasks, deterministic checks, a rubric judge on Claude Sonnet 4.6, calibration against human labels, and a CI gate that fails the build when scores regress.