Breaking: Copilot Applied Science Researcher Automates Intellectual Toil with New 'Eval-Agents' Tool

A lead AI researcher at the Copilot Applied Science team has developed a new system called eval-agents that automates the tedious analysis of AI agent performance data, freeing up developers to focus on creative problem-solving. The tool, built using GitHub Copilot, has already been adopted by the researcher's peers on the team.

"I may have just automated myself into a completely different job," the researcher said, highlighting a familiar pattern where engineers build systems to remove toil and then own those systems. "During this process, I learned a lot about how to effectively create and collaborate using GitHub Copilot. Applying these learnings has unlocked an incredibly fast development loop for myself as well as enabled my teammates to build solutions to fit their needs."

The impetus for the project came from the researcher's daily work analyzing coding agent trajectories—detailed JSON logs containing hundreds of lines of code each. With dozens of tasks in a benchmark and multiple runs per day, analysts faced hundreds of thousands of lines of code. "It's an impossible task to do alone," the researcher explained. "I found that I kept repeating the same loop: I used GitHub Copilot to surface patterns then investigated them myself."

The engineer in the researcher saw the repetition and decided to automate it. Thus, eval-agents was born.

Background

The researcher designed eval-agents with three core principles: make agents easy to share and use, make it easy to author new agents, and make coding agents the primary vehicle for contributions. "Bullets one and two are in GitHub's lifeblood," the researcher noted, adding that these are skills honed during their time as an open-source maintainer for the GitHub CLI.

Breaking: Copilot Applied Science Researcher Automates Intellectual Toil with New 'Eval-Agents' Tool — Source: github.blog

The system leverages GitHub Copilot's capabilities to streamline the analysis of agent performance across benchmarks like TerminalBench2 and SWEBench-Pro. Instead of poring over raw JSON files, developers can now generate insights quickly and share them with the team.

What This Means

Eval-agents represents a shift toward automating not just rote tasks but also intellectual toil—the repetitive cognitive work that slows down AI research. If adopted more broadly, such tools could dramatically accelerate the pace of agent development and evaluation, allowing teams to iterate faster on coding agents.

"Engineering and science teams work better together," the researcher said, emphasizing the collaborative nature of the project. By making it easy to author and share agents, the team hopes to foster a culture where everyone can contribute to improving the evaluation pipeline.

Industry observers note that this development could set a precedent for how AI research teams leverage generative AI itself to enhance productivity. As one analyst commented, "Using Copilot to improve Copilot's evaluation is a meta approach that could have far-reaching implications."

Breaking: Copilot Applied Science Researcher Automates Intellectual Toil with New 'Eval-Agents' Tool

Background

What This Means

See Also

External Resources