Visualizing the Lifecycle of AI Models: A Live Tracker for ELO Ratings

Introduction

Have you ever tried a new flagship AI model and been impressed by its sharp reasoning and creative flair, only to feel weeks later that it has lost some of its magic? This phenomenon, often called "model degradation" or "nerfing," has puzzled users and developers alike. To explore whether this perception has a measurable basis, I built a live tracker that visualizes the entire lifecycle of flagship AI models using historical ELO ratings from Arena AI.

Visualizing the Lifecycle of AI Models: A Live Tracker for ELO Ratings

The Live Tracker: A Clear View of Model Performance

Instead of cluttering the chart with every model variant, the tracker plots a single continuous curve for each major AI lab. It dynamically follows the highest-rated flagship model over time, making it easy to spot both sudden generational leaps and gradual performance decays.

The visualization is designed with care: it took many iterations to get the chart looking clean and responsive on mobile devices. An optional dark mode is included for comfortable viewing at any hour.

Methodology

The data source is Arena AI, a platform that collects ELO ratings from model-against-model battles. The tracker applies a smoothing algorithm to reduce noise while preserving trend patterns. Each lab's curve is color-coded, and hovering over any point reveals the model name and rating at that time.

Key Findings

Early observations from the tracker confirm what many suspect: top-performing models often experience a noticeable dip in ELO within weeks of launch. This decline may be due to model updates, changed safety wrappers, or server-side optimizations that subtly reduce quality. On the other hand, major version bumps—like from GPT-3.5 to GPT-4—show sharp jumps upward.

The Blindspot: API vs. Consumer Experience

Arena AI primarily tests models via their API endpoints. However, everyday users interact through consumer chat UIs, which often add heavy system prompts, safety filters, or silently switch to quantized versions under high load. These differences can lead to a significant gap between API benchmarks and real-world performance.

This blindspot means the tracker, while informative, may not fully capture the "nerfing" that web users experience. I'd like to integrate data that reflects the consumer UI experience more accurately.

Call for Data: Consumer Web UI Evaluations

If you know of any historical ELO or evaluation datasets that scrape or test outputs from consumer web interfaces (rather than raw APIs), please get in touch. The project is open-source, and I'm eager to incorporate such data for a more complete picture.

Open-Source and Community Feedback

The entire project is open-source, with the repository linked in the footer of the dashboard. I welcome any suggestions, bug reports, or pointers to datasets. The goal is to make this tracker a reliable resource for understanding how AI models evolve in the wild.

Feel free to explore the live dashboard and see for yourself the peaks and valleys of AI model performance.