This AI model evaluation platform hit $100M ARR in 8 months — its competitor just shut down

There’s building a product users like, and then there’s building a product enterprises pay millions for. Arena just proved how wide that gap can be.

The AI model evaluation platform, spun out of UC Berkeley in April 2025, announced Tuesday that its enterprise service AI Evaluations has hit $100 million in annualized recurring revenue — just eight months after launch. The product uses human feedback data to help companies assess how AI models perform in real business environments, with traceable test samples and service-level agreements baked in.

Arena started as an academic project called LMArena at UC Berkeley in 2023. Its early strategy was simple: let anyone test AI models for free by comparing their outputs side by side. That hands-on approach built a loyal user base fast, and the resulting leaderboard — ranking models by human preference — became one of the industry’s most-watched benchmarks.

The company incorporated in April 2025 and launched its commercial product five months later. AI Evaluations sits on the same human-judgment data the platform has been collecting since its academic days, but packages it with formal SLAs and traceable testing. For companies choosing between dozens of competing models — or deciding whether to switch providers — that kind of ground-truth data is exactly what internal benchmarks can’t provide.

Not everyone figured this out. Yupp, another AI evaluation platform founded in 2024, shut down on March 31 this year. The company had attracted over 1.3 million users and signed paid partnerships with AI labs, but never found strong enough product-market fit. Its vision was a two-sided marketplace: let users test models for free, then sell the usage data back to AI companies. The users came. The data accumulated. The revenue didn’t follow.

Arena’s trajectory suggests the market for honest model evaluation is real — but only when the customer is the enterprise buying, not the model being tested. The platform says it will keep building more evaluation tools and AI collaboration features. With a $100M ARR validation and a failed competitor in the rearview, it has the run room to do it.