Bridgewater tested GPT, Claude, and Gemini on basic financial judgment tasks. None passed.

If you ask a hedge fund analyst whether a central bank document signals an upcoming rate change, they’ll tell you in seconds. Ask them to write down exactly how they know — the rules and heuristics behind that call — and most will struggle.

This gap between tacit expertise and explicit reasoning is what makes financial judgment hard to automate. And it’s exactly where today’s frontier AI models fall short.

Bridgewater’s AIA Labs, in collaboration with Thinking Machines Lab (the startup founded by former OpenAI CTO Mira Murati), put GPT, Claude, and Gemini through a battery of six tasks drawn from real analyst workflows. The jobs were deliberately basic: decide if a financial news article is worth an executive’s time, or determine whether a central bank release hints at a policy shift. The kind of thing a junior analyst handles before their first cup of coffee.

The frontier models averaged roughly 50 percent accuracy with standard prompts. Even after the researchers wrote detailed expert prompts and introduced a three-level classification system (“relevant and interesting,” “relevant but not interesting,” “not relevant”), accuracy inched up to only 70 percent. Bridgewater’s internal bar for trustworthy deployment is 80 percent. None of the models cleared it.

Worse, newer versions didn’t help much. GPT-5.4 costs 43 percent more to run than GPT-5.2 but delivered only marginal accuracy gains.

So the team went a different route. They took Qwen3-235B, an open-source model from Alibaba, and fine-tuned it on the Tinker platform. The hardest part was building a clean training dataset. Professional annotation is expensive: domain experts don’t come cheap. The team tried cheaper non-specialist labeling first and found the data was riddled with errors.

Their workaround was clever: train a preliminary model on the noisy labels, then have it re-evaluate the same data. Wherever the model’s output disagreed with the original label, those cases were flagged for expert review. This cut annotation costs while catching exactly the ambiguous examples that matter most.

After multiple training rounds (interleaved batch training, CISPO loss functions, asymmetric cropping, and on-policy distillation at the best validation checkpoint), the fine-tuned model hit 84.7 percent accuracy. That beat the best frontier model tested, which maxed out at 78.2 percent, and reduced the error rate by 29.8 percent. Because the model was smaller, inference cost was roughly one-fourteenth that of the leading commercial APIs.

Bridgewater is already using the model in daily operations. The broader takeaway, the researchers argue, is that frontier models have blind spots where valuable proprietary data lives, data that companies keep private precisely because it’s competitive. By fine-tuning open-source models with their own toolchains, organizations can retain control over weights, data, and compute infrastructure instead of handing proprietary information to frontier labs where it could become fodder for future competitors.

The research paper, “Learning to Replicate Expert Judgment in Financial Tasks” from Thinking Machines Lab, lays out the full methodology and benchmark results.