Sakana Fugu posts strong benchmarks but draws backlash over speed, cost, and wrapper claims

June 24, 20264 min read

Sakana AI's new Fugu system is the loudest thing in the AI world this week, and not all the noise is good. The Tokyo startup says its multi-agent orchestration technology reaches frontier-level performance by coordinating several large language models behind a single OpenAI-compatible API. The published benchmarks look strong on paper. Early hands-on tests and industry commentary have flagged real problems with speed, pricing, and how the launch was marketed.

Fugu is not a single model. It works as a conductor, a learned orchestrator that breaks a complex task into parts, routes each part to a specialized model from a swappable pool, then stitches the results into one answer. That pool can include models from Anthropic, Google, and OpenAI, and Sakana can swap pieces in and out. The pitch is one unified API call, no juggling agents yourself, and resilience when a given model goes down or gets blocked by export controls.

There are two tiers. Fugu is the balanced option tuned for speed and everyday work. Fugu Ultra is tuned for maximum quality on hard problems. Both grew out of Sakana's research into learned orchestration, published as the TRINITY and Conductor papers at ICLR 2026. Sakana launched the system on June 22, 2026.

The published benchmarks are eye-catching, with one big caveat: every number is self-reported by Sakana. On SWE-Bench Pro, Fugu Ultra scored 73.7 and Fugu 59.0, beating Claude Opus 4.8 at 69.2 and GPT-5.5 at 58.6, while trailing the inaccessible Fable 5 at 80.0. On LiveCodeBench both tiers landed near 93, ahead of Claude Fable 5 at 89.8. On GPQA-Diamond both hit 95.5, edging Mythos Preview at 94.6.

On TerminalBench 2.1, Fugu Ultra reached 82.1 and Fugu 80.2, ahead of Claude Opus 4.8 at 74.6 and GPT-5.5 at 78.2. On Humanity's Last Exam, Fugu Ultra scored 50.0 to Opus 4.8's 49.8, a narrow win. Independent verification is still thin, and Fable 5 and Mythos-level models are not in Fugu's current pool because of U.S. export controls.

Sakana also leaned on custom demos. Fugu Ultra generated a working mechanical iris in CAD where other models broke down structurally, built a Rubik's Cube solver that cleared all 300 test cases, and deciphered tangled 17th-century Japanese scattered writing, known as chirashigaki. It also posted strong results on automated research and long-horizon tasks.

Real-world reviews are split. Nearly 500 beta testers praised Fugu Ultra for code review that often surfaced more than 20 issues where GPT-5.5 caught about 3. Testers liked how it handled long, messy workflows like cybersecurity assessments, patent analysis, and reproducing research papers, and how it held a stable persona across long sessions.

The complaints are just as loud. Many independent testers call it extremely slow. AI researcher Ethan Mollick said his usual coding tests took around 30 minutes. One user burned an entire 20 dollar plan quota on a single complex prompt. Frontend and visual coding came out weaker and jagged compared to top single models, and several testers said the output was fine but did not consistently match Claude Fable 5 in practice.

The backlash on X, Hacker News, Reddit, and tech media has been sharp. The loudest line is that Fugu is just a wrapper or router, an orchestration layer sitting on top of existing models rather than a new frontier model. Some called Sakana's framing, that it matches Fable and Mythos performance, misleading or even fraudulent.

Pricing drew heat too. The 200 dollar per month Max plan reportedly gives heavy users less than three hours of use a week, and Fugu Ultra costs about 30 dollars per million output tokens at base. Critics also note Fugu does not disclose which underlying model handles a given task, and that it still leans heavily on U.S. frontier models, which undercuts the claim of independence from export controls. Some observers also criticized Sakana's aggressive hiring push around the launch.

Sakana's answer is that benchmarks only tell part of the story. The company points to beta feedback showing strength in long-running real-world agentic work, where single models often drift or fail. It also stresses the strategic value of a swappable pool as export rules tighten. The plan is to add more open models, eventually its own models, and give users more control over coordination.

Fugu fits a growing shift toward orchestration and multi-agent systems instead of ever-bigger single models. When a top model can vanish overnight for geopolitical reasons, tools that smartly combine what is already available start to look valuable. Whether Fugu becomes mainstream or stays a niche high-end option depends on faster responses, better pricing, and independent proof of its claims. For now it delivers genuinely strong results on paper and shows promise for complex workflows, while early adopters hit real friction on latency and cost, and the orchestrator versus model debate stays wide open.