o3 没有“刷榜”

AI寒武纪

2024-12-22 15:03

260

OpenAI研究科学家Nat强调，为了排除训练数据泄露的可能性，OpenAI非常重视数据污染问题，并且在ARC和FrontierMath等没见过的数据集上验证了o3的性能，确保了结果的可靠性，o3没有刷榜。

o3最新解读

两位研究员，英伟达Jim Fan 和 OpenAI 研究科学家Nat McAleese 在社交媒体上分享了他们对 o3 的看法，揭示了这一模型在通用领域推理能力上的巨大进步，以及其背后的技术逻辑和未来展望，还有你可能关心的一个问题：o3是不是存在刷榜的问题

Jim Fan：扩展“单点 RL 超级智能”

Jim Fan 认为，o3 的本质在于将“单点 RL 超级智能”的概念进行扩展，使其能够覆盖更广泛的实用问题。他指出，AI 在特定领域利用强化学习 (RL) 取得惊人成就并不罕见。

例如，AlphaGo 在围棋领域、AlphaStar 在星际争霸领域以及 Boston Dynamics 的 e-Atlas 机器人在特定动作上的表现都堪称超级智能，超越了人类的平均水平

类似地，在 AIME、SWE-Bench 和 FrontierMath 等领域，o3 也展现出了超越绝大多数人类的超强能力。与 AlphaGo 不同，o3 的突破在于解决了复杂数学和软件工程领域中奖励函数的难题。这意味着 o3 不再是只擅长单点任务的 RL 专家，而是在更大范围的有用任务中都表现出强大的 RL 能力

然而，Jim Fan 也指出，o3 的奖励工程并不能覆盖人类认知的全部范围。这解释了为什么 o3 在某些领域能让顶尖专家惊叹，但在一些简单的儿童谜题上却会失败。这就像我们不能指望 AlphaGo 去玩扑克牌并获胜一样。

Jim Fan 认为 o3 是一个巨大的里程碑，并描绘了清晰的发展路线，但同时也强调仍有许多工作要做

Nat McAleese：通用领域推理的巨大进步

Nat McAleese 强调 o3 代表了 RL 在通用领域推理上的巨大进步。她首先回顾了 o1 模型——使用 RL 训练的大型语言模型，并指出 o3 是在 o1 的基础上进一步扩大 RL 规模的结果。

o3 在多个领域的表现都令人印象深刻：

编程竞赛： 在最近的编程竞赛中，o3 的表现可以与世界上顶尖的程序员相媲美，估计在 CodeForces 上的评级超过 2700 分，这甚至超出了 Nat 此前的预期
GPQA： 在 GPQA 数据集上，o3 的得分高达 87.7%，远超之前任何已知的外部 LLM 模型（例如 Gemini Flash 2 的 62%）以及 o1 的 78%
软件工程： o3 在 SWE-bench 验证集上的得分高达 71.7%，大幅提高了之前模型的水平
数学难题： 在 FrontierMath 2024-11-26 数据集上，o3 将准确率从 2% 提高到 25%
ARC： 在 ARC 数据集上，o3 的得分在半私有测试集和公开验证集上分别达到了 87.5% 和 91.5%

Nat 特别强调，为了排除训练数据泄露的可能性，OpenAI 非常重视数据污染问题，并且在 ARC 和 FrontierMath 等没见过的数据集上验证了 o3 的性能，确保了结果的可靠性，o3没有刷榜

中英文原文

Jim Fan：英伟达高级研究经理，GEAR 实验室（具身人工智能）联合创始人。GR00T 项目负责人：解决通用机器人问题@斯坦福大学博士学位@OpenAI首位实习生

Thoughts about o3: I'll skip the obvious part (extraordinary reasoning, FrontierMath is insanely hard, etc). I think the essence of o3 is about relaxing a single-point RL super intelligence to cover more points in the space of useful problems.

关于 o3 的一些想法：我会跳过那些显而易见的部分（例如非凡的推理能力、FrontierMath 的难度极高等）。我认为 o3 的核心在于将单点的 RL 超级智能扩展到覆盖更多有用问题空间的能力

The world of AI is no stranger to RL achieving god-level stunts.

人工智能领域对于强化学习取得惊人成就并不陌生。

AlphaGo was a super intelligence. It beats the world champion in Go - well above 99.999% of regular players. AlphaStar was a super intelligence.

It bests some of the greatest e-sport champion teams on StarCraft. Boston Dynamics e-Atlas was a super intelligence. It performs perfect backflips. Most human brains don't know how to send such sophisticated control signals to their limbs.

AlphaGo 是一个超级智能。它击败了围棋世界冠军——远高于 99.999% 的普通棋手。AlphaStar 是一个超级智能。它击败了一些星际争霸中最伟大的电子竞技冠军队伍。波士顿动力公司的 e-Atlas 是一个超级智能。它能完美地完成后空翻。大多数人类大脑不知道如何向肢体发送如此复杂的控制信号。

Similar statement can be made for AIME, SWE-Bench, and FrontierMath - they are like Go, which requires exceptional domain expertise above 99.99....% of average people. o3 is a super intelligence when operating in these domains.

类似的说法也适用于 AIME、SWE-Bench 和 FrontierMath——它们就像围棋一样，需要超越 99.99....% 普通人的卓越领域专业知识。当在这些领域运行时，o3 是一种超级智能。

The key difference is that AlphaGo uses RL to optimize for a simple, almost trivially defined reward function: winning the game gives 1, losing gives 0.

Learning reward functions for sophisticated math and software engineering are much harder. o3 made a breakthrough in solving the reward problem, for the domains that OpenAI prioritizes. It is no longer an RL specialist for single-point task, but an RL specialist for a bigger set of useful tasks.

关键的区别在于，AlphaGo 使用强化学习（RL）来优化一个简单、几乎可以说是微不足道的奖励函数：赢得比赛得 1 分，输掉比赛得 0 分。

而为复杂的数学和软件工程设计奖励函数则要困难得多。o3 在 OpenAI 优先关注的领域中解决奖励问题上取得了突破。它不再是单点任务的 RL 专家，而是能够处理更大范围有用任务的 RL 专家

Yet o3's reward engineering could not cover ALL distribution of human cognition. This is why we are still cursed by Moravec's paradox. o3 can wow the Fields Medalists, but still fail to solve some 5-yr-old puzzles like the one below. I am not at all surprised by this cognitive dissonance, just like we wouldn't expect AlphaGo to win Poker games.

然而，o3 的奖励机制设计仍无法覆盖人类认知的所有分布。这也是为什么我们仍然受到莫拉维克悖论的困扰。o3 能让菲尔兹奖得主惊叹不已，但却仍然无法解决一些类似下图这样的 5 岁儿童谜题。对于这种认知上的反差，我一点也不感到惊讶，就像我们不会指望 AlphaGo 赢得扑克比赛一样

Huge milestone. Clear roadmap. More to do.

巨大的里程碑。清晰的路线图。还有更多事情要做

Nat McAleese ：OpenAI 研究员。此前在DeepMind 工作

o3 represents enormous progress in general-domain reasoning with RL — excited that we were able to announce some results today! Here’s a summary of what we shared about o3 in the livestream

o3 代表了在通用领域推理方面使用强化学习取得的巨大进步——很高兴我们今天能够宣布一些成果！以下是我们直播中分享的关于 o3 的总结

o1 was the first large reasoning model — as we outlined in the original “Learning to Reason” blog, it’s “just” an LLM trained with RL. o3 is powered by further scaling up RL beyond o1, and the strength of the resulting model the resulting model is very, very impressive.

o1 是第一个大型推理模型——正如我们在最初的“学习推理”博客中概述的那样，它“只是”一个使用强化学习训练的 LLM。o3 的动力来自于在 o1 的基础上进一步扩大强化学习的规模，由此产生的模型的强度非常非常令人印象深刻

Firstly and most importantly: we tested on recent unseen programming competitions and find that the model would rank amongst some of the best competitive programmers in the world, with an estimated CodeForces rating over 2700.

首先也是最重要的是：我们在最近未见过的编程竞赛中进行了测试，发现该模型在世界顶尖的竞技程序员中排名靠前，估计其 CodeForces 评级超过 2700 分

This is a milestone (codeforces better than Jakub Pachoki) that I thought was further away than December ‘24; these competitions are hard and extremely competitive; the model is absurdly good.

这是一个里程碑（在 Codeforces 上比 Jakub Pachoki 更好），我原以为它会比 2024 年 12 月更晚到来；这些比赛非常困难且竞争激烈；该模型好得离谱

Scores are impressive elsewhere too. 87.7% GPQA diamond towers over any LLM I’ve aware of externally (I believe non-o1 sota is gemini flash 2 at 62%?), as well as o1’s 78%. Unknown noise ceiling, so this may even understate o3 science improvements over o1.

在其他方面的得分也很令人印象深刻。87.7% 的 GPQA diamond 超过了我所知的任何外部 LLM 模型（我认为非 o1 的 SOTA 是 Gemini Flash 2 的 62%？），以及 o1 的 78%。未知的噪声上限，因此这甚至可能低估了 o3 在科学方面对 o1 的改进

o3 can also do software engineering, setting a new state of the art on SWE-bench verified with 71.7%, massively improving over o1.

o3 也可以进行软件工程，在 SWE-bench 验证集上达到了 71.7% 的新技术水平，与 o1 相比有了巨大的改进。

With scores this strong, you might fear accidental contamination. Avoiding this is something OAI is obviously obsessed with; but thankfully we also have some test sets that are strongly guaranteed uncontaminated: ARC and FrontierMath… What do we see there?

有了如此高的分数，你可能会担心意外的污染。避免这种情况显然是 OpenAI 非常关注的事情；但幸运的是，我们也有一些被强烈保证未受污染的测试集：ARC 和 FrontierMath……我们在那里看到了什么？

Well, on FrontierMath 2024-11-26 o3 improves the state of the art from 2% to 25% accuracy. These are absurdly hard strongly held out math questions. And on ARC, the semi-private test set and public validation set scores are 87.5% (private) and 91.5% (public). (7/n)

好吧，在 FrontierMath 2024-11-26 上，o3 将最先进的水平从 2% 的准确率提高到 25%。这些是极其困难且严格隔离的数据集中的数学问题。在 ARC 上，半私有测试集和公开验证集的得分分别为 87.5%（私有）和 91.5%（公开）

So at least in those cases, we know with true certainty that results are not due to memorization (and very sure in all the other evals I describe as unseen too; I'm just tremendously paranoid).

因此，至少在这些情况下，我们可以非常确定这些结果并非由于记忆化所致（而且我对其他我描述为“未见过”的评估结果也非常有信心；只是我特别谨慎而已）

We’ve also found that we can use o3 to train faster and cheaper models without losing as much performance as you might expect: o3-mini is a mighty little beast, and I’m hopeful that Hongyu will share a good thread on how it stacks up.

我们还发现，我们可以使用 o3 来训练更快更便宜的模型，而不会像你想象的那样损失太多性能：o3-mini 是一个强大的小野兽，我希望 Hongyu Ren能分享一个关于它如何堆叠的精彩帖子

Are there any catches? Well, as the ARC team outlined in our release, o3 is also the most expensive model ever at test-time. But what that means is we’ve unlocked a new era where spending more test-time compute can produce improved performance up to truly absurd levels.

有什么缺点吗？嗯，正如 ARC 团队在我们的发布中所指出的，o3 也是测试阶段成本最高的模型。但这也意味着我们开启了一个新时代，通过投入更多的测试计算资源，可以将性能提升到极其惊人的水平

My personal expectation is that token prices will fall and that the most important news here is that we now have methods to turn test-time compute into improved performance up to a very large scale.

我个人的期望是 token 价格会下降，而这里最重要的消息是，我们现在有了将测试时计算转化为大规模性能提升的方法

The models will only get better with time; and almost nobody (on a grand scale) can still beat them at programming competitions or math. Merry Christmas!

模型只会随着时间的推移变得更好；而且几乎没有人（从宏观层面）仍然能在编程竞赛或数学上击败它们。圣诞快乐！

As Sam mentioned at the start of the stream: this is not a model that you can talk to yet... unless you sign up to red team it with us! https://openai.com/index/early-access-for-safety-testing/

正如 Sam 在直播开始时提到的那样：这还不是一个你可以与之交谈的模型……除非你注册加入我们的红队测试！https://openai.com/index/early-access-for-safety-testing

本文来源：AI寒武纪，原文标题：《o3 没有“刷榜”》

风险提示及免责条款

市场有风险，投资需谨慎。本文不构成个人投资建议，也未考虑到个别用户特殊的投资目标、财务状况或需要。用户应考虑本文中的任何意见、观点或结论是否符合其特定状况。据此投资，责任自负。