“嘈杂微调”带来的问题

我的理解

训练过程中的“嘈杂微调”揭示了一个根本性限制:只有在训练数据中高频出现的知识才能穿透噪声底板被模型真正内化;稀少的知识会被大多数样本的微调信号淹没而无法留存。这从机制层面解释了幻觉现象——模型并非在回忆“事实”,而是在预测“与训练数据一致的词”,没有任何内置的知识正确性校验。数学是最典型的受害者:训练数据中数学题比例低、子领域分散,噪声难以穿透。对构建者的实践启示是:对精确性要求高、知识罕见的任务(数学计算、私有数据查询),应借助外部工具而非依赖模型内部存储。

相关链接


原文

Lesson 54 of 68 “嘈杂微调”带来的问题 / Problem of the Noisy Nudge

“嘈杂微调”引出了一个重要的注意事项。请记住,训练过程会针对不同的输入文本,微调模型去预测正确的下一个词。这种微调既可以来自该样本本身,也可以来自其他相关的样本。回想我们之前“热狗与篮子”的例子——它在这里同样适用。如果训练数据中既有大量“太阳从东方升起”的例子,又有大量“在金星上,太阳从西方升起”的例子,那么模型会同时学到这两种情况,并掌握其中的细微差别。但如果关于金星的例子相对很少,那么来自大多数训练数据的微调或引导就会占据主导地位,模型就学不到“在金星上太阳实际上是从西方升起”这一事实。因此,当被要求预测“the sun rises from the”的下一个词时,模型会自信地给出“east”。

这正是为什么我们只能在某条知识在训练数据中频繁出现时,才依赖大语言模型对它的掌握。这种依赖根源于“嘈杂微调”的过程。如果某条知识出现的频率不足以穿透噪声底板,它就无法在微调过程中“存活”下来,也就不会被大语言模型所捕获。归根结底,大语言模型被训练的目标,是预测与训练数据相一致的下一个词。其中并没有显式的“知识”概念,也没有对知识正确性的任何校验,这正是它会出现“幻觉”的原因。从预测下一个词的角度来看,那些幻觉内容其实是再合理不过的。

由此,我们也就很自然地理解了为什么 AI 模型在数学方面表现挣扎。数学在训练数据中并不占主导地位,很难穿透噪声底板。此外,数学是一个复杂的领域,包含众多子领域和题型,这进一步稀释了训练数据。然而,对于训练数据覆盖较为充分的某些子领域,AI 模型表现得还算不错,例如涉及 100 以内数字的算术题。这也印证了我们前面的理论。

这一问题一直是 AI 研究中的重大挑战。目前主要有两个解决方向。第一个方向是重新思考:是否一定要在 AI 模型内部解决数学问题?精确的记忆调用和精确的数学运算本身已经是被很好解决的问题。为什么不使用数据库和计算器来辅助 AI 模型,让它专注于自己最擅长的事情:推理与协调?这也是 OpenAI 在 ChatGPT 中集成 Python 环境和检索增强生成(RAG)的部分原因。另一个方向,则是让 AI 模型自身具备精确的记忆和数学能力。如果能够实现这一点,将极大地简化我们的生活,但目前尚不存在这样的产品。

English Original

Noisy nudging introduces an important caveat. Remember, the training process nudges the model to predict the correct next word for different input texts. This nudge can come from its direct example or other relevant examples. Consider our hot dog and basket example—it applies here too. If there are enough training examples for both “the sun rises from the east” and “on Venus, the sun rises from the west,” the model will learn both and grasp the nuances. But if there are relatively few examples about Venus, the nudges or guidance from the majority of the training data will dominate, and the model won’t learn that the sun actually rises from the west on Venus. So, when asked to predict the next word for “the sun rises from the,” the model will confidently give “east.”

This is why we can only rely on the LLM’s knowledge when it appears many times in the training data. This reliance is rooted in the noisy nudging process. If the knowledge doesn’t appear frequently enough to penetrate the noise floor, it won’t survive the nudging process or be captured by the LLM. Ultimately, the LLM is trained to predict the next word consistent with the training data. There is no explicit concept of knowledge or any check on the correctness of the knowledge, which is why it suffers from hallucination. From the perspective of predicting the next word, the hallucinated content makes perfect sense.

And it’s now natural to understand why AI models struggle with math. Math doesn’t dominate the training data, making it difficult to penetrate the noise floor. Additionally, math is a complex field with many subfields and types of questions, further thinning out the training data. However, for some subfields with sufficient training data coverage, AI models perform adequately, such as arithmetic questions involving numbers less than 100. This supports our previous theory.

This issue has been a significant challenge in AI research. There are two main directions to address it. The first is to reconsider the need to solve math problems within the AI model itself. Exact memory recall and precise math operations are already well-solved problems. Why not use databases and calculators to support the AI model, allowing it to focus on what it does best: reasoning and coordination? This is partly why OpenAI integrated the Python environment and retrieval-augmented generation in ChatGPT. The other direction is to equip AI models with accurate memory and math capabilities. Achieving this would greatly simplify our lives, but no such product exists yet.