预训练大语言模型

我的理解

LLM 的知识、常识与推理能力全部来自预训练阶段:在 15 万亿 token 级别的文本上反复预测“下一个词”。这个任务看似简单,却要求模型隐式把握上下文主题、季节、语境等复杂因素,从而涌现出真正的语义理解。对构建者最重要的洞察是:你通过提示词激活的能力,早已内嵌于预训练模型之中,提示不是在“教”模型新知识,而是在触发它已有的能力。规模是这种“涌现”的必要条件,这也解释了为什么大模型能力的上限远高于小模型。

相关链接


原文

Lesson 53 of 68 预训练大语言模型 / Pre-Training an LLM

支撑 ChatGPT 的 AI 模型同样是一种机器学习模型,它在海量数据上完成训练。它的强大能力主要来自两点。第一,训练数据规模极其庞大。例如 Meta 开源的 LLM Llama 3 就使用了 15 万亿个 token,大致相当于 1 亿本书的规模。第二,它遵循特殊的训练流程:预训练(pre-training)与微调(fine-tuning)。微调对于让 AI 真正变得有用至关重要,但预训练才是赋予 LLM 核心能力的关键阶段,包括知识、常识、推断和推理等。因此,本节课我们将聚焦于预训练。

在预训练阶段,训练数据是文本。比如,可以是本节课的文字。我们把一段在前的文本(例如本课的第一段)输入给 LLM,并要求它预测下一个词 [脚注 1,请见课程末尾]。如前所述,当模型预测错误时,我们会调整它的内部参数,让它下次预测得更准。

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

就是这样。我们对下一个词、下一段,乃至 1 亿本书中的全部文本反复进行这个过程。通过这一过程,LLM(极有可能)会遇到这样的情形:当输入是“the sun rises from”时,它需要输出下一个词“the”。如果输出不对,模型就会被更新,引导它给出正确答案“the”。而当输入是“the sun rises from the”时,它还会学到要输出“east”。在这 1 亿本书中,提到“the sun rises from the east”的实例,远比“the sun rises from the west”或“the sun rises from an apple”要多得多。

魔力正是从这里开始的。训练过程只是教 LLM 预测下一个词,但通过这一点,它实际上学到了一个事实:太阳从东方升起。通过反复遇到这类模式,LLM 把这一信息内化了下来。

但魔力还不止于此。训练数据往往呈现出多样性。例如,有些书可能在前文先提到“现在是某地的夏天”,随后才说“the sun rises from the northeast(太阳从东北方升起)”;又或者在讨论金星这颗行星时写道“the sun rises from the west(太阳从西方升起)”。在这种情况下,LLM 需要学会:当它在前文(不一定紧邻当前文本)看到“in the summer”时,就应当倾向于预测“northeast”;而当话题与金星相关时,预测“west”才是更稳妥的选择。这正是“预测下一个词”这一简单任务能够承载潜在复杂知识的原因。它甚至在某种程度上触及了理解能力,因为为了完成这一任务,模型必须判断出文本所涉及的季节或主题。

当我们拥有了一个预训练好的 LLM 后,它看似简单的“预测下一个词”能力,确实可以支撑起复杂的应用。例如,给定输入“Venus is an interesting planet. The sun rises from”,LLM 可能会把下一个词预测为“the”。接着,我们将这个新预测出的词追加到输入中,让 LLM 再次预测“Venus is an interesting planet. The sun rises from the”之后的下一个词。此时,LLM 会预测出“west”。通过不断重复“预测—追加—再预测”的过程,我们就能生成长篇文本。

这种“预测—生成”的迭代过程,使 LLM 能够基于其从大规模训练数据中内化的模式与知识,生成连贯且符合上下文的文本。正是这种能力,让 LLM 在写作辅助、回答问题等众多应用中如此通用而强大。

关于“预测下一个 token”为何能催生出像 GenAI 这样强大的能力,人类目前的理解仍然有限。一种较有说服力的观点来自 OpenAI 首席科学家 Ilya Sutskever,他认为大语言模型(LLM)是对世界知识的一种压缩,而这种压缩本身就意味着智能(ref)。我们能够确定的是,这种“推理”能力是随着模型规模更大、训练数据更广而“涌现”出来的。从本课程的实用目的出发,我们暂不深究其中的细节。

脚注:

[1] 严格来说应是下一个 token。但为了便于讨论,这里使用“词”这一说法。

English Original

The AI model underlying ChatGPT is an ML model as well, trained on a vast amount of data. Its power comes from two main factors. First, it’s trained on an enormous dataset. For instance, Llama 3, an open-source LLM from Meta, uses 15 trillion tokens, roughly equivalent to 100 million books. Second, it’s trained following special procedures: pre-training and fine-tuning. While fine-tuning is critical for making AI helpful, pre-training is the key stage that grants LLMs their core capabilities, such as knowledge, common sense, deduction, and reasoning. So, we will focus on pre-training in this lesson.

During the pre-training stage, the training data consists of text. For example, it could be the text of this lesson. We provide the LLM with some preceding text, such as the first paragraph of this lesson, and ask it to predict the next word [footnote 1, check the end of the lesson for footnotes]. As introduced before, when the model predicts the wrong word, we tweak its internal parameters so that it will be more accurate the next time.

https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/

That’s it. We repeat this process for the next word, the next paragraph, and across all the texts from 100 million books. Through this process, the LLM will (very likely) encounter cases where it faces the input “the sun rises from,” and it needs to output “the” as the next word. If it outputs something different, the model will be updated to encourage the correct output of “the.” When it faces the input “the sun rises from the,” it would also learn to output “east.” In these 100 million books, there are likely far more instances mentioning “the sun rises from the east” compared to “the sun rises from the west” or “the sun rises from an apple.”

This is where the magic starts. The training process simply teaches the LLM to predict the next word, but through this, it actually learns a fact: the sun rises from the east. By encountering such patterns repeatedly, the LLM internalizes this information.

But the magic doesn’t stop here. Often, the training data present diversity. For example, some books may indeed say “the sun rises from the northeast” after mentioning it’s the summertime somewhere earlier in the text. Or “the sun rises from the west” when discussing the planet Venus. In this case, the LLM needs to figure out that when it sees “in the summer” somewhere before (not necessarily immediately before the text), it should probably predict “northeast.” And when the topic is relevant to Venus, predicting “west” is a safer bet. This is how the simple task of predicting the next word can support potentially complicated knowledge. It even touches on comprehension to some extent because, to accomplish the task, the model needs to figure out the season or topic of the text.

When we have a pre-trained LLM, its seemingly simple capability of predicting the next word can indeed power complicated applications. For example, when given the input “Venus is an interesting planet. The sun rises from,” the LLM may predict the next word as “the.” Then we append this newly predicted word to the input and ask the LLM to predict the next word after “Venus is an interesting planet. The sun rises from the.” In this case, the LLM would predict “west.” By repeating the process of prediction, appending, and further prediction, we can generate long texts.

This iterative process of prediction and generation allows the LLM to produce coherent and contextually accurate text based on the patterns and knowledge it has internalized from its extensive training data. This capability is what enables LLMs to be so versatile and powerful in various applications, from writing assistance to answering questions and more.

Humans still have limited understanding of why predicting the next token can lead to something as powerful as GenAI. One plausible theory comes from Ilya Sutskever, the chief scientist at OpenAI, suggesting that large language models (LLMs) are a compression of the world’s knowledge, and this compression signifies intelligence (ref). What we do know is that this “reasoning” ability has “emerged” with larger models and more extensive data. For the practical purpose of this course, let’s not delve too deeply into it.

Footnotes:

[1]  Actually the next token. But for an easier discussion, we use words here.