理解常见陷阱——偷懒
我的理解
“偷懒”行为的根本原因在于 GPT 模型的训练数据中缺乏长输出样本,导致其在长文本生成任务中性能退化——这是模型层面的限制,而非 ChatGPT 产品设计造成的。验证思路本身极具价值:通过直接调用 GPT API 排除 ChatGPT 系统 prompt 的影响,用快速实验代替猜测,这是理解 AI 内部机制的核心方法论。解决方案十分简单:将长任务拆解为多个短任务,使每个子任务落在模型的“舒适区”内。这种“设计实验→验证假设”的思维方式,是整门课反复强调的工程心态,适用于遇到的任何非预期 AI 行为。
相关链接
- Ch02-L02 研究 LLM内部机制 记忆知识上下文 — 提供偷懒行为的理论基础:输出长度限制与上下文窗口机制
- Ch02-L04 理解常见陷阱 遗忘 — 平行的上下文陷阱,两者对比能更清晰地理解上下文窗口的双重限制
- Ch02-L06 应用 使用 ChatGPT 的新方式 编辑 而非对话 — 任务拆解与上下文管理的实践应用方法
- Ch02-L08 案例研究 周业务回顾 具体实现 — WBR 实现中实际运用了任务拆解策略来应对输出限制
原文
Lesson 10 of 68 理解常见陷阱——偷懒 / Make sense of common pitfalls - Laziness
在使用 ChatGPT 等生成式 AI 的过程中,我们常常会产生复杂的感受。有时,我们会惊叹于 AI 的聪明:它能像人类一样理解我们的表达,像人类一样对话,并凭借出色的推理能力完成各种高难度任务。但有时,我们也会因为它的“笨”而感到失望,尤其是在一些人类绝不会犯错的奇怪场景中。甚至在某些时候,这两种感受会同时出现。例如,我们经常会觉得 ChatGPT 变“懒”了,就像一个聪明人在偷工减料。又或者,它一开始表现得很好,严格遵守我们的所有要求;可随着对话推进,它会突然开始遗忘,不再满足许多要求。这种不可预期与困惑让人难以真正信任并充分发挥生成式 AI 的价值。但好消息是,这节课将帮助你理解所有这些出人意料的行为。你会发现,所有这些表现都是可以解释和理解的,我们也完全有办法绕开这些糟糕体验——这将是下一节课的主题。
偷懒:输出长度的限制
我们来看一个常见的例子。当我们让 AI 翻译或重写一段包含许多段落的长文本时,它在开头可能会非常严格地遵循要求,逐句翻译。但渐渐地,它可能会变得“不耐烦”,开始跳过句子。这种情况会逐步加剧,最终它会跳过大段文本,使整个翻译或重写结果变得毫无用处。
这种偷懒行为确实存在,从经验上看,原因在于当前底层的 GPT 模型并未针对“长输出”任务进行训练。这可能是因为这类长对话样本难以收集,且成本高昂。就像人类学习一样,如果模型在训练过程中从未见过这种长对话,它在真实世界中也不知道如何恰当应对。一个佐证是:最新的 GPT 模型(截至 2024 年 5 月为 GPT-4-Turbo)拥有 128k 的上下文窗口,但输出长度被硬性限制在 4k tokens 以内。
不过解决方法很简单:把任务拆分成更小的任务,让它们落在 GPT 的“舒适区”内。例如,与其让 GPT“重写以下文本以提升可读性:<5 页文本>”,我们可以将提示拆解为多个小段:“重写以下文本以提升可读性:<1 页文本>”,最后将结果合并。这样通常能得到好得多的结果。
你可能会问:这究竟是底层 GPT 模型本身的限制,还是 ChatGPT 的问题?前面的课程中提到,ChatGPT 是基于 GPT 构建的产品,OpenAI 在其中加入了额外的提示词,可能在“暗示”它不必那么卖力。要判断到底是哪种原因,看似困难,但其实有一个简单的办法:我们直接用相同的提示词调用 GPT API。如果偷懒源自 ChatGPT 的系统提示词,那么在调用 GPT API 时这些提示词并不存在,偷懒行为应该会消失。我们实际做了这个实验,结果发现 GPT API 同样会表现出偷懒。因此可以得出结论:偷懒行为来自底层 GPT 模型本身的局限。
上面的讨论其实非常具有启发性。虽然仅凭已有信息我们无法直接得出问题的答案,但我们通过设计一些快速实验来寻找答案。正如后文将展示的那样,这种思维方式对于理解 GPT 的内部工作机制、并为我们面临的问题找到解决方案非常有帮助。
English Original
In our journey of using GenAIs, such as ChatGPTs, it is often the case that we get a mixture of feelings. At some point, we are astonished by how smart AI could be. It understands what we say like humans, talks like humans, and can accomplish hard tasks with its impressive reasoning capabilities. In other cases, it is also easy to get disappointed by how dumb the AI could be, especially in some weird places where humans won’t make mistakes. Sometimes, we even get a mixed feeling of these two at the same time. For example, it is not uncommon to feel that ChatGPT gets lazy, just like a smart human to cut corners. Other times, it performs well at first, adhering to all our requests and requirements. When the conversation goes along, suddenly it begins to forget things, not meeting many requirements. This unexpectedness and confusion makes it hard to trust and take full advantage of GenAI. But fortunately, this lesson will help you make sense of all of these unexpected behaviors. It will turn out that all of those behaviors can be explained and understood, and it is also possible for us to navigate around those bad experiences, which will be the topic of the next lesson.
Laziness: Limit of Output Size
Let’s take a look at a common example. We ask the AI to translate or rewrite a long piece of text with many paragraphs. It may stick with the requirement really well at the beginning. Translate sentence by sentence. However, it may become “impatient,” and begin to skip sentences. This will gradually become more severe, until it reaches a point that it skips large segments of text and rendering the entire translation or rewriting not useful.
Such kind of lazy behavior indeed exists and that’s empirically because the current underlying GPT model is not trained for tasks with long outputs. This is potentially because those long conversations are hard and expensive to collect. Similar to human learning about something, when the model never saw such long conversations during training, it didn’t know how to respond properly in the real world. A side evidence is, for the latest GPT model (GPT-4-Turbo as of May 2024), it has a 128k context window, but the output size has a hard limit of 4k tokens.
The solution is simple though. Just split the task into smaller tasks, so they fall within the “comfort zone” of GPT. For example, instead of asking GPT “rewrite the following text to improve the readability: <5 pages of text>”, we could reorganize the prompt as small pieces of “rewrite the following text to improve the readability: <1 page text>”. And finally combine the results. This will likely give us a much better result.
And you may ask, is it really because of the underlying limit of the GPT model? Or is it because of ChatGPT? It was mentioned in previous lessons that ChatGPT is a product built upon GPT and OpenAI adds extra prompts, which potentially tells it not to work that hard. It may seem challenging to determine which one is the case, but there is actually an easy way to figure it out. We can simply invoke the GPT API with the same prompt. If the laziness is from ChatGPT’s system prompts, then because these prompts are not there when we use GPT API, we would expect the laziness will go away. We actually did the experiment and we found out the GPT API still shows laziness. Therefore, we could conclude that the lazy behavior is from the limitation of the underlying GPT model.
It is actually a quite inspiring discussion above. Although we don’t know the answer to the problem itself based on the available information, what we did was we designed some quick experiments to figure it out. As will be shown in the following part, this mindset is very helpful in understanding how GPT works internally and finding a resolution to the problems we face.