重新审视上下文窗口管理:使用 Open WebUI 自动管理上下文窗口
我的理解
输出上下文窗口是一个长期被忽视的 LLM 瓶颈——当输出文本填满窗口后,模型会逐渐”偷懒”:翻译跳段、写作草草收尾,且这一问题不会随模型迭代消失(即便 o1 的输出窗口也仅有 16K token)。Open WebUI 的 pipe 机制将解决方案推进到了无代码可操作层面:自动切分输入、逐段调用 LLM 并拼接结果,同时通过 Prefix KV Cache 确保重复前缀不带来额外成本(Claude 提供 90% 折扣,GPT 提供 50% 折扣)。这使原本需要繁琐手工操作的上下文管理变成了一键部署的模型功能,彻底降低了长文翻译和深度写作的实操门槛,也体现了开放工具相比商业产品在定制性上的核心优势。
相关链接
- Ch02-L02 研究 LLM内部机制 记忆知识上下文 — 本课深入讨论输出上下文窗口管理,直接建立在第二模块 LLM 内部机制与上下文基础概念之上
- Ch02-L03 理解常见陷阱 偷懒 — 输出上下文填满后的”偷懒”现象是第二模块所述 LLM 偷懒陷阱在长输出场景的具体表现
- Ch02-L04 理解常见陷阱 遗忘 — 遗忘与偷懒同为上下文窗口管理不当导致的常见陷阱,两课理解合并可构建完整的上下文风险图
- Ch02-L09 自动化的力量 以及生成式 AI 的角色 — 自动化管理上下文窗口正是将手动繁琐流程转化为自动化能力的典范,与自动化价值主张一脉相承
原文
Lesson 50 of 68 重新审视上下文窗口管理:使用 Open WebUI 自动管理上下文窗口 / Revisiting Context Window Management: Automatically Managing Context Windows with Open WebUI
在第三模块中,我们介绍了上下文窗口的核心概念,并提到让 LLM 能够有效回答问题的关键之一,是对上下文窗口进行恰当的管理。不过在那一模块中,我们的关注点主要是如何组织输入上下文窗口,使其系统性地包含所有背景信息,让 LLM 能够专注于作答,而不必把精力耗费在理解嘈杂的上下文窗口、或从中过滤错误或无用信息上。
但上下文窗口管理还有另一个重要的方面:输出上下文窗口。在第三模块中我们仅简单提及,并未做详细分析。本节课我们将先更深入地探讨输出上下文窗口,理解其内在的陷阱,并基于 Open WebUI 提出简便的解决方案。
输出上下文窗口之痛
管理输出上下文窗口的关键在于:当 LLM 的输出文本占据输出上下文窗口的一定比例后,其智能水平或指令遵循能力会随之下降。例如在翻译长文档时,LLM 起初会进行细致的逐句翻译。但随着输出语句逐渐填满上下文窗口,它就会开始“偷懒”,比如在翻译时缩短句子长度,或者跳过具体例子只翻译大意。到后期,它甚至可能整段都不翻译就直接跳过。这种偷懒行为正是需要进行输出上下文窗口管理的典型症状。
另一个相关的例子是写作。如果我们希望 LLM 就某个主题进行深入的分析性写作,即使我们在输入提示词中明确指定例如 4000 字的篇幅,它仍然很难写到一两千字以上。一开始分析还相对透彻,但随着写作的推进,会越来越敷衍,最终草草收场。
另一方面,摘要任务则是一个正面的例子。即使 LLM 的输入非常长,由于摘要的输出通常较短,而大多数 LLM 的输入上下文窗口又远长于输出窗口,因此我们在执行摘要任务时一般不会遇到这种偷懒现象。这两个方向上的现象,都是由 LLM 输出上下文窗口的特性所导致的。
解决这些问题其实也相当直接。正如第三模块中提到的那样,我们可以将交给 LLM 的任务进行拆解,让每个子问题都落在它能够轻松处理的上下文窗口范围内。最后我们再手动把这些结果拼接起来,从而避免输出过长的问题,并高质量地完成任务。
这种方案虽然有效,但仍然存在两个问题。第一,这种方式纯靠手工,既浪费时间,又容易出错。更重要的是,这种方法往往依赖我们对输入进行切分,而作为人,我们的切分往往相当随意。当我们为了省事而把输入段切得过长时,LLM 仍然会偷懒,但这种偷懒会更隐蔽、危害也更大。因为我们可能会被一些无用劳动所感动或麻痹,从而失去对结果质量进行检查的警觉性,这可能导致在翻译或写作过程中永久性地丢失某些信息。
在探讨这个问题的解决方案之前,可能有同学会问:从历史或长期发展的角度来看,LLM 自身是否会解决这个问题?类似于输入上下文窗口,在 GPT-4 刚发布时,我们只有 4K Token 的输入窗口;但随着新模型的不断推出,如今我们已经有了 128K tokens 的 GPT-4o,甚至 200K tokens 的 Claude。那么,输出上下文窗口是否也会随着 LLM 的发展而变长,从而让我们不再需要做这类优化?
答案是否定的。至少就目前来看,尽管 GPT-4 在输入上下文窗口方面取得了显著进展,但其输出窗口仅从 4K 增加到 8K tokens。即便是最新、最昂贵的 o1 这类模型,也只增加到了 16K tokens。而且,即使是 o1 这类拥有较大上下文窗口的模型,在处理长输出时仍会遇到类似问题。因此,输出上下文窗口管理是一个长期存在的问题,不会随着时间的推移而自行消失。在可预见的未来,我们仍然需要有意识地对其进行管理。
无需写代码的自动化上下文窗口管理
要解决上一章中提到的繁琐且易出错的手动操作,最简单直观的思路就是引入自动化,比如通过编写代码来解决问题。但显然,即使有 AI 的辅助,要写出这样的程序也颇具挑战,更何况还要让其具有可复用性并配上图形界面,这就更非易事。
好消息是,正如我们在上一节课中所演示的那样,Open WebUI 允许我们一键导入个性化的模型。这些模型可以包含预定义的提示词、输入输出过滤、工具调用以及知识库等。具体而言,我们已经开发了两款工具,用户可以直接使用,以解决 LLM 输出上下文窗口的自动化管理问题。本章中,我们将以这两款工具为例,演示如何一键下载并部署类似的上下文管理工具,并分享一些开发心得。
下载和部署这两款工具非常简单。我们可以直接前往 Open WebUI 官方社区的分享页面,点击下载按钮,并填入本地 Open WebUI 的 URL。
https://openwebui.com/posts/diligent_llm_c6b8bcda
https://openwebui.com/posts/cot_augmented_smart_llm_d108aeb6
下载完成后,与之前演示的工具类似,我们也需要进行简单配置。具体来说,在 Admin Panel 的 Functions 标签页中启用刚下载的工具,然后点击齿轮图标,配置其 LLM 后端地址和 API Key。和上一节课介绍的类似,如果你使用的是 OpenAI 的 API,直接粘贴 API Key 即可;如果你使用的是兼容 OpenAI 接口的本地 API 后端,则还需要修改第一项设置,将本地 API 的 URL 粘贴进去,如下图所示。
之后刷新页面,在新建聊天时,我们就可以在模型选择面板中看到这些新模型。它们的用法也十分简单。对于 Long Context 模型,我们可以直接粘贴大段文本进行翻译或修改。对于专为写作量身定制的 CoT 模型,输入一个大致的主题后,它会先生成一份提纲,然后基于每个具体的提纲要点逐一写作,最后将结果合并。我分享了一些聊天记录,演示这两个模型的使用方法:https://openwebui.com/posts/3d_printing_and_laser_cutting_insights_da3bdb89
总体而言,这些工具的思路是自动对输入进行切分,然后分别针对每个小片段调用 LLM 进行推理,最后将结果拼接起来并在程序内展示。这种做法在 ChatGPT 或 claude.ai 等现有的商业产品中是难以实现和共享的,只有在像 Open WebUI 这样的开放工具中才能真正落地应用。
更多实现细节
如果你只关心如何使用这些工具,可以跳过本节。但如果你对开发类似工具感兴趣,可以在这里了解更多技术细节。
要开发类似的工具,我们需要明确三个概念。
第一是 pipe 的基本概念。在 Open WebUI 中,有一种类似工具的代码模块叫做 pipeline:https://docs.openwebui.com/features/extensibility/pipelines/。它是一种非常灵活的、类似插件的机制,可以修改 LLM 接收到的输入和输出,例如用于内容审核;可以对 LLM 的输出做进一步处理,比如反思和事实核查;甚至可以将用户输入转发给其他 LLM。例如我们之前使用的 Anthropic 插件,本质上就是把用户输入转发给 Anthropic 的 Claude API,再把结果返回给用户。因此,它是一个非常灵活的模块,使我们能够非常便捷地对整个 LLM 流水线进行高度定制。
第二,提示词究竟是如何构造的。需要注意的是,在切分提示词时,仅仅把它当作一个字符串切成几段是远远不够的。在这一过程中,我们仍然需要管理输入上下文窗口。换句话说,我们需要确保 LLM 在回答问题时仍然掌握全局的背景知识,从而能够给出正确的答案,同时还要保证不同片段的结果之间具有一定的统一性和一致性。因此,对这两款具体的插件,我们采用了类似的提示词构造策略:先把用户的所有需求和提示词完整地告知 LLM,然后明确告诉它,我们将采用分而治之的策略,下面你只需要针对每个小片段分别作答、分析或翻译即可。这样可以确保它有足够的信息,给出准确、完整且一致的回应。但这也带来了一些问题,也就是第三点:
第三,在这种情况下,由于每个 API 请求的输入都非常长,token 使用量可能会特别大,导致 API 的延迟和成本成倍增加。但幸运的是,无论是商业 LLM 还是开源 LLM,都支持一项重要功能,叫做 Prefix KV Cache。这里我们不深入讨论技术细节,但其直观含义是:当两次 LLM 推理任务的输入提示词的前缀有重叠时,第二次推理可以在相当程度上复用第一次推理的中间结果。换言之,第一次推理产生的中间结果可以被缓存下来,供下一次使用。因此,第二次推理在速度和成本上都会有显著提升。
在 vLLM 等推理引擎中,我们可以打开 Prefix KV Cache 之类的开关来启用这一功能。此时我们经常能观察到,输入提示词的处理速度提升了 5 到 8 倍。在商用 LLM 中,对于五到十分钟内具有相同前缀的提示词,Claude 提供 90% 的折扣 [doc],GPT 则提供 50% 的折扣 [doc]。因此考虑到这种缓存优化,在用户体验和推理成本方面其实并不存在显著的顾虑。
唯一需要注意的一点是,在构造提示词时,我们必须确保它们的前缀完全相同。需要把最长的背景部分放在最前面,以便缓存能够被有效利用。
English Original
In Module 3, we introduced the core concept of the context window, and mentioned that a key to enabling LLMs to effectively answer questions is proper context window management. However, in that module, our focus was primarily on how to organize the input context window to systematically include all background information, allowing the LLM to focus on response without expending energy on understanding a noisy context window or filtering out incorrect or useless information.
But context window management has another important aspect: the output context window. We only briefly mentioned this in Module 3 without a detailed analysis. In this lesson, we will first explore the output context window in more detail, understand the inherent pitfalls, and propose simple solutions based on the Open WebUI.
The Pain of Output Context Window
The key to managing the output context window lies in the fact that, when the LLM’s output text occupies a certain proportion of the output context window, its intelligence or instruction-following ability decreases. For example, when translating a long document, the LLM will initially perform detailed, word-for-word translation. However, as the output sentences gradually fill the context window, it will begin to exhibit some slack, such as shortening sentence lengths during translation, or skipping specific examples and only translating the general viewpoints. In the latter stages, it might even skip entire paragraphs without translation. This kind of slacking behavior is a typical symptom that requires output context window management.
Another related example is writing. If we expect the LLM to perform an in-depth analytical writing on a particular topic, even if we explicitly specify in the input prompt, say, 4,000 words, it still struggles to exceed one or two thousand words. At first, the analysis is relatively thorough, but as the writing progresses, it becomes increasingly cursory, ultimately fizzling out.
On the other hand, summarization is a positive example. Even if the LLM’s input is very long, because the summarization output is generally shorter, and most LLMs have much longer input context windows than the output, we usually don’t encounter such slacking when performing summarization tasks. These observations in both directions are caused by the characteristics of the LLM’s output context window.
Solving these problems is also quite straightforward. As mentioned in Module 3, we can decompose the tasks assigned to the LLM so that each sub-question falls within a context window that it can easily handle. Finally, we can manually stitch these results together, thereby avoiding the problem of overly long output and completing our task with high quality.
Although this solution is effective, it still has two issues. First, this method is purely manual, which not only wastes time but also is prone to errors. More importantly, this method often relies on us segmenting the input. As humans, our segmentation of the input is often quite random. When we try to save effort by making the input segments too long, the LLM will still slack off, but this slacking is more covert and more harmful. Because we may be moved/numbed by doing some useless work, we lose the vigilance to check the quality of results. This can result in the permanent loss of some information during our translation or writing process.
Before exploring solutions to this problem, some students may ask, from a historical or long-term development perspective, will the LLM itself solve this issue? Similar to the input context window, when GPT-4 was first released, we only had a 4K Token input window, but with the continuous development of new models, we now have GPT-4o with 128K tokens, and even Claude with 200K tokens. Will the output context window also become longer with the development of LLMs, so that we no longer need to perform such optimizations?
The answer is no. At least for now, although GPT-4 has made significant progress in its input context window, its output window has only increased from 4K to 8K tokens. Even the latest and most expensive models like o1 have only increased to 16K tokens. Moreover, even models like o1 with large context windows still encounter similar issues when dealing with long outputs. Therefore, managing the output context window is a long-standing problem that will not resolve itself with the passage of time. At least in the foreseeable future, we still need to consciously manage it.
Automated Context Window Management Without Coding
To solve the manual operations mentioned in the previous chapter, which are cumbersome and prone to errors, the simplest and most intuitive approach is to introduce automation, such as coding to address the issue. However, it is obvious that even with AI assistance, writing such programs is not only challenging but also making them reusable and incorporating a graphical interface is quite a daunting task.
The good news is that, as we demonstrated in the previous lesson, Open WebUI allows us to import personalized models with a single click. These models could contain pre-defined prompts, input and output filtering, tool usage, and knowledge bases, among others. Specifically, we have already developed two tools that users can directly use to solve the automated output context window management problem for LLMs. In this chapter, we will use these two tools as examples to demonstrate how to download and deploy similar context management tools with a single click and share some development experiences.
Downloading and deploying these two tools is very straightforward. We can directly go to the Open WebUI official community’s sharing page, click the download button, and enter our local Open WebUI URL.
https://openwebui.com/posts/diligent_llm_c6b8bcda
https://openwebui.com/posts/cot_augmented_smart_llm_d108aeb6
After downloading, similar to the tools demonstrated earlier, we also need to perform simple configurations. Specifically, in the Admin Panel’s Functions tab, enable the newly downloaded tool, then click the gear icon to configure its LLM backend address and API Key. Similar to what was introduced in the previous lesson, if you are using OpenAI’s API, you can simply paste the API Key; if you are using a local API backend compatible with OpenAI’s interface, you also need to change the first setting by pasting the local API URL, as shown in the figure below.
After this, we refresh the page, and when creating a new chat, we can see the new models in the model selection panel. Their usage is also very simple. For the Long Context model, we can directly paste large blocks of text for translation or modification. For the CoT model, which is tailored for writing, it first generates an outline after inputting a general topic, then writes based on each specific outline point and combines the final results. I have shared some chat records to demonstrate how to use these two models: https://openwebui.com/posts/3d_printing_and_laser_cutting_insights_da3bdb89
Overall, the idea behind these tools is to automatically split the input, then call the LLM for inference on each small segment separately, finally stitching the results together and displaying them within the program. This approach is difficult to implement and share in existing commercial products like ChatGPT or claude.ai, and can only be practically applied in open tools like Open WebUI.
More Implementation Details
If you are only interested in using these tools, you can skip this section. However, if you are interested in developing similar tools, you can learn more technical details here.
To develop similar tools, we need to clarify three concepts.
The first is the basic concept of a pipe. In Open WebUI, there is a code module similar to a tool called a pipe: https://docs.openwebui.com/features/extensibility/pipelines/. It is a very flexible plugin-like mechanism that can alter the input and output received by the LLM, such as for content moderation; it can perform further processing on the LLM’s output, such as reflection and fact-checking; and it can even forward user input to other LLMs. For example, the Anthropic plugin we used earlier essentially forwards the user input to Anthropic’s Claude API and then returns the input to the user. Therefore, it is a very flexible module that allows us to highly customize the entire LLM pipeline very conveniently.
Secondly, how the specific prompts are constructed. It is important to note that when splitting the prompts, it is not sufficient to treat them as a single string and divide them into several segments. In this process, we still need to manage the input context window. In other words, we need to ensure that the LLM still has the global background knowledge when answering questions, so that it can effectively provide correct answers, while also ensuring a certain level of uniformity and consistency among the results of different segments. Therefore, for these two specific plugins, we use a similar prompt construction strategy. We first communicate all the user’s requirements and prompts to the LLM, and then clearly tell it that we will adopt a divide-and-conquer strategy, and below you only need to respond, analyze, or translate each small segment accordingly. In this way, it can ensure that it has enough information to provide accurate, complete, and consistent responses. However, this also brings some problems, which is the third point:
Third, in this situation, because each API request’s input is very lengthy, its token usage may be particularly large, causing the API latency and costs to increase several times. But fortunately, both commercial or open-source LLMs support an important feature called Prefix KV Cache. We will not delve into the technical details here, but the intuition is that when the prefixes of the input prompts for two LLM reasoning tasks overlap, the second reasoning can reuse the intermediate results from the first reasoning to a considerable extent. In other words, the intermediate results from the first reasoning can be cached and used for the next time. Therefore, the second reasoning, in terms of speed and cost, will see significant improvements.
In inference engines like vLLM, we can toggle features like Prefix KV Cache to enable this functionality. At this point, we often observe that the processing speed of input prompts has increased by 5 to 8 times. In commercial LLMs, Claude offers a 90% discount for prompts with the same prefix within five to ten minutes [doc], while GPT offers a 50% discount [doc]. Therefore, considering this cache optimization, in fact, there are no significant concerns regarding user experience or inference costs.
The only point to note is that when constructing the prompts, we need to ensure that their prefixes are identical. We need to place the longest background portion at the very beginning to ensure that the cache can be effectively applied.