理解常见陷阱——文件与网页

我的理解

文件和网页在进入 GPT 之前都必须转化为文本,这一转化过程充满潜在失真:pandas 截断宽表列、反爬机制阻断网页抓取,均会导致 GPT 对内容做出错误假设进而产生幻觉。解法一致且简单:直接把文件前几行或网页内容粘贴进 prompt。ChatGPT(对超大文件使用 RAG)与 Claude(全文加载进上下文)代表了两种不同权衡——前者适合大文件和代码执行,后者在小文件复杂分析上表现更可靠。本课明确提出两个重要心态:一是“把 AI 当实习生”——永远问自己有没有提供充分且便于 AI 使用的信息;二是“动态视角”——今日的限制可能随模型进化消失,定期用小实验探索能力边界才是长久之道。

相关链接


原文

Lesson 12 of 68 理解常见陷阱——文件与网页 / Make sense of common pitfalls - Files and webpages

文件与网页:它们如何进入上下文窗口

在最近的更新中,ChatGPT 推出了多项新功能。例如,用户现在可以上传文件供 ChatGPT 在编程会话中使用,也可以请求对网页进行摘要。这些功能对于快速可视化和数据处理极为有用,免去了复制、粘贴和在本地运行代码的繁琐步骤。

然而,这些新功能也可能因意外行为而带来挑战。例如,即使上传了一个具有特定列的 CSV 文件,ChatGPT 也可能无法正确识别其内容,从而基于对文件结构的错误假设生成 Python 代码。同样,当被要求对网页进行摘要时,ChatGPT 可能信心满满地给出一份摘要,但其内容却与网页的实际内容毫无关联——这种现象被称为“幻觉”(hallucination)。

要理解这一行为,归根结底还是要回到模型的上下文窗口。ChatGPT 的运作方式是为底层的 GPT 模型生成提示词(prompt),再由 GPT 模型生成基于文本的回复。暂且不论 GPT-4o 的多模态能力,从根本上说,GPT 处理和生成的都是文本。因此,无论输入是文件还是网页,ChatGPT 都必须将一切转换为文本。然而,直接粘贴文件内容通常并不可行,因为这会超出 ChatGPT 慷慨的 512MB 文件大小限制。因此,GPT 必须采用一些巧妙的方法,从文件或网页中提取并利用更多信息,以构建一个有效的提示词。

在 ChatGPT 中,关于文件的信息通常来源于用户的提示词。如果用户没有提供具体细节,GPT 会尝试自行推断。然而,这一过程可能颇为棘手,存在诸多障碍和潜在错误。例如,如果 GPT 缺少有关已上传文件的信息,它会聪明地编写 Python 代码来显示基本细节,例如 CSV 文件的前几列和前几行。

然而,这种方式存在局限。例如,在处理列数较多的文件时,ChatGPT 所使用的 pandas 库可能会跳过部分列以适配显示效果,正如截图中“Postal Code”列后的省略号(…)所示。这种截断给 GPT 带来了挑战,因为它对文件的理解仅限于 Python 代码的输出内容。因此,如果许多列被省略,GPT 可能会对文件内容做出不准确的假设。这就会导致一些看似“愚蠢”的问题,正如前面提到的那样——即便“Company Name”这一列在文件中清晰可见。

解决方案非常简单。为了确保 GPT 理解文件的结构和内容,只需将文件的前几行复制并粘贴到提示词中。通过将这些信息直接包含在提示词中,并保持其余问题不变,GPT 这次就能够生成正确的回答。

类似的问题也会出现在 ChatGPT 试图从网站获取内容的场景中。反爬虫机制、登录要求以及动态加载内容等挑战,都可能阻碍模型基本的抓取与解析能力。当提示词中信息不足时,GPT 容易出现“幻觉”——这是语言模型基于虚构细节生成内容的常见问题。

解决方案同样简单。通过我们自己访问网站并将内容粘贴到 ChatGPT 中,可以为 GPT 提供关于该网页的完整信息。这种做法绕过了容易出错的网页抓取与解析过程,使 GPT 能够生成准确的回答。

至此,我们已经了解了 ChatGPT 如何处理文件和网页。值得注意的是,AI 领域还有其他参与者,例如 Anthropic。Anthropic 的 Claude 模型在文件处理上采用了不同的技术路线。由于 Claude 目前尚不支持代码执行或网页浏览,它会将整个文件的内容直接整合进上下文窗口。可以通过上传一个略微超过 Claude 上下文窗口限制的小型纯文本文件来验证这一点——你将看到一条提示文件过大的错误信息。

相比之下,Claude 在文件处理上通常表现出比 ChatGPT 更高水平的智能。这是因为 ChatGPT 对超出上下文窗口大小的文件采用了一种被称为“检索增强生成”(Retrieval Augmented Generation,RAG)的方法。RAG 的工作方式是:先从上传文件中提取相关片段以构建提示词,再用该提示词生成最终输出。而 Claude 则直接基于整个文件的内容来构建提示词。虽然这种方式往往能带来更高水平的智能,但它也更昂贵、更慢,并且仅限于处理能放入上下文窗口的文件。这也对产品形态形成了更多约束,并加重了公司推理基础设施的负担。

理解不同 AI 产品在文件处理方式上的差异,有助于我们针对具体任务选择最合适的工具。例如,对于涉及小文件的复杂任务,Claude 是更优的选择;而对于需要代码执行或处理较大文件的需求,ChatGPT 则更为合适。

把 ChatGPT 当作实习生来对待

总结来看,让 ChatGPT 发挥作用的原则可以概括为:把它当作一名实习生。始终问问自己:在考虑到它的内部工作机制后,我提供给 ChatGPT 的信息是否足够、是否便于它解决我的问题?说到底,即便面对人类实习生,我们仍需要手把手地引导,归纳整理需求,而不是指望他们自己去梳理所有要求。

在这里,我还想强调本节课想分享的最后一种心态:对生成式 AI 领域保持动态视角。我们上面所介绍的内容在今天可能成立,但到了明天,随着模型变得更先进、上下文窗口更长,这些结论可能就不再正确,或者不再必要。因此,始终值得去探索生成式 AI 能力的边界,并保持好奇心去学习和构建。当心存疑虑时,就设计一个快速实验来验证你的假设。这些就是真正理解并规避使用生成式 AI 时常见陷阱的关键所在。

English Original

Files and Webpages: How They Enter Context Window

In the most recent updates, ChatGPT introduces several new features. For instance, users can now upload files for ChatGPT to use during coding sessions, and they can request summaries of web pages. This functionality is incredibly useful for quick visualization and data processing, eliminating the cumbersome steps of copying, pasting, and running code locally.

However, these new features can also present challenges due to unexpected behavior. For example, even if a CSV file is uploaded with specific columns, ChatGPT may not recognize its contents correctly, leading to Python code based on incorrect assumptions about the file’s structure. Similarly, when asked to summarize a web page, ChatGPT might confidently provide a summary that turns out to be completely unrelated to the actual content of the page—a phenomenon known as “hallucination.”

Understanding this behavior again boils down to the model’s context window. ChatGPT operates by generating prompts for the underlying GPT model, which then produces text-based responses. Setting aside the multi-modal capabilities of GPT-4o for a moment, GPT fundamentally processes and generates text. Thus, regardless of the input—be it files or web pages—ChatGPT must convert everything into text. However, simply pasting file content directly often isn’t feasible as it would exceed the ChatGPT’s generous 512MB file size limit. Consequently, GPT must employ clever methods to extract and utilize more information from the files or web pages to construct an effective prompt.

In ChatGPT, the information about the file typically comes from the user’s prompt. If specific details aren’t provided, GPT attempts to deduce them independently. However, this process can be tricky, with numerous obstacles and potential errors. For instance, if GPT lacks information about an uploaded file, it is smart enough to write Python code to display basic details like the first few columns and rows of a CSV file.

However, there are limitations. For example, when dealing with files that have numerous columns, the pandas library ChatGPT uses might skip some columns to fit the display, as indicated by the ellipsis (…) after the “Postal Code” column in the screenshot. This truncation creates challenges for GPT because its understanding of the file is restricted to what the Python code outputs. Consequently, if many columns are omitted, GPT might make inaccurate assumptions about the file’s content. This leads to seemingly “dumb” questions, like the one mentioned, even though the “Company Name” column is clearly present in the file for us to see.

The solution is straightforward. To ensure GPT understands the file’s structure and content, simply copy and paste the first few lines into the prompt. By including this information directly in the prompt and keeping the rest of the question unchanged, GPT will be equipped to generate the correct response this time.

Similarly, issues can arise when ChatGPT attempts to retrieve content from a website. Challenges such as anti-bot mechanisms, login requirements, and dynamically loaded content can thwart the model’s basic scraping and parsing capabilities. When faced with inadequate information in the prompt, GPT is prone to “hallucinate,” a common problem where language models generate content based on imaginary details.

The solution, once again, is simple. By visiting the website ourselves and pasting the content into ChatGPT, we can provide GPT with comprehensive information about the webpage. This approach bypasses the error-prone process of fetching and parsing a webpage, allowing GPT to produce accurate responses.

We now understand how ChatGPT processes files and webpages. It’s important to recognize that there are other players in the AI landscape, such as Anthropic. Anthropic’s Claude model uses a different technical approach to manage files. Since Claude does not yet support code execution or web browsing (yet), it directly integrates the content of the entire file into the context window. This can be verified by uploading a small plain text file that slightly exceeds Claude’s context window limit, which will result in an error message indicating that the file is too large.

In comparison, Claude typically demonstrates a higher level of intelligence in file handling than ChatGPT. This is because ChatGPT employs a method known as Retrieval Augmented Generation (RAG) for files that exceed the context window size. RAG works by first extracting relevant sections of the uploaded file to construct a prompt, which is then used to generate the final output. On the other hand, Claude builds the prompt directly from the entire file’s content. While this approach often results in a superior level of intelligence, it is also more costly, slower, and limited to handling files that fit within the context window. This places greater constraints on the product and increases the strain on the company’s inference infrastructure.

Understanding these differences in how files are handled by different AI products guides us in selecting the most appropriate tool for specific tasks. For instance, Claude is the preferable choice for complex tasks involving small files. However, for needs that include code execution or processing of larger files, ChatGPT is the more suitable option.

Treating ChatGPT like an Intern

In summary, the principle of making ChatGPT work could be summarized as treating it as an intern. Always ask yourself, is the information given to ChatGPT sufficient and convenient for it to solve your problem, with consideration of how it works internally? At the end of the day, even for human interns, we still need to do hand-holding, summarizing the requests instead of expecting them to track all the requests by themselves.

Here I also want to emphasize another and the last mindset we wanted to share in this lesson, that is to have a dynamic view on the GenAI landscape. What we introduced above may be true for today, but in tomorrow, with more advanced models with longer context windows, it might become incorrect or not necessary anymore. Therefore, it’s always worth trying to explore the envelope of the capabilities of the GenAIs and always be curious to learn and build. When in doubt, design a quick experiment to verify your assumption. These are the keys to really understand and get around the common pitfalls of using GenAI.