学习 3:评估机制
我的理解
优秀管理者不参与每一个执行细节,而是定义并跟踪成功标准——这一管理原则在 AI 项目中同样成立,且能覆盖 90% 的方向把控工作。提前定义成功标准有双重价值:一是强制团队在启动前深入思考任务边界与拆解方式,常在此过程中提前发现陷阱或更优路径;二是自然产出可作为测试代码的文档,实现对 AI 输出质量的自动化守护。没有测试时,AI 驱动的组件会发生“静默恶化”——你无从知道它是在变好还是变差;有了测试,调试成本极低,多人协作构建也成为可能。评估机制本身是一种委托,同时也是进一步大规模委托的基础设施。
相关链接
- Ch04-L07 第三步 探索智能可行性 — 评估机制正是在第三步中被首次建立,两课构成“设立—深化”的关系
- Ch04-L11 技术洞察 像管理者一样思考 — 评估机制是管理者通过标准而非执行细节来控制方向的体现
- Ch05-L06 要点 1 与幻觉共处 — 测试也是检测和限制幻觉影响的关键手段,与风险管理中的“尽早失败”原则相辅相成
- Ch03-L07 总结与反思 — 实战项目完成后的系统复盘,在思维层面与评估机制的事后验证互补
原文
Lesson 33 of 68 学习 3:评估机制 / Learning 3: Assessment mechanisms
要把控一个项目的方向,管理者并不需要参与每一个具体决策。相反,定义并跟踪成功标准就能完成 90% 的工作。这其实正是我们在前几节课里所做的事情。我们并不真的在意程序使用的是 JavaScript 还是 Python,也不在意它用的是 GPT-4-Turbo + JSON mode 还是 GPT-3.5 + prompting。只要能通过我们的测试,就算合格。建立合适的评估机制,是在把控项目方向的同时下放执行权的好办法。围绕这件事,也有一些最佳实践。
首先,这需要我们具备相应的专业能力、经验,以及对问题的深入理解。在本例中,与 GenAI 相关这一标准看起来定义得很清楚,但很大程度上是因为这是一个假想的例子。如果我们设身处地地想一想自己究竟希望优先阅读哪一类与 GenAI 相关的邮件,就会发现需要更加具体。比如,你可能更关注视频生成模型,而我可能更关注开源 LLM。因此,向 AI 描述我们想要什么、从中寻找灵感是可以的,但完全把这件事交给 AI 通常并不可取。这正是我们在上一节课里做错的地方。
其次,在项目启动前就定义好成功标准有两个好处。一方面,它会迫使我们和团队深入思考这项任务,既包括做什么,也包括怎么做。在这一过程中产生新的灵感并不少见。例如,我们可能会意识到当前正在推进的方案并非达成成功标准的最佳路径,或者更好的做法是先将任务拆解为若干独立的子问题。这一过程通常有助于在动手实施之前识别潜在的陷阱,而这总比事后发现要好得多。
另一方面,这种做法自然会产出一份描述成功标准的文档。正如我们之前讨论过的,这样的文档对于改善团队沟通、提高任务的可复用性都非常有价值。对于编程任务来说,这份文档可以以程序的形式呈现(例如测试用例)。除了作为沟通工具外,我们还可以利用这个程序方便地验证 AI 生成的方案是否正确。
这种验证比看上去更重要。如果没有上一节课中引入的测试,我们就无从判断 prompt 是否真的有效。我们只能直接把它投入使用,寄希望于它能凑合运行,并打算之后再慢慢改进。然而,如果它的效果其实并不好,我们可能很长时间都意识不到——直到被折腾得受够了,才会进行更系统的测试。这种缺乏可见性的状态对维护同样有害。后续修改组件时,我们可能以为通过解决几个糟糕的示例在改进它,但实际上可能破坏了其他示例,反而降低了整体效果。
因此,测试可以作为防范质量退化或 bug 的一道防线。它本身就是一种委派与自动化形式。我们不再需要手动检查“AI 写出来的程序是否正确?”,而是有了自动化的检查方式。这让我们能够迅速定位问题,使调试和修复变得异常轻松,因为我们知道问题源自最近的那次改动。它也让整个搭建流程更具可扩展性。如果没有这些测试,我们就只敢自己动手做改动;一旦有人想加入一起开发这个工具,往往就需要一套繁重的上手流程,让对方理解他们在做什么。但如果我们已经建立了评估机制(良好的测试),就完全不必担心这一点。即便有人犯了错,系统也会捕获它。这样一来,扩大搭建规模就变得容易得多。
可以看到,评估机制本身就是一种委派,同时也有助于进一步的委派。因此,尽早建立并持续更新评估机制至关重要。这正是你在大规模委派的同时仍能保持掌控的方式。
English Original
To control the direction of a project, a manager doesn’t need to participate in every single decision. Instead, defining and tracking success criteria can do 90% of the job. This is actually what we did in the previous lessons. We don’t really care whether our program uses JavaScript or Python, or whether it uses GPT-4-Turbo + JSON mode or GPT-3.5 + prompting. As long as it passes our tests, we are good. Setting up a proper assessment mechanism is a good way to delegate execution while still controlling the project’s direction. There are also a few best practices for this.
First, it requires our expertise, experience, and deep understanding of the problem. In this example, related to GenAI seems well-defined, mostly because it’s an imaginary example. If we put ourselves in the shoes and think about what kind of GenAI-related email we want to prioritize reading, it becomes clear that we need to be more specific. For example, you might be more interested in video generation models, while I might focus on open-source LLMs. So, while it’s okay to describe what we want to AI and seek inspiration, it’s usually bad to entirely delegate this to AI. This is what we did wrong in the previous lesson.
Second, defining success criteria before a project starts has two benefits. On one hand, it forces us and the team to think deeply about the task, both in terms of what to do and how to do it. It’s not uncommon to get further inspired during this process. For example, we might realize that the ongoing project isn’t the best solution to reach the success criteria, or it might be better to decompose the task into several independent sub-problems first. This process often helps identify potential pitfalls before implementation begins, which is always preferable to discovering them later.
On the other hand, this approach naturally produces a document outlining the success criteria. As we discussed earlier, such a document is very valuable for better communication with the team and making the task more reusable. For programming tasks, this document can take the form of a program (e.g., test cases). Beyond serving as a means of communication, we can use this program to easily verify whether the AI generates correct solutions.
This verification is more important than it appears. Without the tests we introduced in the previous lesson, we would have no visibility on whether the prompt works well. We would simply put it into use, hoping it works somehow and planning to improve it over time. However, if it doesn’t work well, we might not realize it for a long time—until we are frustrated enough to conduct more thorough testing. This lack of visibility is also bad for maintenance. If we make changes to the component later, we might think we’re improving it by solving a few bad examples, but it could actually break other examples and decrease the overall effectiveness.
So the tests act as a guard against quality degradation or bugs. It’s a form of delegation and automation. Instead of manually checking, “Is the program written by AI correct?”, we now have an automatic way to check it. This allows us to identify issues quickly, making debugging and fixing extremely easy, since we know the problem stems from the most recent change. It also makes the entire building process more scalable. Without these tests, we would only trust ourselves to make changes. If someone else wanted to join us in building the tool, it would likely require a heavy onboarding process so they understand what they are doing. However, if we have an assessment mechanism (good tests) set up, we don’t need to worry about this. If someone makes a mistake, the system will catch it. This makes scaling up the building process much easier.
As you can see, the assessment mechanism is a delegation in itself and helps further delegation. Therefore, it’s crucial to set this up early and keep it updated. That’s how you maintain control while delegating at scale.