第三步:探索智能可行性

我的理解

技术可行不等于智能可行——能够调用 GPT 并不意味着 GPT 真的能把任务做好,需要专门验证提示词的有效性与 AI 的实际判断能力。正确的做法是:先让 GPT 提出成功标准和测试用例,再让 GPT 生成实现代码,最后用测试来验证 AI 能否通过,从而在构建之前消除不确定性。本课还暗含一个重要警示(课中以“小测验”点出):把成功标准的定义完全交给 AI 是错误的——这需要领域专业知识与个人主观判断,AI 只能提供灵感,不能替代人的决策。在实施之前让系统“尽早暴露问题”,比事后在生产中发现缺陷代价小得多。

相关链接


原文

Lesson 32 of 68 第三步:探索智能可行性 / Step 3: Explore the intelligence feasibility

现在我们已经能够通过编程方式调用 GPT。但有一个关键因素我们还没有验证:GPT 是否真的能把这件事做好?我们的提示词是否最优?是否存在隐藏的挑战?我们可以直接和 ChatGPT 讨论这个问题:

示例提示词

=======================================================

我们打算写一个 JavaScript 函数,调用 GPT API 来判断一封邮件(包含 subject 和 content)是否与 GenAI 相关。但在动手之前,你建议把哪些指标作为成功标准?能否给出一些 AI 需要满足的测试用例?

=======================================================

示例输出

=======================================================

为“判断一封邮件是否与 GenAI 相关”这一任务制定清晰的成功标准和测试用例,对于评估该函数的有效性至关重要。以下是你可以考虑的方面:

成功标准:

准确性:函数应能准确识别真正与 GenAI 相关的邮件,也就是要在灵敏度(真正例率)和特异度(真负例率)之间取得平衡。

响应速度:函数应迅速返回结果,最好在几秒钟之内完成,以避免过长的延迟。

相关性:函数应能正确识别涉及 GenAI 关键主题的邮件,例如人工智能、机器学习、深度学习、神经网络,以及具体的生成式模型,如 GPT-3、GPT-4、DALL-E 等。

误报/漏报:函数应尽量减少误报(被标记为相关但实际不相关)和漏报(被标记为不相关但实际相关)。

易用性:函数应便于集成,并清晰地输出该邮件是否与 GenAI 相关。

测试用例:

正例(明确提及 GenAI):

输入:Subject: “Upcoming Generative AI Conference”,Content: “Join us for a conference on generative AI…”

期望输出:True(该邮件明显与 GenAI 相关)

正例(隐式提及 GenAI):

输入:Subject: “New ML Model Release”,Content: “We have released a new machine learning model that generates images…”

期望输出:True(该邮件通过“能够生成内容的机器学习模型”间接涉及 GenAI)

负例(无关主题):

输入:Subject: “Office Picnic Next Week”,Content: “Let’s gather for a fun day at the park…”

期望输出:False(该邮件明显与 GenAI 无关)

歧义场景(主题模糊,内容相关):

输入:Subject: “Exciting News”,Content: “We have made significant advancements in deep learning…”

期望输出:True(尽管主题模糊,但内容与 GenAI 相关)

歧义场景(主题相关,内容无关):

输入:Subject: “AI Innovations”,Content: “Join us for a team-building activity this weekend…”

期望输出:False(尽管主题相关,但内容与 GenAI 无关)

边界场景(空邮件):

输入:Subject: “”,Content: “”

期望输出:False(该邮件不包含任何有效信息)

这些测试用例可以帮助你评估该函数能否正确识别与 GenAI 相关的邮件。在不断优化函数并观察其在真实场景中的表现时,你可能需要相应调整这些标准。

=======================================================

这次讨论的结果非常有用。我们可以先建立一组测试,然后根据测试结果优化提示词。让我们请 GPT 基于这些用例为我们写出测试代码,以便快速检查函数的表现。由于上下文窗口里的内容都很有用,我们故意采用对话的方式,而不是去修改原有的提示词。

小测验:我们其实在这里做错了一件事,是什么呢?我们将在下一节课中讨论。

示例提示词

=======================================================

我们已经有了如下的核心函数。你能写一个 Google Apps Script 函数,使用上面提到的测试用例来测试它吗?

/**

  • Determines if a given email is related to Generative AI by using OpenAI’s GPT-3.5-turbo model.

  • @param {Object} email - The email object containing body as a string representing the email content.

  • @returns {boolean} - Returns true if the email is identified as related to Generative AI, otherwise false.

    • The email object should have a body field that contains the text content of the email.
    • The function sends up to the first 10,000 characters of the email body to the OpenAI API for classification.
    • The classification is done by prompting the API to return “yes” or “no” based on relevance to Generative AI.
    • Ensure the YOUR_OPENAI_API_KEY placeholder is replaced with an actual API key before using this function.

*/

function isGenAIEmail(email) {

// Function implementation

}

=======================================================

示例输出(仅代码部分)

=======================================================

/**

  • Test suite for the isGenAIEmail function.

*/

function testIsGenAIEmail() {

const testCases = [

{

description: “Positive Case (Clear Mention of GenAI)”,

input: {

subject: “Upcoming Generative AI Conference”,

body: “Join us for a conference on generative AI…”

},

expected: true

},

{

description: “Positive Case (Implicit Mention of GenAI)”,

input: {

subject: “New ML Model Release”,

body: “We have released a new machine learning model that generates images…”

},

expected: true

},

{

description: “Negative Case (Unrelated Topic)”,

input: {

subject: “Office Picnic Next Week”,

body: “Let’s gather for a fun day at the park…”

},

expected: false

},

{

description: “Ambiguous Case (Vague Subject, Relevant Content)”,

input: {

subject: “Exciting News”,

body: “We have made significant advancements in deep learning…”

},

expected: true

},

{

description: “Ambiguous Case (Relevant Subject, Unrelated Content)”,

input: {

subject: “AI Innovations”,

body: “Join us for a team-building activity this weekend…”

},

expected: false

},

{

description: “Edge Case (Empty Email)”,

input: {

subject: "",

body: ""

},

expected: false

}

];

testCases.forEach((test, index) => {

const result = isGenAIEmail(test.input);

const passed = result === test.expected;

Logger.log(Test Case ${index + 1} - ${test.description}: ${passed ? "Passed" : "Failed"});

});

}

=======================================================

如果我们真的运行这段代码,会发现当前版本的函数通过了所有测试。这就消除了关于“GPT 是否能有效找出目标邮件”的不确定性。然而,这背后还有更深层的含义,我们将在下一节课中加以分析。

English Original

We can call GPT programmatically now. But one key factor we haven’t checked yet is whether GPT can really do the job well. Is our prompt optimal? Are there any hidden challenges? We could simply discuss this with ChatGPT:

Example Prompt

=======================================================

We would like to write a Javascript function to call GPT API to tell whether a given email (subject, content) is relevant with GenAI. But before we do that, what do you suggest to put as the success criteria? Can you come up with some test cases that the AI needs to satisfy?

=======================================================

Example Output

=======================================================

Creating a clear set of success criteria and test cases for determining whether an email is relevant to GenAI is crucial for evaluating the effectiveness of the function. Here’s what you might consider:

Success Criteria:

Accuracy: The function should accurately identify emails that are genuinely related to GenAI. This means achieving a balance between sensitivity (true positive rate) and specificity (true negative rate).

Speed: The function should respond promptly, ideally within a few seconds, to avoid excessive delays.

Relevance: The function should correctly identify emails containing key GenAI topics such as artificial intelligence, machine learning, deep learning, neural networks, and specific generative models like GPT-3, GPT-4, DALL-E, etc.

False Positives/Negatives: The function should minimize false positives (emails marked as relevant but aren’t) and false negatives (emails marked as irrelevant but are relevant).

Usability: The function should be easy to integrate and should provide clear output indicating whether an email is about GenAI or not.

Test Cases:

Positive Case (Clear Mention of GenAI):

Input: Subject: “Upcoming Generative AI Conference”, Content: “Join us for a conference on generative AI…”

Expected Output: True (The email is clearly related to GenAI)

Positive Case (Implicit Mention of GenAI):

Input: Subject: “New ML Model Release”, Content: “We have released a new machine learning model that generates images…”

Expected Output: True (The email indirectly refers to GenAI through machine learning models that generate content)

Negative Case (Unrelated Topic):

Input: Subject: “Office Picnic Next Week”, Content: “Let’s gather for a fun day at the park…”

Expected Output: False (The email is clearly unrelated to GenAI)

Ambiguous Case (Vague Subject, Relevant Content):

Input: Subject: “Exciting News”, Content: “We have made significant advancements in deep learning…”

Expected Output: True (Despite the vague subject, the content is relevant to GenAI)

Ambiguous Case (Relevant Subject, Unrelated Content):

Input: Subject: “AI Innovations”, Content: “Join us for a team-building activity this weekend…”

Expected Output: False (Despite the relevant subject, the content is unrelated to GenAI)

Edge Case (Empty Email):

Input: Subject: "", Content: ""

Expected Output: False (The email lacks any relevant information)

These test cases should help you evaluate whether the function can correctly identify emails relevant to GenAI. You might need to adjust the criteria as you refine the function and observe its performance in real-world scenarios.

=======================================================

This discussion result is very useful. We could first set up a series of tests and then optimize our prompt based on the test results. Let’s ask GPT to write tests for us based on these cases, so we can quickly check how well our function performs. Since all the content in the context window is indeed useful, we intentionally chat instead of editing the prompt.

Pop Quiz: We actually did something wrong here. What is it? We will talk about it in the next lesson.

Example Prompt

=======================================================

We have our core function as below. Can you write a Google Apps Script function to test the function using the test cases mentioned above?

/**

  • Determines if a given email is related to Generative AI by using OpenAI’s GPT-3.5-turbo model.

  • @param {Object} email - The email object containing body as a string representing the email content.

  • @returns {boolean} - Returns true if the email is identified as related to Generative AI, otherwise false.

    • The email object should have a body field that contains the text content of the email.
    • The function sends up to the first 10,000 characters of the email body to the OpenAI API for classification.
    • The classification is done by prompting the API to return “yes” or “no” based on relevance to Generative AI.
    • Ensure the YOUR_OPENAI_API_KEY placeholder is replaced with an actual API key before using this function.

*/

function isGenAIEmail(email) {

// Function implementation

}

=======================================================

Example output (code only)

=======================================================

/**

  • Test suite for the isGenAIEmail function.

*/

function testIsGenAIEmail() {

const testCases = [

    {

        description: "Positive Case (Clear Mention of GenAI)",

        input: {

            subject: "Upcoming Generative AI Conference",

            body: "Join us for a conference on generative AI..."

        },

        expected: true

    },

    {

        description: "Positive Case (Implicit Mention of GenAI)",

        input: {

            subject: "New ML Model Release",

            body: "We have released a new machine learning model that generates images..."

        },

        expected: true

    },

    {

        description: "Negative Case (Unrelated Topic)",

        input: {

            subject: "Office Picnic Next Week",

            body: "Let's gather for a fun day at the park..."

        },

        expected: false

    },

    {

        description: "Ambiguous Case (Vague Subject, Relevant Content)",

        input: {

            subject: "Exciting News",

            body: "We have made significant advancements in deep learning..."

        },

        expected: true

    },

    {

        description: "Ambiguous Case (Relevant Subject, Unrelated Content)",

        input: {

            subject: "AI Innovations",

            body: "Join us for a team-building activity this weekend..."

        },

        expected: false

    },

    {

        description: "Edge Case (Empty Email)",

        input: {

            subject: "",

            body: ""

        },

        expected: false

    }

];

testCases.forEach((test, index) => {

    const result = isGenAIEmail(test.input);

    const passed = result === test.expected;

    Logger.log(`Test Case ${index + 1} - ${test.description}: ${passed ? "Passed" : "Failed"}`);

});

}

=======================================================

If we actually run the code, we can see our current version of the function passes all the tests. This resolves the uncertainty about whether GPT can effectively find the target emails. However, it has deeper implications, which we will analyze in the next lesson.