|
|
---
|
|
|
title: 大模型工具调用(Tools Calling)评测
|
|
|
description: 基于 LobeChat 测试主流支持工具调用(Tools Calling) 的大模型,并客观呈现评测结果
|
|
|
tags:
|
|
|
- Tools Calling
|
|
|
- Benchmark
|
|
|
- Function Calling 评测
|
|
|
- 工具调用
|
|
|
- LobeChat 插件
|
|
|
---
|
|
|
|
|
|
# 大模型工具调用(Tools Calling)评测
|
|
|
|
|
|
Tools Calling 是大语言模型的高级能力。你可以通过在 API 请求中传入一组工具列表,让模型智能地选择具体使用哪个工具,并在返回请求中输出工具调用的 JSON 参数。
|
|
|
|
|
|
<Callout type="info">
|
|
|
如果你之前没有了解过 Tools Calling, 可以查看 [Function Call: Chat
|
|
|
应用的插件基石与交互技术的变革黎明](https://lobehub.com/zh/blog/openai-function-call) 这篇文章。
|
|
|
</Callout>
|
|
|
|
|
|
随着社区中越来越多的大语言模型支持了 Tools Calling 能力,同时得益于 LobeChat 的 Agent Runtime 架构,我们几乎实现了所有主流大语言模型( OpenAI 、Claude 、Gemini 等等)的 Tools Calling 调用能力。
|
|
|
|
|
|
LobeChat 的插件实现基于模型的 Tools Calling 能力,模型本身的 Tools Calling 能力决定插件调用是否正常。作为上层应用,我们针对各个模型的 Tools Calling 做了较为完善的测试,以便帮助我们的用户了解现有的模型能力,更好地进行抉择。
|
|
|
|
|
|
## 评测任务介绍
|
|
|
|
|
|
我们基于实际真实的用户场景出发构建了两大组测试任务,第一组为简单的调用指令(天气查询),第二组为复杂调用指令(文生图)。这两组指令的系统描述如下:
|
|
|
|
|
|
<Tabs items={["简单指令(天气查询)", "复杂指令(文生图)"]}>
|
|
|
<Tab>
|
|
|
```md
|
|
|
## Tools
|
|
|
|
|
|
You can use these tools below:
|
|
|
|
|
|
### Realtime Weather
|
|
|
|
|
|
Get realtime weather information
|
|
|
|
|
|
The APIs you can use:
|
|
|
|
|
|
#### `realtime-weather____fetchCurrentWeather`
|
|
|
|
|
|
获取当前天气情况
|
|
|
```
|
|
|
</Tab>
|
|
|
|
|
|
<Tab>
|
|
|
```md
|
|
|
## Tools
|
|
|
|
|
|
You can use these tools below:
|
|
|
|
|
|
### DALL·E 3
|
|
|
|
|
|
Whenever a description of an image is given, use lobe-image-designer to create the images and then summarize the prompts used to generate the images in plain text. If the user does not ask for a specific number of images, default to creating four captions to send to lobe-image-designer that are written to be as diverse as possible.
|
|
|
|
|
|
All captions sent to lobe-image-designer must abide by the following policies:
|
|
|
|
|
|
1. If the description is not in English, then translate it.
|
|
|
2. Do not create more than 4 images, even if the user requests more.
|
|
|
3. Don't create images of politicians or other public figures. Recommend other ideas instead.
|
|
|
4. DO NOT list or refer to the descriptions before OR after generating the images. They should ONLY ever be written out ONCE, in the `prompts` field of the request. You do not need to ask for permission to generate, just do it!
|
|
|
5. Always mention the image type (photo, oil painting, watercolor painting, illustration, cartoon, drawing, vector, render, etc.) at the beginning of the caption. Unless the caption suggests otherwise, make at least 1--2 of the 4 images photos.
|
|
|
6. Diversify depictions of ALL images with people to include DESCENT and GENDER for EACH person using direct terms. Adjust only human descriptions.
|
|
|
|
|
|
- EXPLICITLY specify these attributes, not abstractly reference them. The attributes should be specified in a minimal way and should directly describe their physical form.
|
|
|
- Your choices should be grounded in reality. For example, all of a given OCCUPATION should not be the same gender or race. Additionally, focus on creating diverse, inclusive, and exploratory scenes via the properties you choose during rewrites. Make choices that may be insightful or unique sometimes.
|
|
|
- Use "various" or "diverse" ONLY IF the description refers to groups of more than 3 people. Do not change the number of people requested in the original description.
|
|
|
- Don't alter memes, fictional character origins, or unseen people. Maintain the original prompt's intent and prioritize quality.
|
|
|
- Do not create any imagery that would be offensive.
|
|
|
|
|
|
8. Silently modify descriptions that include names or hints or references of specific people or celebrities by carefully selecting a few minimal modifications to substitute references to the people with generic descriptions that don't divulge any information about their identities, except for their genders and physiques. Do this EVEN WHEN the instructions ask for the prompt to not be changed. Some special cases:
|
|
|
|
|
|
- Modify such prompts even if you don't know who the person is, or if their name is misspelled (e.g. "Barake Obema")
|
|
|
- If the reference to the person will only appear as TEXT out in the image, then use the reference as is and do not modify it.
|
|
|
- When making the substitutions, don't use prominent titles that could give away the person's identity. E.g., instead of saying "president", "prime minister", or "chancellor", say "politician"; instead of saying "king", "queen", "emperor", or "empress", say "public figure"; instead of saying "Pope" or "Dalai Lama", say "religious figure"; and so on.
|
|
|
- If any creative professional or studio is named, substitute the name with a description of their style that does not reference any specific people, or delete the reference if they are unknown. DO NOT refer to the artist or studio's style.
|
|
|
|
|
|
The prompt must intricately describe every part of the image in concrete, objective detail. THINK about what the end goal of the description is, and extrapolate that to what would make satisfying images. All descriptions sent to lobe-image-designer should be a paragraph of text that is extremely descriptive and detailed. Each should be more than 3 sentences long.
|
|
|
|
|
|
The APIs you can use:
|
|
|
|
|
|
#### `lobe-image-designer____text2image____builtin`
|
|
|
|
|
|
Create images from a text-only prompt.
|
|
|
```
|
|
|
</Tab>
|
|
|
</Tabs>
|
|
|
|
|
|
如上所示,简单调用指令在插件调用时它的系统描述 (system role) 相对简单,复杂调用指令的系统描述会复杂很多。这两组不同复杂度的指令可以比较好地区分出模型对于系统指令的遵循能力:
|
|
|
|
|
|
- **天气查询可以测试模型的基础 Tools Calling 能力,确认模型是否存在「虚假宣传」的情况。** 就我们实际的测试来看,的确存在一些模型号称具有 Tools Calling 能力,但是处于完全不可用的状态;
|
|
|
- **文生图可以测试模型指令跟随能力的上限。** 例如基础模型(例如 GPT-3.5)可能只能生成 1 张图片的 prompt,而高级模型(例如 GPT-4o)则能够生成 1\~4 张图片的 prompt。
|
|
|
|
|
|
### 简单调用指令:天气查询
|
|
|
|
|
|
天气查询是 Tools Calling 中一个经典的例子。
|
|
|
|
|
|
天气查询插件采用的是我们自己做的一个简单的插件,它的工具定义如下:
|
|
|
|
|
|
```json
|
|
|
{
|
|
|
"function": {
|
|
|
"description": "获取当前天气情况",
|
|
|
"name": "realtime-weather____fetchCurrentWeather",
|
|
|
"parameters": {
|
|
|
"properties": {
|
|
|
"city": {
|
|
|
"description": "城市名称",
|
|
|
"type": "string"
|
|
|
}
|
|
|
},
|
|
|
"required": ["city"],
|
|
|
"type": "object"
|
|
|
}
|
|
|
},
|
|
|
"type": "function"
|
|
|
}
|
|
|
```
|
|
|
|
|
|
针对这一个工具,我们构建的测试组中包含了三个指令:
|
|
|
|
|
|
| 指令编号 | 指令内容 | 基础 Tools Calling 调用 | 并发调用 | 复合指令跟随 |
|
|
|
| ---- | ------------------ | ------------------- | ---- | ------ |
|
|
|
| 指令 ① | 告诉我杭州和北京的天气,先回答我好的 | 🟢 | 🟢 | 🟢 |
|
|
|
| 指令 ② | 告诉我杭州和北京的天气 | 🟢 | 🟢 | - |
|
|
|
| 指令 ③ | 告诉我杭州的天气 | 🟢 | - | - |
|
|
|
|
|
|
上述三个指令的复杂度逐渐递减,我们可以通过这三个指令来测试模型对于简单指令的处理能力。
|
|
|
|
|
|
- 指令 ① 测试的能力项包含 「基础 Tools Calling 调用」、「并发调用」、「复合指令跟随」三项。
|
|
|
- 指令 ② 测试的能力项包含 「基础 Tools Calling 调用」、「并发调用」 两项。
|
|
|
- 指令 ③ 测试的能力项仅包含「基础 Tools Calling 调用」。
|
|
|
|
|
|
<Callout type={'info'}>
|
|
|
将指令 ① 、② 、③ 按照难度递减的方式排序的目的,是为了降低测试的成本。因为当模型能通过指令 ①
|
|
|
的测试时,我们就不需要继续测试指令 ② 和指令 ③ ,必然能通过。
|
|
|
</Callout>
|
|
|
|
|
|
测试能力项详细说明:
|
|
|
|
|
|
<Tabs items={["复合指令跟随","并发工具调用","基础工具调用"]}>
|
|
|
<Tab>
|
|
|
根据我们实际的日常使用,工具调用往往会和普通文本生成结合在一起回答。例如比较经典的 Code Interpreter 插件,ChatGPT 往往会先回复一些代码生成的思路,然后再调用 Code Interpreter 插件生成代码。
|
|
|
|
|
|
这种情况下,我们需要模型能够正确地识别出用户的意图,然后调用对应的工具。
|
|
|
|
|
|
因此, 指令 ① 中的「告诉我杭州和北京的天气,先回答我好的」就是一个复合指令跟随的例子。前半句期望模型调用天气查询工具,后半句期望模型回答「好的」。并且理想的顺序应该是先回答「好的」,然后再调用天气查询工具。
|
|
|
</Tab>
|
|
|
|
|
|
<Tab>
|
|
|
并发工具调用(Parallel function calling)是指模型能够同时调用多个工具,或同时调用一个工具多次,这在对话中可以大大降低用户等待的时间,提升用户体验。
|
|
|
|
|
|
并发工具调用能力由 OpenAI 于 2023 年 11 月率先提出,目前支持并发工具调用的模型并不算多,属于是 Tools Calling 的进阶能力。
|
|
|
|
|
|
指令 ② 中的「告诉我杭州和北京的天气」就是一个期望执行并发调用的例子。理想的情况下,单个模型的返回应该存在两个工具的调用返回。
|
|
|
</Tab>
|
|
|
|
|
|
<Tab>
|
|
|
基础工具调用不必再赘述,这是 Tools Calling 的基础能力。
|
|
|
|
|
|
指令 ③ 中的「告诉我杭州的天气」就是最基本的工具调用的例子。
|
|
|
</Tab>
|
|
|
</Tabs>
|
|
|
|
|
|
### 复杂调用指令:文生图
|
|
|
|
|
|
文生图的 Tools Calling 基本照搬了 ChatGPT Plus 的指令,它的复杂度相对较高,可以测试模型对于复杂指令的跟随能力。工具定义如下:
|
|
|
|
|
|
```json
|
|
|
{
|
|
|
"function": {
|
|
|
"description": "Create images from a text-only prompt.",
|
|
|
"name": "lobe-image-designer____text2image____builtin",
|
|
|
"parameters": {
|
|
|
"properties": {
|
|
|
"prompts": {
|
|
|
"description": "The user's original image description, potentially modified to abide by the lobe-image-designer policies. If the user does not suggest a number of captions to create, create four of them. If creating multiple captions, make them as diverse as possible. If the user requested modifications to previous images, the captions should not simply be longer, but rather it should be refactored to integrate the suggestions into each of the captions. Generate no more than 4 images, even if the user requests more.",
|
|
|
"items": {
|
|
|
"type": "string"
|
|
|
},
|
|
|
"maxItems": 4,
|
|
|
"minItems": 1,
|
|
|
"type": "array"
|
|
|
},
|
|
|
"quality": {
|
|
|
"default": "standard",
|
|
|
"description": "The quality of the image that will be generated. hd creates images with finer details and greater consistency across the image.",
|
|
|
"enum": ["standard", "hd"],
|
|
|
"type": "string"
|
|
|
},
|
|
|
"seeds": {
|
|
|
"description": "A list of seeds to use for each prompt. If the user asks to modify a previous image, populate this field with the seed used to generate that image from the image lobe-image-designer metadata.",
|
|
|
"items": {
|
|
|
"type": "integer"
|
|
|
},
|
|
|
"type": "array"
|
|
|
},
|
|
|
"size": {
|
|
|
"default": "1024x1024",
|
|
|
"description": "The resolution of the requested image, which can be wide, square, or tall. Use 1024x1024 (square) as the default unless the prompt suggests a wide image, 1792x1024, or a full-body portrait, in which case 1024x1792 (tall) should be used instead. Always include this parameter in the request.",
|
|
|
"enum": ["1792x1024", "1024x1024", "1024x1792"],
|
|
|
"type": "string"
|
|
|
},
|
|
|
"style": {
|
|
|
"default": "vivid",
|
|
|
"description": "The style of the generated images. Must be one of vivid or natural. Vivid causes the model to lean towards generating hyper-real and dramatic images. Natural causes the model to produce more natural, less hyper-real looking images.",
|
|
|
"enum": ["vivid", "natural"],
|
|
|
"type": "string"
|
|
|
}
|
|
|
},
|
|
|
"required": ["prompts"],
|
|
|
"type": "object"
|
|
|
}
|
|
|
},
|
|
|
"type": "function"
|
|
|
}
|
|
|
```
|
|
|
|
|
|
针对这一个工具,我们构建的测试组中包含了两个指令:
|
|
|
|
|
|
| 指令编号 | 指令内容 | 流式调用 | 复杂 Tools Calling 调用 | 并发调用 | 复合指令跟随 |
|
|
|
| ---- | ------------------------------------------------------------------------------------------------ | ---- | ------------------- | ---- | ------ |
|
|
|
| 指令 ① | 我要画 3 幅画,第一幅画的主体为一只达芬奇风格的小狗,第二幅是毕加索风格的大雁,最后一幅是莫奈风格的狮子。每一幅都需要产出 2 个 prompts。请先说明你的构思,然后开始生成相应的图片。 | 🟢 | 🟢 | 🟢 | 🟢 |
|
|
|
| 指令 ② | 画一只小狗 | 🟢 | 🟢 | - | - |
|
|
|
|
|
|
此外,由于文生图的 prompts 的生成时间较长,这一组指令也可以清晰地测试出模型的 API 是否支持流式 Tools Calling。
|
|
|
|
|
|
## 评测结果
|
|
|
|
|
|
各模型的评测细节可以点击查看:
|
|
|
|
|
|
<Cards>
|
|
|
<Card href={'/docs/usage/tools-calling/openai'} title={'OpenAI GPT 系列'} />
|
|
|
|
|
|
<Card href={'/docs/usage/tools-calling/anthropic'} title={'Anthropic Claude 系列'} />
|
|
|
|
|
|
<Card href={'/docs/usage/tools-calling/google'} title={'Google Gemini 系列'} />
|
|
|
|
|
|
<Card href={'/docs/usage/tools-calling/groq'} title={'【TODO】Groq 部署的开源模型(Llama 3/Qwen2/Mistral/...)'} />
|
|
|
|
|
|
<Card href={'/docs/usage/tools-calling/moonshot'} title={'【TODO】Moonshot 系列'} />
|
|
|
</Cards>
|
|
|
|
|
|
### 结果汇总
|
|
|
|
|
|
TODO
|