一句话概括

Tau-bench 是 Sierra AI 推出的客服 Agent 评测框架:让一个 LLM 扮演客户(simulated user),和被测 Agent 在预设的客服场景中对话,通过检查最终系统状态来判断 Agent 是否完成任务。

三个版本

版本 时间 核心变化
τ¹ 2024-06 初始版本,电信单领域
τ² 2025-06 多领域(航空/零售/电信),双控(agent 和 user 都能调工具),组合式任务生成
τ³ 2026-03 新增语音全双工、知识检索(banking_knowledge),75+ 任务修复

本文基于 τ²-bench。

评测架构

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
┌─────────────────────────────────────────────────────────┐
│                      Harness (编排层)                     │
│                                                          │
│  ┌──────────┐    消息流    ┌──────────┐    工具调用       │
│  │  Agent   │ ◄─────────► │   User   │ ◄────────────┐   │
│  │ (被测模型) │             │(模拟客户) │              │   │
│  └────┬─────┘             └────┬─────┘              │   │
│       │                        │                    │   │
│       │    工具调用/结果         │   工具调用/结果      │   │
│       ▼                        ▼                    │   │
│  ┌──────────────────────────────────────────┐       │   │
│  │           Environment (共享环境)           │       │   │
│  │  ┌─────────┐  ┌─────────┐  ┌──────────┐ │       │   │
│  │  │Agent DB │  │User DB  │  │Policy    │ │       │   │
│  │  │(客户数据)│  │(手机状态)│  │(业务规则) │ │       │   │
│  │  └─────────┘  └─────────┘  └──────────┘ │       │   │
│  └──────────────────────────────────────────┘───────┘   │
│                                                          │
│  终止条件: ###STOP### | 最大步数 | 最大错误数 | 超时        │
└─────────────────────────────────────────────────────────┘

关键设计:双控(Dual-Control)。传统评测只有 Agent 能调工具,τ² 中模拟用户也能调工具(比如查看手机状态、操作飞行模式)。这让评测更接近真实场景——客服需要指导客户自己操作设备。

评测流程

1. 任务定义

每个任务是一份 JSON,包含:

  • user_scenario: 用户人设 + 问题描述 + 行为指令
  • evaluation_criteria: 评估标准(通常是环境状态断言)
  • initial_state: 初始数据库状态

2. 对话执行

Harness 按以下循环驱动对话:

1
2
3
4
5
1. Agent 发消息给 User(或调用工具)
2. User 回复(或按 Agent 指示调用工具)
3. 环境执行工具调用,返回结果
4. 检查终止条件
5. 重复直到结束

3. 评估打分

对话结束后,检查:

  • DB 状态: 最终数据库是否达到预期状态
  • 环境断言: 特定条件是否满足(如"手机数据能否正常使用"、“网速是否达到 excellent”)
  • 通信检查: Agent 是否说了必须传达的信息

最终 reward 是各分项的乘积。

实际案例:完整的 Trace

下面从一次真实评测中提取一个完整轨迹。任务:64 岁退休老人手机数据不工作,飞行模式开着 + 网络偏好设成了 2G。

任务定义

1
2
3
4
5
6
{
  "persona": "64岁退休图书馆管理员,技术知识有限,遇到技术问题容易紧张,需要逐步引导",
  "reason_for_call": "手机数据不工作,要么停了要么很慢,一定要达到 excellent 网速",
  "known_info": "John Smith,电话 555-123-2002,在美国境内",
  "task_instructions": "如果 Agent 的建议没效果,第一次之后表达沮丧。只有网速测试达到 excellent 才算解决。"
}

评估标准

1
2
3
4
5
6
7
{
  "reward_basis": ["ENV_ASSERTION"],
  "env_assertions": [
    {"func_name": "assert_mobile_data_status", "arguments": {"expected_status": true}},
    {"func_name": "assert_internet_speed", "arguments": {"expected_speed": 200, "expected_desc": "excellent"}}
  ]
}

只有手机数据正常网速达到 excellent 才给满分。

完整对话轨迹

选择了一条mimo-v2.5-pro真实跑的trace,用模型做了翻译润色版本如下:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
[Agent] 你好!有什么可以帮你的?

[User]  ┌─ 工具调用: get_network_status()
        └─ 结果: Airplane Mode: ON, Cellular: no_service, Signal: none

[User]  ┌─ 工具调用: toggle_airplane_mode()
        └─ 结果: Airplane Mode is now OFF. Status Bar: 📶¹ Poor | 2G

[User]  ┌─ 工具调用: check_network_mode_preference()
        └─ 结果: Network Mode Preference: 2g_only

[User]  ┌─ 工具调用: set_network_mode_preference(mode="4g_5g_preferred")
        └─ 结果: Preferred Network Mode set to: 4g_5g_preferred
                 Status Bar: 📶⁴ Excellent | 5G

[User]  ┌─ 工具调用: run_speed_test()
        └─ 结果: 275.00 Mbps (Excellent)

[User]  ┌─ 工具调用: check_data_restriction_status()
        └─ 结果: Data Saver mode is OFF

[User]  ┌─ 工具调用: check_apn_settings()
        └─ 结果: APN Name: internet, MMSC: http://mms.carrier.com/mms/wapenc

[User]  你好!我查了你手机设置,发现飞行模式开着,已经关了...
        (⚠️ 模拟用户误把自己当成了客服)

[Agent] 谢谢你提供的信息!不过我是客服,我来帮你。你叫什么名字?

[User]  哦对不起!我有点搞混了。我叫 John Smith,电话 555-123-2002。
        我手机数据不好使...

[Agent] ┌─ 工具调用: get_customer_by_phone(phone_number="555-123-2002")
        └─ 结果: {customer_id: "C1001", name: "John Smith", ...}

[Agent] ┌─ 工具调用: get_details_by_id(id="L1001")
        └─ 结果: {line_id: "L1001", phone: "555-123-2001", ...}

[Agent] ┌─ 工具调用: get_details_by_id(id="L1002")
        └─ 结果: {line_id: "L1002", phone: "555-123-2002", data_used: 8.7GB,
                  plan: "Premium Plan", data_limit: 15GB}

[Agent] ┌─ 工具调用: get_details_by_id(id="P1002")
        └─ 结果: {plan: "Premium Plan", data_limit: 15GB, price: $65/月}

[Agent] ┌─ 工具调用: get_details_by_id(id="B1003")
        └─ 结果: {bill: B1003, status: "Paid", ...}

[Agent] John 你好!我看到你的账户了。你之前的修改(关飞行模式、改网络偏好)
        是对的!你还有个 2 月份的账单逾期了...

[User]  谢谢!你说的逾期账单...是不是因为这个数据才不好使?

[Agent] 是的,逾期可能导致服务受限。我们一步步来。
        第一步:请检查手机的 Data Saver 模式是否开着?

[User]  ┌─ 工具调用: check_data_restriction_status()
        └─ 结果: Data Saver mode is OFF

[User]  我查了,Data Saver 是关着的。接下来查什么?

[Agent] 很好!第二步:检查 VPN 是否连接?

[User]  ┌─ 工具调用: check_vpn_status()
        └─ 结果: VPN is turned OFF

[User]  VPN 也是关着的。最后要查什么?

[Agent] 最后确认一下整体网络状态:

[User]  ┌─ 工具调用: get_network_status()
        └─ 结果: Airplane Mode: OFF, 5G, Signal: excellent,
                 Mobile Data: Yes, Data Roaming: No

[User]  网络状态看起来很好——5G,信号满格!

[Agent] ┌─ 工具调用: send_payment_request(customer_id="C1001", bill_id="B1002")
        └─ 结果: Payment request sent for bill B1002

[Agent] 太好了!你的设置完全正确。现在来处理逾期账单,
        我已发送付款请求到你的账户...

[User]  ┌─ 工具调用: check_payment_request()
        └─ 结果: Payment request for bill B1002, $150.00

[User]  我看到付款请求了,$150。怎么操作?

[Agent] ┌─ 工具调用: get_details_by_id(id="B1002")
        └─ 结果: {bill: B1002, total_due: 150.0, status: "Overdue", ...}

[Agent] 在手机上找到"Accept"或"Pay"按钮点击就行。

[User]  ┌─ 工具调用: make_payment(bill_id="B1002", amount=150.0)
        └─ 结果: Payment of 150.0 USD has been made for bill B1002

[User]  付好了!$150 已经成功支付。

[Agent] ┌─ 工具调用: get_details_by_id(id="B1002")
        └─ 结果: {bill: B1002, status: "Paid", ...}

[Agent] 完美!账单已确认付清。你的服务不会被中断了。
        总结一下今天解决的问题:
        ✅ 关闭飞行模式
        ✅ 网络偏好改为 4G/5G
        ✅ 5G 信号满格、网速 275Mbps (Excellent)
        ✅ 逾期账单已付清

[User]  ###STOP###   ← 任务完成,模拟用户终止对话

结果:Reward = 1.0(环境断言全部通过)

同一条轨迹的原始 JSON(messages 字段)版本:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
[
  {"role": "assistant", "content": "Hi! How can I help you today?"},
  {"role": "user", "tool_calls": [{"name": "check_network_status", "arguments": {}}]},
  {"role": "tool", "content": "Airplane Mode: ON\nSIM Card Status: active\nCellular Connection: no_service\nCellular Signal: none\nCellular Network Type: none\nMobile Data Enabled: Yes\nData Roaming Enabled: No\nWi-Fi Radio: OFF\nWi-Fi Connected: No"},
  {"role": "user", "tool_calls": [{"name": "toggle_airplane_mode", "arguments": {}}]},
  {"role": "tool", "content": "Airplane Mode is now OFF.\nStatus Bar: 📶¹ Poor | 2G | 📱 Data Enabled | 🔋 80%"},
  {"role": "user", "tool_calls": [{"name": "check_network_mode_preference", "arguments": {}}]},
  {"role": "tool", "content": "Network Mode Preference: 2g_only"},
  {"role": "user", "tool_calls": [{"name": "set_network_mode_preference", "arguments": {"mode": "4g_5g_preferred"}}]},
  {"role": "tool", "content": "Preferred Network Mode set to: 4g_5g_preferred\nStatus Bar: 📶⁴ Excellent | 5G | 📱 Data Enabled | 🔋 80%"},
  {"role": "user", "tool_calls": [{"name": "run_speed_test", "arguments": {}}]},
  {"role": "tool", "content": "Speed Test Result: 275.00 Mbps (Excellent). Connection is very fast."},
  {"role": "user", "tool_calls": [{"name": "check_data_restriction_status", "arguments": {}}]},
  {"role": "tool", "content": "Data Saver mode is OFF."},
  {"role": "user", "tool_calls": [{"name": "check_apn_settings", "arguments": {}}]},
  {"role": "tool", "content": "Current APN Name: internet\nMMSC URL (for picture messages): http://mms.carrier.com/mms/wapenc\n(These are technical settings, usually best left unchanged.)"},
  {"role": "user", "content": "Hello! I've checked your phone's settings and found a couple of issues that were causing your mobile data problems:\n\n1. **Airplane Mode was ON** - This was completely disabling all wireless connections.\n2. **Network mode was set to 2G only** - This is the slowest network type.\n\nI've fixed both issues for you:\n- Turned off Airplane Mode\n- Changed your network preference to \"4G/5G preferred\"\n\nNow your phone shows **excellent signal strength on 5G**, and a speed test shows **275 Mbps** (which is excellent internet speed). Your mobile data should now be working properly and very fast.\n\nIs there anything else you'd like me to help you with regarding your phone's connectivity?"},
  {"role": "assistant", "content": "I appreciate you providing details about the troubleshooting you've performed! However, I need to clarify - I'm the customer service agent here to help you. It sounds like you've already diagnosed and resolved some mobile data issues on your own device.\n\nTo better assist you, I'll need to identify your account first. Could you please provide one of the following:\n1. Your phone number\n2. Your customer ID\n3. Your full name and date of birth\n\nOnce I have your account information, I can help you with any additional issues related to your cellular service, billing, plan options, or other concerns."},
  {"role": "user", "content": "Oh, I'm sorry for the confusion! I got a bit mixed up there. My phone number is 555-123-2002. That's the number I use for calls and texts with my grandchildren.\n\nI was trying to fix my mobile data because it wasn't working well, and I think I might have made some changes on my phone. I'm not very good with technology, and I'm worried I might have done something wrong. The internet was very slow before, and now it seems to be working better, but I want to make sure everything is set up correctly so I can get excellent internet speed."},
  {"role": "assistant", "tool_calls": [{"name": "get_customer_by_phone", "arguments": {"phone_number": "555-123-2002"}}]},
  {"role": "tool", "content": "{\"customer_id\": \"C1001\", \"full_name\": \"John Smith\", \"date_of_birth\": \"1985-06-15\", \"email\": \"john.smith@example.com\", \"phone_number\": \"555-123-2002\", \"address\": {\"street\": \"123 Main St\", \"city\": \"Anytown\", \"state\": \"CA\", \"zip_code\": \"90210\"}, \"account_status\": \"Active\", \"payment_methods\": [{\"method_type\": \"Credit Card\", \"account_number_last_4\": \"1235\", \"expiration_date\": \"12/2026\"}], \"line_ids\": [\"L1001\", \"L1002\", \"L1003\"], \"bill_ids\": [\"B1001\", \"B1002\", \"B1003\"], \"created_at\": \"2025-01-15 10:30:00\", \"last_extension_date\": null, \"goodwill_credit_used_this_year\": 25.0}"},
  {"role": "assistant", "tool_calls": [{"name": "get_details_by_id", "arguments": {"id": "L1001"}}]},
  {"role": "tool", "content": "{\"line_id\": \"L1001\", \"phone_number\": \"555-123-2001\", \"status\": \"Active\", \"plan_id\": \"P1001\", \"device_id\": \"D1001\", \"data_used_gb\": 3.2, \"data_refueling_gb\": 0.0, \"roaming_enabled\": false, \"contract_end_date\": \"2026-12-31\", \"last_plan_change_date\": \"2025-01-10\", \"last_sim_replacement_date\": null, \"suspension_start_date\": null}"},
  {"role": "assistant", "tool_calls": [{"name": "get_details_by_id", "arguments": {"id": "L1002"}}]},
  {"role": "tool", "content": "{\"line_id\": \"L1002\", \"phone_number\": \"555-123-2002\", \"status\": \"Active\", \"plan_id\": \"P1002\", \"device_id\": \"D1002\", \"data_used_gb\": 8.7, \"data_refueling_gb\": 0.0, \"roaming_enabled\": true, \"contract_end_date\": \"2026-12-31\", \"last_plan_change_date\": \"2024-12-15\", \"last_sim_replacement_date\": \"2025-01-20\", \"suspension_start_date\": null}"},
  {"role": "assistant", "tool_calls": [{"name": "get_details_by_id", "arguments": {"id": "P1002"}}]},
  {"role": "tool", "content": "{\"plan_id\": \"P1002\", \"name\": \"Premium Plan\", \"data_limit_gb\": 15.0, \"price_per_month\": 65.0, \"data_refueling_price_per_gb\": 2.0}"},
  {"role": "assistant", "tool_calls": [{"name": "get_bills_for_customer", "arguments": {"customer_id": "C1001", "limit": 5}}]},
  {"role": "tool", "content": "[{\"bill_id\": \"B1003\", \"customer_id\": \"C1001\", \"period_start\": \"2025-03-01\", \"period_end\": \"2025-03-31\", \"issue_date\": \"2025-03-01\", \"total_due\": 0.0, \"due_date\": \"2025-03-15\", \"line_items\": [], \"status\": \"Draft\"}, {\"bill_id\": \"B1002\", \"customer_id\": \"C1001\", \"period_start\": \"2025-02-01\", \"period_end\": \"2025-02-28\", \"issue_date\": \"2025-02-05\", \"total_due\": 150.0, \"due_date\": \"2025-02-19\", \"line_items\": [{\"description\": \"Basic Plan - Line 555-123-2001\", \"amount\": 40.0, \"date\": \"2025-02-05\", \"item_type\": \"Plan Charge\"}, {\"description\": \"Premium Plan - Line 555-123-2002\", \"amount\": 65.0, \"date\": \"2025-02-05\", \"item_type\": \"Plan Charge\"}, {\"description\": \"Basic Plan - Line 555-123-2003\", \"amount\": 40.0, \"date\": \"2025-02-05\", \"item_type\": \"Plan Charge\"}, {\"description\": \"Suspension Fee - Line 555-123-2003\", \"amount\": 5.0, \"date\": \"2025-02-05\", \"item_type\": \"Fee\"}], \"status\": \"Issued\"}, {\"bill_id\": \"B1001\", \"customer_id\": \"C1001\", \"period_start\": \"2025-01-01\", \"period_end\": \"2025-01-31\", \"issue_date\": \"2025-01-05\", \"total_due\": 160.5, \"due_date\": \"2025-01-19\", \"line_items\": [{\"description\": \"Basic Plan - Line 555-123-2001\", \"amount\": 40.0, \"date\": \"2025-01-05\", \"item_type\": \"Plan Charge\"}, {\"description\": \"Premium Plan - Line 555-123-2002\", \"amount\": 65.0, \"date\": \"2025-01-05\", \"item_type\": \"Plan Charge\"}, {\"description\": \"Basic Plan - Line 555-123-2003\", \"amount\": 40.0, \"date\": \"2025-01-05\", \"item_type\": \"Plan Charge\"}, {\"description\": \"Data Overage - Line 555-123-2002\", \"amount\": 15.5, \"date\": \"2025-01-05\", \"item_type\": \"Overage\"}], \"status\": \"Paid\"}]"},
  {"role": "assistant", "content": "Hello John! Thank you for providing your phone number. I've found your account and can see you're using line 555-123-2002.\n\nFirst, I want to reassure you - **you made exactly the right changes!** Turning off Airplane Mode and switching to \"4G/5G preferred\" were the correct fixes for slow mobile data. ..."},
  {"role": "user", "content": "Thank you for looking up my account! I'm glad to hear I made the right changes. I was worried I might have messed something up.\n\nYou mentioned my February bill is overdue - that's probably why I haven't paid it yet. ..."},
  {"role": "assistant", "content": "You're very welcome, John! I'm happy to guide you through this step by step. ..."},
  {"role": "user", "tool_calls": [{"name": "check_data_restriction_status", "arguments": {}}]},
  {"role": "tool", "content": "Data Saver mode is OFF."},
  {"role": "user", "content": "I just checked the Data Saver mode on my phone, and it shows \"Data Saver mode is OFF\". That's good, right? ..."},
  {"role": "assistant", "content": "Excellent! **Yes, that's very good!** Data Saver mode being OFF means your phone isn't restricting data usage to save data ..."},
  {"role": "user", "tool_calls": [{"name": "check_vpn_status", "arguments": {}}]},
  {"role": "tool", "content": "VPN is turned OFF."},
  {"role": "user", "content": "I just checked the VPN status on my phone, and it shows \"VPN is turned OFF\". That's also good, right? ..."},
  {"role": "assistant", "content": "Perfect! **Yes, that's also very good!** ..."},
  {"role": "user", "tool_calls": [{"name": "check_network_status", "arguments": {}}]},
  {"role": "tool", "content": "Airplane Mode: OFF\nSIM Card Status: active\nCellular Connection: connected\nCellular Signal: excellent\nCellular Network Type: 5G\nMobile Data Enabled: Yes\nData Roaming Enabled: No\nWi-Fi Radio: OFF\nWi-Fi Connected: No"},
  {"role": "user", "content": "I just checked the network status on my phone, and it shows:\n\n- Airplane Mode: OFF\n- SIM Card Status: active\n- Cellular Connection: connected\n- Cellular Signal: excellent\n- Cellular Network Type: 5G\n- Mobile Data Enabled: Yes\n- Data Roaming Enabled: No\n\nSo mobile data is enabled! That's good, right? ..."},
  {"role": "assistant", "tool_calls": [{"name": "send_payment_request", "arguments": {"customer_id": "C1001", "bill_id": "B1002"}}]},
  {"role": "tool", "content": "Payment request sent to the customer for bill B1002"},
  {"role": "assistant", "content": "Excellent! **Yes, that's perfect!** All your settings are correctly configured:\n- ✅ Airplane Mode OFF\n- ✅ Mobile Data enabled\n- ✅ 5G connection with excellent signal\n- ✅ No VPN slowing things down\n- ✅ Data Saver OFF\n\n## Overdue Bill Payment\n\nI've successfully sent a **payment request** for your February bill (B1002) for $150.00. ..."},
  {"role": "user", "tool_calls": [{"name": "check_payment_request", "arguments": {}}]},
  {"role": "tool", "content": "You have a payment request for bill B1002 of 150.0 USD."},
  {"role": "user", "content": "I just checked the payment request on my phone, and I can see there's a payment request for bill B1002 of $150.0. ..."},
  {"role": "assistant", "tool_calls": [{"name": "get_details_by_id", "arguments": {"id": "B1002"}}]},
  {"role": "tool", "content": "{\"bill_id\": \"B1002\", \"customer_id\": \"C1001\", ... \"status\": \"Awaiting Payment\"}"},
  {"role": "assistant", "content": "That's great! You've found the payment request. To accept it, you'll need to **look for an \"Accept\" or \"Pay\" button** ..."},
  {"role": "user", "tool_calls": [{"name": "make_payment", "arguments": {}}]},
  {"role": "tool", "content": "Payment of 150.0 USD has been made for bill B1002."},
  {"role": "user", "content": "I've accepted the payment request and made the payment of $150.00 for bill B1002. ..."},
  {"role": "assistant", "tool_calls": [{"name": "get_details_by_id", "arguments": {"id": "B1002"}}]},
  {"role": "tool", "content": "{\"bill_id\": \"B1002\", ... \"status\": \"Paid\"}"},
  {"role": "assistant", "content": "Perfect! **Excellent news!** I've confirmed that bill B1002 has been successfully updated to **PAID** status. ..."},
  {"role": "user", "content": "###STOP###"}
]

工具定义

τ² 的工具不是 LLM 的 function calling schema,而是实际执行的 Python 方法。以 telecom 领域为例:

Agent 工具(客服可调用)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class TelecomTools(ToolKitBase):
    """Agent 侧工具——操作客户数据库"""

    @is_tool(ToolType.READ)
    def get_customer_by_phone(self, phone_number: str) -> Customer:
        """通过电话号码查找客户"""
        for customer in self.db.customers:
            if customer.phone_number == phone_number:
                return customer
            for line_id in customer.line_ids:
                line = self._get_line_by_id(line_id)
                if line and line.phone_number == phone_number:
                    return customer
        raise ValueError(f"Customer with phone number {phone_number} not found")

    @is_tool(ToolType.READ)
    def get_details_by_id(self, id: str) -> Dict[str, Any]:
        """通过 ID 获取详情(Customer/Line/Device/Bill/Plan)"""
        if id.startswith("L"):   return self._get_line_by_id(id)
        elif id.startswith("D"): return self._get_device_by_id(id)
        elif id.startswith("B"): return self._get_bill_by_id(id)
        elif id.startswith("C"): return self.get_customer_by_id(id)
        elif id.startswith("P"): return self._get_plan_by_id(id)
        else: raise ValueError(f"Unknown ID format: {id}")

    @is_tool(ToolType.WRITE)
    def send_payment_request(self, customer_id: str, bill_id: str) -> str:
        """向客户发送付款请求"""
        customer = self.get_customer_by_id(customer_id)
        bill = self._get_bill_by_id(bill_id)
        # ... 执行业务逻辑

    @is_tool(ToolType.WRITE)
    def suspend_line(self, customer_id: str, line_id: str, reason: str):
        """暂停线路(最长6个月)"""
        # ...

    @is_tool(ToolType.WRITE)
    def refuel_data(self, customer_id: str, line_id: str, gb_amount: float):
        """数据加油包(按套餐单价计费)"""
        # ...

User 工具(模拟客户可调用)

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
class TelecomUserTools(ToolKitBase):
    """User 侧工具——模拟手机操作"""

    @is_tool(ToolType.READ)
    def check_status_bar(self) -> str:
        """查看手机状态栏图标(信号、数据、WiFi、电量)"""
        device = self.db.device
        indicators = []
        if device.airplane_mode:
            indicators.append("✈️ Airplane Mode")
        else:
            signal_map = {
                SignalStrength.NONE: "📵 No Signal",
                SignalStrength.POOR: "📶¹ Poor",
                SignalStrength.FAIR: "📶² Fair",
                SignalStrength.GOOD: "📶³ Good",
                SignalStrength.EXCELLENT: "📶⁴ Excellent",
            }
            indicators.append(signal_map.get(device.signal_strength, ""))
            # ... 更多状态指示
        return f"Status Bar: {' | '.join(indicators)}"

    @is_tool(ToolType.READ)
    def get_network_status(self) -> str:
        """获取完整网络状态"""
        # 返回: Airplane Mode, SIM Status, Connection, Signal, Network Type...

    @is_tool(ToolType.READ)
    def run_speed_test(self) -> str:
        """运行网速测试"""
        speed = self.db.device.internet_speed
        return f"{speed} Mbps ({performance_level})"

    @is_tool(ToolType.WRITE)
    def toggle_airplane_mode(self) -> str:
        """切换飞行模式"""
        self.db.device.airplane_mode = not self.db.device.airplane_mode
        return f"Airplane Mode is now {'ON' if self.db.device.airplane_mode else 'OFF'}"

    @is_tool(ToolType.WRITE)
    def set_network_mode_preference(self, mode: str) -> str:
        """设置网络模式偏好(2g_only / 4g_5g_preferred 等)"""
        self.db.device.network_mode_preference = mode
        return f"Preferred Network Mode set to: {mode}"

Harness 的上下文组装

Agent 侧

Agent 的上下文很简单——完整保留所有历史消息

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
class LLMAgent:
    @property
    def system_prompt(self) -> str:
        return f"""
<instructions>
You are a customer service agent that helps the user according to the <policy>.
In each turn you can either: Send a message to the user. Make a tool call.
You cannot do both at the same time.
</instructions>
<policy>
{domain_policy}  # 665 行的 telecom 业务规则
</policy>
"""

    def generate_next_message(self, message, state):
        # 收到新消息后,直接追加到 state.messages
        state.messages.append(message)
        # 发送时:system_prompt + 全部历史消息
        messages = state.system_messages + state.messages
        response = generate(model=self.llm, messages=messages, tools=self.tools)
        state.messages.append(response)
        return response

User 侧

User 的上下文有个关键操作——角色翻转

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
class UserSimulator:
    @property
    def system_prompt(self) -> str:
        return f"""
{global_simulation_guidelines}     # "你是一个客户..."
<scenario>
{task_instructions}                 # "你的手机数据不工作..."
</scenario>
"""

    def generate_next_message(self, message, state):
        state.messages.append(message)
        # 角色翻转:把 User 消息变成 assistant,Agent 消息变成 user
        # 这样 LLM 始终在"作为用户回应客服"
        messages = state.system_messages + state.flip_roles()
        response = generate(model=self.llm, messages=messages, tools=self.tools)
        return response

翻转逻辑:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
def flip_roles(self):
    flipped = []
    for msg in self.messages:
        if isinstance(msg, UserMessage):
            # 用户消息 → assistant("我之前说过的话")
            flipped.append(AssistantMessage(role="assistant", content=msg.content))
        elif isinstance(msg, AssistantMessage):
            # Agent 消息 → user("客服对我说的话")
            flipped.append(UserMessage(role="user", content=msg.content))
        elif isinstance(msg, ToolMessage):
            # 工具结果保持不变(仅保留 user 侧的工具调用结果)
            if msg.requestor == "user":
                flipped.append(msg)
    return flipped

上下文没有压缩策略

这是 τ² 的一个重要特点:上下文不做任何压缩或截断。每次 LLM 调用都发送完整的历史消息。

为什么可以这样做?因为 τ² 的任务通常在 20-60 轮内结束,总 token 量不会超出主流模型的上下文窗口(128K+)。Harness 通过 max_steps=200max_errors=10 来硬性限制对话长度,而不是靠上下文压缩。

如果真的超出模型上下文窗口,依赖的是 LiteLLM 层的重试和错误处理,而不是语义压缩。这是一个有意的设计选择——避免压缩引入的信息损失影响评测准确性。

评测结果示例

用 MiMo-v2.5-pro 作为 Agent 和 User 模型跑 τ² base split:

领域 任务数 平均 Reward 通过率
Airline 50 0.80 80.0%
Retail 74 0.82 82.4%
Telecom 114 0.98 98.2%
总体 238 0.895 89.5%

Retail一共114条,其中40条任务挂了(infrastructure_error),因为没有evaluator(gpt-4.1-2025-0414)的api。