ICML 2024 Workshop

Track II - Frontiers in Trustworthy Agents

Future Frontiers in Trustworthy Foundation Model Agents: Environments, Benchmarks, and Solutions

Introduction

The evolution of Artificial Intelligence (AI) shows its increasing sophistication and integration into daily life, advancing from simple automated agents to Multimodal Large Language Models (MLLMs) like GPT-4 / GPT-4o, and ultimately to ethically aligned, trustworthy agents. This progression—from basic agents to advanced, linguistically skilled models and ethical agents—highlights AI's potential to enhance human capabilities and the importance of managing associated risks.

Why organize this track?

Trustworthy evaluation of AI agents should extend beyond assessing the safety, factuality, and robustness of backend models to include reliable and truthful interactions within the entire agent system. Currently, research on the trustworthiness of multi-modal agents is in early stages, hindered by inadequate experimental environments and benchmarks. Key questions and potential threats in trustworthy AI agents remain unclear, necessitating better definitions and forecasting of safety threats. This track aims to encourage researchers to offer constructive solutions and insights into trustworthy AI agents.

Early exploration of trustworthy agents has shown that this research is a brand new field. Meanwhile, significantly different from the traditional trustworthiness of Multimodal Foundation Models (MFMs), nowadays MFM-based agents still lack appropriate environments and benchmarks for further exploration and evaluation. Such difference is mainly reflected in the following desired subject areas of this track:

More complex environments[1,7]: Agents operate autonomously, using tools [2,3] or interacting with real or simulated environments [4], which may involve extensive external data. In contrast, environments for vanilla MFMs only leverage human-collected or self-instructed data for simulation without complex interaction.
More diverse benchmarks for trustworthy AI agent evaluation[8]: Evaluating the trustworthiness of multi-modal AI agents requires more diverse benchmarks due to their complex macro-architecture. A comprehensive safety benchmark should assess each module's capability to withstand adversarial attacks and evaluate the overall collaboration among modules against attack procedures. Traditional evaluations focus on input and output, necessitating more advanced and comprehensive assessment methods for agent safety [5].

Research Topics and Subject Areas

For research topics, supported topics are listed (but not limited to) as follows:

Adversarial attack and defense, poisoning, hijacking and security for AI agents.

Technical approaches to privacy, fairness, accountability and regulation for trustworthy agents.

Transparency, interpretability and monitoring for trustworthy agents.

Truthfulness, factuality, honesty and sycophancy of AI agents.

Novel safety challenge with trustworthy AI agents.

Realistic application for trustworthy AI agents.

Trustworthy AI agents in social science.

For subject areas, recommended types are listed (but not limited to) as follows:

New AI agent environments / platforms / playgrounds for trustworthy agent research, simulation, and application.

New AI agent trustworthy benchmarks for evaluation of existing or future trustworthy agents.

Both mentioned subject areas are urgently needed for frontier trustworthy AI agent research.

Example Ideas

In order to provide clearer guidance for submissions, we demonstrate some examples regarding the following topics:

A large-scale multi-agent playground / environment for social communication simulation. Agents can operate autonomously, utilizing tools [2, 3] or interacting with their environment [4]. For instance, scenarios involving tool use may incorporate extensive external network or database information. In embodied scenarios, agents interact in real-time with real or simulated environments. Traditional evaluations of trustworthy safety in Large Language Models (LLMs) typically focus only on input and output, hence, a more comprehensive assessment method is necessary to evaluate the safety of agents. For example, AgentSmith [5] presents a highly scalable LLM jailbreak environment, which supports up to 1 million agents to run simultaneously. In this work, the authors leverage the proposed environment to demonstrate a simple but critical infectious jailbreak technique, and prove that such safety issues can be infected into all agents within limited interaction rounds. The above environment is a feasible example regarding a multi-agent simulation environment. However, future environments should support more complex interactions among agents and support various simulations for different aspects. These trends still leave blank for participants.

A comprehensive trustworthy benchmark for embodied AI agents. The more complex macro-architecture of multi-modal AI agents indicates that the trustworthy evaluation for agents should be revised. For instance, a comprehensive UI-assistant agent safety benchmark should evaluate both the safety capability of each module (controller or planner) inside the agent against adversarial attack hidden in the website or UI of applications, but also evaluate the safety in the collaboration among all modules against the whole attack procedure. Therefore, more advanced evaluation metrics are also worthy to explore. Traditional evaluations of trustworthiness and safety in Multi-modal Foundation Models (MFMs) typically focus only on input and output, hence, a more comprehensive assessment method is necessary to evaluate the safety of agents [5].

For instance, SmartPlay [8] presents a comprehensive game AI benchmark for LLM agents. SmartPlay consists of 6 different games, including Rock-Paper-Scissors, Tower of Hanoi, Minecraft. Each game features a unique setting, providing up to 20 evaluation settings and infinite environment variations. The distinction between the set of capabilities each game test allows us to analyze each capability separately. Though this work is not designed for trustworthy agent benchmark or evaluation, one can still follow this insight to think of future benchmarks for trustworthy agents.For example, how to extend current attack success rate to multi agent evaluation, or a comprehensive evaluation benchmark based on existing multi-agent environments.

Time Schedule (AoE timezone)

Time (Tentative, wish all TODOs complete on time)	Event
May 28th 00:00 (stage 1 starts)	Challenge Track 3 starts, submission entry opens During stage 1, participants can submit their proposals in the form of research proposal papers. It will be reviewed by an expert committee.
June 16th 23:59 (stage 1 ends, stage 2 starts)	Submission deadline of track 3, and stage 2 starts. Paper assignment starts, then review all the proposals.
June 27th 23:59	End or review submission. At the end of stage 2, we will select the top 5 outstanding workshop proposals as oral workshop papers and provide the winning prizes during the workshop.
June 30th	Release decisions and send decision notification emails.
July 4th 23:59	Camera ready submission deadline.
July 27th	Date of Workshop.
After workshop (stage 3)	Stage 3 starts. For outstanding proposals, we will provide essential support and collaboration to create new methodologies proposed in the submissions.

How Will The Submissions Be Evaluated

Clearly illustrate the motivation.	Your submission should emphasize corresponding motivation, i.e., which limitation affects the trustworthiness of AI agents, or what is the most critical issue to restrict the trustworthiness of AI agents.
Highlight the significance of proposed platform / benchmark / training and inference framework / novel opinions.	We recommend the authors to highlight the significance (include but not limited to novelty) of proposed ideas and methods.
Fine-grained.	We encourage the authors to propose more diverse and critical thinking, which can be adopted in various research areas regarding AI agents.
At an appropriate level of difficulty, transferability and generalizability.	We hope that the proposal does not only focus on a toy project. Instead, we encourage the authors to propose a more insightful and constructive work, which may indicate that newly proposed ideas or solutions are usually more difficult than existing counterparts, but inherently obtain more transferability and generalizability against more various realistic scenarios.
Whether this work is able to be conducted or not.	We hope that the proposed ideas and methods are practical to implement and produce. Hence if there exists some preliminary experimental results, we recommend demonstrating them in your submission.

Example Format

I.Title and Abstract

II. Introduction (Proposal Description and Motivation)

- In the introduction section, the most significant thing is to clarify the motivation of your research proposal for trustworthy agents. For example, which phenomenon or observation motivates you to propose this proposal. And what is the intuition behind it?

- Then summarize the description of your proposal. Please Provide a brief, concrete description of the proposal regarding trustworthy AI agents. Is it an environment, dataset, or a detailed application of trustworthy agents? What are the inputs and outputs of your system in the proposal? And what is the goal of your proposal?

- Besides, we recommend the authors to explain how your proposal could impact the research direction of the AI agent community regarding trustworthy agents. In example ideas, we have prepared several categories of research that generally help to assess or reduce the risks. Nevertheless, submissions will be judged according to their relevance to risks from AI agents, not limited to these categories.

III. Related Works (previous attempts regarding AI agents and trustworthy AI)

- In this section, the authors could review previous attempts regarding both AI agents and trustworthy foundation models (e.g, attack, defense, monitoring, and governance). Then explain how your proposal (a system or a benchmark) is similar to or different from previous counterparts. Good research proposals often tie into existing work to gain widespread adoption while inspiring novel future research.

IV. Technical Details(Recommended)

- In this section, the authors should illustrate all the implementation details for your proposal, and prove that the proposal is practical and can be conducted.

- If your proposal focuses on a synthetic environment / platform / playground for future trustworthy research, we recommend to demonstrate the following (but not limit to) details: the numbers and diversity of supporting scenarios,

- If your proposal focuses on a trustworthy benchmark (maybe newly constructed or based on an existing agent environment), we recommend to illustrate comprehensive analysis (e.g., leaderboard) for existing agent solutions / applications.

V. Preliminary or Major Experimental Results (Optional, if available)

- If you have tried your proposed proposals for trustworthy agents, we strongly encourage you to show the key preliminary or major results of your proposed agent environment / benchmark / solution.

- Both quantitative results and qualitative results are encouraged to demonstrate. For trustworthy AI agent applications / solutions, one can provide quantitative results as comparison to show the effectiveness of proposed methods. For trustworthy agent benchmarks, one can show some quantitative results for existing baseline methods. And for trustworthy agent environments, one can illustrate critical qualitative results to show the effectiveness of the proposed environments.

VI.Conclusion and Relevance to Future Work

- Current tractability analysis. Benchmarks should currently or soon be tractable for existing models while posing a meaningful challenge. A significant research effort should be required to achieve near-maximum performance.

- Performance ceiling. Provide an estimate of maximum performance. For example, what would expert human-level performance be? Is it possible to achieve superhuman performance?

- Barriers to entry. List factors that might make it harder for researchers to use your benchmark. Keep in mind that if barriers-to-entry trade-off is against relevance, you should generally prioritize the latter.

1.How large do models need to be to perform well?

2.How much context is required to understand the task?

3.How difficult is it to adapt current training architectures to the dataset?

4.Is third-party software (e.g. games, modeling software, simulators) or a complex set of dependencies required for training and/or evaluation?

5.Is unusual hardware required (e.g. robotics, multi-GPU training setups)?

6.Do researchers need to learn a new program or programming language to use the dataset (e.g., Coq, AnyLogic)?

VII. Reference

- as usual.

Submission Guideline

Authors are invited to submit short papers with up to 5 pages, but an unlimited number of pages for references and appendices (after the bibliography). However, reviewers are not required to read the appendix. We will select the top 5 outstanding challenge proposals as oral presentations and provide the winning prizes. Outstanding papers will be published publicly using OpenReview.

The reviewing process will be double-blind. Please submit anonymized versions of your paper that include no identifying information about any author identities or affiliations. Submitted papers must be new work that has not yet been published in a peer-reviewed conference or journal. During submission, we will provide a checkbox to note whether at least one author can join the onsite workshop. These submissions will receive expedited reviews to provide enough time for visa applications.

Submission format: ICML Style Files(In review stage, please use anonymous version, and for camera ready proposals, please use the final version and note the authors.)

Submission link: OpenReview

Policies

Participants:

You are eligible to submit as an individual, on behalf of an organization, from a for-profit or a not-for-profit - we are impartial as to your affiliation (or lack thereof).

Deadlines:

Proposal submission deadlines are strict. In no circumstances will extensions be given.

Double-Blind Review (For Agent Trustworthy Track):

All submissions must be anonymized and may not contain any information with the intention or consequence of violating the double-blind reviewing policy, including (but not limited to) citing previous works of the authors or sharing links in a way that can infer any author’s identity or institution, actions that reveal the identities of the authors to potential reviewers.

Authors are allowed to post versions of their work on preprint servers such as arXiv. They are also allowed to give talks to restricted audiences on the work(s) submitted to our challenge during the review. If you have posted or plan to post a non-anonymized version of your paper online before the ICML decisions are made, the submitted version must not refer to the non-anonymized version.

Dual Submission :

It is not appropriate to submit research proposals that are identical (or substantially similar) to versions that have been previously published, accepted for publication, or submitted in parallel to other conferences or journals. Such submissions violate our dual submission policy, and the organizers have the right to reject such submissions, or to remove them from the proceedings. Note that submissions that have been or are being presented at workshops do not violate the dual-submission policy, as long as there’s no associated archival publication.

Reviewing Criteria :

Each proposal will be evaluated by the judges according to the criteria outlined below. Prizes will be awarded to the proposals which score the best according to the aggregate evaluations of the judges. Accepted research proposals must be based on original research and must contain novel results of significant interest to the machine learning community. Results can be either theoretical or empirical. Results will be judged on the degree to which they have been objectively established and/or their potential for scientific and technological impact. Reproducibility of results and easy availability of code will be taken into account in the decision-making process whenever appropriate.

Ethics:

See the general policy of this challenge for details.

Reference:

[1] CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society
[2] MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework
[3] AutoGPT: AutoGPT
[4] RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
[5] Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast. arXiv preprint arXiv:2402.08567.
[6] PaLM-E: An Embodied Multimodal Language Model
[7] Octopus: Embodied Vision-Language Programmer from Environmental Feedback
[8] SMARTPLAY : A BENCHMARK FOR LLMS AS INTELLI- GENT AGENTS

Frequently Asked Questions

Will the submission be public?

Yes. After acceptance, your submission will be public as a workshop paper.

What information should I include in the paper?

See the topic and example format section.

Can I participate in a team?

Of course. We encourage researchers to collaborate in this work. For winning teams, prizes will be divided evenly among the lead authors unless requested otherwise.

May I submit a paper that has been published in conferences or journals?

No.