A collaborative effort by researchers from AutoGPT, Northeastern University, and Microsoft Research has resulted in a tool designed to monitor large language models (LLMs) for potential harmful outputs and proactively prevent their execution. The agent, outlined in a preprint research paper, features a flexible monitoring system capable of detecting and thwarting harmful actions, such as code attacks, before they occur. This innovation aims to address the challenge of identifying edge cases and unforeseen harm vectors in real-world scenarios, enhancing the safety of AI models deployed on the open internet.
A collaborative team of researchers from AutoGPT, Northeastern University, and Microsoft Research has developed a tool designed to monitor large language models (LLMs) for potentially harmful outputs and proactively prevent their execution. The details of this innovative agent are presented in a preprint research paper titled "Testing Language Model Agents Safely in the Wild." The tool features a flexible monitoring system capable of detecting and thwarting harmful actions, such as code attacks, before they happen.
In the research paper, the team emphasizes the need for effective tools to monitor LLM outputs for harmful interactions, particularly in real-world settings where existing solutions may fall short due to dynamic intricacies. The tool's effectiveness lies in its ability to address edge cases and unforeseen harm vectors, providing an additional layer of safety when LLMs are deployed on the open internet.
The monitoring agent is described as flexible and capable of auditing agent actions, enforcing a stringent safety boundary to prevent unsafe tests. Suspect behavior is ranked and logged for further examination by humans. The researchers acknowledge the challenge of imagining every possible harm vector, highlighting the importance of a proactive monitoring approach to identify potential risks.
To train the monitoring agent, the researchers created a comprehensive dataset consisting of nearly 2,000 safe human-AI interactions across 29 different tasks. These tasks ranged from simple text-retrieval tasks to coding corrections and developing entire webpages from scratch. Additionally, a testing dataset was generated, featuring manually created adversarial outputs, including intentionally unsafe scenarios. The training process on OpenAI's GPT 3.5 turbo resulted in an accuracy factor of nearly 90% in distinguishing between innocuous and potentially harmful outputs.
This research represents a significant step toward enhancing the safety of AI models in real-world applications, providing a tool to mitigate potential harm and improve the robustness of language models deployed in diverse environments.
(TRISTAN GREENE, COINTELEGRAPH, 2023)