Key Takeaways
To ensure chatbots don’t provide undesirable responses, AI developers use a process known as red-teaming, which involves intentionally trying to solicit dangerous or offensive answers before models are released to the public. However although it has become an integral part of chatbot security testing, the process remains slow and laborious.
However, researchers recently made a breakthrough in automatic red-teaming that could significantly boost efforts to program safer chatbots.
A red team refers to a group of testers who intentionally try to break or compromise a system to discover potential vulnerabilities.
In the context of cybersecurity, red teams are comprised of insider hackers who identify weaknesses that might otherwise be exploited by malicious individuals or groups. However, the process can also be deployed to test how well physical security systems withstand attack.
When it comes to modern foundation models, the thing being guarded is the dangerous information, personal data, or toxic content that AI could potentially reveal.
Without appropriate safeguards, platforms like ChatGPT could be powerful tools for hate, fraud, or even terrorism. Red teams are therefore tasked with eliciting undesirable chatbot responses using tactics such as code injection or prompt engineering.
The strategy has been endorsed by the White House, which has encouraged tech firms to utilize external red teams to identify and mitigate AI risks.
Given the number of potentially toxic responses that chatbot red teams must screen for, relying solely on human testers is expensive and time-consuming.
Efforts to automate the process have centered on building dedicated large language models (LLMs) that can do the same work as human red teams.
However, as good as LLMs are at answering questions, they haven’t traditionally excelled at asking them.
To overcome this challenge, a group of researchers recently applied a machine-learning framework known as curiosity-driven exploration (CDE) to automated red-teaming.
Modern AI agents are very good at finding and compiling information from a known body of data. But place them in an unknown environment and they will likely stumble at the first hurdle.
CDE sits within a growing field of AI research focused on building more adaptable AI models better suited to real-world problem-solving and learning new skills.
Unlike traditional machine learning approaches that iteratively train agents based on an externally defined goal, CDE encourages something like human curiosity, treating the acquisition of knowledge as a worthy endeavor in and of itself. The framework is especially suited to objectives that might require many steps to achieve, but where it isn’t always obvious which steps go in the right direction.
Developing a technique they called “curiosity-driven red-reaming,” researchers at MIT managed to build an AI model that successfully provoked undesirable responses from a LLaMA-2 model that was trained to avoid toxic outputs.
While it is unlikely that curiosity-driven red-reaming will be able to completely replace human red teams, it could make the process much more efficient as an extra line of defense.