3 min read

AI Chatbots Trained To Avoid Toxic Responses: What is Automated Red Teaming in Cybersecurity?

Published April 11, 2024 12:34 PM

By James Morales

Red-teaming attempts to break a system by emulating an attacker. Photo by Aleks Dorohovich on Unsplash.

Key Takeaways

Chatbot developers use dedicated red teams to help identify risks and weaknesses.
In an effort to increase the efficiency of AI testing, researchers are developing new automated red-teaming techniques.
One novel approach described recently involves a machine-learning framework known as curiosity-driven exploration.

To ensure chatbots don’t provide undesirable responses, AI developers use a process known as red-teaming, which involves intentionally trying to solicit dangerous or offensive answers before models are released to the public. However although it has become an integral part of chatbot security testing, the process remains slow and laborious.

However, researchers recently made a breakthrough in automatic red-teaming that could significantly boost efforts to program safer chatbots.

How Red Teams Deliver Safer Chatbots

A red team refers to a group of testers who intentionally try to break or compromise a system to discover potential vulnerabilities.

In the context of cybersecurity, red teams are comprised of insider hackers who identify weaknesses that might otherwise be exploited by malicious individuals or groups. However, the process can also be deployed to test how well physical security systems withstand attack.

When it comes to modern foundation models, the thing being guarded is the dangerous information, personal data, or toxic content that AI could potentially reveal.

Without appropriate safeguards, platforms like ChatGPT could be powerful tools for hate, fraud, or even terrorism. Red teams are therefore tasked with eliciting undesirable chatbot responses using tactics such as code injection or prompt engineering.

The strategy has been endorsed by the White House, which has encouraged tech firms to utilize external red teams to identify and mitigate AI risks.

Automating Chatbot Testing

Given the number of potentially toxic responses that chatbot red teams must screen for, relying solely on human testers is expensive and time-consuming.

Efforts to automate the process have centered on building dedicated large language models (LLMs) that can do the same work as human red teams.

However, as good as LLMs are at answering questions, they haven’t traditionally excelled at asking them.

To overcome this challenge, a group of researchers recently applied a machine-learning framework known as curiosity-driven exploration (CDE) to automated red-teaming.

Building Better Exploratory Models

Modern AI agents are very good at finding and compiling information from a known body of data. But place them in an unknown environment and they will likely stumble at the first hurdle.

CDE sits within a growing field of AI research focused on building more adaptable AI models better suited to real-world problem-solving and learning new skills.

🧐Automatically breaking LLAMA2-7B, GPT-3.5, Stable Diffusion, etc with curiosity-driven exploration!
Curious exploration helps test large language and vision models by generating diverse inputs/prompts that trigger unwanted responses from a target model.

⚡️Read our paper,… pic.twitter.com/40yAHFiunW

— Pulkit Agrawal (@pulkitology) March 9, 2024

Unlike traditional machine learning approaches that iteratively train agents based on an externally defined goal, CDE encourages something like human curiosity, treating the acquisition of knowledge as a worthy endeavor in and of itself. The framework is especially suited to objectives that might require many steps to achieve, but where it isn’t always obvious which steps go in the right direction.

Developing a technique they called “curiosity-driven red-reaming,” researchers at MIT managed to build an AI model that successfully provoked undesirable responses from a LLaMA-2 model that was trained to avoid toxic outputs.

While it is unlikely that curiosity-driven red-reaming will be able to completely replace human red teams, it could make the process much more efficient as an extra line of defense.

Was this Article helpful? Yes No

AI Chatbots Trained To Avoid Toxic Responses: What is Automated Red Teaming in Cybersecurity?

How Red Teams Deliver Safer Chatbots

Automating Chatbot Testing

Building Better Exploratory Models

James Morales

US and UK Announce AI Safety Partnership – Here’s How The Nations’ Approach to AI Differs

AI Researchers Simulate Peripheral Vision in Machine Learning Models to Improve Vehicle Safety

Meta’s Llama 3 Release Date – What to Expect From Open-Source AI Model?