Key Takeaways
Google and other Big Tech companies have long relied on bug bounty schemes and the goodwill of white hat hackers to uncover software vulnerabilities.
Now, the company is looking to replicate that model by crowdsourcing the process of red-teaming its AI models. But the firm has been criticized for not offering security researchers sufficient protections.
On Friday, May 31, Google announced the Adversarial Nibbler Challenge: a program to crowdsource safety evaluation of text-to-image AI models.
Using a process known as “red-teaming” in which AI models are intentionally prompted to elicit harmful outputs, the scheme is designed to expose vulnerabilities before they can be exploited
“Red teaming is an essential part of responsible [machine learning] development as it helps uncover harms and facilitate mitigations,” Google scientists wrote . “However, existing red teaming efforts are privately done within specific institutions, and may not typically seek community input when determining adequate safety guardrails.”
“This could lead red teaming efforts to miss subtle or non-obvious harms,” they warned.
Anyone who wants to participate in the challenge is invited to register online. Adversarial Nibbler engages volunteers to submit prompts, generate images, and provide annotations that describe identified harm. Participation is incentivized through the competition format, with a public, pseudonymized leaderboard ranking contributors.
But although Google has made open red teaming a key component of its AI safety testing framework, researchers have criticized the firm for not offering outside security researchers sufficient indemnity.
While Google has made certain text-to-image models available for public red-teaming, researchers who test AI models outside of an officially sanctioned framework risk breaching the company’s terms of use.
In March, a group of 23 leading security researchers warned that the terms of service and enforcement strategies used by prominent AI companies to deter model misuse often disincentivize legitimate safety evaluations.
To avoid this, developers traditionally implement safe harbor policies that prevent good-faith security researchers from having their access suspended or being subjected to legal action. But this norm hasn’t always been extended to AI.
The researchers welcomed OpenAI’s decision to expand its safe harbor policy to include “model vulnerability research” and “academic model safety research.” However, they called out Google and other companies for not doing the same.
They have called for AI developers to commit to legal and technical safe harbors protecting “research activities that uncover any system flaws, including all undesirable generations currently prohibited by a company’s terms of service.”
While they can certainly play an important role in AI safety testing, initiatives like the Adversarial Nibler Challenge assume a limited threat model that fails to take into account the full range of potential AI jailbreaks.
An adversary who is truly committed to using AI models maliciously is unlikely to limit themselves to public user interfaces. But there is no API equivalent to the Adversarial Nibler Challenge.
As Google’s red team lead Daniel Fabian has acknowledged , prompt attacks are only one of many attack vectors that threaten AI safety. Internally, the company also tests how its models stand up to backdooring, data poisoning, and exfiltration. However it offers few protections for external researchers who do the same.