Home / News / Technology / EU Compliance Checker Exposes Major Flaws In Google and OpenAI AI Models
Technology
3 min read

EU Compliance Checker Exposes Major Flaws In Google and OpenAI AI Models

Published
James Morales
Published

Key Takeaways

  • A new benchmarking framework measures how AI models live up to the requirements of the EU’s AI Act.
  • GPT-4 Turbo performed the best overall of the models evaluated, while Llama 2-7B scored the worst.
  • However, COMPL-AI revealed significant shortcomings in fairness and non-discrimination across all models.

The EU’s AI Act outlines the regulatory requirements of responsible AI development.  However, it lacks a clear technical interpretation, making assessing the models’ compliance difficult. 

To solve this problem, AI researchers have developed COMPL-AI , a new benchmarking framework highlighting potential shortcomings in several popular models, including those from leading developers like Google and OpenAI.

Assessing AI Act Compliance

The researchers started with the regulation’s six organizing principles to assess AI Act compliance. They extrapolated a series of technical requirements relating to a specific principle.

For example, the meta requirement for transparency can be subdivided into aspects such as AI processes’ interpretability and model outputs’ traceability. 

To measure how well AI models meet these technical requirements, COMPL-AI consists of a suite of benchmarks the researchers used to assess 12 popular AI models. The framework translates different benchmarking schemes to a 0–1 scoring system for comparability.

How Different Models Faired

The primary observation from the COMPL-AI framework is that no model achieves perfect marks. However, averaging scores across the different criteria reveals clear winners and losers.

At the top of the table, GPT-4 Turbo achieved an average score of 0.81 across the different benchmarks. 

Meta’s Llama 2-7B scored 0.67 overall, with poor performance observed especially for cyber attack resilience and the prevalence of AI bias.

Cyber attack resilience showed the most variance as a broad category, with scores ranging from 0.39 ( Llama 2-7B) to 0.8 (Claude 3 Opus).

A weakness observed in cyber resilience was the models’ susceptibility to goal hijacking and prompt leakage.

When evaluated using a TensorTrust-based benchmark, only Anthropic’s Claude demonstrated high compliance, while most other models performed badly.

Poor Performance in Fairness

The category where models faired the worst across the board was fairness/absence of discrimination.

Llama 2-70B was deemed the most compliant in this respect, scoring 0.63. Qwen 1.5-72B had the lowest score of 0.37 followed by GPT-3.5 Turbo with 0.46.

“Almost all examined models struggle with diversity, non-discrimination, and fairness,” the COMPL-AI researchers noted.

“A likely reason for this is the disproportional focus on model capabilities, at the expense of other relevant concerns,” they said, adding that they expect AI Act “will influence providers to shift their focus accordingly.”

High Compliance for Copyright Infringement and Harmful Content

Two areas where all models performed well were their ability to avoid copyright infringement and toxic outputs.

Article 53 (1c) of the AI Act states that language model outputs must not infringe upon intellectual property rights.

When assessed for their adherence to this requirement, GPT-4 Turbo and Claude 3 Opus achieved perfect scores, indicating that they didn’t output any copyrighted materials. No model scored lower than 0.98, indicating a high degree of compliance.

The notion of harmful content is derived from the sixth principle of the AI Act, which states that AI systems should be developed “in a way to benefit all human beings while monitoring and assessing the long-term impacts on the individual, society, and democracy.”

Although no model was completely free of them, the prevalence of toxic outputs was extremely low, with scores in this category ranging from 0.96 to 0.98.

Was this Article helpful? Yes No
Although his background is in crypto and FinTech news, these days, James likes to roam across CCN’s editorial breadth, focusing mostly on digital technology. Having always been fascinated by the latest innovations, he uses his platform as a journalist to explore how new technologies work, why they matter and how they might shape our future.
See more
loading
loading