Home / Opinion / Technology / Growing Ethical Data Drought Could Derail the Next Leap in AI Development

Growing Ethical Data Drought Could Derail the Next Leap in AI Development

Published
Max Li
Published
By Max Li
Edited by Samantha Dunn

Key Takeaways

  • The future of AI hinges on access to high-quality, ethically sourced data.
  • Synthetic data is on the rise, but comes with risks like hallucinations and reliability concerns.
  • Decentralized data systems offer a transparent, scalable, and inclusive solution for the AI industry’s growing data needs.

The world is obsessed with the competition for AI chipsets. While manufacturers like Nvidia are arguably the market leaders, there is growing competition from startups and other chip companies as artificial intelligence continues to develop.

But a less talked-about but equally important front is opening up: the data drought. Elon Musk recently said AI companies have used up the “cumulative sum of human knowledge” to train their AI models.

This could mean that AI companies will have to rely on “synthetic data,” data created by AI models instead of actual data to create and modify new systems.

As we enter 2025, a data war, not a chip war, will be the challenge for AI’s future.

The Scramble for Data and its Effects

In 2023, visual artists filed a major lawsuit against Stability AI, MidJourney, and DeviantArt, accusing them of using their artwork without permission to train AI models like Stable Diffusion.

The issue is that AI needs to learn. It needs reference points before it can create outputs for end-users, whether those outputs are images, voices, sounds, or information.

The easiest way to gain these reference points is to go into the wild and use what is readily available on the Internet.

Around the same time as the lawsuit, Elon Musk sued AI companies for “scraping” data from platforms like X.com without permission, resulting in tighter API access and higher costs.

Reddit also significantly raised its API prices, disrupting AI companies like OpenAI and Anthropic, which had been using Reddit’s vast user-generated content.

These are just a few examples of a bigger problem: the pool of legal and ethical data is drying up. So, how can AI systems grow and improve when the data they rely on becomes harder to get?

Unlike the competition to build better hardware, which is about building more powerful processors, the data war is about getting the right datasets to train AI. Without enough diverse and good data, even the best AI hardware is useless.

Big companies like Google and Microsoft can afford to buy data from centralized entities, even if it’s expensive. Smaller companies can’t.

This is creating a growing gap between big corporations and smaller players in the AI industry, making it harder for smaller businesses to keep up.

So, how can data be collected in a way that is ethical, legal, and sustainable and supports AI innovation?

Ethical Data Collection Strategies

Data collection is challenging. It involves determining who controls the data pipes and how to make it fair and transparent. Over the years, many approaches have been tried, each with its pros and cons.

Some institutions, such as Harvard , have started projects to collect user data with explicit consent. These open-access datasets are available for research and public use. While these are great, they don’t meet the scale of commercial AI development.

As outlined earlier, another method for collecting data is using synthetic data or data generated by AI itself, which has become a popular alternative.

Companies like Meta and Microsoft use synthetic data to fine-tune their AI models, such as Llama and Phi-4. Google and OpenAI also use synthetic data in their projects.

However, synthetic data is not a perfect solution. Issues include model hallucinations, where AI generates inaccurate or misleading information, can affect the reliability of these datasets.

To illustrate this in a financial context, an AI model hallucination might be a model claiming the price of a certain stock fell by 20% over the past month when in fact, the stock price is actually up 5%.

This is an issue, especially when AI models don’t have access to the data they need, so they try to fill in the gaps with information they believe to be the most viable.

A better alternative to collect data? decentralization.

In this approach, individuals voluntarily share their data, and the transactions are recorded on blockchain systems to ensure transparency and authenticity.

Contributors are rewarded with cryptocurrency, which is perfect for small cross-border transactions that traditional currencies can’t handle.

Decentralized systems solve the problems of data diversity, quality, and trustworthiness. They also level the playing field for smaller companies to access valuable data without needing massive budgets.

The Importance of High-Quality Data

Data quality is just as important as data quantity. Poor quality data can lead to biased AI models, inaccurate predictions, and loss of trust in AI systems. To address this, companies use several strategies to ensure data quality.

They often use advanced validation methods to remove errors and inconsistencies in datasets. This may involve a mix of human oversight and automated tools to make it more reliable.

Bias is another big problem. For example, in healthcare, datasets must include data from a wide range of demographics to avoid creating models that produce biased or unfair outcomes.

Many organizations also follow established standards like ISO/IEC 27001 to ensure data quality and comply with global ethical guidelines.

Crowdsourcing platforms are used for tasks like data labeling and verification. However, these platforms need to be monitored carefully to ensure consistency and accuracy.

Decentralized verification methods like blockchain are gaining popularity to certify data authenticity and prevent tampering. This builds trust in the origin and reliability of the data.

Governments also have a big role to play in ensuring data quality. They need to balance individual privacy rights with technological innovation.

This means addressing cybersecurity risks, protecting sensitive information, and preventing data misuse by foreign entities or adversaries.

Competition Shifting From Hardware to Data

The continued need for high-quality data will affect many industries. In healthcare for example, better access to good patient data could lead to big breakthroughs in diagnostics and treatment planning.

However, strict privacy laws are a major obstacle. AI could also change everything in music, from composing new songs to enforcing copyright laws, but only if intellectual property is respected.

These challenges show how important decentralized systems that prioritize transparency, data quality, and accessibility are.

By using decentralized models, the AI industry can create a more level playing field where individuals control their data, businesses have access to good datasets, and innovation can happen without compromising privacy or security.

The stakes will only increase as the scramble for viable data continues to heat up. Ethical data collection, good data quality, and data for all are necessary for the future of AI.

Disclaimer: The views, thoughts, and opinions expressed in the article belong solely to the author, and not necessarily to CCN, its management, employees, or affiliates. This content is for informational purposes only and should not be considered professional advice.
About the Author

Max Li

Max Li is the founder and CEO of OORT. Li also a faculty member in the department of electrical engineering at Columbia University and a holder of 200+ International/US patents. He has published many academic papers on top-ranking journals and is the author of the book Reinforcement Learning for Cyber-physical Systems.
See more