Key Takeaways
After years of intensive development, the world’s preeminent AI developers have all but exhausted their supplies of training data. Having already chewed through nearly every publicly available resource they can get their hands on, companies including Meta, Microsoft and Google are increasingly turning to synthetic data to feed their models’ insatiable appetites.
While the technique is not without its risks, proponents of synthetic data argue that it is needed to keep the field moving forward without violating people’s privacy.
Protected by privacy policies, personal data held by various public and private organizations remain some of the greatest untapped data resources in the world.
Imagine, for example, an AI developer trying to build a system that could identify someone’s age by scanning their face. To do this, they would first need a database consisting of photos tagged with their subject’s age. But for obvious privacy reasons, where such databases exist, (in national driving license registries, for example) access is strictly controlled.
In the context of AI training, synthetic data maps the statistical properties of datasets without copying individual units of information.
For the hypothetical face scanner, the synthetic dataset would need to be similar enough to be useful, but different enough not to disclose personal information.
This approach has huge potential in fields like healthcare and education, where there are necessarily strict controls on who can process real data. But Big Tech AI developers are using it for other reasons too.
Having depleted many of their traditional data sources, AI developers are turning to alternative means of acquiring it.
Google and OpenAI have apparently started transcribing YouTube videos to generate fresh texts.
Synthetic data is yet another solution to the problem that can help AI developers multiply the resources they already have.
Meta reported that it used LlaMA-2 to create the text-quality classifiers for LlaMA-3. It also leveraged synthetic data to expand its training dataset in areas such as coding and reasoning. For its part, Google used synthetic data to train Gemma and AlphaGeometry .
Although these uses are often about stretching existing resources further, a recent report by Microsoft points to another advantage of the technique that emphasizes quality over quantity.
In 2023, Microsoft machine learning researcher Ronen Eldan was reading bedtime stories to his daughter when a thought occurred to him: “how did she learn this word? How does she know how to connect these words?”
Inspired by Eldan’s storytime, Microsoft researchers created a discrete dataset starting with 3,000 words, about the vocabulary of a 5-year-old.
The team then prompted a large language model to create a children’s story using one noun, one verb and one adjective from the list. By repeating this process over several days, they generated millions of stories in a synthetic dataset they called TinyStories .
Compared to the training sets behind large language models (LLMs), TinyStories is, well, tiny. Yet when they used it to train a new language model, the results were impressive.
That experiment helped Microsoft develop the small language model family Phi-3. Although the new models are just a fraction of the size of their larger cousins, they nonetheless approach (and sometimes even surpass) their capabilities.
“The power of the current generation of large language models is really an enabler that we didn’t have before in terms of synthetic data generation,” commented Microsoft Vice President Ece Kamar.
If LLMs have indiscriminately absorbed huge repositories of human knowledge, synthetic datasets like TinyStories essentially boil that information down into more precise, targeted resources.