What is synthetic data and how it can help us break the data wall?

In the rapidly evolving landscape of artificial intelligence (AI), training data is indispensable. The quality and quantity of the input data used to train AI models directly impact their performance and capabilities. As we push the boundaries of AI technology, we come across a significant challenge in the form of the so-called data wall—a training data shortage that threatens to slow down AI’s progress, which is seen as the future of many industries.

One proposed solution to this problem is generating synthetic data. The following article explores the concept of synthetic data, how data scientists believe it can break through the data wall, and the real-world implications for current machine learning models and the future of AI development.

The race for data

The success of AI models hinges on the quality of the data used to train them. Simply put, the quality of output from an AI is only as good as its training data. Or, in more crass terms, “garbage in, garbage out.” This reality has sparked a “race for AI data” among major tech companies. They have realized that achieving significant improvements in the performance and capabilities of future AI models will require even larger datasets. As companies compete in the race for real-world data, some controversial practices have emerged.

First, companies are developing sophisticated algorithms to crawl the Internet on a massive scale and extract data points from websites, social media platforms, and online databases. Although this approach yields large volumes of original data, it raises concerns about copyright infringement, breaching of social media policies, and data privacy, as web scraping can reveal some extremely sensitive data. For instance, the New York Times report “How Tech Giants Cut Corners to Harvest Data for A.I.” exposes how OpenAI’s Whisper (an application that transcribes audio from YouTube videos) operates in a legal gray area, potentially violating the copyrights of YouTube creators. It also notes how Meta quietly changed its policies to make it easier to use user-generated content from its platforms for training AI models.

Finding the best LLM—a guide for 2024

Small language models (SLMs)—a gentle introduction

Generative AI implementation in business: how to do it responsibly and ethically

AI in manufacturing: Four nonobvious implementations

Ten business use cases for generative AI virtual assistants

Generative AI in knowledge management—a practical solution for enterprises

Curbing ChatGPT hallucinations with retrieval augmented generation (RAG)

Large language models (LLMs)—a simple introduction

RAG vs. fine-tuning vs. prompt engineering—different strategies to curb LLM hallucinations

Second, the market for pre-compiled datasets for purchase is growing, as companies are willing to pay substantial sums for high-quality, curated information based on actual data. This has led to the emergence of data brokers, from whom real datasets can be purchased. This can prove valuable to any company but is costly and “kicks the ethical can down the road,” by delegating responsibility to a third party. In the previously mentioned report, the author highlights the fact that Meta hired contractors in Africa to aggregate summaries of fiction and nonfiction, which included copyrighted content.

Third, tech giants are increasingly looking to expand their data holdings through strategic partnerships or outright acquisitions of data-rich companies. For example, Google’s 2021 acquisition of Fitbit for $2.1 billion provided it with a wealth of health and fitness data. This consolidation of data resources raises concerns about monopolistic practices and the concentration of power in the hands of a few large corporations, as well as multiplying the aforementioned copyright and privacy issues.

The data wall: a looming threat

The concept of the data wall, as highlighted by research from data scientists at Epoch AI, suggests that we may be approaching a point where the available, high-quality, real-world data for training AI models becomes scarce. This scarcity could potentially slow down or even halt progress in certain areas of AI development. The data wall presents several challenges.

The limited availability of diverse, high-quality data means increasing costs associated with acquiring new data. The aforementioned New York Times report exposed the sometimes ethically dubious lengths to which tech giants will go to obtain training data. At the same time, the declining availability of original data creates barriers to entry for smaller companies and researchers, thus limiting innovation in the field.

With growing awareness of data privacy issues and the implementation of stricter data protection regulations, such as the General Data Protection Regulation (GDPR) in Europe, the ethical and legal landscape surrounding data collection is becoming more complex. The European Parliamentary Research Service has published a study that examines the interplay between the GDPR and artificial intelligence (AI). It explores how AI technologies interact with personal data and how the GDPR regulates AI, concluding that, while AI can be GDPR-compliant, the regulation lacks sufficient guidance for controllers and requires further development to address AI-specific challenges. This can make it more challenging and costly to obtain certain types of data, particularly personal information, as the compliance costs and legal risks connected with collecting sensitive data increase.

Additionally, many existing datasets contain inherent biases that reflect historical inequalities or limitations in data collection methods. As we approach the data wall, there’s a risk that these biases could become more entrenched in AI systems, leading to unfair or inaccurate outcomes.

Furthermore, in rapidly changing fields, historical data may quickly become outdated or irrelevant. Maintaining up-to-date datasets that reflect current realities is an ongoing challenge that contributes to the data wall problem.

As we confront these challenges, the AI community is exploring solutions to overcome the data wall. One of these approaches is the use of synthetic data.

What is synthetic data generation?

Stanford University’s “Artificial Intelligence Index Report 2024” suggests synthetic data as a potential solution to the data wall. Synthetic data is artificially generated information that mimics real-world data as well as its statistical properties and patterns. Unlike real-world data, which is collected from actual sources, synthetic data is created using algorithms and AI models. These algorithms analyze existing datasets and generate synthetic data points that share similar characteristics and relationships with the original data.

Microsoft has successfully employed synthetic data in training its Phi-3 family of small language models. By using carefully curated synthetic datasets, Microsoft was able to create highly capable models with significantly less training data compared to larger models like GPT-3.

How is synthetic data created?

The process of creating synthetic data involves several key steps: analysis of real data, model development, data generation, validation, and refinement. Researchers begin by studying existing datasets to understand their structure and patterns. They then develop a generative model to capture these characteristics, which is used to produce new data points. The synthetic data undergoes rigorous testing to ensure it accurately represents real data without compromising privacy. This process is iteratively refined to enhance the quality and utility of the generated data.

Various techniques can be employed to create synthetic data, each with its own strengths and applications. Generative adversarial networks (GANs) use competing neural networks to produce increasingly realistic data. Variational autoencoders (VAEs) learn to encode and decode data, allowing for the generation of new samples. Statistical modeling and simulation creates mathematical models to generate data with specific properties, while rule-based systems apply domain-specific logic to produce structured data. Researchers often combine these approaches to achieve optimal results, tailoring their methods to the specific requirements of each project.

The benefits of synthetic data

Synthetic data offers several advantages that make it an attractive solution for breaking through the data wall:

1. Privacy and security

One of the most significant benefits of synthetic data is its ability to preserve privacy. Artificially generated data does not contain any real personal information or sensitive data. This characteristic makes synthetic data particularly valuable in industries dealing with confidential information, such as healthcare or finance.

For example, in healthcare, records can be created that mimic patient data, along with the statistical properties of real-world data, without including any actual patient information. This allows researchers and AI developers to work with realistic medical data to validate mathematical models without risking violating data protection regulations. In the financial sector, synthetic data generation tools (generated synthetic data in the form of tabular data and transaction data) may aid in the development and testing of fraud detection systems without exposing real customer data. This approach enables financial institutions to improve their security measures while maintaining strict confidentiality to protect sensitive data.

2. Unlimited supply

Unlike real-world data, which is finite and often challenging to collect, synthetic data generation allows for creation of data in theoretically unlimited quantities. This abundance of artificial data allows researchers and developers to create diverse datasets tailored to specific needs without the constraints of authentic data availability.

The ability to generate synthetic data in large volumes on demand is particularly valuable when dealing with rare events or edge cases. For instance, in autonomous vehicle development, synthetic data can be used to substitute real vehicle crash data, as well as to create scenarios that are uncommon in real-world driving but crucial for safety testing, such as various types of accidents or extreme weather conditions.

3. Cost-effectiveness

Acquiring real data can be expensive, involving costs related to collection, cleaning, and compliance with data protection regulations. Synthetic data, once the generation process is established, can be produced at a fraction of the cost, making it a more economical option for large-scale AI training.

While there are initial costs associated with developing and refining synthetic data generation models, the long-term savings can be significant. This cost-effectiveness can democratize access to high-quality training data, enabling smaller companies and researchers to compete in AI development without the need for massive data acquisition budgets.

4. Faster acquisition

The process of collecting and preparing real-world data can be time consuming. Synthetic data generation, on the other hand, can be rapidly scaled up or down based on requirements, significantly reducing the time needed to obtain training datasets.

This speed advantage is particularly crucial in fast-moving fields where the ability to quickly iterate and test new ideas can provide a competitive edge. For example, in the development of natural language processing models, synthetic data can be used to rapidly generate diverse text datasets for training and testing, allowing for faster development cycles.

5. Customization and control

Synthetic data allows for precise control over the characteristics and distributions of the dataset. This control enables researchers to create balanced datasets, address specific edge cases, or simulate rare events that might be underrepresented in real-world data.

For instance, in developing AI systems for medical diagnosis, synthetic data can be used to ensure a balanced representation of different diseases, including rare conditions for which sufficient real-world data might not be available. This customization of synthetic test data can lead to more robust, fairer AI models that perform well across a wide range of scenarios.

6. Bias mitigation

Real-world datasets often contain inherent biases that can be perpetuated or amplified by AI models. Synthetic data generation offers an opportunity to create more balanced and representative datasets, potentially reducing bias in AI systems.

By carefully designing the data generation process, researchers can create synthetic datasets that correct for known biases in historical data. For example, in developing AI for hiring processes, synthetic data can be used to create a balanced dataset that doesn’t reflect historical gender or racial biases in recruitment, leading to fairer AI-driven recruitment tools.

7. Facilitating innovation

By providing access to high-quality datasets in areas where real data is scarce or restricted, synthetic data can accelerate innovation and research in fields such as healthcare, autonomous vehicles, and financial modeling.

In emerging technologies like quantum computing or advanced materials science, where real-world data may be limited, synthetic data can provide a valuable resource for training AI models and exploring new possibilities. This can help bridge the gap between theoretical research and practical applications, driving innovation forward.

Drawbacks and limitations of synthetic data

While synthetic data offers many benefits, it is also crucial to recognize its limitations and potential drawbacks, including:

1. Fidelity to real-world complexity

Synthetic data may not always capture the full complexity and nuances of real-world phenomena. There’s a risk that important edge cases or rare events might be missed or underrepresented in any particular synthetic data set.

2. Model dependency

The quality of synthetic data is heavily dependent on the machine learning models and algorithms used to generate it. If these models are flawed or biased, those imperfections can be reflected in the process of generating synthetic data.

3. The threat of fake data

Ensuring that synthetic data accurately represents real-world scenarios can be challenging. Rigorous validation processes and software testing are necessary to confirm that models trained on synthetic data perform well when applied to real-world situations.

4. Potential for amplifying biases

While using synthetic data can help mitigate some biases, it can also potentially amplify existing biases if the generation process whereby synthetic data is generated is not carefully designed and monitored.

5. Overreliance risk

There’s a potential risk of overreliance on synthetic data, which could lead to a disconnect from real-world data and its evolving patterns. Excessive dependence on synthetic data can result in models that fail to capture real-world nuances or become outdated as societal trends evolve. It might also create a false sense of data abundance, discouraging the collection and analysis of valuable real-world information.

6. Creativity doom loop

The fear of many people in the creative industries has been that as AI models are increasingly used to make artificially generated text data and synthetic images and text, our media landscape will become more repetitive and less creative (which was the case even before the introduction of machine learning models). Using synthetic data generation tools to train a new machine learning model runs the risk that newly generated data will just be a constant recreation of the same ideas in the same styles.

7. Model collapse

Model collapse occurs when models trained predominantly on synthetic data begin to lose their ability to generalize to real-world scenarios. As a model repeatedly learns from artificially generated patterns, it may overfit to the specific characteristics of the synthetic data, failing to capture the full complexity and variability of genuine data. This can result in degraded performance when the model encounters real-world inputs, potentially leading to unreliable outcomes.

Summary

As the demand for high-quality, diverse datasets grows, traditional data collection methods are struggling to keep pace. Synthetic data offers a promising solution to this problem, providing a potentially unlimited source of training data that can be generated quickly, cost-effectively, and with built-in privacy protections, which allows for innovation across various industries. However, that comes at a price. The high pace of data degradation, varying quality of synthetic data, potential lack of complexity, as well as the various ethical implications of using synthetic data, are all valid concerns one must be aware of prior to choosing it as a solution.

Synthetic data represents a powerful tool in the quest to overcome the data wall and drive AI innovation forward. Although synthetic data is by no means a silver bullet and real-world data will be needed to offset its limitations, for IT managers and AI practitioners, staying informed about the potential of synthetic data and its evolving applications will be crucial in navigating the future landscape of AI development.

What is synthetic data and how it can help us break the data wall?

Marcin Pędich

Contents

The race for data

The data wall: a looming threat