The Promise of Synthetic Data

Synthetic data can augment existing engineering workflows where real-world data is scarce.

Software vendors are continually finding ways to integrate generative AI and other algorithm-based approaches into their design and simulation products.

Courtesy of Getty Images.


Machine learning and artificial intelligence (AI) are increasingly part of the design and engineering conversation, as software vendors find ways to integrate generative AI and other algorithm-based approaches into their design and simulation products. These solutions can be used to accelerate design space exploration, create reduced order models (ROMs) for rapid analysis, help quickly search existing designs or past simulation data for guidance in solving new problems, or build virtual worlds in which to test out everything from robots to autonomous aircraft concepts.

However, for some applications there is simply not enough real-world data to properly train an AI model—at least not for demanding engineering use cases. Synthetic data has emerged as a solution. Synthetic data can mimic real-world inputs and help ensure there are enough diverse data points to generate reliable results for things like training autonomous vehicle systems, for example, that would otherwise require millions of hours of drive time.

“Synthetic data are datasets generated through algorithms or simulations to replicate the behavior and statistical properties of real-world data,” says Eric Vinchon, vice president of Product Strategy at Tech Soft 3D.

Synthetic data can be particularly valuable for modeling things like equipment failure conditions or other anomalous events that are rare and difficult to gather data from. “You can add sensors to a machine to get an idea of what is happening before it breaks, but the machine actually needs to break to get that data,” says Johanna Pingel, MathWorks AI Product Marketing Manager. “You could be waiting a long time.”

Synthetic data can help avoid model collapse (which can occur when an AI model ingests too much AI-generated content), and save time. There are risks in using synthetic data—you need high-quality data, and engineers need to curate that data and compare results to real-world physics-based analysis.

But over the past year, there has been a lot of activity around synthetic data for engineering applications. NVIDIA has been a leader in this space, and at the most recent Consumer Electronics Show they announced the Cosmos World Foundation Model Platform, with state-of-the-art generative world foundation models, advanced tokenizers, guardrails, and an accelerated video processing pipeline. According to NVIDIA, foundation models are neural networks trained on immense amounts of raw data and they serve as the building blocks for generative AI.

At CES, NVIDIA announced new generative AI models and blueprints that expand NVIDIA Omniverse. Image courtesy of NVIDIA

As the company explained in its press release: “Cosmos WFMs are purpose-built for physical AI research and development, and can generate physics-based videos from a combination of inputs, like text, image and video, as well as robot sensor or motion data. The models are built for physically based interactions, object permanence, and high-quality generation of simulated industrial environments—like warehouses or factories—and of driving environments, including various road conditions.”

Cosmos can reduce the time and cost to develop physical AI models by generating large amounts of photo-real, physics-based synthetic data for training existing models. Robotics and automotive companies are already adopting it (including Uber, Agility Robots, and Waabi). Paired with NVIDIA Omniverse, the company says Cosmos can act as a synthetic data-multiplication engine.

“Data scarcity and variability are key challenges to successful learning in robot environments,” says Pras Velagapudi, chief technology officer at Agility. “Cosmos’ text-, image- and video-to-world capabilities allow us to generate and augment photorealistic scenarios for a variety of tasks that we can use to train models without needing as much expensive, real-world data capture.”

At SIGGRAPH in 2024, NVIDIA previously demonstrated how an AI- and Omniverse-enabled workflow could generate a large amount of synthetic motion and perception data using a small amount of real-world data.

Siemens Digital Industries Software, meanwhile, has partnered with PhysicsX, a startup that uses generative AI and synthetic data for deep physics simulation. PhysicsX is building its latest pretrained deep physics model for aerodynamics on high-fidelity simulation data generated with the Siemens Xcelerator portfolio.

According to PhysicsX: “Delivered as a browser-based application, PhysicsX’s LGM-Aero gives you the ability to use the target payload as the starting point, then gradually shape the optimal geometry by considering thrust, cruising speed, wingspan, lift, and other parameters. There is also an upload button that lets users upload CAD geometry as STL.“

Synthetic Data in Action

Synthetic data is used in instances where data exists but has not been collected; where data is collected but not labeled; or where data can’t be collected easily using sensors or other technology.

In industrial or engineering applications, it can be used in cases where you need to test a system, like an autonomous vehicle, against a difficult-to-replicate scenario, like a squirrel passing in front of the car. You can use synthetic data to generate these scenarios without putting anyone in harm’s way.

“Synthetic data are used as substitutes for real-world data when capturing or labeling real data is challenging,” Vinchon says. “They can potentially preserve privacy and enable applications such as machine learning model training, testing, and simulations.”

There have been a few recent examples of this related to quality management and anomaly detection. Digital transformation specialist GFT Technologies is working with NVIDIA to use technology like Omniverse to develop AI applications for manufacturing, including inspection and quality management. GFT is using NVIDIA Replicator for synthetic data generation to virtually train the solution using 3D models and reduce the need for disruptive physical testing.

“The promise that we are offering the market is that instead of investing weeks or months in managing this data in the acquisition stage, we can do this pre-work offline,” Ignasi Barri, global head for Data and AI at GFT. “We don’t have to be in the factory. We can use Replicator to train the model, then go to the factory, set up the hardware with sensors, and test the model.”

Likewise, MathWorks has also been working on AI-based anomaly detection systems to help enable smart factory solutions. That includes some synthetic data generation. “We see a lot of interest in synthetic data generation because customers don’t have enough examples of anomalous data to train an accurate model,” says Rachel Johnson, MathWorks principal product manager. “The tie-in for us is that we are a company that works with a lot of customers doing model-based design. If engineers have models of their systems that they use in the design process, you can repurpose those models to generate physics-based data for model training. This is assuming that they have validated, physics-based models, and we have operational data that can be used.

“Once they have validated the accuracy of the model itself, you can be fairly sure that the data that comes out of that model will be accurate,” Johnson adds. “The use case for synthetic data generation from an engineering perspective is from physics-based models built in Simulink or Simscape.”

Synthetic data for quality control/anomaly detection was also the subject of a Deloitte paper on quality management, where the data can help protect personally identifiable information (PII) or protected health information (PHI) that might be necessary for certain types of equipment analysis.

“Synthetic data is leveraged at scale by many organizations to reduce risks of PHI/PII infringement in lower environments and to leverage datasets that more holistically mimic their production datasets,” says Rohit Pereira, principal and quality engineering practice leader, Deloitte Consulting LLP. “This enables more efficient and effective test execution at scale across teams with a reduction in data contention across teams operating in the same non-production environments. Additionally, this removes the need for test data de-identification while generating higher volumes of test data for transaction validation at scale.”

Vinchon at Tech Soft 3D provided some additional examples:

• Synthetic data is often used to train algorithms that detect and classify objects captured by cameras, depth sensors, or other devices. Applications include: autonomous driving systems and digitalization from Point Clouds.

• Starting from a complete virtual 3D model of an environment or object, it’s possible to generate synthetic images, depth maps, or point clouds that mimic what a real-world device such as a camera, LiDAR sensor, or depth scanner would capture. From there, developers can simulate complex or rare scenarios with different conditions and validate the object or property they classify/recognize from what’s captured by the sensor match.

• Properties stored or simulated in the 3D environment.

Trusting the Data

According to Pingel at MathWorks, you wouldn’t want to start from scratch with synthetic data; you use it to augment real-world data and physics.

“Regardless of the application, it comes down to testing the system. You have to verify that the output is what you expect, and test that against real-world data,” she says. “We’re not at the point yet where engineers can take synthetic data and build accurate models 100%, and then deploy them into the field.”

In other words, you can’t train a model on a fully synthetic data set and expect it to work perfectly on a real-world data set. That’s where the engineer plays a key role. “You can’t take the engineer out of the loop,” Pingel says. “You need to think like an engineer. If you don’t understand the problem you are trying to solve, you could build a model that doesn’t meet the design requirements. Engineers are essential for the generation and input of synthetic data into a model. They are the last line of defense that the model is working correctly.”

“Synthetic data needs to reflect realistic conditions and behaviors across diverse scenarios,” adds Vinchon. “Going back to the example of autonomous driving, a 3D scene would need to be rendered under different weather conditions: cloudy, raining, snowing, etc. [to be useful].”

Generating reliable synthetic data relies on having good real-world data and physics available. There are also tools like those offered by NVIDIA and others to help create that data quickly. Ansys offers Ansys AVxcelerate, developed specifically to test and validate sensor perception with physically accurate sensor simulation, as well as Ansys Perceive EM, which can simulate real-time radar and wideband 5G/6G channels. Ansys AVxcelerate Sensors is accessible within NVIDIA DRIVE Sim, a scenario-based AV simulator powered by NVIDIA Omniverse.

In simple applications, though, you can manually incorporate synthetic data or create some test data based on existing simulations and data. For example, you can manually create potential questions and responses for a simple customer service chatbot without the need for generative AI. For engineering use cases, Vinchon says users will need advanced simulation tools, physics engines, and photorealistic graphics engines.

Depending on the application, there are also some real challenges to creating using synthetic data. Replicating a real-world environment can be difficult, with sensors and cameras affected by things like temperature, humidity and lighting conditions. Simulating all of those combinations is a huge undertaking. Real-world data also includes unexpected anomalies or outliers that synthetic data generation might not anticipate.

Creating realistic and representative 3D models is challenging and expensive to do manually,” Vinchon says. “Machine learning can assist in this process, but it requires large amounts of high-quality data for training. This creates a circular problem: You need ML to create data. You need data to create ML.”

Synthetic Data is a Tool

The experts we spoke to recommended a number of best practices when it comes to using synthetic data. First, start with simple use cases focused on creating variants and development workflows where the data augments existing simulations. Keep the engineer in the loop, and use real-world data when available.

For example, Bharat Electronics worked with MathWorks to leverage synthetic data to fill in the gaps when analyzing various parameters to help develop its radar solutions. “They aren’t using it for the entire application, but when they don’t have data available, it’s a great opportunity to incorporate synthetic data on a small scale,” Pingel says.

It’s also important to continuously test the system and uncover any bias that may have been introduced in the simulated data, and ensure it covers all relevant edge cases.

“This is never a one-and-done scenario. Once an AI algorithm is trained and put into operation, you have to continuously monitor it and make sure that it is continuing to represent the system it is deployed on. Continuously monitoring and validating with new data and then deciding when to retrain the model is baked into these deployments,” says Johnson at MathWorks.

“Like any tool, you need to use it when it is appropriate, but not overuse it,” adds Erick Galinkin, AI security researcher at NVIDIA. “AI can be useful for data that is difficult to collect. It is better to have real-world data, because you cannot perfectly simulate it, but it is still worth augmenting the data if you are lacking certain cases. You should keep an eye on how the overall model performs on real-world data, as a kind of subset. That is something that I have done. I use my real-world data, not as a holdout set, but as a sanity check to compare to the performance on synthetic data.”

More Ansys Coverage

More MathWorks Coverage

More NVIDIA Coverage

More PhysicsX Coverage

PhysicsX Company Profile

Share This Article

Subscribe to our FREE magazine, FREE email newsletters or both!

Join over 90,000 engineering professionals who get fresh engineering news as soon as it is published.


#29852