These horrible fake humans herald a new era in AI
Once seen as more desirable than real data, some now view synthetic data as a panacea. The actual data is confusing and biased. New data privacy regulations make it difficult to collect. In contrast, synthetic data is clean and can be used to construct more diverse data sets. You can create perfectly labeled faces, such as people of different ages, shapes, and ethnicities, to build a face detection system that will work among populations.
But synthetic data has its limitations. If it doesn’t reflect reality, it can lead to even worse AI than confused and biased real-world data, or it can simply inherit the same problems. “What I don’t want to do is give up on this paradigm and say, ‘Oh, that will solve so many problems,'” says Cathy O’Neil, a data scientist and founder of the ORCAA algorithmic audit firm.
Realistic, not real
In-depth study has always been about data. But in recent years the AI community has learned that good are more important than data great data. Even small amounts of properly tagged data can do more than 10 times more than the amount of bulk data or a more advanced algorithm to improve the performance of the AI system.
That’s why companies are changing their approach to developing their AI models, says Datagen CEO and founder of Ofir Chakon. Nowadays, they start getting as much data as possible and then adapt and adjust the algorithms to get better performance. Instead, they should do the opposite: use the same algorithm while improving data composition.
But collecting real-world data is very expensive for this type of iterative experimentation and requires a lot of time. Because of the synthetic data, teams can create and test dozens of new data sets a day to identify whether they maximize the performance of a model.
To ensure the realism of their data, Datagen provides vendors with detailed instructions on how many people to scan for each age group, BMI range and ethnicity, as well as a list of actions they will take, such as walking around the room or drinking soda. Vendors send high-fidelity static images and data to capture the movement of these actions. Datagen algorithms spread this data across hundreds of thousands of combinations. The synthesized data is sometimes re-verified. False faces are drawn against real faces, for example, to make them look realistic.
Datagen creates facial expressions in smart cars to control driver alertness, monitor body movements and iris movements and hand movements in ATM stores to improve the eye and hand tracking capabilities of VR headsets. The company said its data has already been used to provide services to tens of millions of users to develop computer visual systems.
They are not the only ones who are manufacturing synthetic humans. Click-ins is a startup that uses synthetic AI for automated vehicle inspections. Using the design software, it creates all the cars and brands that AI needs to know and then provides them with different colors, damage, and deformation under different lighting conditions, against different backgrounds. This allows the company to update its AI when automakers release new models, and helps prevent data privacy violations when registration is not available in countries where privacy is considered private and therefore in photos used to train AI.
Mostly.ai works with finance, telecommunications and insurance companies to provide customers with fake data spreadsheets so that companies can share their customer database in a way that they can legally fill with outside vendors. Anonymization may reduce the richness of a data set but does not yet adequately protect people’s privacy. But synthetic data can be used to create accurate false data sets that have the same statistical properties as the company’s actual data. It can also be used to simulate data that the company does not yet have, including scenarios such as a more diverse customer population or fraudulent activity.
Proponents of synthetic data say it can also help evaluate AI. In recent role Published at an AI conference, the Suchi Award, an associate professor of machine learning and health care at Johns Hopkins University, and his colleagues demonstrated how data generation techniques can be used to extrapolate different patient populations from a single data set. This could be useful, for example, if a company only had data on the younger population of New York, but wanted to understand how its IA behaves in the face of an older population with a higher prevalence of diabetes. He is now creating his own company, Bayesian Health, which will use this technique to help doctors test the AI system.
But is synthetic data excessive?
In terms of privacy, having “synthetic” data and not matching users ’actual data doesn’t mean it doesn’t encode sensitive information about real people,” says Aaron Roth, a computer and information science professor. At the University of Pennsylvania. Some data creation techniques have been shown to closely reproduce images or text found in training data, for example, others are vulnerable to attacks that completely invalidate that data.
That might be fine for a company like Datagen, because synthetic data doesn’t hide the identities of the people who allowed the scan. But it would be bad news for companies that offer their solution as a way to protect the financial or sensitive information of patients.
According to research, the combination of the two synthetic data techniques in particular—differential privacy and opposition network creators—It can create the strongest privacy protections, says Bernease Herman, a data scientist at the eScience Institute at the University of Washington. But skeptics worry that this nuance could be lost in the marketing lingo of synthetic data vendors, as they don’t always know what technique they’re using.