Synthetic data could ease the burden of training data for AI models

Machine learning models are gluttonous. They need to consume a lot of training data — and the right training data…

— if they’re going to work properly. In fact, one of the hurdles to capitalizing on machine learning technology is collecting enough data to satisfy the models.

But new techniques could ease that burden, according to David Schatsky, managing director at Deloitte LLP, and Rameeta Chauhan, a senior analyst at the firm.

In research published in November, Schatsky and Chauhan cited reducing the need for training data as one of five areas of progress in machine learning that will lower the barrier of entry for the enterprise. One method to get enough data is to use synthetic data or artificially manufactured data that looks and acts enough like real-world data to train AI models effectively.

Synthetic data can be valuable in situations where data is restricted, sensitive or subject to regulatory compliance, said Schatsky, who specializes in emerging technology. And it can advance projects that are hindered by a too-arduous process of acquiring the necessary training data.

Indeed, one of the first synthetic data examples Schatsky encountered was for computer vision, technology that enables machines to recognize faces or identify objects in digital photos. Researchers today are building sophisticated computer vision features where the technology can follow an eye gaze or detect an emotion on someone’s face. But gathering the amount of data needed — and labeling it — is laborious. “And, so, what researchers did is they took a 3D-digital model of a human face and then manipulated it,” Schatsky said. They can generate as many permutations of facial expressions or eye positions as they want — and they can do so “quickly and cheaply, compared to collecting a comparable number of images,” he said. Another synthetic data use case is training robots to perform complex and agile tasks such as picking up or manipulating objects of different shapes and sizes, which is a big challenge for roboticists. “One approach is to generate an initial training data set by having a human being demonstrate what they want done — in virtual reality,” Schatsky said. The human model moves a hand, picks up an object and puts it down. The entire set of actions is captured digitally, which means the images can be easily manipulated. “The digital model of that behavior can be rerendered in countless ways — with different backgrounds or at different angles and so forth — without having a human do it a thousand times,” he said.