Why synthetic data may be better than the real thing

Credit Source

We are excited to bring Transform 2022 back in-person July 19 and virtually July 20 – August 3. Join AI and data leaders for insightful talks and exciting networking opportunities. Learn more about Transform 2022

To deploy successful AI, organizations need data to train models.

That said, high-quality data isn’t always easy to access – creating a major hurdle for organizations in launching AI initiatives.

This is where synthetic data can be so useful.

As opposed to data that is collected from and measured in the real world, synthetic data is generated in the digital world by computer simulations, algorithms, simple rules, statistical modeling, simulation, and other techniques. It is an alternative to real-world data, but it reflects real-world data, mathematically and statistically.

Some experts even contend that synthetic data is better than real-world people, places, and things when it comes to training AI models. Constraints in using sensitive and regulated data are removed or reduced; datasets can be tailored to certain conditions that might otherwise be unobtainable; insights can be gained much more quickly; and training is less cumbersome and much more effective.

To that point, Gartner projects synthetic data to completely overshadow real data in AL models by 2030.

“The fact is you won’t be able to build high-quality, high-value AI models without synthetic data,” according to the Gartner report.

Leaders in synthetic data

To support accelerating demand, a growing number of companies are offering synthetic models – top and emerging companies in the space include Mostly AI, AI.Reverie, Sky Engine, and Datagen. Leading data engineering company Innodata has also entered the market, today launching an e-commerce portal where customers can purchase on-demand synthetic datasets and immediately train models.

“The kind of datasets we’re going after reflect real-world problems that CIOs and customers have come back to us with,” said CPO Rahul Singhal. “We began looking at: How do we create large amounts of training data that machines need?”

The Innodata AI Data Marketplace has been developed by in-house experts specifically for building and training AI/ML models. The data packs are off-the-shelf, easily previewable, unbiased, diverse, thorough, and secure, according to Singhal. Innodata is initially releasing 17 data packs in four languages that home in on financial services. These packs are textual, meaning they include invoices, purchase orders, and banking and credit card statements.

“One of the big needs in AI is diversity of data,” said Singhal. “We need lots of diverse ways that invoice can be created, we need visibility. It seems very easy, but it’s actually really complicated.”

The marketplace compliments Innodata’s open-source repository of more than 4,000 datasets. These help in the prototyping of supervised and unsupervised ML projects.

The new synthetic datasets take that to the next level based on real-world information. “Machines learn by seeing real-world examples,” Singhal said.

For instance, he pointed to the many ways in which a credit card statement could be structured – one could have names listed on the right side; another on the left; one could use a table format; another a column format. To be accurate, machines have to be provided with those variations, and in both quality and quantity. Innodata models have been provided with hundreds of templates to allow for such variations and to replicate true scenarios.

“Machine learning (ML) depends on a diversity of datasets,” Singhal said. “We create real-world data sets as much as possible and replicate what real-world document types will look like.”

Why synthetic data?

Among their many advantages, synthetic datasets are free from personal data and therefore not subject to compliance restrictions or other privacy protection laws, Singhal pointed out. This also shields against security breaches. Biases are removed to help automate workflows and enable predictive modeling. Singhal pointed out that, “things in the real world are not pristine,” and that people can smudge banking statements or accidentally or purposely obfuscate things.

Ultimately, synthetic data will be an important tool in driving the adoption of AI, Singhal said.

The eventual intent with Innodata’s marketplace is to expand to third-party AI training data sets, as well as beyond documents to images, video, audio, and speech (the latter in response to the growth in conversational AI). These datasets will also span industries – telecom and utilities, transportation and logistics, energy services, pharmaceuticals, hospitality, insurance, retail, healthcare – and will be provided in an expanding number of languages so that data scientists can build from a global perspective.

“Our goal is to create a vibrant marketplace where companies can contribute datasets and monetize data sets,” Singhal said. “This has the potential of democratizing data for AI.”

Read Full Article

What's Hot

SEC 'next chair' must be named before US election — Tyler Winklevoss

Mirae Asset Mutual Fund lifts restrictions from large and midcap fund

This Pakistani City Is Ranked Second-Riskiest For Tourists

Sold house in last 2 years? You may get indexation benefit and lower tax rate

Fixed deposits: Can you double your money in 10 years by investing in FDs? Check rates of these 6 banks to find out | Mint

Invested in debt MF before 1 April 2023? You may pay 40% higher tax on gains | Mint

Confused whether capital gains tax on your asset sale will be short term or long term? Here is a complete guide | Mint

Funding winter for startups may end with angel tax abolition: DPIIT Secy

WayCool lays off over 200 employees, aims to achieve profitability

Urban Company revenue up 37.3% in Q1FY25, loss narrows to Rs 93 cr in FY24

Angel tax abolition significant milestone, will boost startups: IT Minister

SEC 'next chair' must be named before US election — Tyler Winklevoss

Bitcoin Rising: Next Most “Hated” Range Will Be Between $75,000 And $95,000

Michigan pension fund discloses $6.6M investment in Bitcoin ETFs

Bitcoin Network's OP_CAT upgrade fuels developer innovation

SEC 'next chair' must be named before US election — Tyler Winklevoss

Mirae Asset Mutual Fund lifts restrictions from large and midcap fund

Bitcoin Rising: Next Most “Hated” Range Will Be Between $75,000 And $95,000

Michigan pension fund discloses $6.6M investment in Bitcoin ETFs

British Woman Lost 48 Kg In A Year, Thanks To One Simple Gym Hack

Bengaluru Woman Spends Over Rs 16,000 Per Month On Uber: ”More Than Half Of My Rent”

“So Irresponsible”: Man Drives Car With Daughter On His Lap, Video Sparks Concern

Pakistani Woman In US Throws Party To Celebrate Her Divorce, Video Goes Viral

How An Employee Fooled His Boss Into Thinking He Was At Work For A Month

Why synthetic data may be better than the real thing

This Pakistani City Is Ranked Second-Riskiest For Tourists

Epic Games says Fortnite returning to iOS in EU, leaving Samsung app store | Tech News

180,000 Gazans Displaced In 4 Days As Israeli Aggression Continues

British Woman Lost 48 Kg In A Year, Thanks To One Simple Gym Hack

FBI Confirms Donald Trump Was Hit By Bullet In Assassination Attempt

JPMorgan Chase unveils AI-powered LLM Suite; may replace research analysts | World News

SEC 'next chair' must be named before US election — Tyler Winklevoss

Mirae Asset Mutual Fund lifts restrictions from large and midcap fund

This Pakistani City Is Ranked Second-Riskiest For Tourists

Bitcoin Rising: Next Most “Hated” Range Will Be Between $75,000 And $95,000

SEC 'next chair' must be named before US election — Tyler Winklevoss

Mirae Asset Mutual Fund lifts restrictions from large and midcap fund

This Pakistani City Is Ranked Second-Riskiest For Tourists

What's Hot

Why synthetic data may be better than the real thing

Leaders in synthetic data

Why synthetic data?

Keep Reading

Subscribe to Updates