Pros and cons of synthetic data

Dr. Winnie Tang January 13, 2023 10:41

Waymo has been using synthetic data to generate lifelike driving datasets, including complex and diverse scenarios, to test the reaction of its self-driving vehicles. Photo: Reuters

Artificial intelligence (AI) depends on massive amounts of data being available. As many countries have tightened privacy protection measures, therefore, synthetic data that does not violate personal privacy regulations has emerged. It is estimated that its cost is only 1% of that from real data, which is attractive to enterprises.

Real data may not reflect the truth due to different apportioning arising from races and nationality specifics, while synthetic data can reduce bias. Besides, synthetic version provides more diverse data including rare cases, making up for the difficulty of obtaining such information from real situations.

This kind of data can be in the form of text, media (video, image, sound) and tabular synthetic data. According to the content of real data, it can be roughly divided into three categories: fully synthetic, partially and hybrid.

Today, it is used in a variety of industries, ranging from banking, medicine to self-driving cars.

The American Express is reported to have begun testing with deepfake videos and fake data for two years, such as credit card transactions, in order to improve the ability of AI algorithms to detect fraudulent behaviors. JPMorgan Chase also used synthetic data to detect anti-money laundering, as well as to develop innovative products and services when historical data may not meet the needs.

In the medical field, Roche, the Swiss pharmaceutical company in partner with a startup, uses synthetic data instead of patient’s in clinical research to improve analytical ability. While in Germany, the Charité Lab for Artificial Intelligence in Medicine (CLAIM) which has been involving in stroke research, pointed out that each patient's brain structure is unique, and the anonymising images are of little significance. Therefore, they generated synthetic data while preserving its statistical and predictive properties.

For the past two years, the self-driving company Waymo, owned by Alphabet, has been using synthetic data to generate lifelike driving datasets, including complex and diverse scenarios, such as involving cyclists, or adjusting the speed of approaching vehicles to test Waymo’s reaction.

Synthetic data is better suited for straightforward problems, like fraud detection or credit scoring, according to industry insiders. However, it cannot cope with complex and changeable situations. The Economist gave an example. In the past, purchase of one-way air ticket would be regarded as an obvious predictor of fraud by the automatic detection model, but under the COVID-19 epidemic, many customers are forced to do so. Another example is face recognition, which is difficult to function when wearing a mask becomes the norm.

Further, synthetic data may not be adequate to serve the purpose when accurate and real data is needed for detailed planning. In the U.S., the American community survey (ACS) is distributed to 1% of the population once a year to study the relationship between education, health, income, demographics and geography. The authority is criticised for attempting to replace real data with synthetic one, while it may be good for creating large-scale estimates, poor and small communities with limited resources would suffer.

Synthetic data is an emerging industry, it depends on different industries and startups to jointly explore its potential.

-- Contact us at [email protected]

Dr. Winnie Tang

Adjunct Professor, Department of Computer Science, Faculty of Engineering; Department of Geography, Faculty of Social Sciences; and Faculty of Architecture, The University of Hong Kong

Hong Kong

Equip young people for the future Dr. Winnie Tang

In late February, the inaugural flight of an air taxi from Shenzhen Shekou Cruise Homeport to Zhuhai Jiuzhou Port took only 20 minutes with an estimated one-way ticket price of 200 to 300 yuan per

Are we raising a generation of leaders, or of followers? Brian YS Wong

The essence of education is defined not by the facts it imparts, but the potential knowledge it inspires students to individually pursue on their own. Put it this way – the ideal form of education

The urgent need for reforms to sex education in Hong Kong Sharon Chau

Nearly one in every four university students (23%) in Hong Kong has been sexually harassed, according to a 2019 report published by the Equal Opportunities Commission (EOC). A 2019 study found that

STEAM should be linked to real life Dr. Winnie Tang

In the 2017 Policy Address, STEM (science, technology, engineering and mathematics) education was proposed as one of the eight major directions to promote I&T development. Since then, funding has

Let trees speak for themselves Dr. Winnie Tang

I often say that smart cities start with smart planning, but smart planning presupposes adequate, systematic and up-to-date data. This is important not only for city administration, but also for tree

Pros and cons of synthetic data

Dr. Winnie Tang

Hong Kong

Equip young people for the future Dr. Winnie Tang

Are we raising a generation of leaders, or of followers? Brian YS Wong

The urgent need for reforms to sex education in Hong Kong Sharon Chau

STEAM should be linked to real life Dr. Winnie Tang

Let trees speak for themselves Dr. Winnie Tang

Most Popular 24 Hrs

Equip young people for the future