Knowing when to limit your data dramatically affects the quality of your AI. How do you know that your AI data is enough?

ai data limit
Image: pickup/Adobe Stock

Whether it’s due to a lack of funding, lack of know-how or censorship, some governments and entities are shrinking the amount of data that they incorporate into their AI. Does this compromise the integrity of AI results?

The case for shrinking the data

Intentional data shrinking is occurring as a matter of policy and expediency.

Roya Ensafi, assistant professor of computer science and engineering at the University of Michigan, discovered that censorship was increasing in 103 countries.

Most censorship actions “were driven by organizations or internet service providers filtering content,” Ensafi reported. “While the United States saw a smaller uptick in blocking activity, the groundwork for such blocking has been put in place.”

In other industry sectors, analytics providers and companies work hard to shrink the amount of data they admit into their processing and data repositories. They only want data that they deem relevant to the problem they are trying to solve.

In 2018, the U.S. Census Bureau moved to reduce the amount of data it was collecting on citizens — even if it meant more inaccurate data — in order to protect citizen privacy.

All of these use cases have clear cut business objectives, but what is the net impact of their data exclusions on the quality of the AI that operates on it?

SEE: Artificial Intelligence Ethics Policy (TechRepublic Premium)

How AI “misses” when data is missing

Sanjiv Narayan, professor of medicine at Stanford University School of Medicine, explains how missing data can impact healthcare.

“Think of height in the U.S.,” said Narayan. “If you collected them and put them all onto a chart, you’d find overlapping groups or clusters of taller and shorter people, broadly indicating adults and children and those in between. However, who was surveyed to get the heights? Was this done during the weekdays or on weekends, when different groups of people are working? If heights were measured at medical offices, people without health insurance may be left out. If done in the suburbs, you’ll get a different group of people compared to those in the countryside or those in cities. How large was the sample?”

The Amazon hiring algorithm that attracted controversy in 2019 illustrates this well.

Amazon’s AI-propelled recruiting engine was trained on historical data about successful job candidates from a time when most candidates were male. Observing this pattern, the AI taught itself that male candidates were preferable to females. Consequently, the company missed out on many qualified female applicants.

What companies can do

The cost of processing and acquiring data, and the emphasis on faster times to insight, all have driven companies to consider data exclusion.

This makes sense: The more data you can exclude upfront, the less time it will take to process results and the less compute you will consume. But how far should you dare to close the data lens?

Companies can make good decisions if they do these three things:

  1. Consider the tradeoffs. If you exclude data on customers who don’t live within 25 miles from your office, will you miss those who would come from farther away if they knew of your services?
  2. Be prepared to widen the data lens. You might discover that your data is inaccurate for the patients you are trying to analyze. Do you have enough data to draw a sound analytic conclusion? If the answer is no, then you can always widen the lens so you can check for improvements in accuracy.
  3. Explain your data sources and limitations to users. Those who rely on the data you provide need an upfront understanding of your data and its limitations. For instance, if a user wants to look at transportation trends over the past 10 years, but you only have eight years of data, they need to know that information.