Data scientists have to make decisions about which data to include in data repositories. To make this decision-making process easier, learn tips for maintaining control of your data funnel.

mary-how-data-funnel
Image: Elnur/Adobe Stock

As of 2022, 2.5 quintillion bytes of new data is being created worldwide each day. While some of this data will be useful for analysis, it can be time-consuming and difficult to sort through. By creating an effective data funnel, you’ll be able to more easily filter out the data you need.

SEE: Hiring Kit: Database engineer (TechRepublic Premium)

What is a data funnel?

A data funnel refers to narrowing how much data you allow into your master data repository.

A good way to think about a data funnel is to compare it to the hiring funnels that a human resources tool applies when it uses software to screen job applicant résumés. HR inputs the requirements for an open position into an analytics software that screens incoming résumés to create a smaller incoming data funnel of applicants for a given position. This allows HR and interviewing managers to focus on more important tasks rather than manually funneling the résumés.

Funneling works on data, too. In one case, a life sciences company studying a particular molecule for its disease-fighting potential eliminated all incoming data research sources that didn’t mention the molecule by name. The goals were to save storage and processing as well as to arrive at insights sooner. While filtering out all that extraneous data worked for this company, controlling a data funnel is a balancing act between how much data you need versus how much data you can afford to store and process.

How do you decide which data is important?

The sheer cost of storage and processing, whether it is internal or in the cloud, is forcing companies to evaluate just how much data they need for business analytics.

In some cases, deciding which data to throw away is easy. You probably don’t want the noise of network and machine handshakes in your data, but deciding which subject-related data to exclude is harder. There’s also the risk that analytics teams might miss an important insight because of excluded data.

For example, using the data it would normally collect, a U.K. retailer might not have discovered that at-home housewives made the bulk of their online purchases while their husbands were away at soccer games.

Examples like this unexpected but impactful insight are why IT and end business groups must be careful when making decisions about how much they narrow the funnel for incoming data.

3 best practices for controlling a data funnel

Outline the use cases that your analytics are supporting and the data that you think they need

This should be a collaborative exercise between IT/data science and end users. Do you want to include social media product complaints when you are analyzing your sales and revenue data? And if you’re studying disease rates in your medical service area in New York, do you care about what’s going on in California?

Determine how accurate your analytics need to be

The gold standard for analytics accuracy is that analytics must reach at least 95% accuracy when compared to what human subject matter experts would conclude—but do you always need 95%?

You might need 95% accuracy if you are assessing likelihood of a medical diagnosis based upon certain patient health conditions, but 70% accuracy might only be needed if you’re forecasting what climate conditions might be like 20 years from now.

Accuracy requirements have a bearing on the data funnel, and you might be able to exclude more data and narrow your funnel if you’re only looking for general, longer-term trends.

Test the accuracy of your analytics on a regular basis

If your analytics demonstrates 95% accuracy when first implemented, but declines to 80% over time, it makes sense to recheck the data you’re using and to recalibrate the data funnel.

Perhaps new data sources that weren’t originally available are now available and should be used. Adding these data sources will widen the data funnel, but if it boosts accuracy levels, expanding the data funnel is worth the cost.