Bringing big data governance and security up to the level of practice applied to structured data is critical. Here are five ways to get there.

Data concept
Image: Outflow_Designs/Shutterstock

Businesses have to govern their data to keep it clean and organized for better use and Data governance is a collection of processes, roles, policies, standards and metrics that ensure the effective and efficient use of information in enabling an organization to achieve its goals with that data.

SEE: Hiring Kit: Database engineer (TechRepublic Premium)

Organizations focus on data governance for their systems of record and structured data, but what about big, unstructured data like photos, videos, digitized hardcopy documents and continuous text messages from social media?

Ramesh Koovelimadhom of RCG Global Services pointed out several weakness in big data governance:

  • Relying on data scientists who lack IT’s skills in setting up standards and procedures for data.
  • A lack of discipline and process enforcement in the development of data schemas.
  • Not cleaning up bad data.
  • Not supporting people and processes with technology.

“Successful data governance solves business problems by identifying root causes of data problems that impede business effectiveness,” Koovelimadhom said.

So, how can we improve the governance of unstructured data that now comprises roughly 80% of corporate data under management? Here are five ways to tackle the problem in the enterprise.

1. Use trusted data sources

The data that organizations have directly created and accumulated is trusted, but most organizations also acquire data from outside cloud sources as they build an aggregated data repository for analytics.

How do you know that data from these outside sources is trustworthy? You don’t—unless you vet the data provider, understand where the provider has gotten its data, and know how the provider has prepared and secured the data. If you are in a sensitive industry like healthcare, you’ll also want to know that data on individual patients has been anonymized to meet privacy requirements.

SEE: 4 steps to ensuring your analytics stay clean and healthy (TechRepublic)

Checking vendor governance standards to ensure they align with your own should be a routine task performed before any contract is entered into with a vendor. Prior to signing a contract, you should also request the vendor’s latest IT audit so recent governance and security performance can be reviewed.

2. Establish unstructured data guidelines for user access and permissions

System of record, structured data has firm rules in place for user access and permissions—but unstructured data may not. Unstructured data access should play by the same same rules that structured data does.

In other words, access to unstructured data should be limited to those users who require the data. Within the category of access, there are also likely to be levels of permission, with some users getting more access to data than others, depending on job function or role.

These user access decisions should be made between IT and end-user departments. There should minimally be reviews annually, and procedures should be in place so that if an individual leaves the company, access is immediately removed as part of the separation process.

3. Secure all data 

The basics of data security are trusted networks; strong user access methods and monitoring; perimeter monitoring that checks for vulnerabilities and potential breaches; and user habits that align with security best practices (such as not sharing passwords or not copying data to thumb drives that can be carried away). If data is stored on hardware at the edge of the enterprise, that hardware should be physically caged and secured when possible, where only those authorized can gain access.

Most of these standards and practices are in place with structured data but not necessarily with data that is unstructured, such as Internet of Things data.

Unstructured data should be governed by the same levels of security guidelines and practices that its structured counterpart is.

4. Use logging and traceability

Robust logging and traceability software should be continuously at work where big data is concerned. Who or what is accessing the data? When and from where? If there is an issue that arises, what event initiated the issue?

SEE: Cybersecurity experts hail new IoT law (TechRepublic)

Logging, tracing and (in the future) observability all speed time to problem resolution and are integral to security.

5. Dispose of bad data

As an upfront data cleaning practice, bad data should be eliminated as raw, and incoming big data streams in. There is a lot of bad big data, whether it is documents that aren’t needed, IoT streams that contain as many device handshakes as salient information or superfluous social media threads.

The data preparation process that is part of data ingestion should eliminate this data so it never takes up real estate in storage. Big data repositories should also be regularly refreshed and revisited with data that is no longer needed being discarded.