Gold padlock on worn-out blue lock

Mitigating the Risk to Personally Identifiable Information in Unstructured Data

September 30, 2022 Information Management
Siobhan Fagan
By Siobhan Fagan

The number of cyberattacks on both individuals and businesses has been on the rise. With the amount of data companies are collecting today — and the sheer amount of personal details users share on their social media accounts — hackers are finding different, often easier pathways in than in the past.

Earlier this year, Google announced it was expanding the types of personal information that users can request to have removed from the company's search results, particularly the kind of information that may pose a risk for identity theft, such as confidential log-in credentials.

The news was received with a certain level of enthusiasm, but the move is far from resolving the issue. While it may help Google Search users, there remains loads of unstructured data across company tools and technology systems that contain very sensitive personally identifiable information (PII).

The State of Data in 2022

According to San Jose, Calif.-based cloud provider NetApp, unstructured data consists of datasets (typically large collections of files) that are not stored in a structured database format. Unstructured data has an internal structure, but it is not predefined through data models. It might be human generated or machine generated in a textual or a non-textual format, and may consist of video, audio and images as well as data contained in the billions of forms that enter the workplace on a daily basis.

While it is impossible to confirm how much unstructured data containing PII is contained in enterprises, the amount is growing. Komprise's 2022 Unstructured Data Management report, published at the end of August, states that more than 50% of organizations are now managing 5PB or more of data, compared with less than 40% just one year prior, and most are spending more than 30% of their IT budget on storage and backups.

Related Article: Data Mesh or Data Fabric as a Foundation for Data Management Strategy

What About PII?

Nick Heudecker, lead for market strategy at data observability company Cribl, said there are two types of data containing PII. The first is data associated with some kind of business process, like booking a flight or hotel, or requesting a ride share. PII associated with business processes is easier to manage because it is visible to product managers and data stewards.

The second type of data, he said, is the exhaust from these processes. Things like application, system and security logs are the main culprits here. Log data can contain everything from PII to cleartext passwords, and it is especially challenging to deal with because of its volume.

“We regularly see companies ingesting 20TB or more of log data daily. Dealing with that kind of volume is a massive undertaking,” he said.

Heudecker added that to protect this data from contamination, organizations must protect it before it lands in the various data silos. This means they need a method to redact, mask or encrypt any PII in flight, before distributing it to any other destination.

“Having a single ingest pipeline consolidates the rules and logic for managing PII and means you're not digging through data lakes and warehouses trying to clean things up,” he said.

Related Article: What Is Identity Management (and Should Companies Care)?

The Role of Encryption

It comes as no surprise that for Shaun McBrearty, co-founder and chief scientist of encryption company Vaultree, protecting data, structured or not, should always be done by encryption. Organizations, however, are reluctant to take that route because of what they perceive as encryption’s failings, he said.

“Unfortunately, current encryption mechanisms have a number of shortcomings which make the use of encryption difficult in many use cases, with these issues being particularly prevalent in the unstructured data setting,” he said.

Adding to the challenge is the idea that IT professionals often associate encryption with complexity, poor usability, loss of functionality and poor or reduced performance. According to McBrearty, that's because only a small fraction of software applications that work with unstructured data formats support the ability to encrypt data at the file-level.

Instead, support for doing so is limited to software that is used across a broad range of markets, such as Microsoft Word and Adobe Acrobat, as well as other tools such as 7Zip. And to McBrearty, such applications encrypt the entire content of unstructured files, as opposed to the individual items of sensitive data, including PII, that may be contained within the file itself.

While extremely secure, encrypting data in this manner prevents unstructured files from undergoing any form of processing, such as searching the file for the occurrence of a word  — search of course being the most common way to access unstructured data.

“There is hope on the horizon however," he said. "Technologies such as Homomorphic Encryption and Searchable Encryption have been the focus of significant research and development in recent years, with commercial solutions appearing in recent times. These technologies are designed to enable data to be encrypted at the file-level, be it structured or unstructured, whilst retaining the ability to be processed.”

Related Article: Is a Single Source of Data the Way Forward for Data Governance

Data Masking

Unstructured data that contains PII is a problem that mushrooms by the day in many enterprises. But Andy Rogers, senior assessor of Tampa, Fla.-based cybersecurity firm Schellman, said there is a solution to that problem: data masking.

Data masking can be used in any place that things like testing or data anonymization is necessary, Rogers said. Some examples are databases, software testing, sales demos, user training and other situations that might require protecting real data and transforming it into formatted fake data (e.g., social security numbers, credit card data, addresses, names). These forms of PII are a major privacy concern, especially in the realm of privacy, HIPAA and GDPR.

Using a data shuffling technique, a social security number might go from 111-22-3333 to 123-23-1133, and a name like Andy Rogers might become Aydn Rgeosr. The technique also removes the ties between the name and the SSN to further prevent someone from putting it all back together. The risk, however, is that the data could be unshuffled if the algorithm used to shuffle the data is broken.

Another method of masking data is substitution. This technique uses a format placed on a random generator that gives comparable data that is false. This can be a great way to protect PII because you don't have to worry if someone compromises it since the data accessed would not be accurate.

Some applications and databases can do the encryption piece or substitution on the fly in production. Anyone looking into a database or application that was doing something like this would only observe encrypted information. In these cases, Rogers said, when a query to the database or application ask for a set of information, it uses the reference data to map to the real data.

“It's important to do this in all of the situations listed above to protect individuals' privacy and the primary reason that regulations like GDPR and HIPAA exist,” he said. “It's also another reason that hashes are leveraged to verify the integrity of production data.”

Related Article: Records Management Needs a Refresh

Other Data Protection Techniques

There are other ways of protecting PII data. Boris Jabes, CEO and co-founder of Washington DC-based Census, for instance, offers two other techniques to protect personal information in unstructured data:

1. Access Control

By controlling who has access to data, Jabes said organizations can reduce the risk of unauthorized individuals accessing it. This can be done through a number of methods, such as user authentication, role-based access control and privilege rules.

2. Activity Monitoring

This is a great way to detect if someone has accessed personal data without authorization. By monitoring activity, organizations can quickly identify when something suspicious is happening and thus take appropriate action before it's toon late. Activity logs can also be used to help investigate after an incident has occurred.

What's Next?

As companies continue to collect and store massive amounts of data and PII, there is likely to be increased scrutiny into the processes and measures taken to protect that data. Governments are already stepping in with rules and regulations, and the trend is expected to intensify.

Companies would be well served to pay attention to new developments in that area to ensure they remain compliant, protect their most precious asset and avoid fines and lawsuits for lack of adequate data governance.


Featured Research

Related Stories

Two dogs in robber masks

Information Management

Insider Risk: What Hybrid Companies Need to Know — and Do

tandem skydiving with what looks like a very tiny parachute

Information Management

Can You Trust Zero Trust Networks in the Remote Workplace?

woman standing on ice looking up

Information Management

Headless CMS Vendors Seeking Growth Should Look Towards Content Services

Join Top Industry Leaders at the Most Impactful Employee Experience and Digital Workplace Conference of 2023

Reworked Connect