Overcoming Common Pitfalls to Data Discovery and Classification

18 Data Scientists & Security Pros Reveal the Most Common Pitfalls to Data Discovery and Classification

A data classification process makes it easier to locate and retrieve data, and it's an important process for any data security program, as well as for risk management and compliance. By leveraging tools that can automatically locate and identify your sensitive data, companies can gain a deeper understanding of what data they possess, where it exists within the organization, and how sensitive it is, allowing them to apply the appropriate level of security to protect the company's most sensitive information.

Despite its importance, many companies struggle with the data discovery and classification process. To gain some insight into the most common pitfalls companies face when it comes to discovering and properly classifying data, we reached out to a panel of data scientists and security leaders and asked them to answer this question:

"What are the most common pitfalls to data discovery and classification and how can you avoid them?"

Meet Our Panel of Data Scientists & Security Leaders:

Brian Seipel
Arron Richmond
Aidan Simister
Dr. David Bauer
Steve Dickson
Mark Wellins

Sachin Rege
Aparajeeta Das
Erik Hatcher
Dr. Marko Petzold
Nishank Khanna
Vania Nikolova, Ph.D.

Baruch Labunski
Kevin Turner
Anna Bergevin
Sid Mohasseb
Jason Cassidy
Sia Mohajer

Read on to find out what our experts had to say about the most common pitfalls to data discovery and classification and how you can avoid them.

Brian Seipel

@GetSavings

Brian Seipel is a Consultant and the Spend Analysis Practice Lead at Source One. He is focused on helping companies implement innovative solutions for driving revenue and expanding market share through procurement and strategic sourcing best practices. Seipel played an instrumental role in developing the proprietary spend classification taxonomy and user experience for SpendConsultant.

"An issue I see organization run into all the time is..."

'Paralysis by analysis.' Too often, analysts get far too caught up in data. We spend time collecting, cleaning, and centralizing it...but then what? Data on its own is just raw information. Obsessing over it like this a mistake - we need to shift toward an obsession with taking this information and transforming it into knowledge. Only then can we help organizations generate wisdom.

Arron Richmond

@hst

Arron Richmond is a Digital Marketing Executive for High Speed Training, one of the UK's leading providers of online Data Protection and GDPR training.

"The most common pitfall I see in data discovery and classification is..."

The lack of goal setting from the outset. Too often, the thought is to capture more data and that will help influence decision making, but the actual decisions that need further influence are not considered early enough. This leads to outcomes that may have no significant business value relative to the time spent on your data.

Before you set out on data discovery and classification, make sure you have a clear goal in mind of what the data is going to help you achieve.

Aidan Simister

@LepideSW

Aidan Simister is the CEO of Lepide.

"I think the main thing organizations struggle with is that..."

Data discovery and classification in itself holds no intrinsic value. You can't expect to adequately improve your data security and compliance standing solely through locating and labeling your data. You will only start to see real value from DDC if you use it in conjunction with other data security practices.

For example, once you've found out where your most at-risk data resides within your infrastructure, what do you do next? Are you able to determine who has access to that data, who's making changes to it, what those changes are, and whether the surrounding environment is secure?

Data discovery and classification is powerful and necessary, but not in silo. It's when you combine this functionality with permissions analysis, user and entity behavior analytics, and change auditing that you will see the true value.

Dr. David Bauer

@LucdDIS

Dr. David Bauer is the CTO at Lucd, Enterprise AI, and the foremost leader in Big Data/Distributed Computing in the U.S. intelligence community since 2005. He has pioneered code that has executed across two million processors and developed the first Cloud and Big Data Platform Certified & Accredited for use in the Federal Government for highly sensitive and classified data.

"Data discovery is a user driven and iterative process of discovering patterns and outliers in data..."

It is not a 'tool,' though tools may aid in the discovery of relationships and models within the data. One of the most common pitfalls in data discovery are tools that are not well suited for business experts. Traditional tools in this area may require coding, such as with Python Notebooks, or overwhelm the user with many uncorrelated charts and graphs. Reporting dashboards today are overly focused on providing a set of graphs on a single web page, leaving it up to the user to determine if there are any correlations in data or between models. A term we coined for this in 2009 is widget vomit.

A successful approach to providing data discovery relies on providing capabilities for data fusion and unification across multiples sources of both internal and external enterprise data. Data fusion relies on entity resolution analytics to compose objects that are pre-integrated and give the user a more complete sense of the information available. Good entity resolution analytics are capable of determining relationships between data objects and models in structured, unstructured and semi-structured data. Allowing the computer to precompile entities across these multiple data sources provides the business user with a much more complete view of the information available, while alleviating the enormous task of disambiguating details across hundreds or even thousands of data sources.

Data classification is made more effective with robust, fused data objects. Sorting fused entities becomes more effective when all of the available features are available within more complete data objects. This eliminates the problem of having to construct complex queries and categorizations across multiple, disparate data sets.

The majority of data management platforms today do not implement data fusion, and so miss the opportunity to provide the business user with a more complete and constructed view of the data landscape - which places more of the work on the analyst. The task of extraction, correlation, categorization, and disambiguation across billions of records from thousands of sources is precisely what we have built at Lucd, to be able to generate analytic models with much greater accuracy. Data fusion improves model accuracy by lowering the amount of data being filled with synthetic values, aggregating away resolution in the data or dropping data due to empty values. In the Lucd platform, data fusion automatically creates more complete data objects, which leads to more accurate and complete analytic results.

Steve Dickson

@Netwrix

Steve Dickson is an accomplished expert in information security and the CEO of Netwrix, a provider of a visibility platform for data security and risk mitigation in hybrid environments.

"Data discovery and classification (DDC) is the critical component of data security strategies..."

Indeed, IT pros can't effectively protect data if they don't have knowledge about what data they have, where it resides, how valuable it is, and who has access to it. However, there are some pitfalls and challenges associated with DDC:

Data classification policies are so complex and ambiguous that employees find it difficult to understand and implement.
Data discovery and classification requirements do not take into account the possible changes in volumes of data or data sensitivity.
Implementation of DDC solution can provide a false sense of security, because once deployed the solution should fully protect against data loss or misuse.

To avoid the pitfalls, you need to consider several recommendations:

Create an organization-wide data classification policy that is easy to understand and communicate it to all employees who work with sensitive data. Make sure the policy is short and includes basic elements like objectives, workflows, and data owners.
Ensure that your DDC tools and practices take into account the full life cycle of your data, and that controls remain effective and appropriate even if volumes or sensitivity of data change.
Perform data discovery and classification regularly and invest in training/education related to DDC to guarantee effectiveness of your security and compliance efforts. Also, you need to combine DDC with other security practices (e.g., risk assessment) to ensure that you have permanent control over your data.

Mark Wellins

@1touchIo

Mark Wellins is the CCO of 1Touch.io.

"One of the most common pitfalls to data classification is when..."

An enterprise relies on manually defined system interrogation and application detection. With this approach, they are only dealing with the storage and processing of personal data that is known to them. A network-based approach, on the other hand, enables enterprises to discover all storage and processing of personal data, whether known or unknown. Additionally, it gives a constantly-updated holistic view into the unknown uses and categories of personal data.

Vania Nikolova

@RunRepeatcom

Vania Nikolova, Ph.D., is the Head of Data Analysis at RunRepeat.com.

"The most common pitfalls to data discovery and classification are..."

1) Data discovery

For data exploration - the process of looking through the new data set and its preparation for analysis - the most common pitfalls here are:

Missing data - Dealing with that is an art and science on its own, and it's up to the task and the analyst to decide if he'll leave the missing pieces blank, if he'll do mean or median imputation, or he'll delete the rows with missing data or something else.
Wrongly formatted data - There's a big issue when it comes to dates. If your data set is combined between European and USA data, there is usually formatting mismatch, so establishing that is the case and making all the data the same format could be a laborsome task. Another case here is if you have some data points formatted with a comma and some with a decimal point. If this issue isn't resolved before you start the analysis, this could have catastrophic effects. Sometimes numbers are formatted as text. So, this could lead to lots of errors as well.
Wrong data - This is hard to fix, but it is a problem because sometimes errors happen on entry. But still, some of these errors are easy to spot, if the entry is outside of the acceptable range. So you have to identify such entries and decide what to do with them. Usually, they are removed and treated as missing data.

2) Data classification

Data classification can mean one of two things:

Labeling your variables in a meaningful way. Sometimes labeling your variables in a confusing manner could influence the takeaways you get from the data analysis.
In machine learning, this is the classification of new data. So, this is concerning the quality of your classification model, if it's working properly, and if it's giving good results. Here, you need good skills to evaluate the quality of your model.

Sachin Rege

@Citi

Sachin Rege is the former Global Chief Information Officer (CIO) for Citi Commercial Bank. Sachin has built a digital banking platform with successful implementations of Open Banking APIs, RPA, AI, big data and machine learning technologies.

"The most common pitfalls in data discovery and data classification are..."

1. Lack of a single source of truth:

Every organization has a massive amount of data that are created across various sources which are internal and external to the organization. Every organization needs to create a "single source of truth" for each of the critical data elements.

2. Incoherent enterprise wide data management standards and governance:

Organizations need an enterprise-wide data governance policy; otherwise, they would be relying on inaccurate or incomplete information to create dashboards or generate insight. Data management and governance identify critical data elements, their source, usage, and define rules and processes on data creation and consumption.

"Bad data quality is like garbage-in, garbage-out and does more harm than good to the company."

3. Lack of data taxonomy:

Organizations should have a well-defined taxonomy for all their data elements. I have come across various organizations where the definition of an often-used data element such as "revenue" is ambiguous based on the region. For example, in some regions it is based on a fiscal calendar, whereas in other countries they have complicated formulas where they extrapolate the revenues based on last quarter.

4. Missing data fusion and co-relation features across platforms or tools:

In the age of social media, unstructured data for organizations is prevalent. A successful approach requires "data fusion," which is the unification of data categories (such as structured and unstructured) and
multiple disparate sources (internal and external) to generate a complete and meaningful representation such as 360-degree view of their client. Most of the data management platforms do not support this feature and so miss the opportunity to provide the business user with a more complete and constructed view of the data landscape. This gap requires the data analyst to construct the view by writing complex queries across multiple data sources.

Organizations should move away from rogue, disconnected spreadsheets and incomplete dashboards to a more comprehensive, holistic, enterprise-wide data management strategy and a business intelligence platform that can extract and glean business insights which are complete and meaningful to the business users.

Aparajeeta Das

@AparajeetaDas

Aparajeeta is Co-Founder & CDO of ThirdEye Data. She is also the Founder & CEO of ClouDhiti, an analytical insights company targeted to SMBs. Aparajeeta has 20+ years of hands-on experiences in data warehousing, business intelligence, historical, real time and predictive analytics combined with project delivery and management skills.

"The number one issue is that..."

Companies don't spend enough time in data discovery to profile the data sets that they deal with on a daily basis. Most of the time, we don't get the time to do it as businesses are over-demanding about knowing high-level knowledge from the data sets, which is understandable. The second pitfall is a lack of qualified data analysts or data scientists who love data more than algorithms or the technologies. If one doesn't know to profile the data, how will they how to classify the data? They blindly use some random processes and algorithms and normally miss out on 20 to 25% of the truth.

Erik Hatcher

@ErikHatcher

Erik Hatcher is the co-author of "Lucene in Action" as well as co-author of "Java Development with Ant." Erik co-founded and works as a Senior Solutions Architect at LucidWorks.

"The most common pitfalls to data discovery and classification are..."

Bad or messy data
Thinking your data is too structured (or too clean)
Not learning more about your data and users along the way

The best ways to avoid these common pitfalls are:

Unfortunately, you have to deal with the data you're dealt. But there are ways to be clever with cleanup and massaging of messy data to improve discovery and classification. By using smart tools that leverage machine learning techniques, even loosely structured data can be improved upon.
Typos and messes happen, so being less strict in classification can help to ensure that you're not mistakenly cutting out potential matches. Searching through data with fuzzy logic, phonetic spellings, and regex formatting are ways to avoid missing skipping over what you're looking for.
Considering the usage of our data systems - everything is a search-based app nowadays there's a lot to be gleaned from the queries and what users are accessing them from. We've gotten pretty good at search and logging, but the missing magic is when the two are married with machine learned improvements that continually tune results.

Dr. Marko Petzold

@RecordEvolution

Dr. Marko Petzold finished his studies with a Ph.D. in mathematics and kept his love for theoretical challenges ever since. He learned the data craft in a diverse set of projects. Marko is a visionary innovator and founder and CEO of Record Evolution, a data science company providing the Repods Cloud Data Warehouse.

"Three of the most common pitfalls in data discovery are..."

The availability of the data itself, the data quality, and the respresentativity. Or, to put it differently, the most common pitfall is to underestimate the data preparing efforts. Classification algorithms like machine learning or clustering models usually are not the main issue in today's practical applications, even though their advances and power are most prominently featured in the media.

Data acquisition itself can be the most expensive and time-consuming task in data discovery. Depending on the use case at hand, it can be as complex as installing and reading sensor data from thousands of sensors, collecting surveys, or finding, compiling, and cleaning existing data. The data you collect also needs to contain a representative sample of the data you want to apply your classifier on in the future. In practical applications, that may require continuous data streams. If you collect weather data for your classifier only in summer, it won't work well in winter.

Once you have the basic data for your classification problem, you often need to additionally provide some manual information to enable your system to classify with high quality. This process is called labeling the data. This can mean having to manually inspect thousands of photos (e.g., to tell if there are bananas on them or not).

To avoid these pitfalls, the best you can do is to put enough emphasis on data preparation tasks in your project from the beginning and to clearly communicate this to all stakeholders. If you do so, then you can plan smart workarounds right from the start, and you avoid surprises in the middle of your project.

Nishank Khanna

@NishankKhanna

Nishank Khanna is the VP of Growth at Utility.

"The biggest pitfalls to data discovery and classification are..."

Not limiting the overall scope: Without limiting the scope of amount of data, it's hard to focus infosec resources on segments of data that are most important.
Not properly defining a data classification policy: Before defining your policy, you should ask yourself: What are the key goals, objectives and strategic intent? Clearly communicate how the policy can support increased revenue, reduce costs, and eliminate risk.

Baruch Labunski

@Baruch_Labunski

Baruch Labunski is an entrepreneur, internet marketing expert, and author from Toronto, Canada. He currently serves as CEO of Rank Secure, an award-winning web-design and internet marketing firm.

"Data technology tools are not user friendly for the people who are best suited to analysis..."

For example, many require the ability to write code or queries, which isn't in the skill set of the person who is completing the research process.

Data is disparate and needs to be unified across different formats and platforms - The most powerful discoveries are made when different data sets are linked and analyzed together. This is a very difficult task as data objects are not uniform across different systems, and the work to unify data sources is extremely time consuming and difficult. If you are able to do this, then there is a lot of insight and knowledge to be gained; however, most enterprise systems are not in a way "talking" to each other such that this opportunity is overlooked.

Limited capabilities of current data management platforms - We are only as good as our platforms allow. When it comes to large volumes of data, we must rely on our systems. Manual analysis is almost impossible, and it's risky to invest the time in exploring when we don't always know what we're looking for, or what we will find. There is no guaranteed gold nugget in the field of data we have, either, so in effect we don't know the potential benefit of discovery.

The overarching theme is we don't know what we don't know. When it comes to discovery and classification of data, it comes down to the fact that we don't know what's important until we find it, and by then we may already have in place a system that isn't built to unlock what we may find.

Kevin Turner

@kmtcrm

Kevin Turner is the Head of Strategic Partner Development at Nimble. Kevin has an extensive background in Strategic Partner/Account Management, Channel Management, CRM, and Cloud Computing. He has been involved in the start-up and growth of multiple CRM software and consulting firms, including Model Metrics.

"Poor data quality hinders your business teams' ability to deliver customer-centric value..."

To avoid this pitfall, plan your search and segmentation fields ahead of time to ensure the search criteria delivers the value you expect; develop a governance plan with a clear delineation of who is responsible for entering, validating, and maintaining the data; and establish user protocols regarding where data gets entered, and when. I also recommend running exception reports to identify records with missing fields based on their relationship stage and groom the data by establishing a process to maintain and update in order to increase accuracy and keep it up-to-date.

Finally, I recommend selecting a CRM tool that is easy-to-access and easy-to-use everywhere users communicate with customers and prospects (e.g., email, on social, on the web, and while mobile) that consolidates all business contacts and that automates data entry and data enrichment. Data enrichment can either be provided via a third party tool like ZoomInfo or DiscoverOrg or offered in-the-box with a small business CRM like Nimble.

Anna Bergevin

@anna_bergevin

Anna Bergevin is the Director of Operations and Data Scientist at Nozzle. She is a data-oriented research professional and project manager. She enjoys the challenge of tackling complex problems and finding the most effective, efficient, and accurate methods to address them.

"Some of the most common pitfalls I see in data discovery and classification are..."

Ill-defined variables - Defining variables too broadly and ambiguously. When gathering and classifying data, evaluating the method of data collection, and ensuring that the way the data is measured matches the underlying construct, we are trying to gather data on are critical to later utilizing and interpreting the data correctly.

Not validating classification categories - Generating a complete set of data categories to capture the full spectrum of possible inputs from a data source is difficult. Initial data classes must be evaluated against a large enough dataset to ensure that a sufficient set of classes has been generated to cover most (ideally all) cases.

Combining data that is not comparable - When data is combined from various sources, it is critical that the units of measurement are compatible. What are the time frames and what are the units being measured (people, communities, organizations)? Either the data need to be utilizing compatible time frames and units, or the data needs to be transformable to become compatible.

Failing to thoroughly evaluate data consistency - Not all data sources are creating equal. Nearly all datasets have weak points - missing data problems, inconsistent classification, etc. Thoroughly evaluating the extent and nature of those data issues is crucial to correctly integrate a dataset and counteracting those issues in utilizing that data.

Sid Mohasseb

@sidmohasseb

Sid Mohasseb is a serial entrepreneur, venture investor, business thought leader, educator, speaker, and author of the 2014 book, The Caterpillar's Edge. He currently teaches data science at the University of Southern California.

"There are three common pitfalls, and none are technology or tool related..."

They are all mindset driven. First - to simplify and save time, smart people (which includes most data science types) begin with an answer in mind! That is what model would work best, what data elements are relevant and what they should see when the work is done.

Second - analysts approach the wrong problem and come up with an excellent and elegant solution. Often, discovery efforts are focused on sub-sections or a piece of a bigger situation. Data scientists are more fascinated with modeling and less informed about the connectivity of business functions - therefore, the business people who are less informed about the data make assumptions about the problem without real discovery and lead the analyst on a path to solving those problems.

Third - the failure of the analyst to change his / her frame of reference. By changing the way you approach the data and re-examine situations, new insights are almost always guaranteed - often, one perspective, likely the one most in line with either the analysts or his / her business partner's biases, is selected. A single road leading to a set of most often expected outcomes. To avoid them: i) Change your frame of reference as you explore the data and approach discovery, ii) dispute yourself to challenge the biases you have as you approach insight extraction and iii) Don't start with the solved-state in mind and question the problem you are solving and put the problem and the conceived solutions in the bigger business context.

Jason Cassidy

@Shinydocs

Jason Cassidy is the Founder and CEO of Shinydocs Corporation. Solving enterprise technology challenges for over 20 years, Jason's award-winning solutions are transforming the digital landscape from traditional ECM to modern Content Services.

"In the enterprise, these are problems simply beyond human scale..."

We're talking petabytes of data and terabytes the organization doesn't even know they're sitting on - to say nothing about the volumes of data being created every day. Discovery and classification become impossible due to the sheer amount of data organizations have in their possession.

AI-powered auto-classification, trained on a small subset of properly recognized data, is possible today. Machine learning tools are the only way organizations can hope to make significant headway. It's not perfect, but it's a process that improves over time as the machine learns what defines a document specific to an organization. It means the companies that start the process now are in a better position to leverage more of their data in the future and provide cleaner data fuel for future predictive AI-powered analytics and decision making.

Sia Mohajer

@MohajerSia

Sia Mohajer started off as a scientist working on human cognition. He quit science and started building online businesses to help people better understand data privacy and web applications. He currently runs a 10-person company in the privacy and hosting space.

"As we've seen with multiple hacks this year and last..."

Data discovery eventually leads to data storage which creates an irresistible honeypot for malicious third parties. Combine that with chronic levels of employee-centric data breaches, and you have a recipe for disaster. The productivity trap - while doing data discovery looks like work and may seem like a productive use of time to the user, it is easy for casual users to spend time analyzing data without purpose. Data is the new oil. Don't give yours away.