Data classification tools play an important role in enterprise data protection, tagging sensitive data in various formats to enable protective policies to be applied to different data types. As such, it's important that enterprises evaluate data classification options carefully and identify the best classification tools for their specific data protection needs. Choosing the right data classification tool means choosing a tool that integrates with your existing systems, infrastructure, software, and workflows, requires minimal manual intervention by users – particularly non-IT users – and offers sophisticated classification capabilities in order to accurately and efficiently classify data in such a way that it effectively helps to mitigate risk.
To gain insights into the key considerations in comparing and selecting data classification tools, we asked a panel of data classification experts to answer the following question:
"What's the most important thing to look for when comparing data classification tools?"
Find out what our experts had to say below.
Meet Our Panel of Data Classification Experts:
Paul Kubler
Paul Kubler is a Cyber Security and Digital Forensics Examiner at LIFARS LLC, an international cybersecurity and digital forensics firm. He’s a former employee at Boeing, in the Global Network Architecture division, the nation’s largest private cyberattack target. He previously worked at the Flushing Bank, in Network and Systems Infrastructure, protecting valuable financial data at various levels within the network and system. Paul has also performed forensic investigations into mobile devices aiding in the prosecution of criminals.
With several years of experience in cybersecurity and digital forensics, he conducted a wide range of investigations, including data breached through computer intrusions, theft of intellectual property, and computer hacking. He has worked on hardening the systems and deploying protection over an international organization. He has also created business networks with a defense in depth strategy and implemented firewalls on these networks.
"The most important thing to look for in a data classification tool is..."
Whether or not it supports everything in your environment. This includes the file types, storage media, and migration capabilities, as well as the total number of files. It is important that all file types be accounted for; if not, some important business data may be left unprotected. If the product cannot support all the files in an organization, its scope may be too small and may not cover everything as well. Both of these are important unless an organization wishes to dedicate additional resources to fill in the gaps, which may or may not save some funding. Lastly, it is important that all media types be supported with migration capabilities. This allows an organization to have the ability to identify all the data on their media and reclassify and reorganize it based on security level. All of these are part of knowing your organization's environment, and that is the most important part of choosing a tool that will be most effective.
Michael Kummer
Michael Kummer is the President Americas for SECUDE and a technology and security expert. He has enjoyed a decade-long history within the IT industry, going back to his days in the Austrian Army. As an innovative and independent thinker with a broad knowledge of security-related technologies, he plays a key role in facilitating SECUDE's latest efforts in the field of data protection and classification for SAP.
"The most important thing to look for when considering data classification tools is..."
Traditional data classification solutions have to rely on user input or content analysis in an attempt to understand the context of the data. For that reason, data classification is often perceived as a long and painful process. Finding a tool that automates the often-challenging task of data classification with intuitive algorithms should be a top priority for any company looking to add structure to its data. Ideally, the tool should be aware of the context of the data (the user and her role, the data itself, and the technical environment) and be able to suggest best-matched classification labels to the user based on that knowledge. Context awareness makes a decision easy and efficient for the end user and ensures consistency over sensitive data handling across the entire organization. Another important thing to consider is whether the tool can seamlessly integrate into the existing IT infrastructure and fit into the company-specific classification and Data Loss Prevention (DLP) frameworks as well as integrate with ERP systems, where most sensitive data resides.
J. Wolfgang Goerlich
J. Wolfgang Goerlich is a Cyber Security Strategist with CBI.
"When comparing data classification tools, the most important consideration is..."
Ease of use is vital when comparing data classification products. Tools that do not fit employees’ workflows or that require too much interaction will be ignored or, worse, bypassed. The question to ask vendors is the adoption rate of their tools in similar businesses. A security control is only as good as it is used. Are people using it?
Charles Foley
Mr. Foley has over 20 years' experience leading both private and public company teams to success. Prior to Watchful Software, Mr. Foley was the Chairman and CEO of TimeSight Systems, Inc., a developer of leading-edge storage and video management solutions for the physical security market. He also served as President of Tacit Networks, a leader in Wide Area Network acceleration systems, where he designed the marketing and business development strategies that led to their profitable acquisition by Packeteer (NASDAQ: PKTR).
"Without a doubt, the most important aspect of a successful data classification tool is to..."
Remove the variables and maintain programmatic application of policy. In a word: dynamism, or the ability to take steps to identify sensitive information, classify it according to the organization's policy, apply any classification characteristics (markings, taggings, etc.), and enforce any required protections without any user involvement required. The fact is that in any data classification paradigm, the weak link isn't the data, the policy, or the tool; it's the users. They are the only non-programmatic part of the equation, the wildcard variable. If you can remove the variables from the equation and allow the programmatic entities of the data, the policy, and the classification tool to do their job, the entire paradigm holds together tightly.
There are many data classification solutions on the market, and it's too easy to get into a matrix-comparison war of which one handles which application, on which device, how many concurrent users, what languages, etc. However, the simple fact is that any/all of them are poor solutions if a) they are not used or b) not used properly. Remove the variable of whether your users will actually use the tool and further remove the variable of whether your users will apply the proper classification, marks, etc., and you have a successful program with solid compliance.
How does a data classification program ensure that the wildcard is held in check? First, ensure that basic rulesets are applied programmatically, without user involvement. Catalysts such as PII in the data or basic organizational terms such as 'Company Top Secret,' etc. should trigger automatic classification and apply the proper markings, tags, and protections.
For more advanced users, ensure that if the tool allows user interaction, have it be an in-workflow process, whereby they don't have to exit their normal applications or workflow to apply the classification paradigm. Every time they have to do this, it's another obstacle to not only their productivity, but the integrity of the data classification paradigm.
The most important thing to look for in comparing data classification tools is the ability to remove the wildcard and allow the policy to work, then your organization can rest easy at night.
Alice Zheng
Alice Zheng is the Director of Data Science for Dato.com. Dato accelerates development of intelligent apps by making sophisticated machine learning easy to build and deploy.
"The right data classification tool is the one that..."
Fits the rest of your pipeline. Here are a few questions to consider:
A) What does raw data look like? Does it include text, images, videos, or music files?
Whatever format the data comes in, the tool needs to have connectors to import data in that format. Otherwise, the user will need to understand how to write glue code to convert the data into the right format.
B) How much of it is there?
If there is a large amount of data to be classified, then the tool needs to be scalable and extremely fast. Additionally, the tool may need to handle streaming data and update machine learning models, if your application demands up-to-date models based on continuous data.
C) Who is using it?
Data scientists and machine learning experts require flexibility and power in the available classification methods. BI users require good visualization UI with plotting/charting functionalities.
D) What's the desired end result?
For business intelligence, it should feed into your reporting database. For applications, the classification tool should feed into your machine learning models to power capabilities like personalization, recommendations, alerts, or fraud detection.
David Thomason
David Thomason is the Founder and President of Thomason Technologies, specializing in securing your networks against internal and external attacks. With capabilities such as Next Generation Firewall, Next Generation Intrusion Prevention Systems, File Permissions, and File Trajectory, you will have more knowledge and control over the information on your network than ever before. Thomason Technologies has been consulting on cybersecurity since 2007. David has personally been in the cybersecurity space since 1986 when he was writing code for the United States Air Force Electronic Security Command (part of the intelligence community).
"When it comes to comparing data classification tools, companies should remember ..."
The ideal data classification solution can help you quickly identify where sensitive information is vulnerable and who is touching it. It also helps prioritize risk and remediation and should be capable of locking down the data without interrupting business.
Brian Media
Brian Media works at PolyVista, where he specializes in transforming the way organizations are turning data into actionable intelligence. He likes to spend time outdoors, snowboarding in the winter, and golfing in the non-snowy months.
"When discussing data classification tools, in my opinion you have two choices..."
Speed and accuracy. If the user has a need for the data to be classified highly accurately, it might take more time. If the user wants the data classified very quickly, then they will have to sacrifice accuracy. With the current set of tools out there today, the user has to decide if they want the data processed and classified near instantaneously but not at 100% accuracy or if they want the data processed closer to 100% accuracy and can wait the minutes/hours/days it takes to get the data processed and properly classified. Speed and accuracy are mutually exclusive, and it really boils down to the needs of the business, because each has both advantages and disadvantages.
Kevin Barnicle
Kevin Barnicle is the Founder/CEO of an Information Governance software and consulting company, Controle. His company consults and implements many of the different data classification tools out in the marketplace today.
"For businesses comparing data classification tools, I suggest..."
1. End user interaction: Some data classification tools interact with end users and some don't. For those that do, make sure it is not intrusive (i.e., have the end user classify every piece of content they create); otherwise, they will not do it. Period. We used a data classification tool internally at my company for testing purposes and the ones that forced the end users to classify content didn't get used and we scrapped them.
2. Practicality: A lot of data classification tools claim that they can auto-classify documents, or in other words, the computer technology will figure out what the document is. While exciting to think about the possibility of that, it is extremely difficult for any existing technology to accurately do it. We have seen clients try to do it but scrapped these tools because of all the work associated with verifying and correcting. The technology is not there yet, and due to compliance and legal requirements it is not worth the risk.
3. Integration with other tools: Data classification tools need to be able to integrate with other tools that help manage the data AFTER they are classified. If they do not it is very hard to come up with a solution that will work for the business.
Alvaro Pleitez
Eliassen Group System Support Specialist Alvaro Pleitez serves as the internal system support specialist for one of the largest computer and life sciences consulting companies in the U.S. One of his assignments is overseeing the data classification program at Eliassen Group.
"Choosing the right data classification tool for your enterprise is dependent on..."
What the actual data is that you are looking to analyze as an entity. Is it financial data? Geographical data that reflects where your customers are located across the country? Or do you need a tool that can process language? One of the best open source data mining tools is used to process language in NLTK, to cite one of a plethora of examples. That particular tool can help you engage in data mining, machine learning, data scraping, and sentiment analysis. And since it's written in Python you can build applications on top of it, which allows it to be customized for small tasks. The question for companies looking to compare data classification tools can then be further broken down by determining if the enterprise needs something to be built on-site or off, uses a legacy system to operate its enterprise, and has the financial wherewithal to purchase new tools when necessary that will maintain compatibility with its competitors in the marketplace.
Colum Devine
Colum Devine is the Digital Marketing Manager for Private Jet Services, responding on behalf of Reviewster.com.
"As a cloud storage provider, one of the most important things we recommend people consider while comparing data classifications tools is..."
If the tool offers an effective way to tag your data properly. Most do, but learning how the process works in whatever tool is used is crucial. Before this you should have an effective metadata strategy in place which will help you get an overview of the data you want to source, organizing the proper path to it, documenting the data structure and content, and then finally passing this information to the appropriate constituencies.
Once this metadata has been created and replicated to your other information sources, the company or individual in charge of the project can then establish a classification taxonomy which will tag the assets of varying types according to their own companies' needs.
Anatoly Bodner
Anatoly Bodner is an industry-recognized information and infrastructure security professional, subject matter expert, and event speaker. Anatoly currently serves as the Information Security Officer and Director of the Data Protection Practice for NTT Com Security – a global security consultancy organization.
"The most important thing to look for when comparing data classification tools is..."
A very large part of our practice is committed to helping organizations that were previously unsuccessful in achieving business buy-in, implementing enforcement controls, or are overwhelmed with the amount of events and incidents generated by these technologies.
The root case of these issues is usually common: Many organizations don’t realize that unlike AV, IPS, or other security tools, Data Protection and DLP is all about the business.
The key to successful enablement of the tools that can secure or break the business is appropriately preparing the business. Development of the data protection program and strategy needs to revolve around business unit and business process analysis and preparation, communications strategy, workflow, incident response planning, and other steps that are absolutely essential to establishing an effective, efficient data protection program.
Levent Gurses
Developer, hacker, speaker, community organizer, and entrepreneur, Levent is the founder of Movel, a mobile app design and development company in the Washington, DC area. He is actively engaged in several communities on mobile and full-stack development across the Mid-Atlantic region.
"Data classification tools need to have some key features..."
- The data to be classified needs to remain secure and confidential. The tool needs to have good protection against cybersecurity threats.
- Tagging and categorizing is expensive. A good tool will bring the costs down through better user experience and more efficient disk and CPU usage.
- The quality of the data is important. The tool needs to have filters for low-quality data.
- Data redundancy can be a real pain. A good classification tool would have sophisticated duplicate detection and removal capabilities.
- Monitoring and alerts, in cases where data quality goes beyond a certain threshold or a security alarm goes off.
Berrin Sun
Berrin Sun works at Ragic, an online cloud database provider where businesses can design and use their own databases with no technical knowledge required.
"When comparing data classification tools, the most important thing for businesses to look for is..."
How compatible the tool is to their business. Usually, it's a bad idea to try to fit into a software – changing how you run your business to fit with a software program should be a red flag. While searching for the right tool, ask the software provider how customizable it is and if it can fit your own workflow so that you can make the transition in the best possible way for your company.
Andrew Whitmer
Andy is a Research Analyst at SecureState specializing in web application security and wireless penetration testing. Prior to SecureState, he was a Special Operations Linguist and Team Leader with the US Army.
"There are three important considerations for companies comparing data classification tools..."
Very few clients we see utilize data classification tools and most would rather just implement a data classification program. However, with all tools there are some universal principles that apply. Common questions someone needs to ask themselves when considering any tool are:
1. Do they meet the needs of the organization and any compliance frameworks the organization is subject to?
2. Are they scalable?
3. Can they be integrated with the current environment in terms of operations and security?
Joe Ramirez
Joe Ramirez is the Director of Data Analytics at Surgo Group and focuses on finding innovative ways to aggregate and interpret large data sets for the firm's clients. He has worked in quantitative research for the past 8 years and received his B.S. in Electrical Engineering from MIT in 2007.
"Data classification is one of those unique areas that..."
Walks the line between art and science. On the one hand, developers spend countless hours creating algorithms to appropriately categorize and tag data, but such a solution will never work for all datasets as there is no scientific way to address the unique characteristics of every dataset. On the other hand, there will always be some creative or artistic approach to each and every data classification tool that relies to some extent on the author's ability to interpret the various types and formats of data that will be input into the tool. As such, we believe it is important to compare data classification tools in both lights and find the one that strikes the right balance between art and science for our datasets. It's important to recognize that not all tools will provide the right solution for all datasets, but the most appropriate tool is one that is highly technical and robust yet takes the right creative approach for our projects.
HK Bain
HK Bain is the President and CEO of Digitech Systems, providers of an Enterprise Content Management Solutions Software, and oversees the management and overall vision of the company.
"With classification tools, what matters at the end of the day are..."
The accuracy rates. To achieve the best rates, do not rely on older technology such as bag of words search, keyword search, zonal OCR, or x-y positioning. These older approaches are fallible and do not result in the highest possible accuracy rates. In addition, when documents are difficult to read or damaged, partial words or phrases may not result in classification. A better method with newer technology allows a multi-dimensional approach to classifying documents. These new technologies leverage artificial intelligence, a non-deterministic programming model, that utilizes many more data points than the old technologies even recognize. The results are much higher accuracy rates (95% or better), which saves time in manually keying information or sorting documents by hand.
Fuk Yeung, Kela Roberts, and Korina Baraceros
Fuk Yeung, Kela Roberts, and Korina Baraceros are from Quantly, a quantitative consulting firm at Harvard Innovation Lab.
"When comparing data classification tools, the most important factor to consider is that..."
The newest, coolest tool is not always a good fit for your custom needs. The most important thing to look for when comparing data classification tools is to match the tool to your use-case. Too often, you will get into specifics like performance, security, or accuracy for a generic comparison, but the fact of the matter is that you will lose with respect to any one of these criteria just by having more than one success case. It is often best to use the tool that best fits your needs. Are you trying to classify sensitive documents or some new biological data, or are you investigating some time-series data? In each of these cases, the algorithms and thus the software that you need will be different. A word about algorithms: In the case of open source software, you have more leverage over what particular classification algorithm you are using and in many cases, the open source community will be the forefront of what data scientists are currently working on since it is more available to the academic community.
Eric Ebert
Eric is a Communications Manager at Lookeen Desktop Search. Having worked with several B2B software companies across Germany, he has an understanding of which technologies are being utilized by large enterprises and how to demonstrate effective solutions that help increase worker productivity.
"The main thing you want to look for when comparing data classification tools is..."
User experience. Many IT admins will talk about unsuccessful adoption of new technology as the main hindrance to project success. VDI systems have suffered this same problem for years. If you have a familiar user interface that doesn't change the way your employees go about their ordinary day and is familiar to the users, you will have more success deploying a new data classification tool.
Steve Erickson
Steve is Senior Vice President of ForSite Managed Services, a division of Access Sciences. He has over 35 years of professional experience with information management consulting, operations, and managed services. His background includes leading corporate practice strategy activities, project management and control of multi-billion dollar capital asset projects, and developing the underlying governance, organizational, and technology framework for clients’ information management programs. Steve earned a Bachelor of Science in Computer Science and Engineering from the Massachusetts Institute of Technology in Cambridge, Massachusetts.
"Every organization's motive for leveraging data classification tools is different..."
Whether it be to support storage optimization, eDiscovery, search enhancement, or migration to the cloud. Although the goals may differ, the challenges are still the same: locating information in multiple places, classifying that information in accordance with its business value, and finally reorganizing in place or migrating to a new location. Tools help with that process by providing instant visibility into source repositories and facilitating application of business rules in broad brush strokes, while simultaneously enabling inspection at the item level. Given the gargantuan task of massaging often illogically grouped masses of content into logical structures, tools that can deduce that logic by layering multiple technologies including zonal recognition, text analytics in both content and metadata, and that can be trained to extrapolate from prior rules are the most useful in reducing the time and effort of classification initiatives. Various tools boast these capabilities, however the sophistication of their features and the ease of use in their application are widely subjective and should be tested on representative data sets.