Preparing Your Organization for GDPR Compliance
The threat of a $24 million fine is enough to make any organization sit up and listen to what changes they must make to adhere to new European Union laws on data protection. But, in preparing for General Data Protection Regulation (GDPR), are U.S. companies focused too much on the “data” in their big data clusters? David Dingwall, of Fox Technologies, believes so. He says putting these clusters through GDPR compliance is dependent on some fundamental technical setups. Getting the “plumbing” wrong can bypass all that expensive compliance process review work and cause your organization to fail audit reviews.
The beauty of building extra-large Linux clusters is that it’s easy. Hadoop, OpenStack, hypervisor and HPC installers enable you to build on commodity hardware and deal with node failure reasonably simply. However, a minimum fine of at least €20 million (US$24 million) for a GDPR violation does make you focus on how auditors are going to treat their review of your organization’s people-related data storage and manipulation.
Most of the GDPR review articles you may have read in the last 12 months reinforce that privacy and encryption of people data is hugely important. Multiple layers of encryption for data at rest and in transit through your infrastructure is appropriate. However, when dealing with new big data infrastructures, crucial audit areas of concern include being clear how the software manipulates, aggregates, anonymizes or de-anonymizes (soon to be illegal in the U.K.) people data.
There are some key lessons from the financial services marketplace, which have been using Linux-based HPC and blade clusters for data modelling and forecasting for the last 15 years, especially the operational planning and setup that make ongoing cycles audits easier to complete.
Big Data Cluster Fundamentals: The Large Sausage Machine Without Real People
There is temptation to build a new data-processing cluster on a standalone network to constrict data movement, with supplemental admin access on a second corporate LAN interface. Once loaded, however, like an Oracle database in the past, a data work package for Hadoop and HPC clusters tends to execute all running data transforming tasks in a cluster with a single account (e.g., “hadoop”), not the submitting user ID.
Audit needs to prove not just how personal data is stored, but also how data is manipulated. Therefore, this includes understanding who on your staff can create, change or log in at these application-specific accounts, or worse, the operating system root account.
#1: Too Many Setup Options, Not Enough Certified (People) Installers
Big Data or HPC cluster software tools have specific setup and deployment models that suggest standard templates for installation. According to the 451 Group, less than 20 percent of Hadoop licences purchased worldwide so far have moved into live production, and sadly, those typical cluster installation tool models from commercial edition vendors like Cloudera, SAS and Hortonworks do not reflect the compliance regimes you are going to need in 2018. Frankly, unless members of your staff have worked for one of the internet giants like Google or Yahoo!, admin staff life cycle experience is very limited and we are all learning on the job.
#2: Ensure Your Administrators are Real People
For traceability later, ensure your organization has a consistent organizational user ID (UID)/group ID (GID) strategy for Linux. For your cluster’s software, the unique application user and group IDs need to fit into that matrix across the organization’s infrastructure, not just in your cluster. Your staff’s ID needs to be unique across your business, not just in the cluster, and best practice now says using multifactor authentication challenges should be utilized when they login when moving from node to node in your infrastructure to prove they are a real person, not a stolen account and password pair. This is essential to implement early.
#3: Visibility into Your Organization’s SIEM, and Needing to Track Correlated Events
Clusters can generate a large wave of log files. For example, the Hortonworks distribution of Hadoop generates hundreds or thousands of “su hadoop” messages in a few minutes. Security information and event management (SIEM) platforms (opensource or not) are a fantastic way to make sense of correlated events. For example:
- David logged into the corporate network from home via a VPN using MFA
- David SSHed into the production jumpstart server
- David SSHed into cluster node 47, then SUed to root
- David changed the UID of the Hadoop account from 10011 to 13011
- The Cluster ran 138 SU jobs as the Hadoop account on node 47 until 18:00
An operating system, application or cluster manager’s log viewer may only show you slices of this picture. Sending all logs at all levels to your enterprise SIEM is safer, more complete and, frankly, becomes another team’s responsibility in terms of reporting.
Ensuring your admin staff have unique account names and account IDs makes correlation very simple to track in the network, operating system and software layers. Auditors and your business data owners actually prefer this hands-off model, where someone apart from your Linux admin team is proving what happened.
#4: Give Auditors the Right Tools to Do their Jobs – Your Admin Staff are too Busy Running the Business!
A main sign that your Linux admin team is overwhelmed is if a team member is taking more than four days per audit cycle to help auditors. In that case, something is broken and/or not obvious. The ideal is one to two days maximum.
Keep in mind searching for “what actually happened?” events from an SIEM rather than interrupting the operation of your Big Data cluster is going to be essential. Unlike data warehouses from, say, 10 years ago, as a trade-off for x10 or x100 data-processing performance improvements, it is often impossible to get a time-based snapshot of what your customer data looked like 45 days ago from your cluster.
Thankfully, most opensource and commercial SIEM systems have interactive reporting capabilities, and there are robust third-party report tool vendors, often specialising in specific market sectors. Auditor training using these reporting tools can take one to two days, a significant audit cycle cost savings, rather than attempting to train them in the full operation of your cluster’s operations, which can take weeks (and back to point one, always assuming they have technical audit headcount with the appropriate admin and life cycle experience).
#5: Certify Your Organization, Not Just the Big Data Cluster
Whilst working on your 2018 operational plan for your organization, your big data clusters and GDPR, think carefully about how auditors will work their way through their checklists. One more international compliance regime with large potential fines can be quite distracting. With potential fines starting at 2 percent of your company’s total worldwide turnover or €20 million (whichever is the larger), your scope needs to be a whole-organization approach.
A focus just on data privacy is going to be a problem – specifically the exposure of “user-less” big data software solutions are vulnerable to small teams of administration staff who can easily subvert the cluster’s technical platforms. Luckily, international banks have been dealing with exactly these assurance issues on UNIX and Linux platforms for three decades and data forecasting clusters for the last 15 very similar to today’s big data systems, and they’ve been passing quarterly audit cycles with relative ease.
As Pablo Picasso once said, “Good artists copy, great artists steal.” There are a great deal more UNIX and Linux staff with banking operations life cycle experience available on the market compared to the very small pool of big data cluster specialists. To get your organization’s big data clusters through GDPR audit, I suggest you “steal” one or two of these heads to supplement your data science and cluster admin geeks.