Need for Speed: Optimizing Data Masking Performance and Providing Secure Data for DevOps Users

Source – securityboulevard.com

Let’s start with a pretty common life experience — you identify a need (e.g., transportation), you evaluate your options (e.g., evaluate car manufacturers, various features, pricing, etc.), and you decide to purchase (e.g., vehicle X). This process repeats itself over and over again regardless of the purchase. What typically happens following the purchase decision is also equally likely and transferrable — that is: How do I improve it? Increase efficiency? Can I tailor it to my individual needs?

For most technology purchases, including those related to data security — and data masking in particular — the analogy holds equally true. For many of our data security customers, the desire to optimize the through-put, run-rate, or outputs from the solutions they invest in is becoming increasingly important as they race to achieve regulatory compliance with key data privacy requirements and regulations such as the European (EU)-wide General Data Protection Regulation (GDPR), HIPAA, or PCI-DSS. Equally important is that most organizations are looking to mitigate the risk of sensitive data exposure and optimize their DevOps function; allowing more end-users to access data for test and development functions, without the risk associated with using sensitive data. And, they want all this to be achieved FASTER.

Imperva offers a variety of data security solutions to support these increasingly common organizational challenges, including Imperva Camouflage. Our industry-leading data masking solution and best practice implementation process offer a one-stop means to achieve compliance support and reduce data risks in DevOps, while meeting end-user processing expectations. Simply put: the process involves the use of a production database copy, upon which the data is classified for masking, and transformation algorithms are then applied to produce fictional — but contextually accurate — data that is substituted for the original source data; all done in an expeditious manner to meet the need for speed.

You’ve decided you need data masking, but your end-users want more

In previous blogs and webinars, we highlighted the value that data masking provides for protecting data privacy and supporting industry and regulatory compliance initiatives. Industry analysts continue to see ongoing and expanding use cases and demand for the technology. This is largely due to the fact that organizational data capture and storage of data, and sensitive customer data, in particular, continues to grow. Further, changing data applications; database types, migration to the cloud (for DevOps), privacy regulations required for de-identified data, and the growth of big data applications and various data use cases all combine to drive the added need for data masking technologies and their diversification and advancement.

So, organizations are seeing the value of data masking, and many have implemented it into their overall data security strategies to provide yet another critical layer. That said, they are also demanding increased speed of masked copy processing and provisioning to ensure their DevOps teams continue to deliver on business-critical processes. How then can data masking be optimized? Let’s first take a look at typical performance considerations, and then review how the process can be optimized for end-users.

How do you measure data masking ‘performance’?

One of the most common questions asked during a sales cycle, POC or implementation relates to the length of time it will take to mask data, and how that performance is measured. The quick answer is ‘it’s complicated’. Data Masking run-rate performance is typically measured in rows per second. This is the metric most often cited by our customers, but the underlying size of a row can and does vary significantly depending on how wide the particular tables are, therefore impacting performance comparison.

Additionally, the variance in data volumes and data model complexity make it challenging to provide specific performance numbers; but these are batch processes that modify large amounts of data (excepting discovery) and therefore consume a significant amount of time for large amounts of data. That said, Imperva offers a number of avenues to optimize performance for a given customer/client’s requirements and can be reverse engineered in most cases to achieve the desired performance or run-time metric. We’ll get into the specifics on this shortly.

Aside from the inherent capabilities of the software solution itself, there are several factors that influence the performance of data masking that we discuss with our customers. We also explain that the various combination of these make it challenging for any vendor to pinpoint exact masking run times. We’ve consolidated the performance-impacting variables into three key variables, including database characteristics, hardware requirements, and masking configurations. Let’s review each of these:

Database characteristics:

In general, a large database takes longer to mask than a small database- pretty simple! To be more specific the height and width (row count and columns per row) of the tables in the database being masked directly impact the runtime. Tall tables, with high row counts, have more data elements to process compared to shorter tables.

In contrast, wide tables containing extraneous non-sensitive information introduces I/O overhead throughout the data transformation process because the non-sensitive information is included, in part, during the data transformation process. This makes the input of the DevOps SME’s key to assessing the underlying databases and helping scope the performance requirements.

Hardware Requirements:

Data masking can be a relatively heavy process on the database tier in that it copies and moves a significant amount of data in many cases. As we employ our best practice implementation process using a secure staging server, this introduces yet another variable that influences through-put- but also an opportunity. The processing power and I/O speed on the database staging tier greatly influence the performance- regardless of vendor or solution being deployed. When we provide base hardware specifications to customers for the staging server, we make this clear. We also help by providing a range of hardware options required depending on the underlying environment characteristics and end-user requirements- noting that where SLA windows are tight for masking then appropriate hardware should be provisioned to accommodate accordingly. The good news is that this is an easy configuration, and most customers already have access to all the tools they need to maximize their staging server to their specific requirements.

Masking Configurations within the projects:

The details of the security and data masking requirements, which are driven by the organization, also influence performance. In particular, the amount and complexity of the masking being applied have an impact on masking run times. From our experience, in cases where typical sensitive data element types require masking, the data resides in 15% – 20% of the tables. If higher security requirements are imposed and additional data elements are included, this could expand to include as many as 30% – 40% of the tables.

In addition to the volume of data being masked, the specific data transformations also influence the runtimes. There are different types of data transformers, and they each have different performance characteristics based on the manner in which they manipulate data. For example, ‘Data Generators’ synthesize fictitious numbers such as Credit Cards and Phone numbers, whereas, ‘Data Loaders’ load data such as shuffling names and addresses from defined sets. The business needs usually dictate which transformer should be used for a given data element, but sometimes the business requirement can be met with one or more transformers. In those cases, the option chosen can have an impact on performance.

For each of these key variables, it’s important to assess the business requirements and then balance appropriately with regards to the complexity of the underlying data model, the staging server horsepower, and the methods applied for the masking process. Imperva’s depth of experience in this regard provides additional value to customers when they are looking to understand the best implementation approach to meet both data security and end-user requirements. It’s a critical piece of the puzzle.

We know what impacts performance and what to consider beforehand. Now, how do we make it even better?

While we’ve focused on the more tool-agnostic variables that impact masking performance, there are also considerations within the tool itself that can help fine-tune the end result. For Imperva’s solution, there are a variety of levers that can be used to customize and optimize high-volume/high-throughput masking. For example, performance settings within the solution can be adjusted at multiple levels of the application stack including (a) the database server, (b) the Imperva Camouflage application server, and (c) within the masking engine itself to maximize performance. Settings for parallelization of operations, flexible allocation of hardware resources (RAM) and the use of bulk-SQL commands during masking operations, are some of the ways in which the performance and scalability of the Imperva Camouflage solution can also be configured.

A number of approaches for maximum scalability and performance are also available within the Imperva Camouflage solution that can be considered depending on the environment and requirements, including:

Multi-Threading – parallelization is used throughout Imperva Camouflage to enable masking to scale to the largest of databases and masking targets. This includes the capability to process many database columns at the same time while accounting for dependencies within certain databases.
Optimized SQL – although invisible to the user, Imperva Camouflage refines the SQL used to affect the masking depending on the database type as well as the particular masking operation being performed. No configuration changes are necessary to take advantage of bulk-SQL and other commands that minimize database logging overhead.
Execution on the Database Tier – many operations are performed directly on the database server(s) which has the effect of minimizing data movement over the network thereby maximizing performance. It also leverages the hardware resources that are typically dedicated to database servers.
Parallelization on the Database Tier – wherever possible, operations are performed in parallel using multiple instances on the database tier as well. For some environments, this is a combination of database engine settings as well as Imperva Camouflage configuration. By scaling up or down, the masking process can be made to conform to the needs/constraints of the given masking operation. This is one area that Imperva typically spends time with its customer’s and masking end users to ensure they are maximizing the tool’s performance.

It’s also important to reinforce that regardless of the solution, the storage architecture and configuration have a significant impact on performance. Faster storage with reads/writes spread across multiple disks will result in better performance overall. In many cases, database and storage tiers are configured for transactional workloads which are different from the bulk/batch workload that masking represents. Better performance will be found with faster storage that is in the same data center as the database server being masked, period!

Slow down to Speed up!

So, there are clearly a variety of factors that impact the run-rate of data masking, and yet there are a variety of levels (once understood) that can be unpacked of to optimize performance and achieve end-user expectations. Additionally, leveraging industry-recognized, purpose-built solutions and best-practice implementation expertise offers a much more efficient and effective way to optimize data masking run-rates; and offers a more scalable and sustainable process over the long-term.

The key is to slow down during the implementation phase. Understand your requirements. Understand your data model. Understand what resources you need to apply to your staging server and your masking processes, and you’re well on your way to optimizing the resulting output. At the end of the day, Imperva can in most cases reverse engineer a customer’s desired performance requirements and configure the solution, processes, and recommended hosting architecture to achieve the desired result.