The Data Challenge: How to Map Data in A Distributed World

By Dotan Nahum, Developer First Security Lead, Check Point Software Technologies

Here’s a quick question. Do you know where all your sensitive data is? As businesses of all sizes create, accumulate, store and process more records of data in more places than ever before, it becomes increasingly difficult to classify and track all that data.

On the one hand, enterprises engage in digital transformation with siled data silos and outdated legacy code. On the other hand, 86% of developers admit that they do not consider application security a top priority when coding. Somewhere in between, CISOs are facing burnout as they try to apply code security best practices, privacy regulations, and compliance standards to the confusing process that is the software development lifecycle.

In this post, we’ll take a look at whether distributed data mapping is necessary, what challenges you’ll face along the way, and how you can overcome them.

Why is data distributed in the first place?

Like it or not, most of the data created, stored, and processed by business applications is inherently distributed. All applications require both logical and physical data distribution to scale functionality and performance. Organizations store different types of data in different files and databases for different purposes.

A typical example of data distribution within a company is buyer and customer data. One SME can have data on leads, warehouse orders, CRM, and social media monitoring distributed across dozens of internally developed and third-party SaaS applications. These applications read and write data at different intervals and formats to owned and shared repositories. In most cases, each has a different schema and field name to store the exact same data.

The application development process distributes a large portion of the data within the application architecture, especially when it comes to serverless, microservices-based architectures, APIs, and integrating third-party (open source) code. So the important question is not why distribute data to your application. Instead, it’s an effective and secure way to manage it throughout the application’s lifecycle.

Distributed Data Mapping: Is the Effort Worth the Reward?

“Shift left” application security, big data security, code security, and privacy engineering are not new concepts. However, software engineers and developers are beginning to adopt tools and methodologies to keep code and data safe from malicious elements. Primarily because, until recently, security tools were designed and built for use by information security teams, not developers.

Privacy by design isn’t new either, but in today’s hectic pace and delivery-driven developer culture, data privacy tends to be ignored. Regulatory standards (such as GDPR, PCI, and HIPAA) are often ignored until they become business priorities. Alternatively, in the aftermath of a data breach, the C-suite can require all relevant departments to take responsibility and introduce preventive measures.

It would be great if all software services and algorithms were developed with privacy by design principles. We plan and build systems in a way that makes data management easy, streamlining access control across application architectures, and building compliance and code security into our products from the start. In short, it would be absolutely fantastic. But that’s not the case with most development teams today. If you want to be proactive about data privacy, where do you start?

The first step in protecting your data is knowing where it resides, who is accessing it, and where it is moving. This seemingly straightforward process is called data mapping. This includes discovering, evaluating, and classifying the data flow of your application.

Data mapping entails using manual, semi-automated and fully automated tools to organize data processes and to research and enumerate all services, databases, storage and third-party resources that deal with data records.

Mapping application dataflow gives you a holistic view of the moving parts of your app and helps you understand the relationships between the various data components, regardless of storage format, owner, or location (physical or logical).

Don’t expect an easy ride.

Mapping data for compliance, security, interoperability, or integration purposes is easier said than done. Obstacles you may face include:

depiction of a moving target

Depending on the overall size and complexity of your application, the manual data mapping process can take weeks or months. Because most applications that require data mapping are thriving and growing projects, they often chase the pace of codebase growth and deploy additional data stores across microservices and distributed data processing tasks. However, if you rotate the data map, the data map becomes outdated as soon as it is complete.

Ease of data distribution

Why do new data stores appear faster than I can map them? Because it’s so easy to deploy new data-driven features, microservices, and workflows using cloud-based tools and services. As the application grows, the number of services that handle data also grows. Developers also love experimenting with new technologies and frameworks, so you may find yourself dealing with a complex containerized infrastructure (including Docker and Kubernetes clusters) that was easy to deploy but a nightmare to map.

Fear of Legacy Code

When businesses undertake digital transformation of legacy systems, they must deal with the data used and generated by those systems. In many cases, especially in established companies, the person who originally wrote and maintained the legacy code is no longer with the company. So it’s up to you to navigate complex service interconnectivity and data standardization in an outdated environment with limited visibility or documentation.

Integrate security and privacy engineering into your application

It’s no secret that data is stolen every day. We can almost guarantee that your email address is included in one or more datasets sold on the dark web.

What can you do to protect your applications and data from the greed of cybercriminals and the scrutiny of regulators?

Scan the code to map the data.

Modern CI/CD pipelines and processes use Static Application Security Testing (SAST) tools to identify code issues, security vulnerabilities, and code secrets accidentally pushed to public repositories. You can use similar static code analysis techniques to discover and map data flows in your application.

This approach maps the code components that can access, process, and store data, thus mapping the flow of data without fully crawling the contents of a database or data store.

Enforce clear boundaries for microservices.

In a microservices architecture, each microservice should (ideally) be autonomous (for better or for worse). But when it comes to sensitive data, where does each microservice end and where does another microservice start?

By focusing on the application’s logical domain model and related data, you can identify the boundaries of each microservice and its related domain model and data. It then tries to minimize coupling between those microservices.

Move left for privacy in a decentralized world

Data security and privacy are rarely priorities for application developers. So it’s no surprise that application data can be uncataloged and floating around unmanaged cloud properties and on-premise devices. But in 2023, data privacy laws and potential data security threats lurking in code cannot be ignored.

Mapping the flow of data into and out of your application is the first step towards moving privacy to the left and integrating privacy engineering, compliance, and code security in your CI/CD pipeline.

About the author

Dotan Nahum is the Developer First Security Lead at Check Point Software Technologies. Dotan was the co-founder and CEO of Spectralops, which was acquired by Check Point Software, and is currently the Head of Developer-First Security. Dotan is a skilled hands-on tech guru and code ninja. Major open source contributor. High expertise in React, Node.js, Go, React Native, distributed systems and infrastructure (Hadoop, Spark, Docker, AWS, etc.) Dotan is available online (dotann@checkpoint.com) and https://twitter.com/jodot and on our website https://www.checkpoint.com/.

Source

The Data Challenge: How to Map Data in A Distributed World

Why is data distributed in the first place?

Distributed Data Mapping: Is the Effort Worth the Reward?

Don’t expect an easy ride.

depiction of a moving target

Ease of data distribution

Fear of Legacy Code

Integrate security and privacy engineering into your application

Scan the code to map the data.

Enforce clear boundaries for microservices.

Move left for privacy in a decentralized world

CISO to BISO – What’s your next role?

Apple ships that recent “Rapid Response” spyware patch to everyone, fixes a second zero-day – Naked Security

A flaw in OpenSSH forwarded ssh-agent allows remote code executionSecurity Affairs

Experts warn of OSS supply chain attacks on the banking sectorSecurity Affairs

CISO to BISO – What’s your next role?

Apple ships that recent “Rapid Response” spyware patch to everyone, fixes a second zero-day – Naked Security

A flaw in OpenSSH forwarded ssh-agent allows remote code executionSecurity Affairs

Experts warn of OSS supply chain attacks on the banking sectorSecurity Affairs

Editor Picks

Patch now! The Mirai IoT botnet is exploiting TP-Link routers

Aruba fixes critical flaws in EdgeConnect Enterprise OrchestratorSecurity Affairs

MOVEit Transfer vulnerability is being exploited widely

Must read

18 cybersecurity startups to watch

Microsoft’s Autopatch feature improves the patch management processSecurity Affairs

Not Feeling it Tonight? Here’s The Ultimate List of Safe Porn Sites.

Popular categories