Solving the data anonymization problem in Postgres

Evis Drenova

@evisdrenova

May 28th, 2024

Introduction

Postgres is one of the most commonly used databases in the world. It delivers strong performance and reliability, it's extensible, scalable and open source. Given all of the benefits that Postgres provides, there are still some challenges that teams face when it comes to setting up and maintaining lower-level environments like development and staging/QA. The two most common ones that we hear from customers are: 1. How do we make production data safe to use in lower level environments, and 2. How do we create databases efficiently so that developers can have their own databases to build and test against.

At Neosync, we've been laser-focused on the first problem: helping customers make their production data safe to use in lower level environments while delivering a great developer experience. At the same time, our friends at Neon have been laser-focused on making it ridiculously simple and fast to create databases copies for development, previews, testing and staging environments with their branching workflows.

And today, we're excited to announce that we're bring Neosync's anonymization and synthetic data generation capabilities to Neon to give developers a complete solution that allows them to quickly and easily create branches of their database with anonymized data for a world-class developer experience.

Why do teams need data anonymization?

After talking to many teams using Postgres, we’ve learned that:

Developers want to develop against production-like data. Having access to realistic data helps to catch more edge cases and bugs which results in more reliable applications.
Some industries come with strict data privacy regulations. Companies handling sensitive information such as PII (personally identifiable information), PHI (personal health information), and PFI (personal financial information) must comply with data privacy regulations like GDPR and HIPAA. These companies can't use production data locally. As a result, teams have to spend a lot of time creating mock data and maintaining that data as their data models change.

As teams scale, these problems get worse. More databases are needed, which in turn means that more data is needed. Teams have to create more development instances, generate additional mock data, and constantly update this data to reflect changes in production. This increases the complexity and workload significantly, making the setup much harder (and expensive) to maintain.

This is where using Neon with Neosync comes to the rescue.

Creating database branches with anonymized data

Now to the fun part: how does this actually work? And what does the developer experience look like?

We've spent a lot of time working on the developer experience and have a workflow we think is simple and intuitive. Here's a step-by-step (full guide with more details at the end):

As prerequisites, you need an account in both Neon and Neosync (we both offer generous free tiers).
Pick the database branch in Neon that you’d like to anonymize. Usually this is main but it can be anything. This will be your source database in this workflow.
Create a new branch with a new database in Neon to serve as your destination database. This destination will hold the anonymized data.
Log in to Neosync and create a new connections to your source and destination databases. You can just copy and paste the connection string from Neon into Neosync.
Once the connections are established, create a data synchronization job in Neosync. Define the job to sync data from your source database to the destination database, and configure the job to anonymize the data during the sync process. This involves selecting the tables and columns to anonymize and choosing the appropriate data transformers to generate anonymized data.
After setting up the job, you run it to transfer and anonymize the data from the source to the destination database in Neon.

That's the gist of it. You can obviously do a lot more with both Neon and Neosync but in order to get quickly started, it's that simple.

What's makes this workflow simple and intuitive are Neon's database branching and Neosync's transformers. Using Neon, you can create a branch off of your main branch in seconds. Using Neosync, you can define transformers that anonymize data or generate synthetic data with just a few clicks. Pretty awesome.

For more details and information, check out the docs and guide or watch the video tutorial.

What's next?

For a v2 of the workflow, we’re planning a tighter integration with Neon, incorporating webhooks and automated workflows. This would allow developers to create new branches in Neon without needing to manually configure them in Neosync. If this is something you would like to see, tell us in Twitter or Discord (and stay tuned for updates).