Generating Synthetic Data with LLMs
A guide to using AI to generate synthetic data for your database and application using any LLM that is available at an endpoint.
May 21st, 2024
Postgres is one of the most commonly used databases in the world. It delivers strong performance and reliability, it's extensible, scalable and open source. Given all of the benefits that Postgres provides, there are still some challenges that teams face when it comes to setting up and maintaining lower-level environments like development and staging/QA. The two most common ones that we hear from customers are: 1. How do we make production data safe to use in lower level environments, and 2. How do we create databases efficiently so that developers can have their own databases to build and test against.
At Neosync, we've been laser-focused on the first problem: helping customers make their production data safe to use in lower level environments while delivering a great developer experience. At the same time, our friends at Neon have been laser-focused on making it ridiculously simple and fast to create databases copies for development, previews, testing and staging environments with their branching workflows.
And today, we're excited to announce that we're bring Neosync's anonymization and synthetic data generation capabilities to Neon to give developers a complete solution that allows them to quickly and easily create branches of their database with anonymized data for a world-class developer experience.
After talking to many teams using Postgres, we’ve learned that:
Developers want to develop against production-like data. Having access to realistic data helps to catch more edge cases and bugs which results in more reliable applications.
Some industries come with strict data privacy regulations. Companies handling sensitive information such as PII (personally identifiable information), PHI (personal health information), and PFI (personal financial information) must comply with data privacy regulations like GDPR and HIPAA. These companies can't use production data locally. As a result, teams have to spend a lot of time creating mock data and maintaining that data as their data models change.
As teams scale, these problems get worse. More databases are needed, which in turn means that more data is needed. Teams have to create more development instances, generate additional mock data, and constantly update this data to reflect changes in production. This increases the complexity and workload significantly, making the setup much harder (and expensive) to maintain.
This is where using Neon with Neosync comes to the rescue.
Now to the fun part: how does this actually work? And what does the developer experience look like?
We've spent a lot of time working on the developer experience and have a workflow we think is simple and intuitive. Here's a step-by-step (full guide with more details at the end):
That's the gist of it. You can obviously do a lot more with both Neon and Neosync but in order to get quickly started, it's that simple.
What's makes this workflow simple and intuitive are Neon's database branching and Neosync's transformers. Using Neon, you can create a branch off of your main branch in seconds. Using Neosync, you can define transformers that anonymize data or generate synthetic data with just a few clicks. Pretty awesome.
For more details and information, check out the docs and guide or watch the video tutorial.
For a v2 of the workflow, we’re planning a tighter integration with Neon, incorporating webhooks and automated workflows. This would allow developers to create new branches in Neon without needing to manually configure them in Neosync. If this is something you would like to see, tell us in Twitter or Discord (and stay tuned for updates).
If you’re interested in trying this workflow for free, follow the steps in the guide and get started in just a few minutes.
Looking forward to what you'll build!
A guide to using AI to generate synthetic data for your database and application using any LLM that is available at an endpoint.
May 21st, 2024
What is the best way to protect sensitive data in LLMS - synthetic data and tokenization? We take an in-depth look at the two options.
April 23rd, 2024