New Feature: Synthetic Data Generator
You can now generate datasets directly from your workspace in Minibase.
Most machine learning projects fail because of data. It’s easy to imagine a task in your head (like “train a small model to summarize research paper abstracts”) but much more tedious to actually assemble the dataset needed to train that model. Collecting clean data can take days or weeks, and it’s the bottleneck that kills most ideas.
Our goal at Minibase.ai is to remove as many of those bottlenecks as possible. Small models are most useful if you can fine-tune them on the right data and ship them quickly. So today, we’re releasing a beta version of our synthetic data generator.
If you give this tool five to twenty high-quality “seed” examples, it will produce a thousand or more additional rows in minutes. All of its output data is automatically formatted, so it can be used to train any Minibase model. It won’t replace real data, but it will get you from “I have an idea” to “I have a training set” fast.
Getting started is easy. After logging into your account, navigate to the ‘Datasets’ tab and click on “Generate Dataset.” (It’s the big purple button.) Then, follow the prompts. The tool will ask you to answer some questions and then input several “seed” examples. Each seed example should have the same three columns: Instruction, Input, Response.
Instructions are rules your model must follow. If you’re building a small model to do email spam filtering, for example, then your Instruction might be “Classify this email as either spam or not spam.”
Inputs are the actual data points; in the case of our email spam example, these would be real email examples with or without subject lines, messy forwards, terse internal notes, promotional blasts, and so on.
Responses are what the model should produce as output (such as “spam” or “not spam.”) The model learns what “good” outputs look like from your Responses.
When you click “Generate,” we send your seeds to a quorum of different large language models. We take responses from one of those language models, and then use them as seeds for other language models. In this way, we can generate synthetic datasets that are actually diverse and cover a broad set of examples. In practice, our tool generates anywhere from two to ten rows of data per second, depending on the length of each seed.
After your dataset has been generated, you can use it to train models immediately. The generator writes everything into Minibase’s standard format, masks targets correctly (so your model learns to produce the Response, not recite the Input), and sets aside a hold-out split by default so you can evaluate your model’s accuracy easily.
Of course, this tool doesn’t change the fact that the best data is specific to your task and, often, real. If you can gather and refine a thousand hand-labeled examples from your own workflow, you should do that; it will almost always beat a synthetic dataset. We recommend using this tool as an assistant, rather than a replacement, for real data collection. It’s most useful for making quick prototypes or getting models deployed as quickly as possible to run tests and benchmarks.
Still, this synthetic data tool is another step toward our ultimate goal of making model-building as frictionless as possible. If it sounds useful, sign up for a minibase.ai account today and start using it and training models entirely for free. Also, get in touch with us on Discord and let us know what to build next.
— The Minibase Engineers