A tiny AI model for text de-identification.
Our latest model, DeId-Small, is available now on HuggingFace and Minibase.
TL;DR: We’re releasing a small text-to-text model that de-identifies personal information with strong performance. The model is available on HuggingFace, or can be accessed on Minibase.ai for immediate queries or API usage (no setup required.)
De-identification algorithms are used in lots of places. Hospitals use them to scrub patient names and dates from medical notes, for example, and lawyers do the same to redact client identities. It’s not a good idea to just delete sensitive data, either, because the surrounding context is often useful. The best tools for de-identification, then, ought to remove identifying details without stripping away any other words.
Most existing de-identifiers, though, are either rule-based (meaning they follow a “hardcoded” script) or large, domain-specific models. These models tend to either over-mask (removing too much) or under-mask (leaving sensitive information intact). A majority are also too big to deploy locally with low latency.
At Minibase, we decided to train a small model that is fast, runs locally — even from a browser — and works across many text domains and in any language. We trained this latest model, called DeId-Small, in less than one hour with zero code.
There are three key metrics for ranking de-identification models: how well they detect personal information, how completely they remove it, and how much of the original meaning they preserve. On our benchmarks, DeId-Small achieved a 100% detection rate for texts containing personal information, completely sanitized about 65% of them, and retained over 80% of the original meaning. The model is only ~136 MB in size and runs in under half a second per request.
The inputs and outputs look like this:
IN: Patient John Smith, born 1985-03-15, lives at 123 Main St.
OUT: Patient [FIRSTNAME_1] [LASTNAME_1], born [DATE_1], lives at [STREET_1].
IN: My friend David Wilson is getting married June 15, 2025 in Napa. Reach him at david.wilson@gmail.com.
OUT: My friend [FIRSTNAME_1] [LASTNAME_1] is getting married [DATE_1] in [CITY_1]. Reach him at [EMAIL_1].
Our results hold up well compared to other approaches on HuggingFace. Rule-based systems tend to break when formats change, and large multilingual models like mT0-XL are strong but weigh several gigabytes and are slow. We think DeId-Small strikes the right balance between being balanced, open-source, super compact, and fast. It is released under an Apache 2.0 license.
If you have any questions, or want to contribute datasets and ideas, come join us on the Minibase Discord.