Back to Blog
Guide

Building OpenRedaction: A Regex-First Open Source Story

By Sam Pettiford, December 4, 2025
Sam Pettiford profile

Sam Pettiford

Founder of OpenRedaction, focused on building practical privacy infrastructure for developers shipping AI and data-heavy applications.

LinkedIn profile

Most open-source stories start with a vague idea and end with a maintainer's backlog.

OpenRedaction started with something much more practical: a need for deterministic, self-hosted PII redaction that developers could trust in production. Not a black box. Not a model that behaves differently depending on prompt wording. Not a hosted service that quietly moves sensitive text through infrastructure you do not control. Just a library that takes text in, finds sensitive spans, and redacts them in a way you can inspect, test, and deploy on your own terms.

That sounds simple, and in many ways it is. But the moment you start applying it to real-world text, the simplicity gets tested. Production text is messy, jurisdiction-specific, full of edge cases, and often far more ambiguous than the clean examples that make it into docs. That tension, between a clean deterministic core and the messy reality of developer workflows, is where OpenRedaction became a real project.

What follows is the story of why we built it this way, what broke when we tried to apply it at scale, and why the combination of regex-first detection, validators, presets, and aggressive testing turned out to be the right foundation.

The starting point

The original problem was straightforward: teams needed a way to strip names, emails, phone numbers, addresses, account references, and other identifiers out of text before that text hit logs, exports, analytics pipelines, or external systems.

At first glance, this looks like a solved problem. In reality, most teams end up with one of three weak patterns:

  • They rely on manual review, which does not scale.
  • They use a third-party API, which introduces privacy, residency, and procurement friction.
  • They bolt together a few regexes, which works until the first false positive, formatting variation, or new jurisdiction-specific identifier.

The design goal for OpenRedaction was to avoid all three failure modes. We wanted something you could run locally, inside your own infrastructure, without shipping raw text to a vendor. We wanted deterministic output so the same input would always produce the same redacted result. And we wanted the codebase to be readable enough that a security engineer, privacy lead, or skeptical platform team could inspect the patterns and understand exactly what the system would do.

That is why the project is regex-first. Regex gives you a predictable detection layer. It is not magical, but it is auditable, fast, and easy to reason about. In security-adjacent tooling, that matters more than many people admit.

Why regex-first won

Regex is often dismissed as basic, but for PII detection it is the right primitive for a large part of the problem space.

Structured identifiers tend to have structure for a reason. Email addresses, phone numbers, bank identifiers, tax numbers, card formats, and many national ID types are not arbitrary free text. They follow patterns, include delimiters, and often have checksum or format constraints that can be validated without machine learning. That makes them ideal for rule-based detection.

The key advantage is not just accuracy. It is predictability.

If a pattern matches, it matches the same way every time. There is no model drift, no hidden inference layer, no surprise dependence on prompt formatting, and no need to explain why one deployment redacted a value while another did not. That matters in compliance reviews, incident response, audit trails, and internal architecture discussions. It also matters to developers who just need a tool they can trust under load.

That said, regex alone is not enough if you want the tool to survive in production. Real text does not arrive as tidy examples.

What production text actually looks like

Once OpenRedaction moved beyond simple demo cases, the edge cases showed up immediately.

Support tickets include broken formatting, copied signatures, accidental JSON fragments, concatenated messages, and quoted history. Logs contain escaped characters, stack traces, query strings, and partial payloads. CSV exports often blur together user input, internal metadata, and cells that are technically text but semantically sensitive. Chat transcripts can contain repeated turns, nested quotes, pasted documents, and partial redactions from upstream systems that need to be recognized rather than treated as fresh text.

This is where just write a regex stops being a complete answer.

The real work is in the layers around the patterns:

  • Pattern coverage. You need a broad, maintained library of patterns across multiple jurisdictions and data types, not a single mega-regex that claims to solve everything.
  • Validation. Some identifiers need checksum checks, context checks, or format-aware rules to avoid false positives.
  • Priority ordering. When two patterns overlap, the engine has to know which one wins.
  • Redaction modes. Teams need different output styles depending on whether they are masking for internal use, sanitizing logs, or preparing data for external models.
  • Test coverage. The test suite is not a side effect of the product. It is part of the product.

That last point became one of the strongest lessons from building the library. In privacy tooling, the test suite is the contract.

What we engineered around

The project matured by focusing on the boring things that make a privacy library usable in the real world.

Pattern breadth

Instead of relying on a handful of broad patterns, we expanded the library into a large set of maintained, categorized patterns. That gives teams coverage across common identifiers while still letting them tune what gets detected in their environment.

This matters because privacy use cases are rarely uniform. A support system in the UK may care about phone numbers, emails, addresses, NHS-style identifiers, and payment references. A healthcare workflow may care more about medical context and patient identifiers. A SaaS product processing enterprise documents may need a very different default profile. A practical redaction engine needs to support those differences without forcing one global opinion.

Context and validators

Pure pattern matching can overmatch. A number looks like an identifier until it is actually an invoice line item, an internal reference, or part of a benign code sample. That is why we invested in validators and contextual rules that reduce false positives without weakening the detection layer too much.

This is especially important in developer tools, where false positives can break workflows and cause people to distrust the product. If a privacy tool redacts too aggressively, teams start disabling it. If it redacts too little, they stop relying on it. The balance is hard, but it is the difference between a library that gets installed and one that gets adopted.

Presets and modes

Another design decision was to make the tool feel deployable rather than theoretical. Teams do not want to assemble a privacy policy from scratch every time. They want sensible defaults they can understand and override.

That is why presets matter. Different redaction modes are useful in different environments: stricter modes for external outputs, more permissive masking for internal debugging, and compliance-oriented bundles for regulated workflows. The goal is to reduce decision fatigue while still allowing teams to adapt the engine to their risk model.

Testing as infrastructure

We also treated tests as part of the runtime architecture, not just as a CI checkbox.

If a pattern changes, the relevant test should fail immediately. If a new jurisdiction-specific format is added, the suite should show whether it creates regressions elsewhere. If a validator improves recall but hurts precision, the tradeoff should be visible in code review. This is how a redaction engine stays honest over time.

In a project like this, tests are not only about correctness. They are about preserving trust.

Why open source mattered

Open source was not a branding choice. It was a trust choice.

Privacy tools are evaluated differently from generic developer libraries. People want to know where the code runs, how the patterns behave, and whether the system introduces a hidden dependency on some external inference service. They want to read the implementation, inspect the rules, and reproduce the output locally.

MIT licensing, public tests, and visible documentation all support that expectation. If someone doubts a pattern, they can read it. If someone wants to contribute a new detector for a format in their region, they can submit it. If a team wants to run the library in a locked-down environment, they can do that without negotiating a vendor contract.

That visibility is not just philosophically nice. It is operationally useful. Trust scales when behavior is inspectable.

Developer experience became the product

As the project matured, it became obvious that correctness alone would not be enough. The library had to be easy to adopt.

That meant documentation that answered real implementation questions instead of only describing the API surface. It meant examples that showed how the library fits into Node apps, middleware, ingestion jobs, and data pipelines. It meant a browser playground where people could test the behavior instantly without creating an account or uploading data into a system they did not trust.

Developer tools are rarely won by raw capability alone. They are won by reducing friction at the exact moment someone is deciding whether to try them.

We also found that the more boring the setup looked, the more serious teams engaged with it. A library that can run locally, has clear usage examples, and exposes deterministic behavior is easier to defend internally than a tool that promises magic.

The enterprise path without breaking the core

One of the hardest balancing acts in open source is deciding where the project ends and the support layer begins.

The answer for OpenRedaction was to keep the core open and self-hostable, while making it possible for teams to get help when they need it. Some teams just want the library. Others need review support, rollout guidance, or help integrating it into larger workflows. The important thing is that the open core remains useful on its own.

That separation protects the integrity of the project. It also keeps the main promise intact: use it locally, understand it, and keep control of your data path.

What we learned

A few lessons became very clear along the way.

First, transparency beats hype. Security-adjacent libraries do not win by sounding clever. They win by being understandable.

Second, pattern depth becomes a moat only if you keep pruning false positives. Coverage without maintenance is just technical debt.

Third, documentation is part of the sales process whether you want it to be or not. If a tool is hard to evaluate, serious teams move on.

Fourth, limits matter. No pattern-based system will ever guarantee 100 percent detection of every possible PII instance. Saying that clearly builds more trust than claiming perfection.

Finally, maintenance is the product. Formats drift, regulations change, new identifier types emerge, and customers keep finding new ways to paste sensitive data into places they should not. A redaction engine has to keep pace with that reality.

Where it stands now

OpenRedaction today is a regex-first, open-source PII detection and redaction library built for developers who want deterministic, auditable, self-hostable behavior. It is designed to run locally, fit into real applications, and provide enough visibility that teams can explain and defend how it works.

That is the real story.

Not that PII is easy. Not that regex solves everything. But that a disciplined, inspectable, local-first approach can get you much farther than most people expect when they first start building privacy tooling.

If you are building developer infrastructure in the same lane, the lessons are probably similar: ship a deterministic core, invest early in tests and docs, be honest about limits, and let the code earn trust.

OpenRedaction is still evolving, but the principle has not changed. Make the safe path obvious. Make the behavior inspectable. And make privacy a property of the system, not a promise in the README.