- #01box-cox(price)target_enc(country)bin(age, 5)0.858±0.011Best
- #02log(price)freq_enc(country)age², age0.852±0.009
- #03log(price)target_enc(country)bin(age, 5)0.847±0.012
- #04log(price)onehot(country)age0.821±0.018
- #05pricecount_enc(country)bin(age, 10)0.804±0.021
Stop hand-crafting features. Search for them.
Norma treats preprocessing as a search-to-optimize problem. Every transform, encoding, and binning choice is a candidate pipeline. Each pipeline is scored by 5-fold XGBoost cross-validation. The best one wins.
Point it at a table. Name the target. Get back a model-ready dataset and the recipe that produced it.
- Eval engine
- XGBoost · 5-fold CV
- Search space
- 10⁶+
- Output
- Parquet + recipe
- Works with
- any tabular data
When this matters
If your features are the bottleneck and your intuition is the limit.
Norma is for teams whose data is messy, whose target is clear, and whose preprocessing is held together by convention.
You write the same preprocessing every quarter
Log the price, target-encode the country, bin the age. Your repos rhyme. The intuition is shared but never tested against a search.
Feature engineering takes longer than modelling
XGBoost is a one-liner. Getting to the inputs that make it work is two weeks. The bottleneck moved a long time ago.
You suspect a better encoding exists
You'd try mean-encoding with smoothing, or splines on tenure, or a ratio feature. There was never time to A/B every variant against CV.
Pipelines drift between notebook and production
The training transform was eyeballed in a cell. The serving transform was reimplemented in Python. The scores never quite line up.
How it works
Search beats intuition at the long tail of features.
Norma frames preprocessing as optimisation. The objective is held-out CV score. Every choice is a coordinate in the search space.
Point at a table
A Parquet file, a warehouse query, a CSV. Norma reads the schema, samples the rows, and infers types.
Name the target
One column. Classification or regression. Norma picks the right metric (AUC, RMSE, log-loss) and locks the folds.
Norma searches the pipeline space
It proposes candidate pipelines (encodings, scalers, bins, ratios, interactions) and scores each with 5-fold XGBoost CV. Bad branches die early.
Export the winner
You get the model-ready dataset and the recipe that built it. Same recipe runs at training and at serving. No drift.
What ships
A dataset, a recipe, and a leaderboard you can read.
Three artefacts come out of every search.
- 01
A model-ready dataset
Parquet out, leakage-safe by construction. All target encodings are fit per fold; all imputers learn on the train side. Drop it straight into your trainer.
- 02
A reproducible recipe
Every transform, every parameter, every seed. The recipe is the pipeline — load it at serving time and you get the same features for the same row.
Winning recipe · churn_q3.parquet7 stepsCV AUC 0.872 ± 0.007- numeric
log1p(price)
+0.011 - numeric
box-cox(tenure)
+0.004 - category
target_enc(country, smoothing=20, cv=5)
+0.018 - category
freq_enc(plan)
+0.002 - binning
bin(age, 7, strategy=quantile)
+0.006 - ratio
price / (tenure + 1)
+0.009 - interact
plan × country
+0.003
- 03
A leaderboard with provenance
Every trial scored, every CV fold logged, every feature attributed. You can read why the winner won and where the second-best fell off.
- #01box-cox(price)target_enc(country)bin(age, 5)0.858±0.011Best
- #02log(price)freq_enc(country)age², age0.852±0.009
- #03log(price)target_enc(country)bin(age, 5)0.847±0.012
- #04log(price)onehot(country)age0.821±0.018
- #05pricecount_enc(country)bin(age, 10)0.804±0.021
Why Norma
Most preprocessing is folklore. Norma makes it falsifiable.
Two principles hold the system up.
01
The objective is the score on data you held out.
Five-fold cross-validation, stratified, with target encodings fit inside the fold. If a feature wins, it wins on data the model never saw. No leakage, no story.
02
The recipe is the pipeline.
Reproducibility is a property of the artefact, not the notebook. The recipe Norma exports is what runs at serving — same code path, same parameters, same row-level features.
What Norma isn't
Three things people assume — and the actual answer.
The space is crowded with adjacent tools. Here's what makes Norma structurally different.
Not AutoML
Norma stops at the dataset. The model is yours — XGBoost, a linear baseline, a neural net, whatever. We optimise the inputs, you own the architecture.
Not a feature store
A feature store serves features you've already defined. Norma generates new ones from raw columns and proves which ones move the metric.
Not a notebook helper
No autocomplete, no in-cell suggestions. Norma runs as a non-interactive search — submit a job, get a leaderboard, ship the winner.
Common questions
What teams ask before they wire it in.
01Why XGBoost as the evaluator?
XGBoost is a strong, fast, well-calibrated baseline on tabular data. Using it as the scorer means a feature that helps Norma is a feature that helps almost any tree-based or linear model downstream. The signal generalises.
02How do you avoid leakage in the search?
Every transform that learns from the target (target encoding, mean encoding, smoothed counts) is fit inside the cross-validation fold, never on the full set. The recipe Norma exports preserves the same fold-aware logic at serving time.
03How big can my data be?
The search runs on a representative sample by default — typically a few hundred thousand rows is enough to score pipelines reliably. The winning recipe then runs over the full dataset to produce the final artefact.
04Can I constrain the search space?
Yes. Whitelist or blacklist transforms, cap the number of generated features, fix specific encodings for compliance. The default search is broad; the constrained search is a flag.
05Where does it run?
Norma runs as a job — locally, in your VPC, or as a hosted service. The data never leaves the boundary you point it at. The output is a Parquet file and a serialisable recipe.
Pricing
Pay for the search, not the seat.
The engine is open. You pay when you want it to run faster, share more, or sit inside your perimeter.
- Full search engine and recipe runtime
- Local datasets up to a few million rows
- Recipe import / export
- Community support
- Parallel search workers
- Snowflake / BigQuery / Postgres connectors
- Shared leaderboards and recipe registry
- Email + Slack support
- Self-hosted or VPC-isolated
- Audit lineage on every recipe
- SSO, SAML, SCIM
- Dedicated solutions engineer
- Procurement-friendly contracts
Plan 01
Open
Free
Self-hosted, single-node
The Norma engine, the recipe runtime, and the CLI. Run it on your laptop or on a single box.
Plan 02
Team
From $399/mo
For data and ML teams
Hosted search workers, shared leaderboards, and warehouse connectors. Run searches in parallel and share recipes across the team.
Plan 03
Enterprise
Custom
Single-tenant deployment
Self-hosted in your VPC, audit-grade lineage, SSO. For regulated environments and high-volume warehouses.
§ 03 · Engagement intake
01 / Start a brief
Talk to the people who’ll do the work.
We staff small and senior, scope by phase, and end on a written deliverable. We don’t sell decks or hours.
If we’re not the right team for the job, we say so on the first call. The bar is production, not pitch.
02 / Where to find us
Montreal, Quebec
Satellite office- Lat / Lng
- 45.5089°N · 73.5542°W
- Local
- —:— EST · UTC−05