Norma/A product from GroupLabsAutomated feature engineering

Stop hand-crafting features. Search for them.

Norma treats preprocessing as a search-to-optimize problem. Every transform, encoding, and binning choice is a candidate pipeline. Each pipeline is scored by 5-fold XGBoost cross-validation. The best one wins.

Point it at a table. Name the target. Get back a model-ready dataset and the recipe that produced it.

Try Norma See it run →v1.0 · released

norma · search · churn_q3.parquetSearching

Trial budget248 / 1,024

Fold

Best AUC0.858

#PipelineCV AUC ± σ

#01
box-cox(price)target_enc(country)bin(age, 5)
0.858±0.011
Best
#02
log(price)freq_enc(country)age², age
0.852±0.009
#03
log(price)target_enc(country)bin(age, 5)
0.847±0.012
#04
log(price)onehot(country)age
0.821±0.018
#05
pricecount_enc(country)bin(age, 10)
0.804±0.021

Eval engine: XGBoost · 5-fold CV
Search space: 10⁶+
Output: Parquet + recipe
Works with: any tabular data

When this matters

If your features are the bottleneck and your intuition is the limit.

Norma is for teams whose data is messy, whose target is clear, and whose preprocessing is held together by convention.

You write the same preprocessing every quarter

Log the price, target-encode the country, bin the age. Your repos rhyme. The intuition is shared but never tested against a search.

Feature engineering takes longer than modelling

XGBoost is a one-liner. Getting to the inputs that make it work is two weeks. The bottleneck moved a long time ago.

You suspect a better encoding exists

You'd try mean-encoding with smoothing, or splines on tenure, or a ratio feature. There was never time to A/B every variant against CV.

Pipelines drift between notebook and production

The training transform was eyeballed in a cell. The serving transform was reimplemented in Python. The scores never quite line up.

How it works

Search beats intuition at the long tail of features.

Norma frames preprocessing as optimisation. The objective is held-out CV score. Every choice is a coordinate in the search space.

01Step

Point at a table

A Parquet file, a warehouse query, a CSV. Norma reads the schema, samples the rows, and infers types.

02Step

Name the target

One column. Classification or regression. Norma picks the right metric (AUC, RMSE, log-loss) and locks the folds.

03Step

Norma searches the pipeline space

It proposes candidate pipelines (encodings, scalers, bins, ratios, interactions) and scores each with 5-fold XGBoost CV. Bad branches die early.

04Step

Export the winner

You get the model-ready dataset and the recipe that built it. Same recipe runs at training and at serving. No drift.

What ships

A dataset, a recipe, and a leaderboard you can read.

Three artefacts come out of every search.

01
A model-ready dataset
Parquet out, leakage-safe by construction. All target encodings are fit per fold; all imputers learn on the train side. Drop it straight into your trainer.
02
A reproducible recipe
Every transform, every parameter, every seed. The recipe is the pipeline — load it at serving time and you get the same features for the same row.
Winning recipe · churn_q3.parquet
7 stepsCV AUC 0.872 ± 0.007
- numeric
  log1p(price)
  +0.011
- numeric
  box-cox(tenure)
  +0.004
- category
  target_enc(country, smoothing=20, cv=5)
  +0.018
- category
  freq_enc(plan)
  +0.002
- binning
  bin(age, 7, strategy=quantile)
  +0.006
- ratio
  price / (tenure + 1)
  +0.009
- interact
  plan × country
  +0.003
03
A leaderboard with provenance
Every trial scored, every CV fold logged, every feature attributed. You can read why the winner won and where the second-best fell off.

norma · search · churn_q3.parquetSearching

Trial budget248 / 1,024

Fold

Best AUC0.858

#PipelineCV AUC ± σ

#01
box-cox(price)target_enc(country)bin(age, 5)
0.858±0.011
Best
#02
log(price)freq_enc(country)age², age
0.852±0.009
#03
log(price)target_enc(country)bin(age, 5)
0.847±0.012
#04
log(price)onehot(country)age
0.821±0.018
#05
pricecount_enc(country)bin(age, 10)
0.804±0.021

Why Norma

Most preprocessing is folklore. Norma makes it falsifiable.

Two principles hold the system up.

The objective is the score on data you held out.

Five-fold cross-validation, stratified, with target encodings fit inside the fold. If a feature wins, it wins on data the model never saw. No leakage, no story.

The recipe is the pipeline.

Reproducibility is a property of the artefact, not the notebook. The recipe Norma exports is what runs at serving — same code path, same parameters, same row-level features.

What Norma isn't

Three things people assume — and the actual answer.

The space is crowded with adjacent tools. Here's what makes Norma structurally different.

Not AutoML

Norma stops at the dataset. The model is yours — XGBoost, a linear baseline, a neural net, whatever. We optimise the inputs, you own the architecture.

Not a feature store

A feature store serves features you've already defined. Norma generates new ones from raw columns and proves which ones move the metric.

Not a notebook helper

No autocomplete, no in-cell suggestions. Norma runs as a non-interactive search — submit a job, get a leaderboard, ship the winner.

Common questions

What teams ask before they wire it in.

01Why XGBoost as the evaluator?

XGBoost is a strong, fast, well-calibrated baseline on tabular data. Using it as the scorer means a feature that helps Norma is a feature that helps almost any tree-based or linear model downstream. The signal generalises.

02How do you avoid leakage in the search?

Every transform that learns from the target (target encoding, mean encoding, smoothed counts) is fit inside the cross-validation fold, never on the full set. The recipe Norma exports preserves the same fold-aware logic at serving time.

03How big can my data be?

The search runs on a representative sample by default — typically a few hundred thousand rows is enough to score pipelines reliably. The winning recipe then runs over the full dataset to produce the final artefact.

04Can I constrain the search space?

Yes. Whitelist or blacklist transforms, cap the number of generated features, fix specific encodings for compliance. The default search is broad; the constrained search is a flag.

05Where does it run?

Norma runs as a job — locally, in your VPC, or as a hosted service. The data never leaves the boundary you point it at. The output is a Parquet file and a serialisable recipe.

Pricing

Pay for the search, not the seat.

The engine is open. You pay when you want it to run faster, share more, or sit inside your perimeter.

Plan 01

Open

Free

Self-hosted, single-node

The Norma engine, the recipe runtime, and the CLI. Run it on your laptop or on a single box.

Full search engine and recipe runtime
Local datasets up to a few million rows
Recipe import / export
Community support

Start with Open

Recommended

Plan 02

Team

From $399/mo

For data and ML teams

Hosted search workers, shared leaderboards, and warehouse connectors. Run searches in parallel and share recipes across the team.

Parallel search workers
Snowflake / BigQuery / Postgres connectors
Shared leaderboards and recipe registry
Email + Slack support

Talk to us

Plan 03

Enterprise

Custom

Single-tenant deployment

Self-hosted in your VPC, audit-grade lineage, SSO. For regulated environments and high-volume warehouses.

Self-hosted or VPC-isolated
Audit lineage on every recipe
SSO, SAML, SCIM
Dedicated solutions engineer
Procurement-friendly contracts

Get in touch

§ 03 · Engagement intake

01 / Start a brief

Talk to the people who’ll do the work.

We staff small and senior, scope by phase, and end on a written deliverable. We don’t sell decks or hours.

If we’re not the right team for the job, we say so on the first call. The bar is production, not pitch.

→team@grouplabs.ca

Compose a brief30 min · intro

WGS84YYC / YUL

CalgaryYYC

51.05°N · 114.07°W

MontrealYUL

45.51°N · 73.55°W

Δ 3,020 km

02 / Where to find us

Calgary, Alberta

Studio HQ

+1 (587) 700-9968

Lat / Lng: 51.0486°N · 114.0708°W
Local: —:— MST · UTC−07

Montreal, Quebec

Satellite office

+1 (825) 365-9891

Lat / Lng: 45.5089°N · 73.5542°W
Local: —:— EST · UTC−05

Our offices

Follow us