Norma/A product from GroupLabsAutomated feature engineering

Stop hand-crafting features. Search for them.

Norma treats preprocessing as a search-to-optimize problem. Every transform, encoding, and binning choice is a candidate pipeline. Each pipeline is scored by 5-fold XGBoost cross-validation. The best one wins.

Point it at a table. Name the target. Get back a model-ready dataset and the recipe that produced it.

Try NormaSee it run →v1.0 · released
norma · search · churn_q3.parquetSearching
Trial budget248 / 1,024
Fold
Best AUC0.858
#PipelineCV AUC ± σ
  • #01
    box-cox(price)target_enc(country)bin(age, 5)
    0.858±0.011
    Best
  • #02
    log(price)freq_enc(country)age², age
    0.852±0.009
  • #03
    log(price)target_enc(country)bin(age, 5)
    0.847±0.012
  • #04
    log(price)onehot(country)age
    0.821±0.018
  • #05
    pricecount_enc(country)bin(age, 10)
    0.804±0.021
XGBoost · 5-fold CV · stratified24%
Eval engine
XGBoost · 5-fold CV
Search space
10⁶+
Output
Parquet + recipe
Works with
any tabular data

When this matters

If your features are the bottleneck and your intuition is the limit.

Norma is for teams whose data is messy, whose target is clear, and whose preprocessing is held together by convention.

    01

    You write the same preprocessing every quarter

    Log the price, target-encode the country, bin the age. Your repos rhyme. The intuition is shared but never tested against a search.

    02

    Feature engineering takes longer than modelling

    XGBoost is a one-liner. Getting to the inputs that make it work is two weeks. The bottleneck moved a long time ago.

    03

    You suspect a better encoding exists

    You'd try mean-encoding with smoothing, or splines on tenure, or a ratio feature. There was never time to A/B every variant against CV.

    04

    Pipelines drift between notebook and production

    The training transform was eyeballed in a cell. The serving transform was reimplemented in Python. The scores never quite line up.

How it works

Search beats intuition at the long tail of features.

Norma frames preprocessing as optimisation. The objective is held-out CV score. Every choice is a coordinate in the search space.

    01Step

    Point at a table

    A Parquet file, a warehouse query, a CSV. Norma reads the schema, samples the rows, and infers types.

    02Step

    Name the target

    One column. Classification or regression. Norma picks the right metric (AUC, RMSE, log-loss) and locks the folds.

    03Step

    Norma searches the pipeline space

    It proposes candidate pipelines (encodings, scalers, bins, ratios, interactions) and scores each with 5-fold XGBoost CV. Bad branches die early.

    04Step

    Export the winner

    You get the model-ready dataset and the recipe that built it. Same recipe runs at training and at serving. No drift.

What ships

A dataset, a recipe, and a leaderboard you can read.

Three artefacts come out of every search.

  • 01

    A model-ready dataset

    Parquet out, leakage-safe by construction. All target encodings are fit per fold; all imputers learn on the train side. Drop it straight into your trainer.

  • 02

    A reproducible recipe

    Every transform, every parameter, every seed. The recipe is the pipeline — load it at serving time and you get the same features for the same row.

    Winning recipe · churn_q3.parquet
    7 stepsCV AUC 0.872 ± 0.007
    • numeric

      log1p(price)

      +0.011
    • numeric

      box-cox(tenure)

      +0.004
    • category

      target_enc(country, smoothing=20, cv=5)

      +0.018
    • category

      freq_enc(plan)

      +0.002
    • binning

      bin(age, 7, strategy=quantile)

      +0.006
    • ratio

      price / (tenure + 1)

      +0.009
    • interact

      plan × country

      +0.003
  • 03

    A leaderboard with provenance

    Every trial scored, every CV fold logged, every feature attributed. You can read why the winner won and where the second-best fell off.

norma · search · churn_q3.parquetSearching
Trial budget248 / 1,024
Fold
Best AUC0.858
#PipelineCV AUC ± σ
  • #01
    box-cox(price)target_enc(country)bin(age, 5)
    0.858±0.011
    Best
  • #02
    log(price)freq_enc(country)age², age
    0.852±0.009
  • #03
    log(price)target_enc(country)bin(age, 5)
    0.847±0.012
  • #04
    log(price)onehot(country)age
    0.821±0.018
  • #05
    pricecount_enc(country)bin(age, 10)
    0.804±0.021
XGBoost · 5-fold CV · stratified24%

Why Norma

Most preprocessing is folklore. Norma makes it falsifiable.

Two principles hold the system up.

    01

    The objective is the score on data you held out.

    Five-fold cross-validation, stratified, with target encodings fit inside the fold. If a feature wins, it wins on data the model never saw. No leakage, no story.

    02

    The recipe is the pipeline.

    Reproducibility is a property of the artefact, not the notebook. The recipe Norma exports is what runs at serving — same code path, same parameters, same row-level features.

What Norma isn't

Three things people assume — and the actual answer.

The space is crowded with adjacent tools. Here's what makes Norma structurally different.

    01

    Not AutoML

    Norma stops at the dataset. The model is yours — XGBoost, a linear baseline, a neural net, whatever. We optimise the inputs, you own the architecture.

    02

    Not a feature store

    A feature store serves features you've already defined. Norma generates new ones from raw columns and proves which ones move the metric.

    03

    Not a notebook helper

    No autocomplete, no in-cell suggestions. Norma runs as a non-interactive search — submit a job, get a leaderboard, ship the winner.

Common questions

What teams ask before they wire it in.

    01Why XGBoost as the evaluator?

    XGBoost is a strong, fast, well-calibrated baseline on tabular data. Using it as the scorer means a feature that helps Norma is a feature that helps almost any tree-based or linear model downstream. The signal generalises.

    02How do you avoid leakage in the search?

    Every transform that learns from the target (target encoding, mean encoding, smoothed counts) is fit inside the cross-validation fold, never on the full set. The recipe Norma exports preserves the same fold-aware logic at serving time.

    03How big can my data be?

    The search runs on a representative sample by default — typically a few hundred thousand rows is enough to score pipelines reliably. The winning recipe then runs over the full dataset to produce the final artefact.

    04Can I constrain the search space?

    Yes. Whitelist or blacklist transforms, cap the number of generated features, fix specific encodings for compliance. The default search is broad; the constrained search is a flag.

    05Where does it run?

    Norma runs as a job — locally, in your VPC, or as a hosted service. The data never leaves the boundary you point it at. The output is a Parquet file and a serialisable recipe.

Pricing

Pay for the search, not the seat.

The engine is open. You pay when you want it to run faster, share more, or sit inside your perimeter.

    Plan 01

    Open

    Free

    Self-hosted, single-node

    The Norma engine, the recipe runtime, and the CLI. Run it on your laptop or on a single box.

    • Full search engine and recipe runtime
    • Local datasets up to a few million rows
    • Recipe import / export
    • Community support
    Recommended

    Plan 02

    Team

    From $399/mo

    For data and ML teams

    Hosted search workers, shared leaderboards, and warehouse connectors. Run searches in parallel and share recipes across the team.

    • Parallel search workers
    • Snowflake / BigQuery / Postgres connectors
    • Shared leaderboards and recipe registry
    • Email + Slack support

    Plan 03

    Enterprise

    Custom

    Single-tenant deployment

    Self-hosted in your VPC, audit-grade lineage, SSO. For regulated environments and high-volume warehouses.

    • Self-hosted or VPC-isolated
    • Audit lineage on every recipe
    • SSO, SAML, SCIM
    • Dedicated solutions engineer
    • Procurement-friendly contracts

§ 03  ·  Engagement intake

01 / Start a brief

Talk to the people who’ll do the work.

We staff small and senior, scope by phase, and end on a written deliverable. We don’t sell decks or hours.

If we’re not the right team for the job, we say so on the first call. The bar is production, not pitch.

team@grouplabs.ca
Compose a brief30 min · intro
WGS84YYC / YUL
CalgaryYYC
51.05°N · 114.07°W
MontrealYUL
45.51°N · 73.55°W
Δ 3,020 km

02 / Where to find us

01

Calgary, Alberta

Studio HQ
+1 (587) 700-9968
Lat / Lng
51.0486°N · 114.0708°W
Local
—:— MST · UTC−07
02

Montreal, Quebec

Satellite office
+1 (825) 365-9891
Lat / Lng
45.5089°N · 73.5542°W
Local
—:— EST · UTC−05