MMM Dataset

A shared synthetic DTC subscription dataset with known ground truth for MMM lessons on adstock, saturation, validation, and budget decisions.

Published May 20, 2026 6 min read MMM

mmmdatasetsynthetic-datareproducibility

On this page

Why it exists
Downloads
Business scenario
Column groups
Source-of-truth parameters
How modules should use it
Reproducibility

Why it exists

The MMM Dataset is a shared lab dataset for this MMM curriculum. It is synthetic on purpose: the data-generating process is known, so each lesson can compare a model, chart, or business claim against the source of truth.

Real marketing data almost never gives you that luxury. You usually observe spend, controls, and outcomes, then infer carryover, diminishing returns, and contribution under uncertainty. A transparent synthetic dataset makes the mechanics inspectable before we move into messier validation work.

Downloads

The public files are served from /public/data/mmm/ in the repo and from /data/mmm/ on the website. The generator lives at scripts/generate-mmm-dataset.mjs.

Business scenario

The scenario is a DTC subscription brand with three years of weekly observations.

Field	Value
Cadence	Weekly
Rows	156
Outcome	New paid subscriptions
Spend unit	Thousands of dollars
Channels	Paid search, paid social, CTV/video, podcast/audio, influencer
Controls	Trend, seasonality, holiday season, promo flag, price index, competitor pressure

The observed modeling dataset is the subset a real analyst would plausibly have: date, controls, media spend, and observed subscriptions. The truth columns are included for teaching and validation.

Column groups

Group	Columns	Use
Keys	`week`, `date`	Time index
Controls	`trend`, `seasonality`, `holiday_season`, `promo_flag`, `price_index`, `competitor_pressure`	Base demand and business context
Observed media	`*_spend`	Inputs a normal MMM would use
Truth transforms	`_adstock`, `_contribution`	Synthetic source-of-truth for lessons
Outcome	`observed_subscriptions`	Noisy observed business outcome
Truth outcome	`base_demand`, `total_media_contribution`, `expected_subscriptions`	Decomposition used for validation

When a lesson wants a realistic modeling exercise, use the observed columns only. When a lesson wants to explain the mechanism, use the truth columns.

Source-of-truth parameters

Channel	Lambda	Half-life	Beta	Half-saturation	Slope
Paid Search	0.18	0.40	310	135	1.25
Paid Social	0.42	0.80	250	165	1.35
CTV / Video	0.72	2.11	420	360	1.80
Podcast / Audio	0.62	1.45	210	175	1.55
Influencer	0.35	0.66	160	105	1.30

The media transform is recursive geometric adstock:

A_{c, t} = M_{c, t} + λ_{c} A_{c, t - 1}

The saturated contribution is generated with a Hill function:

contribution_{c, t} = \frac{β _{c} A _{c, t}^{α_{c}}}{K _{c}^{α_{c}} + A _{c, t}^{α_{c}}}

The observed outcome adds controls and noise:

observed subscriptions_{t} = round (base demand_{t} + c \sum contribution_{c, t} + ε_{t})

How modules should use it

Adstock as Memory uses *_spend, *_adstock, and *_contribution to show why effects persist after spend stops.
Saturation as Diminishing Attention can use the same adstocked media and Hill parameters to show why the next dollar does not behave like the first.
Validation modules can fit models using observed columns only, then compare estimated adstock and contribution against the truth columns.
Decision Systems modules can use the known response curves for budget scenarios and marginal return exercises.

Reproducibility

The generator is deterministic with seed 20260520. Running it again writes the same CSV and parameter JSON:

node scripts/generate-mmm-dataset.mjs

The generator has no external package dependency. It uses plain Node so the dataset can remain part of the static site workflow.