MMM Dataset
A shared synthetic DTC subscription dataset with known ground truth for MMM lessons on adstock, saturation, validation, and budget decisions.
On this page
Why it exists
The MMM Dataset is a shared lab dataset for this MMM curriculum. It is synthetic on purpose: the data-generating process is known, so each lesson can compare a model, chart, or business claim against the source of truth.
Real marketing data almost never gives you that luxury. You usually observe spend, controls, and outcomes, then infer carryover, diminishing returns, and contribution under uncertainty. A transparent synthetic dataset makes the mechanics inspectable before we move into messier validation work.
Downloads
The public files are served from /public/data/mmm/ in the repo and from /data/mmm/ on the website. The generator lives at scripts/generate-mmm-dataset.mjs.
Business scenario
The scenario is a DTC subscription brand with three years of weekly observations.
| Field | Value |
|---|---|
| Cadence | Weekly |
| Rows | 156 |
| Outcome | New paid subscriptions |
| Spend unit | Thousands of dollars |
| Channels | Paid search, paid social, CTV/video, podcast/audio, influencer |
| Controls | Trend, seasonality, holiday season, promo flag, price index, competitor pressure |
The observed modeling dataset is the subset a real analyst would plausibly have: date, controls, media spend, and observed subscriptions. The truth columns are included for teaching and validation.
Column groups
| Group | Columns | Use |
|---|---|---|
| Keys | week, date | Time index |
| Controls | trend, seasonality, holiday_season, promo_flag, price_index, competitor_pressure | Base demand and business context |
| Observed media | *_spend | Inputs a normal MMM would use |
| Truth transforms | *_adstock, *_contribution | Synthetic source-of-truth for lessons |
| Outcome | observed_subscriptions | Noisy observed business outcome |
| Truth outcome | base_demand, total_media_contribution, expected_subscriptions | Decomposition used for validation |
When a lesson wants a realistic modeling exercise, use the observed columns only. When a lesson wants to explain the mechanism, use the truth columns.
Source-of-truth parameters
| Channel | Lambda | Half-life | Beta | Half-saturation | Slope |
|---|---|---|---|---|---|
| Paid Search | 0.18 | 0.40 | 310 | 135 | 1.25 |
| Paid Social | 0.42 | 0.80 | 250 | 165 | 1.35 |
| CTV / Video | 0.72 | 2.11 | 420 | 360 | 1.80 |
| Podcast / Audio | 0.62 | 1.45 | 210 | 175 | 1.55 |
| Influencer | 0.35 | 0.66 | 160 | 105 | 1.30 |
The media transform is recursive geometric adstock:
The saturated contribution is generated with a Hill function:
The observed outcome adds controls and noise:
How modules should use it
- Adstock as Memory uses
*_spend,*_adstock, and*_contributionto show why effects persist after spend stops. - Saturation as Diminishing Attention can use the same adstocked media and Hill parameters to show why the next dollar does not behave like the first.
- Validation modules can fit models using observed columns only, then compare estimated adstock and contribution against the truth columns.
- Decision Systems modules can use the known response curves for budget scenarios and marginal return exercises.
Reproducibility
The generator is deterministic with seed 20260520. Running it again writes the same CSV and parameter JSON:
node scripts/generate-mmm-dataset.mjsThe generator has no external package dependency. It uses plain Node so the dataset can remain part of the static site workflow.