Synthetic Audiences in Consumer Research

01 — The 1000:1 Question

The simulator runs a thousand miles for every one on the road. Research runs none.

Simulated miles (Waymo, annual)

~20 billion

Real road miles (Waymo, annual)

~20 million

1,000 : 1

The ratio that research has not built

Central Argument

Synthetic audiences are a simulator layer, not a sample replacement. The simulator compresses learning loops, exposes edge cases, and lets teams burn through hypotheses cheaply. It does not certify the vehicle.

Waymo's autonomous vehicles have driven roughly twenty million miles on public roads. The same vehicles have driven over twenty billion miles in simulation. The ratio is close to a thousand to one.

Engineers do not call this cheating. They call it how the system gets safer. Real roads are slow, expensive, and dangerous. Simulators are fast, cheap, and stress the system in ways the real world rarely will.

Consumer and market research operates at a very different ratio. Most studies still run at one-to-one. Real respondents, real fieldwork, real timelines. The idea of a simulator layer, where ideas, questions, and concepts get stress-tested at scale before any human gets a screener, barely exists in practice.

Where it does exist, it is mostly being sold as a replacement for the real thing rather than a complement to it. That framing is the problem. And it is the reason synthetic audiences are arriving in the industry with both more hype and more scepticism than the evidence supports.

The hard question is not whether synthetic respondents can be made. They can. The hard question is whether the inferences drawn from them are valid for the decision being made. That is a question about epistemology, not about model performance.

02 — Eighty Years of History

Synthetic data did not start with GPT. It started with a man playing solitaire.

The year was 1946. Stanislaw Ulam, a Polish mathematician convalescing from a near-fatal bout of viral encephalitis, was sitting up in bed playing Canfield solitaire. He wanted to know the odds of a winning deal. The combinatorics defeated him. So he tried something else: lay out a hundred hands, count the wins, take the ratio.

1946

Monte Carlo Method

Ulam and von Neumann at Los Alamos. Random sampling as substitute for analytic computation. Named after the casino where Ulam's uncle gambled.

1971

Schelling Segregation Model

Agent-based modelling as a discipline begins. Agents on a grid with mild preferences produce sharply segregated outcomes.

1985

Liew, Choi & Liew

Earliest published reference to fabricated population data for statistical privacy protection.

1993

Rubin & Little: Multiple Imputation Extended

Statistical Disclosure Limitation. Rather than imputing missing values, why not impute the entire population? The full synthesis framework is born.

1996

Epstein & Axtell: Growing Artificial Societies

Generative social science formalised. Simulate populations of agents with simple rules. Watch what emerges.

2000

US Census Synthetic Public-Use Files

The Bureau releases OnTheMap and SIPP Synthetic Beta. Production-grade synthetic data carries real policy weight.

2007

RTI Synthetic Population of the US

Public health researchers run synthetic populations of entire cities. Influenza spread, vaccination strategy, transportation flow. Not toy systems.

2018

GAN Explosion

Generative Adversarial Networks make synthetic image generation credible. The unit cost of synthetic data drops by orders of magnitude.

2020

GPT-3 and Conversational Synthetic Data

LLM-driven output becomes conversational rather than tabular. The output feels like a person. The feel is misleading. The underlying logic is still simulation.

2022

Argyle: Algorithmic Fidelity

First peer-reviewed claim that LLMs can simulate sub-populations in aggregate. The practitioner conversation shifts permanently.

2023

ChatGPT Moment in Research

Synthetic respondent vendors proliferate. The rigorous case and the sceptical counter-case race to catch up with commercial adoption.

2026

SYN-DIGITS (Columbia)

Calibration as a discipline. Latent factor alignment from causal inference. 50% improvement in individual correlation. The frontier opens.

Key Shift

Generative AI did not invent synthetic data. What it changed was two things: the unit cost of generating a synthetic respondent dropped by orders of magnitude, and the resolution of the output became conversational rather than tabular.

The Monte Carlo method is the foundational logic of all synthetic data work that followed. Substitute random sampling for analytic computation. Generate populations of imagined cases. Observe what emerges. The method was secret for years. By the 1950s it was widely published.

An LLM-driven synthetic respondent can now produce open-ended text that sounds like a transcript. An agent-based model in 2005 could produce a row of numbers. That shift is real, and it is also exactly where the trouble starts. The new resolution makes the output feel like a person rather than a statistical artefact.

The same questions that applied to a Monte Carlo run still apply to a transcript generated by an LLM. What is the simulator certified for? What is it not? How will the inference be validated? Those questions are the rest of this piece.

03 — Simulation Parallels

Other industries built certified simulator layers. Research has not.

Three industries figured out how to use simulation alongside real-world testing long before research caught up. Each one offers a lesson that does not get talked about enough in market research conversations about synthetic respondents.

✈

Aviation

Level D Certified — 14 CFR Part 60

Framework maturity: 95%

🚗

Autonomous Vehicles

1,000:1 ratio — Waymo Carcraft

Framework maturity: 78%

💊

Pharma In-Silico

FDA V&V 40 — ENRICHMENT

Framework maturity: 88%

📊

Consumer Research

No framework — No benchmark

Framework maturity: 12%

The Core Lesson

Qualification is use-case-bounded, not globally accurate. A Level D simulator for the Boeing 777 is qualified to train 777 pilots on 777 procedures. It is not qualified for anything else. Research lacks this vocabulary entirely.

The important word in aviation is qualified. Not certified. Not accurate. A Level D flight simulator has over a thousand individual test requirements. A pilot can complete an entire type-rating in the box. The simulator's authority is bounded by the conditions under which it was tested.

Most current synthetic audience products make global accuracy claims. Vendors quote single numbers: 85% match to real data, 92% thematic overlap. None of these claims specify the envelope of qualification. None of them survive a Part 60 review.

The Waymo simulation is not used to replace road testing. It is used to do three things road testing cannot do well. First, scale: real roads cannot be driven a thousand times faster, simulators can. Second, edge case generation: the interesting failures are rare, but in simulation they can be triggered intentionally, repeated, varied. Third, regression: every new software update gets driven through the entire library of past scenarios before it goes anywhere near a public road.

The FDA has been accepting computational modelling and simulation as supporting evidence for medical device approvals for years. The ASME V&V 40 standard sets out exactly what evidence the FDA expects to see: Verification. Validation. Applicability analysis. Uncertainty quantification. This is what proper synthetic respondent validation should look like. Not a single accuracy number. A documented credibility framework with named methods for each component of the claim.

Research has not built this for three reasons. First, the cost of failure is lower. Second, the field has been seduced by accuracy claims that do not specify the envelope of qualification. Third, the practitioner community has not yet developed a vocabulary for use-case-bounded validity. That vocabulary has to come from somewhere.

04 — Four Layers

The layer changes what the data is good for.

Synthetic respondents are not one thing. The methodologies differ enough that grouping them under a single label is the source of most confusion in the field. Four distinct layers exist, each with different inputs, different outputs, and different defensible uses.

L1

Prompt-conditioned LLM personas

Best for: hypothesis generation, fast screens — Avoid for: anything requiring variance

▼

What it is

A demographic and psychographic profile is described in a prompt. A general-purpose LLM is asked to respond as if it were that person. Fast, plausible, essentially a stylised average of how the model has seen that kind of person described in its training data.

Tools & limits

Tools: ChatGPT, Claude, raw LLMs
Best for: hypothesis generation, fast screens, stimulus review
Avoid for: anything requiring variance or individual prediction
Produces: stylised averages, not distributions

L2

Structured persona simulation

Best for: concept testing, qualitative pre-fieldwork — Avoid for: decision-grade numbers

▼

What it is

Personas become structured objects with stable attributes, often anchored to OCEAN (Big Five personality model), with explicit memory and continuity across multiple questions. A persona's answer to question seventeen is shaped by their answer to question two. Subgroup variance becomes recoverable.

Tools & limits

Tools: TinyTroupe, ASPIRE, Polypersona, Synthetic Users
Best for: concept testing, qualitative pre-fieldwork, theme generation
Avoid for: decision-grade numbers or individual-level prediction
Requires: thoughtful persona construction upfront

L3

Data-anchored synthetic panels

Best for: population-level patterns — Avoid for: individual-level prediction

▼

What it is

Personas are constructed by conditioning the LLM on rich real-world data about specific individuals. Toubia's Twin-2K-500 dataset and Park et al.'s 1,000-person interview-grounded twins are the cleanest published examples. Captures central tendencies well, individuals poorly.

Tools & limits

Approach: twin construction from real panels
Best for: population-level patterns, segment forecasting
Avoid for: individual-level prediction (Toubia: r ≈ 0.2)
Requires: real anchor data to construct twins

L4

Calibrated simulation frameworks

Best for: prediction within calibrated envelope — Requires: real anchor data

▼

What it is

A calibration layer sits on top of any of the above and adjusts the LLM output to align with human ground truth. SYN-DIGITS (Columbia, April 2026) uses latent factor alignment from causal inference. Reported 50% improvement in individual-level correlation, 50-90% reduction in distributional discrepancy.

Tools & limits

Methods: SYN-DIGITS, isotonic regression, Platt scaling
Best for: prediction within a calibrated envelope
Requires: real anchor data for calibration
Warning: calibration only generalises within anchor conditions

Honest Mapping

L1 is useful for fast hypothesis generation. L2 is useful for exploration where the deliverable is themes. L3 is useful for understanding population-level patterns. L4 is useful for prediction within a defined envelope, once an anchor exists. Selling L1 as L4 is overclaiming. Using L4 where L1 would do is overspending.

05 — Human vs. Machine Cognition

Humans do not think the way language models simulate thinking.

This is the hardest section to write honestly because both sides of the conversation are overconfident. The synthetic respondent industry argues that synthetic respondents get past the narrative because they can be modelled at the behavioural level. That argument is correct about the say-do gap. It is incorrect about what synthetic respondents actually have access to.

Human Respondent only

Embodied experience
Emotional nuance
Cultural ground truth
Social dynamics
Longitudinal memory
Genuine ambivalence

Overlap

Surface language
Thematic responses
General reactions
Topic knowledge

LLM Simulation only

Training text patterns
Pattern recognition
Language generation
Infinite scalability
Zero fatigue

The Actual Gap

Current synthetic respondents do not have behavioural access. They have textual access to descriptions of behaviour. This is not the same thing. The argument is that interviews capture narrative rather than navigation. That is true. The model has access to descriptions of navigation, which are themselves narrative. The gap has been moved one layer. It has not been closed.

Lived Experience

The model has not been hungry. It has not been the only one in a meeting whose accent is mocked. It has not had a parent die. It can produce plausible text. The text will not be what someone with that experience actually feels.

Cultural Ground Truth

Published fidelity research is overwhelmingly conducted on US and Western European populations. A vendor citing 90% parity on US data makes no claim at all about the same product's performance on a Saudi, Tier 2 Indian, or Indonesian sample.

Group Dynamics

Single-agent synthetic respondents cannot produce conformity effects, social proof, contagion, or polarisation. Multi-agent simulations are beginning to simulate these dynamics. The validation is at the very early stages of the literature.

Binz and Schulz, in Nature Computational Science, ran classical cognitive psychology tasks on LLMs and human participants. They found that earlier LLMs showed many of the same reasoning errors humans do. More recent models are not converging on human-like cognition. They are diverging from it.

A 2025 paper analysing 170,000 reasoning traces from 17 models against 54 human think-aloud protocols found that humans use hierarchical nesting and meta-cognitive monitoring. Models use shallow forward chaining. The divergence is most pronounced on ill-structured problems, which is exactly the kind of problem most consumer decisions are.

The behavioural modelling framing the synthetic respondent industry has adopted is partially correct and substantially misleading. They are excellent for the textual layer of behaviour. They are unreliable for the layers underneath it.

06 — State of the Field

The vendor landscape is moving faster than the methodological literature can validate it.

A working map matters because the labels in marketing material rarely correspond to what is happening underneath. The field divides clearly into three categories: academic researchers doing the rigorous work that practitioner conversations should be anchored to, an open-source stack for teams that want to build, and commercial vendors with validation claims that vary enormously.

Academic Core

Toubia et al. (Columbia) _{Twin-2K-500 dataset, r≈0.2 individual correlation finding}

Argyle et al. (BYU) _{Algorithmic fidelity 2022, foundational aggregate claim}

Bisbee et al. (Vanderbilt) _{The rigorous counter-case. Distributional fidelity breaks down.}

Park et al. (Stanford / DeepMind) _{1,000-person twin study. 85% GSS test-retest. Most optimistic finding.}

Sarstedt et al. (LMU Munich) _{Closest thing to a usable guideline for marketing research specifically}

Open Source Stack

TinyTroupe (Microsoft) _{Most production-ready persona simulation library. Python-based.}

PersonaHub (Tencent AI Lab) _{Billion-persona seed library curated from web text}

Polypersona _{LoRA-adapted compact-model approach, runs locally}

ASPIRE _{Implementable from ACM paper specification}

SYN-DIGITS (Columbia, 2026) _{No public repo yet, but calibration logic is implementable}

Commercial Vendors

Synthetic Users _{UX/product focus. 85-92% thematic parity claim. Most transparent vendor.}

Lakmoos _{Market research focus. Maturity and validation claims vary.}

Artificial Societies _{Multi-agent ecosystem. Social dynamics, opinion cascades. Methodology is youngest.}

What is missing. No open benchmark exists for non-US markets, B2B research, sensitive categories, or longitudinal validation. No credibility framework equivalent to the FDA V&V 40 standard. Until it does, every commercial claim is essentially marketing. These gaps are not permanent. They are work that has not been done yet.

Vendor Evaluation Principle

Ask which layer the product actually sits in. Ask what calibration data the vendor uses. Ask for the validation methodology in writing. The serious vendors will answer these questions cleanly. The others will redirect to their accuracy headline.

07 — The Framework

Signal type and decision risk decide where synthetic fits.

Every other axis matters less than these two. Where the research question lands on signal type and decision risk determines whether a synthetic respondent is appropriate, where in the design it sits, and how heavily it needs to be calibrated against real humans.

High Risk Low Risk

Q2

Synthetic-led with human gate required

High risk · Directional

Q4

Humans primary, synthetic for augmentation only

High risk · Parametric

Q1

Synthetic-led

Low risk · Directional

Q3

Synthetic-led with calibration

Low risk · Parametric

Directional Parametric

/Q1 — Synthetic-led

Low risk, directional. The sweet spot.

Use cases

Practitioner mistakes to avoid

How to Use It

The framework is not a way to label a project once. It is a way to interrogate a project's components. A typical commercial research design contains questions from multiple quadrants. The hypothesis generation phase is Q1. The concept refinement is Q2. The internal forecasting is Q3. The launch sizing is Q4. Treat them as separate sub-projects with separate methodological burdens.

08 — Six Use Cases

Six use cases, placed on the quadrant.

The framework only earns its keep when the actual use cases get placed on it. Six are common enough to anchor the conversation. The quadrant position for each use case reflects the typical deployment, not every possible deployment.

Q1

💡

Hypothesis Generation & Discovery

The deliverable is a list of candidate hypotheses and problem framings. The cost of generating a wrong hypothesis is essentially zero. L1 or L2 methods sufficient. Synthetic phase finds the questions; humans answer them.

Q1/Q2

🎯

Concept Testing & Stimulus Screening

Q1 for early screens, Q2 for screening tied to launch decisions. The synthetic phase is the funnel; a small human panel is the gate. Collapse of variance is the main risk: always look at the spread, not just the mean.

Q1/Q2

🔍

Edge-Case & Rare-Segment Exploration

A genuine structural advantage. Rare segments are expensive to recruit; edge cases by definition do not appear in normal samples. A simulator can be asked to produce respondents matching specifications that would take months to recruit. Caveat: edge cases bounded by training data.

Q3

🌧️

Counterfactual & Scenario Analysis

The synthetic respondent can be asked questions a real respondent cannot. "How would you react if the price doubled overnight?" The value is comparative, not predictive. Scenario A vs. scenario B. Absolute numbers should not be trusted. Relative rankings often can be, with calibration.

Q2

🗣️

Sensitive Topic Pre-Exploration

A synthetic respondent will discuss financial stress, mental health, or discrimination without social desirability bias. Genuine methodological advantage. Also a genuine trap: answers will be shaped by how these topics are written about, not actually felt. Use as reconnaissance, not evidence.

Q3/Q4 ⚠

📊

Large-Scale Quantitative Work

Most contested use case. Synthetic panels compress variance, miss tail behaviour, and do not produce reliable point estimates. Toubia's r≈0.2 at the individual level is a property of the method, not poor implementation. Defensible as augmentation: instrument piloting, edge-case oversampling, calibrated forecasting with real anchor.

The framing that does not work: replace the real sample with a synthetic one. The framing that does work: use the synthetic capability to do things the real sample alone could not afford.

09 — Validation Taxonomy

Validation is not a single percentage. It is a structured set of separate claims.

When a vendor reports "85 percent accuracy" the question to ask is: accurate against what? Most synthetic respondent validation claims collapse several different types of evidence into one number that hides more than it reveals. A serious validation framework separates the claims.

Method What It Catches What It Misses Difficulty

Distributional Fidelity

Whether synthetic data follows the same statistical distribution as real data. KS test for univariate, Wasserstein distance for higher-dimensional, Jensen-Shannon for probability, chi-squared for categorical.

Correlations between variables. Joint structure. Tail behaviour. Subgroup-level variance. A panel can match every marginal while failing to capture the joint structure that matters for analysis.

Construct Validity

Whether the synthetic respondent is measuring what it claims to be measuring. Confirmatory factor analysis (CFA), item response theory (IRT), measurement invariance testing against a real anchor sample.

Whether the construct relates to outcomes the way theory predicts. Catches the deepest internal failures but not the external validity of the construct itself.

Convergent & Divergent Validity

Whether the synthetic respondent's answers correlate with what they should and do not correlate with what they should not. The MTMM matrix (Campbell & Fiske, 1959) is the standard tool.

Whether the magnitude of correlation is calibrated, only its direction and presence. Magnitude calibration requires predictive validity testing.

Predictive Validity

Whether the synthetic respondent's outputs predict real-world outcomes within an acceptable error band. Set up pre-registered predictions, compare against outcomes when they arrive. The accumulating record is what makes the method credible over time.

Nothing structural. This is the hardest and most consequential type. Its weakness is practical: it requires patience and a recorded outcome. Most teams skip it.

Calibration Methods

Whether the synthetic respondent can be made accurate with a small real anchor. Baselines: isotonic regression, Platt scaling. Frontier: SYN-DIGITS latent factor alignment (Columbia, 2026), reporting 50% improvement in individual correlation.

Calibration is only as good as the anchor and only generalises within the anchor's conditions. Outside that envelope, the numbers degrade quickly without warning.

The Defensible Validation Package

A distributional fidelity report covering marginal distributions and key joint distributions. KS or Wasserstein values, not just descriptive statistics.
A construct validity report for any panel intended for psychometric or attitudinal work. Factor structures and measurement invariance against a real anchor sample.
A convergent and divergent validity report showing that traits within the synthetic respondents correlate the way theory predicts they should.
A predictive validity record, accumulated over time, showing how the synthetic respondents' predictions compare to subsequent real-world outcomes.
A calibration approach, if L3 or L4 work is involved, with the calibration data and method fully documented.

This is not exotic. It is the same evidence package any psychometrically trained researcher would expect for a new survey instrument. The fact that most synthetic respondent vendors do not provide it is a statement about the field's current maturity, not about what the standard should be.

10 — The Contested Point

Replacement or augmentation: the honest reading.

The synthetic respondent industry is split on what it actually claims to do. Both positions are represented in the published research. Reading the evidence carefully is the only way to make sense of it.

Replacement Camp

Synthetic respondents are good enough for most commercial research questions.

Argyle et al.: LLMs can be conditioned to produce response distributions matching human sub-populations in aggregate
Park et al.: 85% test-retest accuracy on the General Social Survey using interview-grounded twins
Synthetic Users: 85-92% thematic parity across structured measures
Real respondents lie about behaviour, post-rationalise, and respond to demand characteristics. Synthetic respondents are no worse on these dimensions and are far cheaper.
Reproducibility of real survey results is already weak. Synthetic adds little marginal unreliability.

VS

Augmentation Camp

Synthetic respondents are not adequate substitutes where variance matters.

Toubia et al.: r≈0.2 between twin and human answers across 164 outcomes. Better than random, not substitutable.
Bisbee et al.: aggregate accuracy is real, distributional fidelity is not. The model reproduces the mean but not the spread.
Variance compression is a property of the method, not an implementation flaw. Parametric estimates depend on the tails as much as on the centre.
LLMs are linguistic simulators, not cognitive models of humans. The bounds matter for research inference.
Reproducibility is fragile across model versions and prompt variations.

Verdict

Neither universal claim holds. The answer is quadrant-dependent. For population-level central tendencies, in well-documented populations, on well-documented topics, the replacement camp is closer to right. For individual-level prediction, subgroup-specific behaviour, novel topics, sensitive topics, non-Western markets, and parametric estimates, the augmentation camp is closer to right. Q1 and Q2 favour the replacement camp's intuitions, with caveats. Q3 and Q4 require the augmentation camp's discipline.

The disagreement is not about the data. Both camps cite roughly the same studies. The disagreement is about what counts as "good enough". The replacement camp reads the parity findings (85-92% thematic alignment) as evidence that the threshold has been crossed for most use cases. The augmentation camp reads the variance findings as evidence that the parity numbers conceal individual-level and subgroup-level failures.

Both readings are internally coherent. Neither is dishonest. The difference comes down to what kind of research question is in front of the practitioner. A clean reading of the evidence supports a position that neither camp fully holds: synthetic respondents are not universally good enough, and they are not universally inadequate.

11 — Hard Limits

What synthetic respondents cannot do, no matter how good the model gets.

Some limitations are model-version-dependent. Better training data, larger models, and smarter calibration will close some gaps. Other limitations are structural. They will not be solved by scaling. Naming the second category honestly is essential for any practitioner using these methods seriously.

🧠

Lived Experience

A synthetic respondent cannot have lived in a household where the electricity is cut once a month for non-payment. It can produce articulate text describing what that might be like. The text will not be what someone with that experience would actually have said. This is not a calibration problem. It is structural.

💛

Emotional Nuance

A real respondent can be ambivalent, notice their own ambivalence, and change their stated position because they have surprised themselves. Synthetic respondents do not have these states. A synthetic respondent describing ambivalence is not ambivalent. It is producing the textual signature of ambivalence.

🌍

Cultural Ground Truth

Published fidelity research is overwhelmingly conducted on US and Western European populations. Outside that envelope, the model produces outputs with the same fluency but far less reliability, in ways that are not visible without external validation against those specific populations.

⏳

Longitudinal Attitude Drift

Real respondents change. Beliefs shift. The same person answers the same question differently in 2023 and 2026. Synthetic respondents produce outputs from a model trained on a fixed snapshot of human discourse. For trend analysis, synthetic respondents are not a substitute for real wave-on-wave data.

🔭

Genuinely New Categories

The model has not read text about something that does not exist yet. A novel product category, a new behavioural pattern. The model can be asked to simulate reactions to it, but it draws on adjacent categories. The reliability of those outputs cannot be assessed: the category is new, there is no anchor.

🎭

Social Desirability Dynamics

The claim that synthetic respondents bypass social desirability bias is incorrect. They inherit a layered version of the same bias from their training data. The simulator inherits whatever cultural biases shaped the corpus. They do not bypass the bias. They obscure it.

🔗

Group Dynamics

Single-agent synthetic respondents cannot produce conformity effects, social proof, contagion, or polarisation. Multi-agent simulations are beginning to simulate these dynamics. The validation of those dynamics against real group behaviour is at the very early stages of the literature.

The key distinction: A synthetic respondent is a text generator conditioned on a persona description. That sentence sounds reductive because it is. The reduction is the point. For uses that depend on what the respondent's text looks like, the simulator does serious work. For uses that depend on what the actual respondent is, the simulator does not do the work. It produces text that occupies the place where the work would be, which is not the same thing.

12 — Practitioner Playbook

A working discipline, concrete enough to use on Monday.

The framework, the validation taxonomy, and the limitations together imply a working discipline. Five steps make it concrete. Each step has a mini-visual anchoring the key decision.

1

Classify the question before touching any tool.

Place the research question on the quadrant: signal type (directional vs. parametric) and decision risk (low vs. high)
A real project contains multiple questions across multiple quadrants. Decompose them. Do not treat the whole project as one quadrant.
The hypothesis generation phase is one question. The concept refinement is another. The launch sizing is another.

Quick reference

Directional = finding, theme, ranking — Parametric = number, share, forecast
Low risk = reversible, internal — High risk = externally consequential, hard to reverse

2

Select the layer appropriate to the quadrant.

Q1: L1 or L2 methods are usually sufficient. Validation requirements are light.
Q2: L2 methods plus a human gate. The synthetic phase produces candidate findings. A small targeted human study confirms them.
Q3: L3 methods with calibration. The deliverable is a number; calibration starts to matter.
Q4: Real respondents are the primary evidence base. Synthetic plays a supporting role only.

Layer ramp

L1

L2

L3

L4

More rigour

3

Define the validation regime before generating any data.

For Q1 and Q2: qualitative coherence and a small calibration anchor. Informal is fine.
For Q3: formal distributional fidelity tests (KS or Wasserstein on both marginal and key joint distributions). Construct validity if psychometric work is involved.
For Q4: the full validation taxonomy from Section 09. Predictive validity pre-registration is mandatory.

Validation checklist preview

Distributional fidelity → Construct validity → Convergent/divergent validity → Predictive validity record → Calibration documentation

4

Anchor synthetic work to real data at every quadrant.

For Q1: a few hours of qualitative work with real respondents to ground the synthetic exploration in actual category language
For Q2: a structured anchor of 30 to 50 real respondents covering the questions the synthetic phase identified as most consequential
For Q3: a calibration sample of 100 to 300 real respondents matched to the synthetic panel structure
For Q4: real respondents are the primary base. A small real anchor adds more value than a larger synthetic sample alone. Budget should reflect this.

Recommended anchor sample sizes

Quadrant	Anchor type	Size
Q1	Informal qualitative	5-10
Q2	Structured validation	30-50
Q3	Calibration sample	100-300
Q4	Primary fieldwork	Full design

5

Pre-register the inference before generating the data.

Write down before generating the data what conclusion will be drawn under what conditions
Synthetic data is cheap enough that the temptation to keep generating until the numbers fit is unusually strong. Pre-registration is the only protection against motivated reasoning.
Example decision rule: "If concept A scores at least X points higher than concept B in the synthetic phase, the team will proceed with concept A to human validation."

Example pre-registration box

Decision rule: synthetic phase will be considered confirmatory only if (a) the top concept scores >15pts above the median, (b) variance across synthetic respondents exceeds 20% of the range, and (c) at least one finding conflicts with the team's prior hypothesis. All three conditions must hold.

Where the Field Goes Next

Three developments are worth watching. Calibration as a discipline: SYN-DIGITS-style methods will become standard. Multi-agent ecosystems: group dynamics, conformity, and social influence will increasingly be addressed through multi-agent simulations. Hybrid designs as default: the most credible research workflows in five years will be neither all-synthetic nor all-organic, but designed from the start as hybrid systems with calibration anchors, validation routines, and pre-registered inference rules built in.

Read the full series →