Synthetic Audiences in Consumer Research
A working framework for senior practitioners. Where they came from, what the published evidence says, where they fit in a research design, and where they fail.
The simulator runs a thousand miles for every one on the road. Research runs none.
Waymo's autonomous vehicles have driven roughly twenty million miles on public roads. The same vehicles have driven over twenty billion miles in simulation. The ratio is close to a thousand to one.
Engineers do not call this cheating. They call it how the system gets safer. Real roads are slow, expensive, and dangerous. Simulators are fast, cheap, and stress the system in ways the real world rarely will.
Consumer and market research operates at a very different ratio. Most studies still run at one-to-one. Real respondents, real fieldwork, real timelines. The idea of a simulator layer, where ideas, questions, and concepts get stress-tested at scale before any human gets a screener, barely exists in practice.
Where it does exist, it is mostly being sold as a replacement for the real thing rather than a complement to it. That framing is the problem. And it is the reason synthetic audiences are arriving in the industry with both more hype and more scepticism than the evidence supports.
The hard question is not whether synthetic respondents can be made. They can. The hard question is whether the inferences drawn from them are valid for the decision being made. That is a question about epistemology, not about model performance.
Synthetic data did not start with GPT. It started with a man playing solitaire.
The year was 1946. Stanislaw Ulam, a Polish mathematician convalescing from a near-fatal bout of viral encephalitis, was sitting up in bed playing Canfield solitaire. He wanted to know the odds of a winning deal. The combinatorics defeated him. So he tried something else: lay out a hundred hands, count the wins, take the ratio.
The Monte Carlo method is the foundational logic of all synthetic data work that followed. Substitute random sampling for analytic computation. Generate populations of imagined cases. Observe what emerges. The method was secret for years. By the 1950s it was widely published.
An LLM-driven synthetic respondent can now produce open-ended text that sounds like a transcript. An agent-based model in 2005 could produce a row of numbers. That shift is real, and it is also exactly where the trouble starts. The new resolution makes the output feel like a person rather than a statistical artefact.
The same questions that applied to a Monte Carlo run still apply to a transcript generated by an LLM. What is the simulator certified for? What is it not? How will the inference be validated? Those questions are the rest of this piece.
Other industries built certified simulator layers. Research has not.
Three industries figured out how to use simulation alongside real-world testing long before research caught up. Each one offers a lesson that does not get talked about enough in market research conversations about synthetic respondents.
The important word in aviation is qualified. Not certified. Not accurate. A Level D flight simulator has over a thousand individual test requirements. A pilot can complete an entire type-rating in the box. The simulator's authority is bounded by the conditions under which it was tested.
Most current synthetic audience products make global accuracy claims. Vendors quote single numbers: 85% match to real data, 92% thematic overlap. None of these claims specify the envelope of qualification. None of them survive a Part 60 review.
The Waymo simulation is not used to replace road testing. It is used to do three things road testing cannot do well. First, scale: real roads cannot be driven a thousand times faster, simulators can. Second, edge case generation: the interesting failures are rare, but in simulation they can be triggered intentionally, repeated, varied. Third, regression: every new software update gets driven through the entire library of past scenarios before it goes anywhere near a public road.
The FDA has been accepting computational modelling and simulation as supporting evidence for medical device approvals for years. The ASME V&V 40 standard sets out exactly what evidence the FDA expects to see: Verification. Validation. Applicability analysis. Uncertainty quantification. This is what proper synthetic respondent validation should look like. Not a single accuracy number. A documented credibility framework with named methods for each component of the claim.
Research has not built this for three reasons. First, the cost of failure is lower. Second, the field has been seduced by accuracy claims that do not specify the envelope of qualification. Third, the practitioner community has not yet developed a vocabulary for use-case-bounded validity. That vocabulary has to come from somewhere.
The layer changes what the data is good for.
Synthetic respondents are not one thing. The methodologies differ enough that grouping them under a single label is the source of most confusion in the field. Four distinct layers exist, each with different inputs, different outputs, and different defensible uses.
What it is
A demographic and psychographic profile is described in a prompt. A general-purpose LLM is asked to respond as if it were that person. Fast, plausible, essentially a stylised average of how the model has seen that kind of person described in its training data.
Tools & limits
- Tools: ChatGPT, Claude, raw LLMs
- Best for: hypothesis generation, fast screens, stimulus review
- Avoid for: anything requiring variance or individual prediction
- Produces: stylised averages, not distributions
What it is
Personas become structured objects with stable attributes, often anchored to OCEAN (Big Five personality model), with explicit memory and continuity across multiple questions. A persona's answer to question seventeen is shaped by their answer to question two. Subgroup variance becomes recoverable.
Tools & limits
- Tools: TinyTroupe, ASPIRE, Polypersona, Synthetic Users
- Best for: concept testing, qualitative pre-fieldwork, theme generation
- Avoid for: decision-grade numbers or individual-level prediction
- Requires: thoughtful persona construction upfront
What it is
Personas are constructed by conditioning the LLM on rich real-world data about specific individuals. Toubia's Twin-2K-500 dataset and Park et al.'s 1,000-person interview-grounded twins are the cleanest published examples. Captures central tendencies well, individuals poorly.
Tools & limits
- Approach: twin construction from real panels
- Best for: population-level patterns, segment forecasting
- Avoid for: individual-level prediction (Toubia: r ≈ 0.2)
- Requires: real anchor data to construct twins
What it is
A calibration layer sits on top of any of the above and adjusts the LLM output to align with human ground truth. SYN-DIGITS (Columbia, April 2026) uses latent factor alignment from causal inference. Reported 50% improvement in individual-level correlation, 50-90% reduction in distributional discrepancy.
Tools & limits
- Methods: SYN-DIGITS, isotonic regression, Platt scaling
- Best for: prediction within a calibrated envelope
- Requires: real anchor data for calibration
- Warning: calibration only generalises within anchor conditions
Humans do not think the way language models simulate thinking.
This is the hardest section to write honestly because both sides of the conversation are overconfident. The synthetic respondent industry argues that synthetic respondents get past the narrative because they can be modelled at the behavioural level. That argument is correct about the say-do gap. It is incorrect about what synthetic respondents actually have access to.
Human Respondent only
- Embodied experience
- Emotional nuance
- Cultural ground truth
- Social dynamics
- Longitudinal memory
- Genuine ambivalence
Overlap
- Surface language
- Thematic responses
- General reactions
- Topic knowledge
LLM Simulation only
- Training text patterns
- Pattern recognition
- Language generation
- Infinite scalability
- Zero fatigue
Lived Experience
The model has not been hungry. It has not been the only one in a meeting whose accent is mocked. It has not had a parent die. It can produce plausible text. The text will not be what someone with that experience actually feels.
Cultural Ground Truth
Published fidelity research is overwhelmingly conducted on US and Western European populations. A vendor citing 90% parity on US data makes no claim at all about the same product's performance on a Saudi, Tier 2 Indian, or Indonesian sample.
Group Dynamics
Single-agent synthetic respondents cannot produce conformity effects, social proof, contagion, or polarisation. Multi-agent simulations are beginning to simulate these dynamics. The validation is at the very early stages of the literature.
Binz and Schulz, in Nature Computational Science, ran classical cognitive psychology tasks on LLMs and human participants. They found that earlier LLMs showed many of the same reasoning errors humans do. More recent models are not converging on human-like cognition. They are diverging from it.
A 2025 paper analysing 170,000 reasoning traces from 17 models against 54 human think-aloud protocols found that humans use hierarchical nesting and meta-cognitive monitoring. Models use shallow forward chaining. The divergence is most pronounced on ill-structured problems, which is exactly the kind of problem most consumer decisions are.
The behavioural modelling framing the synthetic respondent industry has adopted is partially correct and substantially misleading. They are excellent for the textual layer of behaviour. They are unreliable for the layers underneath it.
The vendor landscape is moving faster than the methodological literature can validate it.
A working map matters because the labels in marketing material rarely correspond to what is happening underneath. The field divides clearly into three categories: academic researchers doing the rigorous work that practitioner conversations should be anchored to, an open-source stack for teams that want to build, and commercial vendors with validation claims that vary enormously.
Academic Core
Open Source Stack
Commercial Vendors
Signal type and decision risk decide where synthetic fits.
Every other axis matters less than these two. Where the research question lands on signal type and decision risk determines whether a synthetic respondent is appropriate, where in the design it sits, and how heavily it needs to be calibrated against real humans.
Use cases
Practitioner mistakes to avoid
Six use cases, placed on the quadrant.
The framework only earns its keep when the actual use cases get placed on it. Six are common enough to anchor the conversation. The quadrant position for each use case reflects the typical deployment, not every possible deployment.
The deliverable is a list of candidate hypotheses and problem framings. The cost of generating a wrong hypothesis is essentially zero. L1 or L2 methods sufficient. Synthetic phase finds the questions; humans answer them.
Q1 for early screens, Q2 for screening tied to launch decisions. The synthetic phase is the funnel; a small human panel is the gate. Collapse of variance is the main risk: always look at the spread, not just the mean.
A genuine structural advantage. Rare segments are expensive to recruit; edge cases by definition do not appear in normal samples. A simulator can be asked to produce respondents matching specifications that would take months to recruit. Caveat: edge cases bounded by training data.
The synthetic respondent can be asked questions a real respondent cannot. "How would you react if the price doubled overnight?" The value is comparative, not predictive. Scenario A vs. scenario B. Absolute numbers should not be trusted. Relative rankings often can be, with calibration.
A synthetic respondent will discuss financial stress, mental health, or discrimination without social desirability bias. Genuine methodological advantage. Also a genuine trap: answers will be shaped by how these topics are written about, not actually felt. Use as reconnaissance, not evidence.
Most contested use case. Synthetic panels compress variance, miss tail behaviour, and do not produce reliable point estimates. Toubia's r≈0.2 at the individual level is a property of the method, not poor implementation. Defensible as augmentation: instrument piloting, edge-case oversampling, calibrated forecasting with real anchor.
Validation is not a single percentage. It is a structured set of separate claims.
When a vendor reports "85 percent accuracy" the question to ask is: accurate against what? Most synthetic respondent validation claims collapse several different types of evidence into one number that hides more than it reveals. A serious validation framework separates the claims.
The Defensible Validation Package
- A distributional fidelity report covering marginal distributions and key joint distributions. KS or Wasserstein values, not just descriptive statistics.
- A construct validity report for any panel intended for psychometric or attitudinal work. Factor structures and measurement invariance against a real anchor sample.
- A convergent and divergent validity report showing that traits within the synthetic respondents correlate the way theory predicts they should.
- A predictive validity record, accumulated over time, showing how the synthetic respondents' predictions compare to subsequent real-world outcomes.
- A calibration approach, if L3 or L4 work is involved, with the calibration data and method fully documented.
This is not exotic. It is the same evidence package any psychometrically trained researcher would expect for a new survey instrument. The fact that most synthetic respondent vendors do not provide it is a statement about the field's current maturity, not about what the standard should be.
Replacement or augmentation: the honest reading.
The synthetic respondent industry is split on what it actually claims to do. Both positions are represented in the published research. Reading the evidence carefully is the only way to make sense of it.
Synthetic respondents are good enough for most commercial research questions.
- Argyle et al.: LLMs can be conditioned to produce response distributions matching human sub-populations in aggregate
- Park et al.: 85% test-retest accuracy on the General Social Survey using interview-grounded twins
- Synthetic Users: 85-92% thematic parity across structured measures
- Real respondents lie about behaviour, post-rationalise, and respond to demand characteristics. Synthetic respondents are no worse on these dimensions and are far cheaper.
- Reproducibility of real survey results is already weak. Synthetic adds little marginal unreliability.
Synthetic respondents are not adequate substitutes where variance matters.
- Toubia et al.: r≈0.2 between twin and human answers across 164 outcomes. Better than random, not substitutable.
- Bisbee et al.: aggregate accuracy is real, distributional fidelity is not. The model reproduces the mean but not the spread.
- Variance compression is a property of the method, not an implementation flaw. Parametric estimates depend on the tails as much as on the centre.
- LLMs are linguistic simulators, not cognitive models of humans. The bounds matter for research inference.
- Reproducibility is fragile across model versions and prompt variations.
Neither universal claim holds. The answer is quadrant-dependent. For population-level central tendencies, in well-documented populations, on well-documented topics, the replacement camp is closer to right. For individual-level prediction, subgroup-specific behaviour, novel topics, sensitive topics, non-Western markets, and parametric estimates, the augmentation camp is closer to right. Q1 and Q2 favour the replacement camp's intuitions, with caveats. Q3 and Q4 require the augmentation camp's discipline.
The disagreement is not about the data. Both camps cite roughly the same studies. The disagreement is about what counts as "good enough". The replacement camp reads the parity findings (85-92% thematic alignment) as evidence that the threshold has been crossed for most use cases. The augmentation camp reads the variance findings as evidence that the parity numbers conceal individual-level and subgroup-level failures.
Both readings are internally coherent. Neither is dishonest. The difference comes down to what kind of research question is in front of the practitioner. A clean reading of the evidence supports a position that neither camp fully holds: synthetic respondents are not universally good enough, and they are not universally inadequate.
What synthetic respondents cannot do, no matter how good the model gets.
Some limitations are model-version-dependent. Better training data, larger models, and smarter calibration will close some gaps. Other limitations are structural. They will not be solved by scaling. Naming the second category honestly is essential for any practitioner using these methods seriously.
A synthetic respondent cannot have lived in a household where the electricity is cut once a month for non-payment. It can produce articulate text describing what that might be like. The text will not be what someone with that experience would actually have said. This is not a calibration problem. It is structural.
A real respondent can be ambivalent, notice their own ambivalence, and change their stated position because they have surprised themselves. Synthetic respondents do not have these states. A synthetic respondent describing ambivalence is not ambivalent. It is producing the textual signature of ambivalence.
Published fidelity research is overwhelmingly conducted on US and Western European populations. Outside that envelope, the model produces outputs with the same fluency but far less reliability, in ways that are not visible without external validation against those specific populations.
Real respondents change. Beliefs shift. The same person answers the same question differently in 2023 and 2026. Synthetic respondents produce outputs from a model trained on a fixed snapshot of human discourse. For trend analysis, synthetic respondents are not a substitute for real wave-on-wave data.
The model has not read text about something that does not exist yet. A novel product category, a new behavioural pattern. The model can be asked to simulate reactions to it, but it draws on adjacent categories. The reliability of those outputs cannot be assessed: the category is new, there is no anchor.
The claim that synthetic respondents bypass social desirability bias is incorrect. They inherit a layered version of the same bias from their training data. The simulator inherits whatever cultural biases shaped the corpus. They do not bypass the bias. They obscure it.
Single-agent synthetic respondents cannot produce conformity effects, social proof, contagion, or polarisation. Multi-agent simulations are beginning to simulate these dynamics. The validation of those dynamics against real group behaviour is at the very early stages of the literature.
A working discipline, concrete enough to use on Monday.
The framework, the validation taxonomy, and the limitations together imply a working discipline. Five steps make it concrete. Each step has a mini-visual anchoring the key decision.
Classify the question before touching any tool.
- Place the research question on the quadrant: signal type (directional vs. parametric) and decision risk (low vs. high)
- A real project contains multiple questions across multiple quadrants. Decompose them. Do not treat the whole project as one quadrant.
- The hypothesis generation phase is one question. The concept refinement is another. The launch sizing is another.
Low risk = reversible, internal — High risk = externally consequential, hard to reverse
Select the layer appropriate to the quadrant.
- Q1: L1 or L2 methods are usually sufficient. Validation requirements are light.
- Q2: L2 methods plus a human gate. The synthetic phase produces candidate findings. A small targeted human study confirms them.
- Q3: L3 methods with calibration. The deliverable is a number; calibration starts to matter.
- Q4: Real respondents are the primary evidence base. Synthetic plays a supporting role only.
Define the validation regime before generating any data.
- For Q1 and Q2: qualitative coherence and a small calibration anchor. Informal is fine.
- For Q3: formal distributional fidelity tests (KS or Wasserstein on both marginal and key joint distributions). Construct validity if psychometric work is involved.
- For Q4: the full validation taxonomy from Section 09. Predictive validity pre-registration is mandatory.
Anchor synthetic work to real data at every quadrant.
- For Q1: a few hours of qualitative work with real respondents to ground the synthetic exploration in actual category language
- For Q2: a structured anchor of 30 to 50 real respondents covering the questions the synthetic phase identified as most consequential
- For Q3: a calibration sample of 100 to 300 real respondents matched to the synthetic panel structure
- For Q4: real respondents are the primary base. A small real anchor adds more value than a larger synthetic sample alone. Budget should reflect this.
| Quadrant | Anchor type | Size |
|---|---|---|
| Q1 | Informal qualitative | 5-10 |
| Q2 | Structured validation | 30-50 |
| Q3 | Calibration sample | 100-300 |
| Q4 | Primary fieldwork | Full design |
Pre-register the inference before generating the data.
- Write down before generating the data what conclusion will be drawn under what conditions
- Synthetic data is cheap enough that the temptation to keep generating until the numbers fit is unusually strong. Pre-registration is the only protection against motivated reasoning.
- Example decision rule: "If concept A scores at least X points higher than concept B in the synthetic phase, the team will proceed with concept A to human validation."