A thinking guide

Synthetic consumers, in plain sight.

A working note on what synthetic consumers are, what they aren’t, what the research is showing, and the questions worth sitting with before they show up in your next study.

May 2026 Thirteen sections Open access
Abstract illustration of overlapping organic shapes — a visual metaphor for the layered architecture behind synthetic consumer systems
IA conversation underway

Something is shifting, and most teams can feel it before they can name it.

For most of the last decade, market research had a quiet, predictable shape. A question came in. A panel was sourced. Fieldwork ran for a few weeks. Findings landed in a deck. The cadence was slow because real people were the only reliable signal, and real people take time.

That shape is starting to bend. Brand and product teams are running concept tests overnight. Pricing studies that used to take six weeks are landing the next afternoon. The labels attached to this shift vary: synthetic consumers, synthetic audiences, silicon samples, AI personas. The marketing varies too. But the underlying claim is consistent. There is now a class of methods that can stand in for a research participant, at least some of the time, in some kinds of work, at a small fraction of the cost.

The conversation around these methods is loud, contested, and moving quickly. Some teams are leaning in. Others are pushing back. Most are somewhere in the middle, trying to work out what the methods actually are, what they actually do well, and where the limits sit.

Synthetic data could account for more than half of all market research inputs by 2027.

Qualtrics industry forecast, 2024

This guide draws on the PyMC Labs practical guide, the Maier purchase–intent paper, Forbes commentary from agency leaders, the Downey work on open–ended responses, the Park generative agents paper, and the Tirrel protocol on silicon sampling in HR research. The goal is not to recommend a tool or settle a debate. The goal is to make the shape of the conversation easier to read.

IIAt a glance

The state of the field in four numbers.

None of these numbers is a warranty. Each is a useful anchor for thinking about where the methods sit in mid–2026.

90%
of human test–retest reliability reproduced by well–elicited synthetic respondents on purchase intent
Maier et al., 2025 · 9,300 humans · 57 surveys
86%
of recent studies show at least partial alignment between synthetic and human responses
PyMC review · 2023–2025 · up from ~35%
<24h
cycle time for concept and pricing tests that traditionally took weeks of fieldwork
Pioneer brand case studies, 2024–25
>50%
projected share of market–research inputs that will be synthetic by 2027
Qualtrics industry forecast, 2024
Where the gains came from

The improvement from ~35% to 86% partial alignment did not come only from better models. A meaningful portion came from better practitioner technique in eliciting responses. Section XI returns to this point. It is one of the more important and under–discussed findings in the recent literature.

IIIWhat synthetic consumers are

A statistical artifact wearing a voice.

A synthetic consumer is an AI–generated participant designed to respond to research stimuli the way a real consumer in a defined segment might. Asked about a concept, a price, a tagline, a category, it produces something that looks like an answer—a number, a sentence, a paragraph of rationale.

The output is text. The thing that decides whether that text is useful is everything that sits behind it: the data the model was trained on, the persona scaffolding it was prompted with, the way the question was asked, and the validation routine the team has run against real human data.

Three observations follow, each of which keeps recurring across the source literature.

  • The model is not the product. Two systems built on the same underlying LLM can behave very differently depending on how they elicit responses and how they calibrate them.
  • The persona is a surface, not a soul. A thick demographic profile makes a synthetic consumer feel specific. It does not make it accurate. Specific–sounding does not mean correct.
  • The output is plural by design. A single synthetic respondent is not the unit of analysis. The unit is a distribution across hundreds or thousands of synthetic respondents, compared against a distribution of real responses.
If you remember one thing

What you see is text. What matters is the architecture behind it. Two synthetic–consumer products that look identical in the demo can be two completely different research instruments under the hood.

IVThe vocabulary in use

Four overlapping terms, each emphasising something slightly different.

The labels are not interchangeable, even though they are often used as if they were. The PyMC practical guide proposes a useful split, plotted below on two axes: how broad the application is, and how complex the underlying system tends to be.

Complexity
Low
Synthetic respondents
general purpose
Digital twins
CX · personalisation
Human simulacra
academic / emergent
0
Narrow
Application breadth

A fifth term—silicon sampling—appears in the peer–reviewed literature. It is the rigorous label for the broader family, and it is the search term that will find serious validation studies rather than vendor marketing.

i
Synthetic respondents
The broadest term. General–purpose AI participants for surveys and social research. The umbrella the other terms sit under.
iii
Digital twins
Continuously updated models tied to live behavioural data. About personalisation and CX, not concept testing.
iv
Human simulacra
Generative agents that model group dynamics and emergent social behaviour. Mostly academic.
A small piece of literacy

Noticing which term a vendor uses, and which one they slide between when pressed, is a small but useful signal. The honest operators can describe the difference in a sentence. The ones who treat the terms as interchangeable are usually selling more than the method can carry.

VHow they actually work

A familiar architecture, sitting under the marketing surface.

Strip away the language and the demos and the vendor logos, and most credible synthetic–consumer systems share the same five–stage architecture.

i
Data foundation
Real behavioural and demographic data that sets the ceiling for everything else.
load-bearing
ii
Persona generation
Digital participants with demographics and attitudes. What the demo usually shows.
demo layer
iii
Simulation
Personas respond to stimuli. Elicitation method matters more than model size.
elicitation
iv
Validation
Synthetic outputs benchmarked against real human responses to test alignment.
load-bearing
v
Iteration
New data feeds in. Models recalibrate. A frozen panel ages badly.
ongoing

Insights are no longer one–off projects but continuous learning loops.

Rismal, Swadi K & Fiaschi, PyMC Labs (2026)

One observation worth holding onto. Stages one and four—data foundation and validation—are doing most of the load–bearing work. Stages two and three are doing most of the demonstrating. The mismatch between which stages get attention in a sales conversation and which stages decide whether the method works is one of the recurring features of the current moment.

VIThe four layers

Not one thing, but a spectrum.

The label “synthetic consumer” covers approaches that differ enough in methodology that grouping them under a single name is the source of most confusion in the field. Four layers exist, each with different inputs, different outputs, and different defensible uses. Mistaking one for another is where most procurement disappointments begin.

L1
Prompt–conditioned LLM personas
A demographic and psychographic profile is described in a prompt. A general–purpose LLM is asked to respond as if it were that person. Fast and plausible, but essentially a stylised average of how the model has seen that kind of person described in its training data.
Best for · hypothesis generation, fast screens Watch · produces averages, not distributions
L2
Structured persona simulation
Personas become structured objects with stable attributes, often anchored to a personality model, with memory and continuity across questions. A persona’s answer to question seventeen is shaped by their answer to question two. Subgroup variance becomes recoverable.
Best for · concept testing, qualitative pre–fieldwork Watch · not yet decision–grade numbers
L3
Data–anchored synthetic panels
Personas constructed by conditioning the LLM on rich real–world data about specific individuals. Twin construction from real panels (Twin–2K–500, Park et al.’s 1,000–person interview–grounded twins) is the cleanest published example. Captures central tendencies well, individuals less reliably.
Best for · population–level patterns, segment forecasting Watch · individual–level prediction is weak
L4
Calibrated simulation frameworks
A calibration layer sits on top of any of the above and adjusts the LLM output to align with human ground truth. Latent factor alignment from causal inference is one of the more promising approaches—published work reports meaningful improvement in individual–level correlation and reduction in distributional discrepancy, within the anchor envelope.
Best for · prediction inside a calibrated envelope Requires · real anchor data for calibration
Reading the stack honestly

L1 is useful for fast hypothesis generation. L2 is useful for exploration where the deliverable is themes. L3 is useful for understanding population–level patterns. L4 is useful for prediction within a defined envelope, once an anchor exists. Selling L1 as L4 is overclaiming. Using L4 where L1 would do is overspending.

VIIWhy the conversation is happening now

Four pressures, arriving at roughly the same time.

Synthetic methods are not new. Statistical agencies and epidemiologists have used them for decades. What is new is the combination of forces that brought the conversation into commercial market research at this particular moment.

Models crossed a usefulness threshold
LLMs can now produce responses that are not only fluent but contextually coherent across thousands of synthetic interviews. That reliability is what makes the method tractable as research.
The data–collection economy got harder
Privacy regulation tightening. Third–party tracking declining. Panel costs rising. The economics of collecting fresh human data were shifting before AI arrived.
Survey fatigue became structural
Response rates have been falling for years. Synthetic methods sidestep fatigue entirely—part of their appeal, and part of why they have to be validated against something other than more surveys.
Cycle times reset across the business
Brand, product, and growth teams now run weekly or daily learning loops. Any method that compresses turnaround from weeks to hours is worth investigating.

Each force on its own would have produced a modest shift. The four together have produced something closer to a category change.

VIIISimulation parallels

Other industries have already built simulator layers.

Organic flowing pattern — a visual metaphor for emergent systems and structured complexity
Patterns, recurring

Three industries figured out how to use simulation alongside real–world testing long before market research caught up. Each one offers a lesson the synthetic–consumer conversation tends to skip past.

Aviation
Level D · FAA Part 60
Framework maturity ~95%

A Level D flight simulator has over a thousand individual test requirements. Pilots can complete an entire type–rating in the box. The simulator’s authority is bounded by the conditions under which it was tested.

Autonomous vehicles
1,000:1 · Waymo Carcraft
Framework maturity ~78%

Simulation is not used to replace road testing. It scales, generates edge cases on demand, and regresses every software update against the entire library of past scenarios before public roads see it.

Pharma in–silico
FDA V&V 40
Framework maturity ~88%

The FDA accepts computational modelling as supporting evidence. ASME V&V 40 specifies what the evidence has to look like: verification, validation, applicability, uncertainty quantification. Not one number. A documented credibility framework.

?
Consumer research
No framework yet
Framework maturity ~14%

No equivalent vocabulary exists in market research. Most current synthetic–consumer products make global accuracy claims. None of them specify the envelope of qualification. None of them would survive a Part 60 review.

The core lesson

Qualification is use–case bounded, not globally accurate. A Level D simulator for the Boeing 777 is qualified to train 777 pilots on 777 procedures. It is not qualified for anything else. The interesting question for synthetic consumers is not “is the method accurate?” but “qualified for what?”

Three reasons the equivalent has not been built in research yet. First, the cost of failure is lower than in aviation or pharma, so the institutional pressure to qualify has not arrived. Second, the field has been drawn to single accuracy numbers that do not specify the envelope. Third, the practitioner vocabulary for use–case–bounded validity has not been developed yet. That vocabulary has to come from somewhere. The interesting question is whether market research builds it itself, or imports it.

IXPatterns in use

What teams are doing with them in real briefs.

Five patterns recur across vendor case studies, agency commentary, and the early commercial literature. None of them is treating synthetic consumers as a finished study on their own. Each is using them as a layer that changes what the rest of the research has to do.

i
Concept and pricing testing at speed
The most established commercial use. Hundreds of variants screened in a day.
In the wild: PyMC’s pricing study of an AI–generated concept vehicle, the “Ford Lumina,” tested with 1,000 synthetic US respondents, produced a price elasticity curve peaking near $20,000, closely matching real–world compact SUV averages.
Best for · stage–gate decisionsWatch · magnitudes not warranties
ii
Augmenting thin or under–represented samples
Synthetic respondents added to a small real sample to expand coverage of niche groups. The real anchor keeps the work grounded.
In the wild: Fairgen’s Fairboost generates synthetic respondents from deep learning models trained on demographic data, used to boost statistical power in subgroups that would otherwise be too thin to analyse.
Best for · rare populationsWatch · amplifies model biases too
iii
B2B and professional persona simulation
Synthetic personas of CTOs, CFOs, clinical leads—used to pressure–test value propositions before fielding with real professionals.
Worth noting: the Tirrel 2025 protocol on silicon sampling in HR and leadership research flags that B2B and employee–research contexts are much less validated than B2C consumer work.
Best for · message iterationWatch · validation thin
iv
Sequential hybrid research
Synthetic respondents in the exploratory phase, real respondents in the confirmation phase. The synthetic round narrows the question. The human round answers it.
In the wild: Andrew Stuart, writing in Forbes, describes a pharma case where synthetic personas generated initial patient–journey hypotheses for acute myeloid leukemia, with a smaller human sample then validating and enriching the priorities.
Best for · hypothesis → confirmationWatch · pressure to skip the human leg
v
Trend and demand forecasting
Synthetic methods used to surface signals about emerging preferences before real–world data catches up.
In the wild: AI Palette’s collaboration with Diageo on flavour forecasting, combining synthetic analysis of menus, reviews, and social media to inform NPD direction.
Best for · NPD directionWatch · training–cutoff bias
The pattern across the patterns

Synthetic consumers are not being used to replace humans. They are being used to change which questions human respondents are asked, and when. The cost structure of inquiry shifts. The job of human fieldwork shifts with it.

XWhat the evidence is showing

The evidence is improving fast, and is still incomplete.

Two patterns sit at the centre of the empirical picture, and both deserve to be held at the same time. The alignment between synthetic and human responses has improved meaningfully in a short window. The alignment is also uneven, and the gap between best–case and average–case performance is substantial.

Alignment between synthetic and human responses, across two review periods
2022–2023
285 direct comparisons · Sarstedt review
25%
10%
65%
Late 2023–early 2025
14 follow–up studies · PyMC review
50%
36%
14%
Strong alignment Partial alignment Minimal or none
90%
Test–retest reliability reproduced
A 2025 study by Maier and colleagues, using Semantic Similarity Rating across 57 surveys and 9,300 human respondents, reports synthetic respondents achieving roughly 90% of human test–retest reliability and above 85% distributional similarity on purchase intent. One of the most concrete pieces of evidence in the current literature.

Alongside the empirical work, the conversation about how to interpret it splits into three recognisable positions. None is unreasonable on its own terms. Each is making a different bet about what the current numbers mean.

i
Cautious
LLMs are fundamentally different from human cognition. Strong alignment numbers risk being artefacts of narrow tasks rather than signs of underlying validity.
ii
Provisional
The technology is moving too quickly for a settled verdict. Use case–by–case validation, treat the methods as provisional, and revisit the evidence often.
iii
Confident
Modern reasoning–capable models already replicate human responses reliably across many practical tasks. The remaining work is about scope and calibration.

It remains difficult to provide a conclusive answer under which circumstances LLMs can mimic human response behavior.

Sarstedt, Adler, Rau & Schmitt, Psychology & Marketing, 2024

One caveat. The two review periods are not directly comparable. Elicitation methods improved between them. Most of the cited work has been conducted in English–language and Western consumer contexts. The trajectory is real. The exact slope is less settled.

XIStrong signal, weak signal

Where the signal is strong, where it isn’t, and where it’s easy to be misled.

The evidence does not say synthetic consumers work, full stop. It says they work well for some tasks, less well for others, and there are a few places where the output is convincing enough to be misleading.

+
Where the signal is strong

Tasks with replicated alignment to human data

  • Ranking concepts, taglines, and pack options against each other
  • Purchase–intent distributions in well–studied consumer categories
  • Price–sensitivity curves where the category has rich public discussion
  • Open–ended rationale that is often more articulate than typical verbatims
  • Age and income segmentation, where conditioning reproduces real patterns
  • Continuous, always–on iteration without respondent fatigue
Where the signal is weak

Tasks where alignment is uneven or unproven

  • Gender, regional, and ethnic subgroup differences
  • Emotional, cultural, and identity–driven reasoning
  • Genuinely new categories or post–training–cutoff topics
  • Lived–experience research, where the texture is the value
  • Group dynamics outside controlled academic settings
  • B2B, employee, and leadership research (validation thin to date)

Two specific traps sit just under the surface. Both have caught experienced researchers who would have spotted any cruder mistake.

!
trap one

The persona–surface illusion

A thick demographic profile makes a synthetic consumer feel specific and credible. The alignment evidence supporting broad age and income splits does not extend to gender, region, or ethnicity at the same level of confidence. Specific–sounding is not accurate–at–the–subgroup–level.

!
trap two

Temporal drift

Synthetic consumers live inside their model’s training window. They over–weight topics that became salient inside that window and under–weight topics that became salient after. For trend and foresight work, that drift is a structural bias calibration does not fully repair.

Many of the shortcomings of prior attempts are not intrinsic limitations of LLMs, but rather artifacts of how responses were elicited.

Maier et al., 2025

The same large language model, asked the same question two different ways, can produce noise in one case and ~90% reliability in the other. Forcing a synthetic respondent to pick a number on a five–point scale tends to collapse the distribution toward the middle. Letting it write a free–text answer and then mapping that answer to a Likert scale via semantic similarity recovers most of the lost signal.

The under–discussed insight

The question that gets the most attention—which underlying model does this product use?—is probably not the question that decides whether the output is useful. The elicitation layer is doing more of the work. “How do you ask” is a sharper procurement question than “which model do you use.”

XIIThe decision frame

A working frame for where the work belongs.

Two questions, asked together, do most of the work of placing a study. What kind of signal does the decision need? Directional—a ranking, a theme, a hypothesis. Or parametric—a number that someone will commit capital or strategy against. And how reversible is the consequence if the answer is wrong? The two together produce four quadrants. Each one suggests a different shape of research, and a different role for synthetic methods inside it.

High risk
Decision risk
Low risk
/ Q2
High risk, directional
Synthetic-led with a human gate
/ Q4
High risk, parametric
Humans primary, synthetic augments
/ Q1
Low risk, directional
Synthetic-led — the sweet spot
/ Q3
Low risk, parametric
Synthetic-led with calibration
Directional
Signal type → Parametric
/ Q1 — Synthetic-led
Low risk, directional. The sweet spot for synthetic methods.
Good uses
    Failure modes to watch

      Most real projects are not located in a single quadrant. They are a portfolio of sub–questions, each sitting somewhere different on the map. The useful move is decomposition: take the brief apart into its sub–questions, place each one, and design each part accordingly. The synthetic phase carries the directional and low–risk parts. The human phase carries the parametric and high–risk parts. The seam between them is where the discipline lives.

      Reading the frame

      The quadrant is a thinking aid, not a verdict. The right answer for any specific study depends on the category, the maturity of the model in that category, and the anchor data available. Use the frame to start the conversation. Use the questions in the next section to keep it honest.

      XIIIQuestions to sit with

      A small set of questions that travel across briefs.

      For yourself, before you start
      • What kind of signal does this decision actually need—a direction, a ranking, or a precise number?
      • What is the consequence if the answer is wrong, and how reversible is it?
      • Is the question about what already exists, or about something genuinely new in the world?
      • Which subgroup splits are load–bearing, and are they the kind where synthetic methods have shown alignment?
      • What real anchor data do I already have, or could plausibly get, to ground the work?
      For the team, while designing the study
      • Where in this project does synthetic data save time without compromising the answer the decision needs?
      • What part of the study explicitly needs real humans, and how small can that part be while still being credible?
      • How will we know if the synthetic phase and the human phase disagree, and what will we do if they do?
      • What does the final deliverable look like if we are honest about which findings rest on which kind of evidence?
      • Are we using synthetic data because it suits this question, or because it suits the timeline?
      For the vendor or partner
      • What real data anchors your system, and how often is it refreshed?
      • How do you elicit responses—direct rating, free text, something else—and why?
      • How do you validate against real human data, and could we see the actual validation report?
      • What is the training cutoff of the underlying model, and how do you handle questions about topics that have shifted since?
      • How does your system perform on gender, regional, or ethnic subgroup splits in this category?
      • Where would you not recommend using this method, and why?
      References & source set

      This guide draws on a small set of recent papers and industry pieces. Each is worth reading in full if the topic is going to land on your desk often.

      1. Synthetic Consumers in Market Research — A Practical Guide (2026) Rismal, Swadi K & Fiaschi · PyMC Labs.Structural spine of this guide.
      2. LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation Maier et al., 2025 · arXiv 2510.08338.Clearest case that elicitation matters more than model choice.
      3. Synthetic Consumers: The Promise, The Reality, The Future Rismal & Fiaschi, 2025 · PyMC Labs.Executive companion to the practical guide.
      4. Can Synthetic Consumers Answer Open–Ended Questions? Downey, 2025 · PyMC Labs.Where the temporal–drift evidence shows up most directly.
      5. The Power of Combining Real and Synthetic Respondents Andrew Stuart, Forbes Agency Council, 2024.Useful agency–side perspective on hybrid workflows.
      6. Generative Agents: Interactive Simulacra of Human Behavior Park et al., 2023 · arXiv 2304.03442.Architectural ancestor of most synthetic–respondent systems.
      7. Silicon Sampling in HRM and Leadership Research Tirrel, 2025.Recent academic protocol. Where validation is still thin.
      8. Using LLMs to Generate Silicon Samples in Consumer & Marketing Research Sarstedt, Adler, Rau & Schmitt, 2024 · Psychology & Marketing.Peer–reviewed reference behind the early alignment numbers.

      The Research Edge Series publishes working notes on research methodology—on measurement, on sampling, on the design of instruments, on the careful use of AI in qualitative analysis. Everything is open and free.