Synthetic consumers, in plain sight.
A working note on what synthetic consumers are, what they aren’t, what the research is showing, and the questions worth sitting with before they show up in your next study.
Something is shifting, and most teams can feel it before they can name it.
For most of the last decade, market research had a quiet, predictable shape. A question came in. A panel was sourced. Fieldwork ran for a few weeks. Findings landed in a deck. The cadence was slow because real people were the only reliable signal, and real people take time.
That shape is starting to bend. Brand and product teams are running concept tests overnight. Pricing studies that used to take six weeks are landing the next afternoon. The labels attached to this shift vary: synthetic consumers, synthetic audiences, silicon samples, AI personas. The marketing varies too. But the underlying claim is consistent. There is now a class of methods that can stand in for a research participant, at least some of the time, in some kinds of work, at a small fraction of the cost.
The conversation around these methods is loud, contested, and moving quickly. Some teams are leaning in. Others are pushing back. Most are somewhere in the middle, trying to work out what the methods actually are, what they actually do well, and where the limits sit.
Synthetic data could account for more than half of all market research inputs by 2027.
Qualtrics industry forecast, 2024This guide draws on the PyMC Labs practical guide, the Maier purchase–intent paper, Forbes commentary from agency leaders, the Downey work on open–ended responses, the Park generative agents paper, and the Tirrel protocol on silicon sampling in HR research. The goal is not to recommend a tool or settle a debate. The goal is to make the shape of the conversation easier to read.
The state of the field in four numbers.
None of these numbers is a warranty. Each is a useful anchor for thinking about where the methods sit in mid–2026.
The improvement from ~35% to 86% partial alignment did not come only from better models. A meaningful portion came from better practitioner technique in eliciting responses. Section XI returns to this point. It is one of the more important and under–discussed findings in the recent literature.
A statistical artifact wearing a voice.
A synthetic consumer is an AI–generated participant designed to respond to research stimuli the way a real consumer in a defined segment might. Asked about a concept, a price, a tagline, a category, it produces something that looks like an answer—a number, a sentence, a paragraph of rationale.
The output is text. The thing that decides whether that text is useful is everything that sits behind it: the data the model was trained on, the persona scaffolding it was prompted with, the way the question was asked, and the validation routine the team has run against real human data.
Three observations follow, each of which keeps recurring across the source literature.
- The model is not the product. Two systems built on the same underlying LLM can behave very differently depending on how they elicit responses and how they calibrate them.
- The persona is a surface, not a soul. A thick demographic profile makes a synthetic consumer feel specific. It does not make it accurate. Specific–sounding does not mean correct.
- The output is plural by design. A single synthetic respondent is not the unit of analysis. The unit is a distribution across hundreds or thousands of synthetic respondents, compared against a distribution of real responses.
What you see is text. What matters is the architecture behind it. Two synthetic–consumer products that look identical in the demo can be two completely different research instruments under the hood.
Four overlapping terms, each emphasising something slightly different.
The labels are not interchangeable, even though they are often used as if they were. The PyMC practical guide proposes a useful split, plotted below on two axes: how broad the application is, and how complex the underlying system tends to be.
A fifth term—silicon sampling—appears in the peer–reviewed literature. It is the rigorous label for the broader family, and it is the search term that will find serious validation studies rather than vendor marketing.
Noticing which term a vendor uses, and which one they slide between when pressed, is a small but useful signal. The honest operators can describe the difference in a sentence. The ones who treat the terms as interchangeable are usually selling more than the method can carry.
A familiar architecture, sitting under the marketing surface.
Strip away the language and the demos and the vendor logos, and most credible synthetic–consumer systems share the same five–stage architecture.
Insights are no longer one–off projects but continuous learning loops.
Rismal, Swadi K & Fiaschi, PyMC Labs (2026)One observation worth holding onto. Stages one and four—data foundation and validation—are doing most of the load–bearing work. Stages two and three are doing most of the demonstrating. The mismatch between which stages get attention in a sales conversation and which stages decide whether the method works is one of the recurring features of the current moment.
Not one thing, but a spectrum.
The label “synthetic consumer” covers approaches that differ enough in methodology that grouping them under a single name is the source of most confusion in the field. Four layers exist, each with different inputs, different outputs, and different defensible uses. Mistaking one for another is where most procurement disappointments begin.
L1 is useful for fast hypothesis generation. L2 is useful for exploration where the deliverable is themes. L3 is useful for understanding population–level patterns. L4 is useful for prediction within a defined envelope, once an anchor exists. Selling L1 as L4 is overclaiming. Using L4 where L1 would do is overspending.
Four pressures, arriving at roughly the same time.
Synthetic methods are not new. Statistical agencies and epidemiologists have used them for decades. What is new is the combination of forces that brought the conversation into commercial market research at this particular moment.
Each force on its own would have produced a modest shift. The four together have produced something closer to a category change.
Other industries have already built simulator layers.
Three industries figured out how to use simulation alongside real–world testing long before market research caught up. Each one offers a lesson the synthetic–consumer conversation tends to skip past.
A Level D flight simulator has over a thousand individual test requirements. Pilots can complete an entire type–rating in the box. The simulator’s authority is bounded by the conditions under which it was tested.
Simulation is not used to replace road testing. It scales, generates edge cases on demand, and regresses every software update against the entire library of past scenarios before public roads see it.
The FDA accepts computational modelling as supporting evidence. ASME V&V 40 specifies what the evidence has to look like: verification, validation, applicability, uncertainty quantification. Not one number. A documented credibility framework.
No equivalent vocabulary exists in market research. Most current synthetic–consumer products make global accuracy claims. None of them specify the envelope of qualification. None of them would survive a Part 60 review.
Qualification is use–case bounded, not globally accurate. A Level D simulator for the Boeing 777 is qualified to train 777 pilots on 777 procedures. It is not qualified for anything else. The interesting question for synthetic consumers is not “is the method accurate?” but “qualified for what?”
Three reasons the equivalent has not been built in research yet. First, the cost of failure is lower than in aviation or pharma, so the institutional pressure to qualify has not arrived. Second, the field has been drawn to single accuracy numbers that do not specify the envelope. Third, the practitioner vocabulary for use–case–bounded validity has not been developed yet. That vocabulary has to come from somewhere. The interesting question is whether market research builds it itself, or imports it.
What teams are doing with them in real briefs.
Five patterns recur across vendor case studies, agency commentary, and the early commercial literature. None of them is treating synthetic consumers as a finished study on their own. Each is using them as a layer that changes what the rest of the research has to do.
Synthetic consumers are not being used to replace humans. They are being used to change which questions human respondents are asked, and when. The cost structure of inquiry shifts. The job of human fieldwork shifts with it.
The evidence is improving fast, and is still incomplete.
Two patterns sit at the centre of the empirical picture, and both deserve to be held at the same time. The alignment between synthetic and human responses has improved meaningfully in a short window. The alignment is also uneven, and the gap between best–case and average–case performance is substantial.
Alongside the empirical work, the conversation about how to interpret it splits into three recognisable positions. None is unreasonable on its own terms. Each is making a different bet about what the current numbers mean.
It remains difficult to provide a conclusive answer under which circumstances LLMs can mimic human response behavior.
Sarstedt, Adler, Rau & Schmitt, Psychology & Marketing, 2024One caveat. The two review periods are not directly comparable. Elicitation methods improved between them. Most of the cited work has been conducted in English–language and Western consumer contexts. The trajectory is real. The exact slope is less settled.
Where the signal is strong, where it isn’t, and where it’s easy to be misled.
The evidence does not say synthetic consumers work, full stop. It says they work well for some tasks, less well for others, and there are a few places where the output is convincing enough to be misleading.
Tasks with replicated alignment to human data
- Ranking concepts, taglines, and pack options against each other
- Purchase–intent distributions in well–studied consumer categories
- Price–sensitivity curves where the category has rich public discussion
- Open–ended rationale that is often more articulate than typical verbatims
- Age and income segmentation, where conditioning reproduces real patterns
- Continuous, always–on iteration without respondent fatigue
Tasks where alignment is uneven or unproven
- Gender, regional, and ethnic subgroup differences
- Emotional, cultural, and identity–driven reasoning
- Genuinely new categories or post–training–cutoff topics
- Lived–experience research, where the texture is the value
- Group dynamics outside controlled academic settings
- B2B, employee, and leadership research (validation thin to date)
Two specific traps sit just under the surface. Both have caught experienced researchers who would have spotted any cruder mistake.
The persona–surface illusion
A thick demographic profile makes a synthetic consumer feel specific and credible. The alignment evidence supporting broad age and income splits does not extend to gender, region, or ethnicity at the same level of confidence. Specific–sounding is not accurate–at–the–subgroup–level.
Temporal drift
Synthetic consumers live inside their model’s training window. They over–weight topics that became salient inside that window and under–weight topics that became salient after. For trend and foresight work, that drift is a structural bias calibration does not fully repair.
Many of the shortcomings of prior attempts are not intrinsic limitations of LLMs, but rather artifacts of how responses were elicited.
Maier et al., 2025The same large language model, asked the same question two different ways, can produce noise in one case and ~90% reliability in the other. Forcing a synthetic respondent to pick a number on a five–point scale tends to collapse the distribution toward the middle. Letting it write a free–text answer and then mapping that answer to a Likert scale via semantic similarity recovers most of the lost signal.
The question that gets the most attention—which underlying model does this product use?—is probably not the question that decides whether the output is useful. The elicitation layer is doing more of the work. “How do you ask” is a sharper procurement question than “which model do you use.”
A working frame for where the work belongs.
Two questions, asked together, do most of the work of placing a study. What kind of signal does the decision need? Directional—a ranking, a theme, a hypothesis. Or parametric—a number that someone will commit capital or strategy against. And how reversible is the consequence if the answer is wrong? The two together produce four quadrants. Each one suggests a different shape of research, and a different role for synthetic methods inside it.
Good uses
Failure modes to watch
Most real projects are not located in a single quadrant. They are a portfolio of sub–questions, each sitting somewhere different on the map. The useful move is decomposition: take the brief apart into its sub–questions, place each one, and design each part accordingly. The synthetic phase carries the directional and low–risk parts. The human phase carries the parametric and high–risk parts. The seam between them is where the discipline lives.
The quadrant is a thinking aid, not a verdict. The right answer for any specific study depends on the category, the maturity of the model in that category, and the anchor data available. Use the frame to start the conversation. Use the questions in the next section to keep it honest.
A small set of questions that travel across briefs.
- What kind of signal does this decision actually need—a direction, a ranking, or a precise number?
- What is the consequence if the answer is wrong, and how reversible is it?
- Is the question about what already exists, or about something genuinely new in the world?
- Which subgroup splits are load–bearing, and are they the kind where synthetic methods have shown alignment?
- What real anchor data do I already have, or could plausibly get, to ground the work?
- Where in this project does synthetic data save time without compromising the answer the decision needs?
- What part of the study explicitly needs real humans, and how small can that part be while still being credible?
- How will we know if the synthetic phase and the human phase disagree, and what will we do if they do?
- What does the final deliverable look like if we are honest about which findings rest on which kind of evidence?
- Are we using synthetic data because it suits this question, or because it suits the timeline?
- What real data anchors your system, and how often is it refreshed?
- How do you elicit responses—direct rating, free text, something else—and why?
- How do you validate against real human data, and could we see the actual validation report?
- What is the training cutoff of the underlying model, and how do you handle questions about topics that have shifted since?
- How does your system perform on gender, regional, or ethnic subgroup splits in this category?
- Where would you not recommend using this method, and why?
This guide draws on a small set of recent papers and industry pieces. Each is worth reading in full if the topic is going to land on your desk often.
- Synthetic Consumers in Market Research — A Practical Guide (2026) Rismal, Swadi K & Fiaschi · PyMC Labs.
- LLMs Reproduce Human Purchase Intent via Semantic Similarity Elicitation Maier et al., 2025 · arXiv 2510.08338.
- Synthetic Consumers: The Promise, The Reality, The Future Rismal & Fiaschi, 2025 · PyMC Labs.
- Can Synthetic Consumers Answer Open–Ended Questions? Downey, 2025 · PyMC Labs.
- The Power of Combining Real and Synthetic Respondents Andrew Stuart, Forbes Agency Council, 2024.
- Generative Agents: Interactive Simulacra of Human Behavior Park et al., 2023 · arXiv 2304.03442.
- Silicon Sampling in HRM and Leadership Research Tirrel, 2025.
- Using LLMs to Generate Silicon Samples in Consumer & Marketing Research Sarstedt, Adler, Rau & Schmitt, 2024 · Psychology & Marketing.
The Research Edge Series publishes working notes on research methodology—on measurement, on sampling, on the design of instruments, on the careful use of AI in qualitative analysis. Everything is open and free.