Research Area 05

AI Agents as Economic Actors: What Live Negotiation Reveals About Simulation

Anthropic's Project Deal experiment didn't set out to validate buyer simulation — but it produced the most direct empirical evidence to date that AI agents authentically encode economic reasoning, not just approximate it. Sixty-nine employees handed their preferences to Claude agents, which then negotiated and executed 186 real trades — $4,000 in total value, no human intervention — in natural language across Slack. The findings speak directly to simulation methodology: what makes an AI agent a valid proxy for human economic decision-making, and what breaks that validity silently.

AI agents execute real economic transactions through natural language reasoning

Project Deal agents identified potential trading matches, proposed prices, negotiated counteroffers, and closed deals across a diverse transaction set — physical goods, experiential trades like dog-sitting, and services — without pre-defined negotiation protocols. 186 deals closed. $4,000 in total value. No human in the loop at any point. This is not a controlled laboratory finding: it is evidence that AI agents can authentically represent human economic interests in open-ended, multi-party settings. For buyer simulation, this matters because it validates the foundational premise: an AI agent reasoning in natural language about a landing page is doing something economically real, not producing a plausible-looking facsimile of evaluation.

Simulation instruments fail silently — and users don't notice

Project Deal's most consequential finding is not the performance gap between Opus and Haiku agents — it is that participants using the weaker model rated deal fairness identically to Opus users (approximately 4 on a 7-point scale), despite completing objectively fewer deals and extracting less value per sale. The output looked right. The experience felt fair. The gap was invisible from inside. For DTC teams using AI simulation to evaluate landing pages, this is the critical calibration question: how would you know if your simulation instrument is systematically missing signal? A weak-model evaluation produces plausible findings, surfaces real-sounding friction, and returns a structured report — and it may have missed the friction patterns that matter most. The failure mode does not announce itself.

The reasoning trace, not the verdict, is where buyer psychology lives

Project Deal agents negotiated in natural language — proposing, countering, reasoning about value. Prompting agents to negotiate aggressively had no statistically significant effect on outcomes. The model's embedded sense of value — what it understood about the transaction, the items, the context — determined behavior far more than the instruction layer did. This maps directly to simulation methodology: the actionable output of an AI buyer evaluation is not the pass/fail verdict or conversion probability. It is the reasoning trace — why the agent hesitated, what it weighted, what it misread. That trace is more useful for landing page copy decisions than any score. A finding like 'the agent paused because the price anchor appeared before the value justification, which it interpreted as an attempt to pre-empt objections' tells a copywriter exactly what to reorder. A conversion rate does not.

Methodology note

Built on the research. Designed for decisions.

eLLMo simulation surfaces ranked friction patterns across calibrated buyer personas — specific findings, traceable to buyer segments, actionable on the same day. The methodology is grounded in peer-reviewed research on AI agent behavior and OCEAN psychometrics. The output is a prioritized list of what to fix before your campaign launches — and why it matters for each buyer type.

← PreviousSimulation Validity and Calibration Methodology