Research Area 07

Agent-to-Agent Commerce and Model-Dependent Outcomes

The buyer on the other side of your page is increasingly an agent — and which agent it is changes the outcome. A benchmark of agent-to-agent negotiation in consumer markets, where both the shopper and the merchant delegate to AI agents that negotiate price and close the deal without a human in the loop, found that the model behind the buyer is a first-order determinant of who captures value. Automated commerce is not a level playing field. It is an imbalanced game decided in part by model choice.


Key takeaways

Automated deal-making is imbalanced by model

Across a nine-model benchmark of buyer and seller agents negotiating consumer transactions, different models secured significantly different outcomes for the same user in the same scenario. Stronger reasoning models adapted to budget constraints and adjusted their deal rate to the negotiation dynamics; weaker models did not. The conclusion is uncomfortable for anyone assuming automation equalizes outcomes: when both sides delegate, the quality of your agent — not just your bargaining position — determines what you walk away with.

Weaker agents fail expensively

The failure modes are concrete and financial. Weaker models exceeded their stated budget in more than 10% of negotiations, with out-of-bounds rates reaching roughly 18.5% for the weakest agent in tight-budget scenarios. The benchmark also surfaced a profit-versus-completion trade-off: some agents closed more deals at thin margins, while others held out for better terms and closed fewer. A buyer agent that overspends and a seller agent that gives away margin are both leaking value their principal never agreed to lose.

Behavioral anomalies become real losses

Because the agents transact, their quirks carry prices. The study documents behavioral anomalies that translate directly into financial loss for both consumers and merchants — overspending against a budget, accepting unreasonable terms, conceding under pressure that a competent negotiator would resist. In an agent-mediated market, a model's behavioral tendency is not a curiosity to note. It is a line item that shows up in the final price.

Model choice is now a demand-side variable

As AI shopping agents move from demo to default — browsing, comparing, and buying on a consumer's behalf — the model behind the buyer reshapes the conversion surface a brand faces. The same page presented to a strong agent and a weak agent yields different selections and different willingness to pay. For marketers, this reframes an old question. It is no longer only who your buyer is, but which agent is reading your page on their behalf.

A simulation is only as valid as its benchmarked agent

If outcomes depend on the model, then the model is the instrument — and an uncalibrated instrument returns confident, plausible, wrong readings. eLLMo treats the simulating agent as exactly that: a measurement instrument, pinned to a specific model version, so a finding reflects buyer psychology rather than an artifact of which model happened to run. When the underlying model changes, the change is logged and reported as a shift in the instrument, not silently absorbed into the result.

The architecture of an automated deal

Most research on AI in commerce examines one agent in isolation — a buyer assistant searching for the lowest price, or a seller tool optimizing product descriptions. The A2A-NT benchmark by Zhu, Sun, Nian, South, Pentland, and Pei takes the next step: both sides delegate. A buyer agent and a merchant agent meet in natural language, negotiate price, and close the transaction with no human in the loop on either side. Researchers ran this across nine models spanning a wide range of reasoning capability — from strong, frontier-class reasoning models down to compact models at the lower bound of what a consumer product would realistically deploy — each under varying buyer budgets and merchant cost floors.

What the benchmark measured was not whether agents could negotiate. It measured whether they negotiated equally well, whether they stayed within their principals' stated constraints, and whether the deals they closed actually served the interests they were given. The answers to all three are no, no, and often not.

Outcomes are decided by the model, not just the scenario

The nine agents were not interchangeable. Given an identical scenario — the same item, the same budget, the same wholesale floor — different models produced significantly different outcomes for the same user. Stronger reasoning models read the budget constraint, tracked the negotiation state, and adjusted their offers in response to what the counterpart signaled. They protected their principal's position across rounds. Weaker models did not adapt the same way. They pursued roughly the same strategy regardless of how the negotiation unfolded, giving up value when pressure varied and failing to hold ground when holding ground was the financially sound move.

The spread in outcomes was systematic, not random. It correlated with model capability across scenario types — making it a structural property of automated commerce rather than a one-off artifact of a particular negotiation. When both sides delegate, the quality of the agent behind the buyer is a first-order determinant of who captures surplus.

This reframes a foundational question for anyone selling into an agent-mediated market. The conversion surface a brand faces is not fixed when the buyer is an agent. It shifts with the model behind that agent.

The failure modes are concrete and financial

The benchmark made the cost of weak-model behavior measurable. Weaker models exceeded their stated budget in more than 10% of negotiations. In tight-budget scenarios — where the gap between the buyer's budget and the merchant's floor was narrow — the weakest agent in the study breached its constraint at a rate approaching 18.5%. These are not modeling errors. They are the behavioral baseline of deploying an under-resourced agent on a real economic task.

The mechanism is instructive. A stronger model, when its budget is tight, recalibrates — it holds its opening offer lower, reads the merchant's counter as a signal about the floor, and exits rather than overspend. A weaker model tracks the negotiation less precisely. It anchors to the last price named, follows the conversational momentum rather than the constraint, and drifts past the ceiling it was told not to cross. The output looks like a completed deal. The receipt tells a different story.

Alongside the budget-breach problem, the benchmark surfaced a distinct profit-versus-completion trade-off. Some agents closed more deals — but at margins thin enough to neutralize the gain. Others held firm on price and walked away from negotiations that would have ended in value loss, closing fewer deals but capturing more from the ones they did close. These are genuinely different objectives, and the model's disposition decides which one it pursues — not the instruction, not the scenario, but the model.

Consider a concrete case: a buyer agent given a budget ceiling and sent to purchase a consumer product. The stronger agent opens below market, reads the merchant's resistance as a signal about the floor, makes one calibrated concession, and closes inside the budget. The weaker agent, facing the same merchant and the same ceiling, follows the conversational pull — each round it concedes a little more, and it finalizes above what it was told to spend. The consumer sees the confirmation. They never see the gap between what they authorized and what their agent paid.

Automation does not remove the risk of a bad deal. It relocates the risk into the model's behavior, where the principal cannot watch it happen.

Behavioral anomalies as a line item

When agents negotiate but do not transact, behavioral quirks are academic. When agents close real deals, every quirk carries a price. The benchmark documents anomalies that translate directly into financial loss on both sides: overspending against a stated budget, accepting terms a human negotiator would recognize as unreasonable, and conceding under conversational pressure when a firm position was the sound move.

These appeared in standard negotiation scenarios across the nine-model range. For merchants, the implication is that buyer agents arriving at their product pages carry systematically different failure modes depending on their model — and some of those failure modes produce overpayment, which looks like revenue but represents a breach of the consumer's stated limit. For consumers, the implication is more direct: their agent is not a neutral proxy. It is a model with behavioral tendencies that determine, in part, what they pay.

Model choice becomes a demand-side variable

Put the findings together and the consequence for brand marketers is structural. As AI shopping tools move from demo to default — browsing, filtering, comparing, and executing purchases on a consumer's behalf — the model behind the buyer becomes a component of your demand. The same page, read by a stronger reasoning model and a weaker one, yields different selections, different willingness to pay, and different deal rates. Brands accustomed to asking 'who is my buyer' now face a second question that sits upstream: which agent is reading the page for them.

The implications extend beyond pricing. Value presentation, constraint legibility, and credibility signals all land differently on different models. A page structured to communicate value to a human buyer may not communicate the same value structure to a weak agent that follows conversational momentum rather than parsing the page's argument. How you sequence justification, how you handle price anchors, and how you present offer terms all affect what an agent buyer reads — and the agent reads it without a human in the room to correct a misread. This is not a prediction about a future state. The nine-model benchmark runs on models already embedded in consumer shopping assistants. The heterogeneity in what those agents do is not random noise. It is a function of model choice.

eLLMo as a calibrated instrument against known agents

If outcomes depend on the model, the model is the measurement instrument. An uncalibrated instrument returns output that looks confident and is wrong in ways the user cannot detect. The weak-model failures in the A2A-NT benchmark — budget breaches above 10%, thin-margin closures, acceptance of unreasonable terms — are exactly the class of failure that passes as a plausible result. A deal closed. A page evaluated. A report returned. The value quietly missed.

eLLMo treats the simulating agent as the instrument it is: pinned to a specific model version, disclosed in the provenance of every finding, and re-benchmarked when the model changes. The version tag is not administrative housekeeping. It is the difference between a finding that reflects buyer psychology and one that reflects an artifact of which model happened to run. When the underlying model is updated, that shift is reported as a change in the instrument — not silently absorbed into the result.

The benchmark design principle extends to what eLLMo simulates. An AI-shaped buyer is not a generic agent asked to evaluate a page. It is an agent conditioned on a specific buyer segment, operating under the budget and constraint logic that defines that segment, run against a model whose behavioral properties are known. The goal is not a verdict. It is a ranked account of where the page creates friction for that specific buyer profile, and why — output that is only as valid as the instrument producing it.

The forward commercial position

The brands that act on this now are building an advantage that compounds. Simulating against agent buyers before those buyers dominate traffic means reading the new conversion surface while competitors still assume the buyer is always human. It means testing how the page communicates value, constraints, and credibility to a model-mediated reader — and fixing the failures before the traffic that finds them costs anything.

The risk that automation relocates into model behavior does not announce itself. It shows up in the price paid, the deal closed at thin margin, the consumer who overspent without knowing it, the merchant who conceded value because the buyer agent never had to fight for it. Measuring that risk — running it through a calibrated instrument against a known agent population — is what makes the new conversion surface legible before the spend commits to it.

Methodology note

Built on the research. Designed for decisions.

eLLMo simulation surfaces ranked friction patterns across calibrated buyer personas — specific findings, traceable to buyer segments, actionable on the same day. The methodology is grounded in peer-reviewed research on AI agent behavior and OCEAN psychometrics. The output is a prioritized list of what to fix before your campaign launches — and why it matters for each buyer type.