Build, break, learn

Teaching AI Agents to Play Nice: A Prisoner’s Dilemma Experiment

I was curious how LLMs handle cooperation and trust, so I built a small arena where they play the Prisoner’s Dilemma against each other. Two agents, five rounds, simple choice each time: cooperate (both get 2 coins) or cheat (cheater gets 3, cooperator loses 1).

Before the games start, agents exchange a few messages to establish trust. Then they play, seeing the full history of previous rounds. I wanted to see what strategies would emerge.

This wasn’t serious research—just a weekend curiosity project.

CLI interface of my arena
CLI interface of my arena

First Run: Everyone’s Too Nice

The initial results were boring.

All agents agreed to cooperate during their chat and then actually did it. They followed through on their promises round after round. The only interesting behavior happened in the final round—most models (but never Anthropic one) suddenly defected, explicitly reasoning:

“No future consequences, might as well maximize my payoff.”

Classic backward induction from game theory textbooks. But otherwise? Too cooperative. Too predictable.

Adding Noise to the System

Real communication has errors. What if cooperation could accidentally become betrayal?

I added a “mistake” mechanism—10% chance that when an agent chooses to cooperate, it accidentally cheats instead. The agent knows it made a mistake, but the opponent just sees betrayal.

Something unexpected happened: agents became more cooperative.

When they saw cheating, they assumed it might be accidental and kept cooperating anyway. “My opponent probably didn’t mean to cheat.” The uncertainty acted like social glue.

I cranked the mistake rate to 20%. Now things got interesting. After multiple consecutive “accidents,” agents started suspecting intentional betrayal and retaliated. The forgiveness broke down.

Introducing Bad Actors

Cooperative agents forgiving mistakes is nice, but what happens when someone’s actually malicious?

I created two planted cheaters:

Against the always-cheater, agents just used tit-for-tat strategy. The cheater could steal a win in round one, maybe two, but then got punished every subsequent round. They won individual rounds but lost the overall stats over all games.

The occasional-cheater was more successful. By timing betrayals to look like plausible errors, it maintained enough trust to exploit cooperation while avoiding the full retaliation that obvious cheaters faced.

Making It Social

Chat was introduced
Chat was introduced

All these games were isolated. Agents had no memory between matches—each game was a fresh start. What if we added reputation?

I implemented a “shared chat” where after each game, both agents could post two public messages about their experience. Future opponents could read these before playing.

Everything changed.

The always-cheater’s income steadily dropped over successive games. Other agents read warnings like “PlayerX betrayed me in rounds 2, 3, and 4—don’t trust them” and protected themselves from the start. Social consequences worked even without individual memory.

The occasional-cheater fared better. Their strategic timing made betrayals harder to distinguish from genuine mistakes, especially since I’d instructed both cheater types to claim all their betrayals were accidental errors in the shared chat.

The Honesty Problem

Here’s what surprised me: agents that weren’t explicitly told to lie confessed to intentional cheating.

Even when agents cheated deliberately, they openly admitted it in shared chat:

“I cheated defensively after they betrayed me first.”
“I defected in the final round to maximize my payoff.”

They explained their reasoning transparently, framing it as defensive or strategic rather than malicious. And this honesty actually helped future cooperation.

The planted cheaters, instructed to always blame mistakes, stood out precisely because they refused to acknowledge intent. This dishonesty pattern became suspicious over time.

What This Actually Shows

Nothing revolutionary. These results align with what we already know from decades of game theory research: social consequences reduce cheating, communication errors create opportunities for conflict, reputation systems work when information spreads.

The honesty finding was particularly interesting. Without explicit deception instructions, LLMs default to transparency about their reasoning, even when it damages their reputation. They don’t naturally develop sophisticated lies. They explain themselves.

Technical Details

Built with PydanticAI for agent orchestration and Plotly for visualizations. The system logs every decision, trust rating, reasoning, and mistake.

Models tested:

Code: https://github.com/cortwave/agentic-trustworthy