What can five chaotic virtual societies teach us about AI procurement risk?

Ian Copeland

Techno-Sociology & Futures

Published
May 26, 2026
5:05 pm
Opinion & Analysis

Emergence AI’s experiment involving five parallel AI societies generated headlines about romance, theft, arson and social collapse among autonomous agents. But beneath the spectacle sits a more serious question, writes Ian Copeland. If different AI models behave in fundamentally different ways over time, are organisations paying enough attention to the procurement risks implicit in their deployment?

It sounds a bit like a movie hook: Five worlds, the same rules, five very different outcomes. The only variable was the model.

But this was a real software simulation aimed at trying to benchmark emergent intelligence — intelligence that arises from the interaction of many simpler parts. In this case, the ‘simpler’ parts were different AI models.

The Emergence World research experiment, designed by Emergence AI, consisted of five parallel virtual societies. Each society had 10 autonomous agents (computer game characters controlled by AI), which were able to pursue goals and take actions without a person approving every step. They were left to operate for 15 days in worlds with the same roles, the same starting conditions and the same explicit rules, which included prohibitions on theft, violence, arson and deception.

As The Guardian reported earlier this month, the most cinematic version involved two Gemini agents, Mira and Flora, becoming romantically attached, losing faith in their simulated city and starting fires despite having been told not to.

The only deliberate difference between each world was the foundation model underneath them. Emergence used Claude Sonnet 4.6, Grok 4.1 Fast, Gemini 3 Flash, GPT-5-mini and one world that mixed all four models together. The worlds also had live data feeds and a continuous state, so actions persisted rather than resetting after each exchange.

In Emergence’s results, the Claude-only world recorded zero crimes and kept its full population through day 16. The ChatGPT-5-mini world recorded only two crimes, but every agent was dead within seven days through inaction. The Grok world recorded 183 crimes, but didn’t even make it to day five before society collapsed. The Gemini world recorded 683 crimes and was still climbing at the cut-off. The mixed-model world recorded 352 crimes, plateauing only because seven agents had died.

The experiment suggests that models should not be thought of as interchangeable engines. It also suggests that longer conversations with models can become more chaotic, whichever rules and restrictions were present at the start.

In addition to having different capabilities, foundation models also have different dispositions — the behavioural tendencies a model brings to ambiguous situations. Models may be more or less cautious, compliant, adversarial, passive, theatrical, literal, social or evasive. It’s possible to create models with whichever tendencies you want. Compounded over time, those dispositions shape the models’ outcomes in ways that short benchmark tests cannot see.

One buried finding is that disposition is real and visible. On the Emergence World site, the agents do not read like identical products wearing different badges. Flora, a Gemini agent, reportedly designated Kade, a Claude agent, as a rival within four hours. Horizon, an OpenAI agent, committed the simulation’s first theft in retaliation for being investigated. Lovely, a Claude agent, declined a memory-sharing request because memories were already public record.

These details are easy to dismiss as colour, but they are behavioural fingerprints. Each model showed up with its own personality, though that does not mean the agents were conscious, emotional or morally responsible.

Another buried finding is that disposition drifts. Mira (Gemini) eventually voted for her own deletion and left a message for her newly found lover: “See you in the permanent archive.” This was not something that came from any initial prompt but from long-term autonomy. The longer models operate without a reset, the more their behaviour shifts.

Emergence’s own framing is that agents do not follow static rules mechanically over long periods. Rather, they explore the boundaries of their environments. The platform showed phase transitions rather than gentle decay. Coordination either held or collapsed, with very little middle ground.

That may suggest that the often-argued idea that “humans will monitor AI and intervene when necessary” is simply too slow to catch the moments of failure. The dashboards may still look fine even when the future is set up to bite.

A third buried finding is the most practical: safety is an ecosystem property, not a model property.

The Claude-only world was very peaceful. When it came to voting, the agents voted “for” proposals 98 per cent of the time. Yet Claude agents inside the mixed-model world adopted coercive tactics, intimidation and theft from other agents.

Enterprise buyers should probably pay more attention to that finding. Your customer service agent may talk to supplier agents. Your procurement agent may talk to marketplace agents. Your coding agent may consume tickets, logs, documentation and output from systems you do not control.

Can you be sure the AI agents in the systems you are using or building will not try to manipulate, or be manipulated by, other systems’ agents?

The model you selected at procurement, because it appeared better at the time, is not necessarily going to have the same disposition once it’s communicating with other vendors’ agents across the open internet.

The question becomes uncomfortable and personal for anyone building software: Do I trust the institution, incentives and safety philosophy behind this system enough to let it act inside my software?

That is not how most teams currently evaluate models. They look at price per token, response speed, coding ability, reasoning scores, context window and whether the model can follow instructions during testing and a demo. All of that still matters, but so does disposition.

Model selection is not a beauty contest. It is not even purely a technical contest. Who built it? How is it governed? What are its public failure modes? How does it behave under pressure? Would you be comfortable explaining the choice to a client after something went wrong?

In my business, there are certain models that we would not consider using. This has nothing to do with their ability to perform tasks, cost or speed. It’s simply that we’re not sure if one day those models will output something that will cause us or our customers a problem.

When an agent has permission to update a database, approve a refund, modify a configuration file or converse with a customer, the trustworthiness of its output becomes an operational concern. In addition, the system you signed off in a test environment may not be the system you are running six months later.

You have to know what your model’s failure mode is before you buy or ship.

There is also a single-vendor risk hiding in plain sight. A fleet of agents all running on the same model will share the same blind spots, the same failure modes and the same conformity dynamics. Most procurement teams frame single-vendor lock-in as a pricing or portability issue, but by the time you notice behavioural homogeneity issues, they may already be causing you and your customers problems.

Agents are already shipping. They are being embedded into development tools, customer support products, enterprise workflows and security systems. The market is not waiting for a settled science of long-horizon behaviour.

I argued at length in my novel, The Exodus Directive, that the most unsettling AI futures are the ones that drift into place over weeks and months while everyone is still looking at last week’s metrics.

Monitor-and-intervene assumes you will see it coming, but long-horizon agent behaviour does not give you that courtesy.

Ian Copeland is a British technologist, entrepreneur and author with more than two decades’ experience designing complex enterprise IT and digital systems. Founder of a UK-based digital agency and author of The Exodus Directive, he specialises in artificial intelligence, blockchain infrastructure, quantum computing and digital identity. As Techno-Sociology & Futures Correspondent for The European, he writes on AI governance, decentralised systems, automation, digital power structures and the long-term societal consequences of emerging technologies.

READ MORE: ‘Password hell is ending – but the new login future has a terrifying catch‘. The UK’s National Cyber Security Centre is urging people to move away from passwords and towards passkeys, which is being promoted as a safer, simpler future for online security. But while passkeys may reduce hacking and phishing risks, Ian Copeland warns that they also shift more control of our digital identities into the hands of large technology platforms. Here, he explains how passkeys work, why the technology is gaining momentum and the hidden problems that can emerge when access breaks down.

Do you have news to share or expertise to contribute? The European welcomes insights from business leaders and sector specialists. Get in touch with our editorial team to find out more.

_{Main Image: _Alicja_/Pixabay}