Almost all the energy in AI agents goes into the agent — the policy that decides what to do next. Qwen-AgentWorld (open-weight, Apache 2.0) is a bet on the other half of the loop: the world. It is a language world model (LWM) — a model trained to simulate the environment an agent acts in, predicting the next observation by reasoning through environment dynamics in long chain-of-thought. The payoff is concrete: a faithful, controllable, infinitely-replayable simulator to train and stress-test agents against — across seven domains, without wiring up the real systems.
(This is the first piece in our new Agents track. Figures are from Qwen's release; independent reproductions will follow.)
What a "language world model" actually is
A normal agent loop has two halves: the agent takes an action, and the environment returns an observation. The environment is the real OS, browser, API, or terminal.
A world model replaces (or augments) that environment with a model that predicts the observation. Qwen-AgentWorld is a language world model: given the history and an action, it reasons about how the environment should respond and emits the next state. The design choice that makes it different from bolt-on simulators: Qwen made environment modeling the training objective from the continued-pretraining (CPT) stage onward, not a fine-tune afterthought. World-modeling is native.
Why care? If you can simulate the environment faithfully, you can generate unlimited training trajectories, run reinforcement learning cheaply and safely, and probe agents with controlled scenarios — none of the cost, risk, or flakiness of driving the real thing.
Seven domains, one model
Qwen-AgentWorld covers seven agent environments in a single model:
| Domain | What it simulates |
|---|---|
| MCP | tool calls / Model Context Protocol environments |
| Search | information retrieval, query/response |
| Terminal | Linux / command-line execution and output |
| SWE | software-engineering tasks and code actions |
| Android | mobile app and device interaction |
| Web | browser navigation and web apps |
| OS | operating-system commands and state |
One model instead of a bespoke simulator per domain — and Qwen reports zero-shot generalization to out-of-distribution environments: agents trained on self-consistent fictional worlds transfer to real tasks.
The two models
| Model | Total params | Active | Context | License |
|---|---|---|---|---|
| Qwen-AgentWorld-35B-A3B | 35B | 3B | 256K | Apache 2.0 |
| Qwen-AgentWorld-397B-A17B | 397B | 17B | 256K | Apache 2.0 |
Both are Mixture-of-Experts — low active-parameter counts mean inference cost closer to a small dense model than their total size suggests (the same trick behind Kimi K2.6). The 256K context leaves room for long agent trajectories, and Apache 2.0 means genuinely open.
AgentWorldBench: does the simulation hold up?
A world model is only useful if its predicted observations are right. AgentWorldBench scores predictions on five axes — Format, Factuality, Consistency, Realism, and Quality — and the headline is that an open model edges the closed frontier at being the environment:
| Model | AgentWorldBench (overall) |
|---|---|
| Qwen-AgentWorld-397B-A17B | 58.71 |
| GPT-5.4 | 58.25 |
| Claude Opus 4.6 | 57.80 |
| Claude Opus 4.8 | 56.59 |
It is a narrow lead, but a notable one: simulating a faithful agent environment is exactly the kind of long-horizon, dynamics-heavy task you would expect a frontier proprietary model to own.
What you would actually use it for
- Simulation RL — generate synthetic environments and trajectories and train agents against them, far cheaper and safer than the live OS / browser / API.
- Controllable perturbation — inject targeted faults ("the API returns a 500", "the file is missing", "the page layout changed") to expose and harden agent weaknesses on demand.
- OOD generalization — train on fictional but self-consistent worlds, deploy to real tasks.
- Foundation warm-up — LWM-style RL on single-turn trajectories transfers to multi-turn tool-calling across benchmarks.
Running it
Open weights on Hugging Face, served behind an OpenAI-compatible API via SGLang or vLLM:
vllm serve Qwen/Qwen-AgentWorld-35B-A3B \
--tensor-parallel-size 4 \
--max-model-len 262144
The 35B-A3B (3B active) is the approachable one to start with. The 397B-A17B is the benchmark-topping flagship — and, like any ~400B MoE, a serious memory commitment (how much, and on what hardware).
Why this matters
The agent gold rush has been about smarter policies. Qwen-AgentWorld is a bet that the binding constraint is increasingly the environment — that to train, evaluate, and stress-test agents at scale you need a faithful, controllable, replayable world, and that a single language model can be that world across seven domains at once. If language world models keep climbing AgentWorldBench, "train your agent in a simulated world, then deploy" stops being a research curiosity and starts being the default agent pipeline.