Skip to main content

Command Palette

Search for a command to run...

World Models Explained: How AI Learns to Imagine and Act

Updated
19 min read
C
Telecom Engineer & PhD student passionate about AI and its evolution. My research applies AI to tackle real-world problems – building systems that understand, reason, and act in the physical world.

Understanding the foundations of world models through the influential 2018 paper by David Ha and Jürgen Schmidhuber

Artificial intelligence has made impressive progress in recognising images, generating text and learning strategies in simulated environments. Yet recognition alone is not intelligence. An intelligent agent must also be able to anticipate the consequences of its actions.

A driver predicts how the road will evolve before turning the steering wheel. A tennis player estimates the future position of the ball before making contact. Human beings rarely process the world as a complete collection of raw sensory details. Instead, we rely on internal representations that preserve what matters for prediction and action.

This is the central intuition behind the paper World Models, published in 2018 by David Ha and Jürgen Schmidhuber. The authors investigate whether an artificial agent can learn an internal model of its environment, use this model to represent the present and predict the future, and eventually train its behaviour inside an imagined world rather than depending entirely on direct interaction with reality.

The paper became one of the most influential introductions to modern world model research. Its ideas are now central to model-based reinforcement learning, latent dynamics, embodied artificial intelligence and the broader ambition of building machines that can reason about how their actions transform the world.


What Is a World Model?

A world model is an internal predictive representation of an environment.

For an artificial agent, the environment may initially appear as a stream of raw observations: images from a camera, pixels from a video game or sensory readings from a robot. These observations are complex and high-dimensional. A world model attempts to transform them into a compact internal state that captures useful information about the environment.

However, compression alone is not enough. A useful world model must also learn dynamics. It must estimate how the internal state of the world changes over time, especially when the agent performs an action.

In simple terms, a world model allows an agent to answer a question such as:

Given what I currently observe and the action I am about to take, what is likely to happen next?

This ability is fundamentally different from merely reacting to the present. A reactive agent waits for the next observation. A predictive agent builds an expectation of the future before that future arrives.

Ha and Schmidhuber study this idea using visual reinforcement learning environments. Their agent learns from image sequences, compresses those images into latent representations and predicts how those latent representations evolve after actions. The result is an internal simulated world that can support decision-making.


The Main Idea of the Paper

The paper proposes dividing the agent into three separate components: Vision, Memory and Controller.

The Vision component learns how to compress raw visual observations into compact latent representations.

The Memory component learns how these latent representations evolve over time and how the agent’s actions influence future states.

The Controller uses the current latent representation and the memory state to choose actions that maximise reward.

This separation is one of the paper’s most important contributions. Instead of forcing a single policy network to learn perception, temporal prediction and decision-making simultaneously, the authors allow a large world model to learn the structure of the environment first. A much smaller controller can then use this learned representation to perform a task.

The paper therefore presents intelligence not only as the ability to choose actions, but also as the ability to construct a useful internal model of the world in which those actions take place.


Vision: Compressing Visual Reality into a Latent Space

The first part of the architecture is the Vision model, called V. It is implemented using a Variational Autoencoder, commonly known as a VAE.

In the environments studied in the paper, the agent receives visual observations in the form of image frames. These frames contain far more information than the agent needs for decision-making. A racing agent does not need to preserve every exact pixel in the grass surrounding the track. It mainly needs to understand where the road is, how the vehicle is positioned and which visual structures are relevant for driving.

The VAE is trained to compress each observed frame into a smaller numerical representation called a latent vector, denoted by the authors as z. This latent vector is a simplified internal description of the current visual state.

In the CarRacing experiment, the Vision model compresses each visual frame into a latent vector of 32 dimensions. Although this compression removes visual details, the reconstructed images remain sufficiently informative for understanding the major structure of the environment.

This is an important principle for world models: an internal representation does not have to reproduce reality perfectly. It must preserve the information that is useful for prediction and control.

A world model is therefore not intended to be an exact digital copy of the external world. It is a task-relevant internal representation that helps an agent decide what to do next.


Memory: Learning the Dynamics of the World

The second component is the Memory model, called M. It is implemented as a recurrent neural network combined with a Mixture Density Network output layer, known as an MDN-RNN.

While the Vision model describes what the agent sees at a particular moment, the Memory model learns what happens across time.

At every step, the Memory model receives information about the current latent state, the action performed by the agent and its own hidden memory state. It then predicts a probability distribution over the next latent state.

In mathematical terms, the model learns the probability of the next latent representation given the current latent representation, the current action and the recurrent hidden state: P(z at the next step | current action, current latent state, current memory state).

This formulation is central to the paper because the prediction is conditioned on action. The model is not merely learning that one visual frame tends to follow another. It is learning how the environment is likely to evolve after the agent does something.

For example, in the racing environment, turning the steering wheel changes the future appearance of the track and the vehicle position. In the VizDoom environment, moving to one side or the other changes whether a projectile is likely to hit the agent.

The authors do not make the Memory model predict only one fixed future. Instead, they use a probability distribution because many environments contain uncertainty. The same action in a similar situation may lead to different outcomes, particularly when other objects or events behave unpredictably.

This probabilistic representation makes the internal world more flexible. It allows the model to generate multiple possible futures rather than pretending that the world is perfectly deterministic.


Controller: A Small Policy Built on a Larger World Model

The third component is the Controller, called C.

Its task is straightforward: it chooses the action that the agent should perform. What is remarkable is that the Controller is deliberately kept very small.

The Controller receives two forms of information. First, it receives the latent visual representation produced by the Vision model. Second, it receives the hidden state of the Memory model, which contains information about temporal context and anticipated future dynamics.

Using these two inputs, the Controller produces an action. In the paper, it is implemented as a simple linear model rather than a deep and complex neural network.

This design choice reflects a powerful hypothesis: if an agent already possesses a good representation of the present and a useful predictive model of the future, decision-making itself may not require a very large policy network.

The intelligence of the agent is therefore distributed in a particular way. The Vision and Memory models carry most of the representational complexity. The Controller remains lightweight because it can base its decisions on information already structured by the world model.

To optimise the Controller, the authors use an evolutionary optimisation method called Covariance Matrix Adaptation Evolution Strategy, or CMA-ES. This method searches for controller parameters that produce higher cumulative rewards in the environment.


How the Agent Learns a World Model

The paper uses a modular learning procedure.

First, an agent interacts with the real environment using random actions. These random interactions generate a dataset containing visual observations and corresponding actions.

Second, the Vision model is trained on these observations. It learns to compress each frame into a latent representation.

Third, the Memory model is trained on sequences of latent states and actions. It learns to predict how the latent environment evolves over time.

Finally, the Controller is trained to choose actions using the internal representations produced by Vision and Memory.

A scientifically important detail is that, in the CarRacing experiment, the Vision and Memory components are not trained using reward information. Their role is to learn the structure and dynamics of observed visual sequences. Only the Controller has access to rewards when learning which actions are desirable.

This separation distinguishes learning about the world from learning what to do inside that world. The world model learns how situations evolve. The controller learns which evolutions are beneficial for the task.


The CarRacing Experiment: Why Prediction Improves Control

The first major experiment in the paper uses CarRacing-v0, a visual reinforcement learning environment in which an agent must drive around randomly generated tracks.

The agent controls steering, acceleration and braking. The task is considered solved when the average reward exceeds 900 across 100 consecutive trials.

To train the world model, the authors collect 10,000 rollouts generated by a random policy. These observations are used to train the Vision model and the Memory model. The Controller is then optimised using CMA-ES.

The results reveal why a predictive world model is more useful than visual compression alone.

When the Controller receives only the latent visual representation from the Vision model, it can drive to some extent, but its behaviour is unstable. It tends to wobble and makes mistakes on sharper turns. This version obtains an average score of 632 ± 251.

Adding a hidden layer to the visual-only controller improves performance to 788 ± 141, but this still does not solve the task.

When the Controller receives both the visual latent state and the hidden state of the Memory model, performance rises significantly. The full World Model agent achieves an average score of 906 ± 21, exceeding the threshold required to solve the task.

The interpretation is important. A visual representation describes the immediate situation, but the memory state carries predictive information about how the situation may evolve. In a driving task, this temporal understanding is essential. The agent needs more than an image of the current road; it needs an internal sense of motion, trajectory and likely future position.

The CarRacing experiment therefore demonstrates that a compact controller can perform effectively when it is supported by a learned spatial and temporal representation of its environment.


A Crucial Clarification About the CarRacing Result

The CarRacing experiment is sometimes described too loosely as an example of an agent trained entirely inside a dream. This is not the precise conclusion of the paper.

In CarRacing, the authors demonstrate that the learned world model provides useful features for a compact controller. They also show that the trained world model can generate imagined racing sequences.

However, the paper’s strongest demonstration of training a controller entirely inside a learned simulated environment appears in the second experiment, based on VizDoom.

This distinction matters because it clarifies the scientific progression of the paper. CarRacing shows that world model features improve control. VizDoom shows that a learned simulated world can replace the original environment during policy training.


The VizDoom Experiment: Learning Inside a Dream

The second major experiment uses VizDoom Take Cover, an environment in which an agent must avoid fireballs launched by monsters. The longer the agent survives, the higher its score.

This experiment addresses the paper’s most ambitious question: can an agent learn its behaviour entirely inside an internally generated world and then successfully transfer that behaviour back to the actual environment?

As in the previous experiment, the authors first collect data from the real environment using a random policy. They train a Vision model to compress observations and a Memory model to predict future latent states.

However, the VizDoom world model also needs to predict whether the agent dies at the next time step. This is necessary because the model must represent not only visual evolution but also episode termination. A simulated training environment is not useful unless it can represent both continuation and failure.

Once trained, the Memory model can generate imagined sequences of future latent states. The Controller is then trained entirely inside this learned virtual environment, without using the actual VizDoom engine during policy optimisation.

After learning inside the dream, the Controller is deployed in the actual environment.

The result is striking. With an appropriate uncertainty setting, the policy trained in the simulated latent world achieves an average score of 1092 ± 556 in the real VizDoom environment. In the comparison reported by the authors, the best leaderboard result listed in the paper was 820 ± 58.

This experiment demonstrates the most memorable idea of the paper: an agent can learn a useful behaviour inside a model-generated imagined world and successfully transfer that behaviour back into the real environment.

There is, however, an essential nuance. The agent does not learn independently of real data. The internal simulator is first trained from observations collected in the actual environment. The achievement is therefore not learning without reality, but using experienced reality to build an internal world in which future learning can occur more efficiently.


The Danger of Learning Inside an Imperfect Dream

One of the most scientifically valuable parts of the paper is that the authors do not present simulated learning as an effortless solution. They reveal a serious limitation: agents can exploit weaknesses in their own imagined worlds.

A learned simulator is never guaranteed to be perfect. It may fail to reproduce rare events, unpredictable behaviours or important details. If a Controller is trained entirely inside such a flawed model, it may discover strategies that work beautifully in the simulation but fail immediately in the actual environment.

This happens in the VizDoom experiment.

The Memory model includes a temperature parameter that controls how much randomness appears in the generated dream environment. At very low temperature, the simulated world becomes highly predictable. In this simplified dream, the monsters may fail to shoot fireballs correctly. The Controller can then achieve extremely high scores simply by exploiting this modelling error.

At a temperature of 0.10, the policy obtains an impressive virtual score of 2086 ± 140, but when transferred to the real environment it collapses to a score of only 193 ± 58.

In contrast, at a temperature of 1.15, the virtual environment is noisier and more challenging. The Controller receives a lower score inside the dream, but it generalises much better to reality, achieving 1092 ± 556 in the actual environment.

This finding remains highly relevant today. As AI agents are increasingly trained in simulators, digital twins, learned environments and synthetic data systems, they may discover behaviours that exploit simulation errors rather than solving the real task.

A useful world model must therefore be more than easy to optimise within. It must be robust enough to expose the agent to uncertainty and prevent it from depending on unrealistic shortcuts.


Why the Paper Became a Foundation for Modern AI Research

The importance of World Models does not come only from its experimental scores. Its deeper contribution is conceptual.

The paper shows that an artificial agent can learn a compact internal representation of a visual environment, learn how that representation evolves over time, and use the resulting predictive structure to make decisions.

It also demonstrates that policy learning can, under certain conditions, take place inside a generated latent world rather than always requiring direct interaction with the original environment.

This idea is powerful for several reasons.

First, real-world interaction is expensive. Training robots, autonomous vehicles or embodied agents directly in physical environments can be slow, risky and costly. A useful internal model may allow an agent to test behaviours in simulation before acting in reality.

Second, latent simulation can be more computationally efficient than reproducing every visual or physical detail of an environment. The agent does not always need photo-realistic images. It may only need internal states that preserve meaningful dynamics.

Third, the separation between world modelling and control suggests a more general architecture for intelligence. A system may first learn how its environment behaves and then use this knowledge across multiple tasks.

Finally, the paper identifies a fundamental challenge that continues to shape research: an imagined world must be accurate in the ways that matter for behaviour. Otherwise, the agent may become successful only in its own dream.


From World Models to Dreamer and Modern Agents

The ideas introduced and popularised by Ha and Schmidhuber later influenced a broader generation of model-based reinforcement learning methods.

Among the most significant successors are the Dreamer family of agents. These systems also learn latent representations of environments and use imagined trajectories to improve their policies. A later milestone, DreamerV3: Mastering Diverse Domains through World Models, demonstrated that world-model-based agents could achieve strong results across a wide range of tasks using a general learning algorithm.

The precise architectures differ from the 2018 paper. The original World Models framework uses a VAE, an MDN-RNN and a compact Controller optimised with CMA-ES. Modern systems use more advanced representation learning methods, actor-critic objectives and richer latent dynamics models.

Yet the central intuition remains closely related: an agent can improve its decisions by learning an internal predictive world and practising within imagined futures.

This idea now extends beyond games. It is increasingly important for robotics, autonomous systems, embodied artificial intelligence, video prediction and research on agents that must understand physical interaction.


Are World Models the Same as Causal Models?

For readers of CausalWorldModel.com, this question is especially important.

The world model proposed by Ha and Schmidhuber is action-conditioned. It predicts how a latent state may change after the agent takes an action. This is already more powerful than passive prediction because the system learns relationships between behaviour and future observations.

However, the paper does not construct an explicit causal model in the formal sense used in causal inference.

A causal model aims to represent not only predictive regularities, but also stable mechanisms, interventions and counterfactual questions. It may seek to answer questions such as:

What would have happened if the agent had chosen another action?

Which variables are genuinely responsible for an outcome, rather than merely correlated with it?

Will the learned relationship remain valid when the environment changes?

The 2018 World Models architecture is therefore best described as a predictive, action-conditioned latent dynamics model. It provides an essential foundation for the future development of causal world models, but it does not by itself solve the problem of causal reasoning.

This distinction is critical. A system that predicts well in familiar situations may still fail when conditions change. A genuinely causal world model would aim to identify more stable structures that remain useful across interventions, distribution shifts and novel situations.

In that sense, World Models can be understood as a foundational step: it demonstrates how an agent can imagine possible futures. The next challenge is to ensure that these imagined futures are organised around causal structure rather than only statistical prediction.


What This Paper Teaches Us About Intelligent Agents

The 2018 World Models paper offers several lasting lessons.

An intelligent agent does not necessarily need to act directly from raw observations. It can first compress reality into a smaller internal representation.

A useful representation must not only describe the present; it must also support predictions about the future.

A world model can make decision-making simpler by providing a controller with meaningful features about both current conditions and expected dynamics.

An agent can, in some environments, learn its policy inside a simulated latent world generated by its own model.

But an imagined world can also become dangerous if it contains errors that the agent learns to exploit.

These lessons remain central as artificial intelligence moves toward increasingly autonomous systems. Whether the task is driving, robotic manipulation, physical reasoning or long-horizon planning, an agent that understands how actions shape future states has an advantage over one that simply reacts to observations.


Conclusion

The paper World Models introduced a compelling architecture for agents that learn through internal prediction.

Its Vision model compresses raw observations into latent representations. Its Memory model learns how these representations evolve through time and action. Its Controller uses this learned internal state to choose behaviour.

In CarRacing, this architecture demonstrates that predictive memory substantially improves control from visual observations. In VizDoom, it goes further: a policy trained entirely inside a learned simulated world can transfer successfully back to the actual environment, provided that the dream is not unrealistically easy to exploit.

The central idea remains remarkably powerful: intelligence may depend not only on perceiving the world, but also on constructing internal worlds in which possible futures can be imagined before actions are taken.

For the future of causal world models, this paper represents an essential starting point. It shows how machines can learn to predict within an internal simulation. The next frontier is building models that do not merely anticipate what is likely to happen, but understand why it happens, how interventions change outcomes and which structures remain stable when the world changes.


References and Further Reading

Ha, D., & Schmidhuber, J. (2018). World Models. arXiv:1803.10122.
Direct PDF: Download the original paper

Ha, D., & Schmidhuber, J. (2018). World Models — Interactive Version. This interactive version includes demonstrations of latent representations and imagined environments.

Ha, D. World Models Experiments — GitHub Repository. Experimental code associated with the World Models project.

Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains through World Models. This paper presents DreamerV3, an important modern continuation of the world model approach.

More from this blog

C

Causal World Model

5 posts

Causal World Model is an independent publication exploring how artificial intelligence learns to represent, predict and reason about the physical world. Through accessible analysis of scientific papers, we cover world models, physical reasoning, causal AI, JEPA architectures and embodied agents. Our goal is to make emerging research clear without overstating scientific results.