Advanced Architectures and Methodologies in Visual Reinforcement Learning: A Technical Analysis of the ViZDoom Platform

1. Introduction: The Paradigm Shift to Embodied Perception

The progression of Artificial Intelligence, particularly within the domain of Deep Reinforcement Learning (DRL), has been inextricably linked to the complexity of the environments used to benchmark algorithmic performance. While the Arcade Learning Environment (ALE) and the suite of 2D Atari games served as the foundational crucible for early breakthroughs—demonstrating that convolutional neural networks could map high-dimensional pixel inputs to control policies—these environments fundamentally lack the geometric and distinct cognitive challenges of the physical world.¹ The transition from third-person, fully observable 2D state spaces to first-person, partially observable 3D navigation represents a critical inflection point in the pursuit of General Artificial Intelligence.

ViZDoom, a research platform built upon the engine of the seminal 1993 video game Doom, emerged to bridge this chasm. It offers a computationally efficient yet visually complex 3D world that forces agents to operate under the constraints of embodied perception.³ Unlike the "God-view" afforded by board games or top-down mazes, a ViZDoom agent perceives only the data contained within its immediate field of view (FOV). To succeed, algorithms must not merely approximate a function mapping pixels to actions; they must synthesize representations of 3D space, develop object permanence, manage long-horizon temporal dependencies, and navigate the "Deadly Triad" of reinforcement learning—instability arising from function approximation, bootstrapping, and off-policy learning.⁵

This report provides an exhaustive technical analysis of ViZDoom as a premier reinforcement learning research platform. It dissects the architectural mechanics of the ZDoom engine, the suite of cognitive scenarios designed to isolate specific neural capabilities, and the evolution of algorithmic approaches—from standard Deep Q-Networks (DQN) to advanced Transformer-based architectures and hierarchical control systems. Furthermore, it examines the infrastructure required for high-throughput training, such as Sample Factory, and the recent emergence of Safe Reinforcement Learning benchmarks like HASARD, which introduce constrained optimization to the chaotic dynamics of First-Person Shooters (FPS).

2. Architectural Framework and Environmental Dynamics

The utility of ViZDoom is derived from its dual nature: it serves as both a faithful recreation of a complex, physics-based game engine and a streamlined API designed for high-frequency interaction with modern machine learning libraries. This section explores the underlying state representations and control topologies that define the agent-environment interface.

2.1 The Engine and State Representation

Built upon the open-source ZDoom engine, ViZDoom allows Reinforcement Learning agents to interact with the simulation synchronously or asynchronously. The core state representation is the screen buffer, which provides raw visual data in RGB, Grayscale, or CRCGCB formats.³ However, the platform's primary advantage over standard video decoding is its access to auxiliary internal buffers, which allow researchers to decompose the visual learning problem.

Research consistently indicates that depth perception is a critical component for successful navigation in 3D space, particularly when monocular cues are insufficient. ViZDoom grants agents access to the depth buffer, a Z-buffer visualization of the scene geometry. This feature facilitates the development of agents capable of SLAM-like (Simultaneous Localization and Mapping) behaviors without the computational overhead of photogrammetry or external estimators.⁸ By correlating depth data with RGB inputs, agents can learn to distinguish between texture variations (e.g., a poster on a wall) and geometric obstacles (e.g., a pillar), a distinction often lost in pure pixel-based learning.⁸

Furthermore, the engine supports the extraction of high-level game variables—such as health metrics, ammunition counts, and weapon states—which are essential for defining the reward function $R_t$ and constructing auxiliary supervised learning tasks.⁸ Recent iterations have also introduced access to the audio buffer, enabling pioneering research into multi-modal learning where agents must utilize sound cues (e.g., the growl of an enemy around a corner) to augment visual limitations.⁹

Table 2.1: ViZDoom Observation Buffers and Research Utility

Buffer Types & Cognitive Implications

Buffer Type	Data Description	Research Application & Cognitive Implication
Screen Buffer	Raw RGB/Grayscale pixels	Simulates human-like visual processing; primary input for CNN feature extractors.
Depth Buffer	Z-buffer geometry visualization	Facilitates visual navigation, obstacle avoidance, and spatial reasoning.
Automap	Top-down map data	Used as "privileged information" in Teacher-Student models or for debugging navigation policies.
Labels Buffer	Semantic segmentation masks	Enables supervised learning of object detection and automatic labeling of game entities.
Audio Buffer	Raw sound wave data	Supports multi-modal fusion research; critical for locating out-of-sight threats.
Game Variables	Scalar values (Health, Ammo)	Essential for reward shaping and defining the agent's internal state.

2.2 Action Space and Control Topologies

The control interface in ViZDoom is highly configurable, supporting both discrete and continuous action spaces, which allows it to simulate a wide range of robotic and virtual agents. The API maps actions to logical "buttons" such as MOVE_LEFT, ATTACK, TURN_RIGHT, and SELECT_WEAPON.¹⁴

In the context of Deep Q-Networks (DQN), the action space is typically flattened into a discrete set of legal button combinations (e.g., ``). However, more sophisticated setups, particularly those utilizing Proximal Policy Optimization (PPO), employ MultiDiscrete or continuous action spaces via delta buttons (e.g., VIEW_PITCH, VIEW_ANGLE) to simulate the precision of mouse-look controls.¹⁵ This distinction is vital; discrete turning limits the agent to fixed angular increments, potentially hindering fine aiming, whereas continuous control allows for smooth tracking of targets, mimicking human motor control.

The simulation proceeds in "tics," with the engine running internally at 35 tics per second. RL agents typically utilize a technique known as "frame skipping," where an action selected at time $t$ is repeated for $k$ consecutive tics. This mechanism serves two purposes: it accelerates the training wall-clock time and effectively increases the agent's reaction horizon, allowing the consequences of actions to manifest more clearly in the next observation. Research suggests that an optimal frame skip parameter typically falls between 4 and 10, balancing fine-grained control with the need to propagate reward signals over longer temporal sequences.⁷

2.3 The Gymnasium and Shimmy Integration

To standardize experimentation and facilitate interoperability with the broader RL ecosystem, ViZDoom has been integrated into the Gymnasium (formerly OpenAI Gym) interface. This integration is achieved through wrappers that translate the native ViZDoom API—which utilizes camelCase methods like makeAction and getGameVariables—into the standard env.reset() and env.step() interaction loop.¹⁹

The Shimmy library and various gymnasium_wrappers handle the crucial translation of observation spaces. While the raw ViZDoom engine returns a dictionary of buffers, standard RL libraries like Stable-Baselines3 expect specific tensor shapes, typically Channels-Last ($H \times W \times C$) or Channels-First ($C \times H \times W$) depending on the backend (TensorFlow or PyTorch). Wrappers are employed to resize frames (e.g., to $84 \times 84$ or $64 \times 64$), stack frames to encode temporal motion velocity, and normalize pixel values to the range $$.²¹ This standardization allows researchers to seamlessly transfer algorithms between ViZDoom, Atari, and MuJoCo environments without altering the underlying neural architecture.¹⁷

3. The Cognitive Gym: Scenarios as Behavioral Benchmarks

ViZDoom does not present a single monolithic problem but rather a diverse suite of scenarios, each rigorously designed to isolate and stress-test specific cognitive capabilities of an RL agent. These scenarios range from simple reactive combat to complex, long-horizon exploration tasks characterized by sparse rewards and deceptive local optima.

‍

3.1 Reactive Control: Basic and Defend the Center

The entry-level benchmarks focus on immediate visual-motor mapping and target acquisition.

Basic: In this scenario, the agent is spawned in a simple room directly facing a monster. The objective is to align the crosshair and fire. While trivial for humans, this serves as a crucial sanity check for the RL pipeline, verifying that the agent can learn the causal link between the ATTACK action and the positive reward associated with a kill (+101 typically).¹¹ It isolates the fundamental mechanics of the Partially Observable Markov Decision Process (POMDP) without the noise of navigation.

Defend the Center: This scenario significantly escalates complexity by introducing omnidirectional threats. The agent occupies a stationary turret position in the center of a circular arena and must rotate to eliminate enemies spawning at the perimeter.¹¹ The primary cognitive challenge here is "object permanence" and threat prioritization. The agent cannot see all enemies simultaneously; it must maintain a belief state about enemies behind it and prioritize targets based on proximity. The reward function is usually defined as $+1$ for a kill and $-1$ for death, forcing the agent to maximize the kill-death ratio while managing limited ammunition.²⁵

‍

3.2 Constrained Optimization: Health Gathering

The "Health Gathering" and "Health Gathering Supreme" scenarios introduce survival as the primary objective, distinct from combat. The agent is placed in a maze with a degrading health counter (simulating a poisoned environment or acid floor), and medkits are scattered stochastically throughout the map.²⁶

The reward function is typically defined as $+1$ for every tick the agent remains alive, creating a dense reward signal. However, this introduces a complex navigation challenge. In "Supreme," the layout is a maze with walls and obstacles, requiring the agent to perform pathfinding under extreme time pressure. The stochastic spawning of medkits prevents simple trajectory memorization, requiring the agent to learn a generalized search policy and visual identification of resources. The "Supreme" variant is often used to benchmark the robustness of an agent's navigation module, as failure results in immediate termination.²⁷

‍

3.3 The Credit Assignment Problem: Deadly Corridor

The "Deadly Corridor" scenario is a canonical example of the "local optima" trap in Reinforcement Learning. The agent must navigate a winding corridor filled with aggressive enemies to reach a "Vest" at the far end.

The Trap: If the reward function is naively constructed based solely on killing enemies or moving forward, the agent frequently falls into a local optimum where it engages enemies aggressively but succumbs to damage before reaching the goal, or learns to oscillate back and forth to minimize risk without progressing.
The Solution: Successful agents must learn to suppress the impulse for immediate combat rewards in favor of the large, delayed reward of reaching the vest. Research in this scenario highlights the necessity of Reward Shaping—providing small auxiliary rewards for distance traveled—or Curriculum Learning, where the difficulty (damage scaling) is gradually increased.²⁴ This scenario effectively tests an algorithm's ability to handle the temporal credit assignment problem.

‍

3.4 Sparse Rewards and Exploration: My Way Home

‍

"My Way Home" places the agent in a complex maze with a single goal item (a vest) located in a specific, randomly selected room. The reward signal is extremely sparse: $0$ (or a small step penalty) for all actions and $+1$ only upon reaching the goal.³⁰

This scenario serves as a benchmark for exploration algorithms. A standard $\epsilon$-greedy policy typically fails here because the probability of randomly stumbling upon the goal decays exponentially with the maze's depth. Consequently, this environment has been pivotal in validating Intrinsic Motivation and Curiosity-Driven Learning methods, where the agent generates internal rewards based on the novelty of visited states or the error in predicting future states.³¹ The "My Way Home" scenario demonstrates the necessity of exploration strategies that go beyond random dithering.

‍

3.5 Predictive Dynamics: Take Cover

‍

In "Take Cover," the agent must dodge incoming fireballs from monsters without the ability to fight back. The critical research aspect here is Latent Overshooting and dynamics prediction. The agent must infer the trajectory of projectiles and the future state of the environment to survive. It serves as a primary testbed for model-based RL approaches like PlaNet (Deep Planning Network), which learn a latent dynamics model to plan actions in "imagination" rather than relying solely on reactive policy mappings. The agent must internalize the physics of the game engine to predict safe zones.³³

‍

4. Algorithmic Evolution in Visual Reinforcement Learning

‍

The history of research utilizing ViZDoom mirrors the broader evolution of Deep Reinforcement Learning. The platform has hosted a progression of algorithms, each addressing specific limitations of its predecessors in the context of high-dimensional 3D observation.

‍

4.1 The Baseline: DQN and its Variants

‍

Deep Q-Networks (DQN) were among the first architectures applied to ViZDoom. By approximating the Q-value function (expected future rewards) using Convolutional Neural Networks (CNNs), DQN agents demonstrated the capacity to navigate basic scenarios.¹ The standard architecture typically involves three convolutional layers followed by fully connected layers, processing a stack of the four most recent frames.

However, standard DQN struggles with the non-Markovian nature of Doom due to the limited Field of View (FOV). A single state $s_t$ often fails to fully describe the environment (e.g., it cannot capture the velocity of a rocket entering the frame or the presence of an enemy seen seconds ago). To mitigate this, researchers implemented Double DQN (DDQN) to reduce value overestimation and Prioritized Experience Replay to focus training on significant events (like dying or scoring a frag) rather than the mundane majority of frames.²⁵ Despite these improvements, off-policy methods like DQN often exhibit instability in the stochastic, high-variance environment of FPS Deathmatch.

‍

4.2 Policy Gradients and On-Policy Learning (PPO/A3C)

‍

The introduction of Asynchronous Advantage Actor-Critic (A3C) marked a paradigm shift toward policy gradient methods in ViZDoom. A3C executes multiple agent instances in parallel environments, updating a global network asynchronously. This mechanism decorrelates the training data without requiring a massive experience replay buffer, which is highly effective for the diverse 3D navigation tasks in Doom.³⁶

Currently, Proximal Policy Optimization (PPO) has superseded A3C as the standard baseline for ViZDoom research. PPO offers the stability of Trust Region methods but with a simpler, clipped objective function that prevents destructive policy updates. Studies comparing PPO and DQN in "Deadly Corridor" and "Health Gathering" consistently show that PPO converges to more robust policies, particularly when rigorous hyperparameter tuning (e.g., learning rate annealing, generalized advantage estimation $\lambda$) is applied.²⁴ PPO's ability to handle continuous or multi-discrete action spaces makes it particularly suitable for the complex control schemes of FPS games, where an agent might need to aim, move, and switch weapons simultaneously.¹⁷

Table 4.1: Hyperparameter Sensitivity in PPO for ViZDoom

‍

Hyperparameter

Typical Value

Impact on Training

Learning Rate

$1 \times 10^{-4}$ to $2.5 \times 10^{-4}$

Crucial for stability; often annealed linearly to zero.

Gamma ($\gamma$)

0.99

Determines the horizon of the agent; higher values needed for "Deadly Corridor".

Batch Size

2048 - 4096

Larger batches stabilize the gradient estimate in high-variance scenarios.

Entropy Coeff

0.01

Encourages exploration; prevents premature convergence to local optima (e.g., camping).

‍

4.3 Memory and Recurrence (LSTM/GRU)

‍

To address the partial observability inherent in First-Person Shooters, simple frame stacking is often insufficient for long-term dependencies (e.g., remembering a visited room in a maze or the location of a health pack seen 10 seconds ago). Research has successfully integrated Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) into RL architectures.³⁹

In architectures like DRQN (Deep Recurrent Q-Network) or A2C-LSTM, the output of the CNN feature extractor is fed into a recurrent layer before the policy head. This effectively provides the agent with a "short-term memory" or hidden state, allowing it to maintain a belief state about off-screen enemies or map geometry.²⁵ This recurrence is critical for scenarios like "Deathmatch," where an enemy might disappear behind a pillar and reappear; a memory-less agent would treat the enemy's disappearance as a cessation of threat, whereas a recurrent agent retains the threat in its hidden state.

‍

4.4 The Transformer Era: StARformer

‍

Recent advancements have introduced Transformer-based architectures to the ViZDoom domain, aiming to solve the "vanishing gradient" problem associated with RNNs over very long sequences. The StARformer (State-Action-Reward Transformer) moves beyond simple recurrence by modeling the RL problem as a sequence modeling task.⁶

The StARformer architecture explicitly models short-term representations of State, Action, and Reward (StAR) within a local window. It utilizes self-attention mechanisms to integrate these representations over long horizons. Unlike LSTM-based agents, which must compress all history into a fixed-size vector, the Transformer can "attend" to specific relevant past events (e.g., seeing a colored keycard) regardless of how many steps ago they occurred. This architecture has demonstrated superior performance on "Deadly Corridor" and "My Way Home," where the causal link between an action and a reward may be separated by hundreds of time steps.¹⁶ By introducing a Markovian-like inductive bias through local StAR representations, the model improves the stability of long-term modeling in visual RL.

‍

5. Solving the Sparse Reward Problem

‍

One of the most significant challenges in ViZDoom—and Reinforcement Learning in general—is learning in environments where feedback is rare or non-existent for long durations. In the "My Way Home" scenario, a random agent might explore for thousands of episodes without ever finding the goal, receiving a reward signal of 0 continuously. This "vanishing reward" problem renders gradient-based learning ineffective.

‍

5.1 Intrinsic Curiosity Modules (ICM)

‍

To combat sparsity, researchers employ Curiosity-Driven Learning. The Intrinsic Curiosity Module (ICM) formulates curiosity as a prediction error. The agent is equipped with an "inverse dynamics model" that attempts to predict the consequences of its actions. Specifically, it predicts the feature representation of the next state $s_{t+1}$ given the current state $s_t$ and action $a_t$.⁴³

If the outcome of an action is unpredictable or novel—meaning the discrepancy between the predicted state and the actual state is high—the agent generates an intrinsic reward $r_i$. In ViZDoom, this mechanism compels the agent to seek out new rooms, activate switches, and interact with objects solely to reduce its prediction error, effectively exploring the maze without any external game score. Experiments demonstrate that ICM-equipped agents can solve "My Way Home" significantly faster than baseline A3C agents and can even generalize better to unseen maps, as the drive to explore is intrinsic to the agent rather than tied to a specific map layout.⁴⁵

‍

5.2 Reward Shaping and Novelty Estimation

‍

Another prominent approach is Reward Shaping, where the researcher manually or automatically designs intermediate rewards to guide the agent. In "Deadly Corridor," simply adding rewards for "distance to target" or "picking up ammo" can create a dense signal. However, this introduces the risk of reward hacking, where the agent maximizes the auxiliary reward (e.g., spinning in circles to collect respawning ammo) rather than solving the main task.

Advanced methods propose automated reward shaping that combines game screen information with internal novelty estimations. These algorithms estimate the "rarity" of visited states and behaviors, incentivizing the agent to visit states that have low visitation counts in the replay buffer. This balances the exploration-exploitation trade-off by dynamically adjusting the weight of the intrinsic reward as the agent becomes more familiar with the environment.³²

‍

6. High-Throughput Training Infrastructure

‍

Training competent agents in 3D environments requires massive interaction counts—often in the billions of frames. This necessity has driven the development of high-throughput training frameworks specifically optimized for the ViZDoom platform.

‍

6.1 Sample Factory

‍

Sample Factory represents the current state-of-the-art in asynchronous RL for Doom. Standard Python-based implementations are often bottlenecked by the Global Interpreter Lock (GIL), which limits the efficiency of parallel environment execution. Sample Factory circumvents this by utilizing a fully asynchronous architecture known as "Double-Buffered Sampling".⁴⁷

In this architecture, the GPU policy inference is decoupled from the CPU-based environment simulation. A shared memory buffer allows the environment workers to write observations and read actions without blocking the learner process. Benchmarks indicate that Sample Factory can achieve throughputs exceeding 100,000 frames per second on a single multi-core machine.⁴⁹ This throughput enables the training of agents for complex tasks like "Deathmatch" in minutes rather than days, allowing for rapid hyperparameter sweeping and population-based training.⁵⁰ The framework supports specific command-line arguments for VizDoom, such as --env=doom_battle, --num_workers=20, and --env_frameskip=4, making it highly specialized for this domain.⁴⁷

‍

6.2 Comparison with Stable-Baselines3 and Tianshou

‍

While Sample Factory focuses on raw throughput, Stable-Baselines3 (SB3) remains a preferred choice for educational purposes and rapid prototyping due to its code readability and strict adherence to algorithmic standards. SB3 utilizes synchronous vectorized environments (DummyVecEnv or SubprocVecEnv) which are easier to debug but generally slower than Sample Factory's asynchronous actor-learner model.¹⁹

Tianshou offers a middle ground, providing a highly modular component-based architecture that supports both synchronous and asynchronous execution. It integrates with EnvPool, a C++-based batched environment manager that significantly speeds up the interaction loop for ViZDoom compared to standard Python multiprocessing.⁵² EnvPool can interface directly with the C++ API of ViZDoom, bypassing Python overhead entirely during the stepping process.

Table 6.1: Framework Comparison for ViZDoom

‍

Framework

Architecture

Throughput (Approx. FPS)

Best Use Case

Sample Factory

Asynchronous Actor-Learner

>100,000

Large-scale training, SOTA Competition entries.⁴⁷

Stable-Baselines3

Synchronous Vectorized

~2,000 - 10,000

Prototyping, Education, Standard Baselines.⁴⁸

Tianshou + EnvPool

Modular / C++ Batched

~20,000 - 50,000

Custom algorithm research, High-performance synchronous training.⁵²

‍

7. Competitive Intelligence: The "Fittest Wins" and League Training

The pinnacle of ViZDoom research is often showcased in the Visual Doom AI Competition (VDAIC). The complexity of the "Full Deathmatch" track—where agents must hunt, navigate, collect resources, and fight simultaneously against hostile agents—exceeds the capacity of simple end-to-end RL models.

‍

7.1 Adaptive Strategic Control (ASC)

The winning approach in recent competitions, detailed in the paper "The Fittest Wins," employs a Multi-Stage Framework centered on Adaptive Strategic Control (ASC). Instead of relying on a single monolithic neural network to handle all aspects of gameplay, the agent utilizes a hierarchical structure trained in stages ⁵⁴:

Goal-Conditioned Navigation: Initially, the agent is trained solely to navigate the map efficiently, learning to reach specific coordinates or item locations without the noise of combat.
Shooting Skills: A separate policy is fine-tuned specifically for aiming and firing accuracy, often using "Curriculum Learning" where targets become progressively harder to hit.
Strategic Manager: A high-level controller acts as a selector, switching between these specialized behaviors based on the current context. For example, if the internal state indicates "Low Health," the manager triggers the navigation policy to seek medkits. If "Enemy Visible" is true, it triggers the combat policy. This hierarchical decomposition mirrors the "Subsumption Architecture" in robotics.⁵⁴

‍

7.2 League Training and Hindsight Experience Replay

To prevent agents from converging to static strategies that are easily exploitable (e.g., camping in a specific corner), top-tier agents are trained using League Training. This involves maintaining a population of agents that train against each other. As one agent discovers a dominant strategy (e.g., circle-strafing), the opponents are forced to learn a counter-strategy (e.g., leading shots or predicting movement). This "arms race" leads to robust policies that generalize well against unseen opponents, including human players.⁵⁶

Furthermore, techniques like Hindsight Experience Replay (HER) are adapted for FPS games. If an agent fails to kill an enemy but successfully survives, HER allows the agent to reinterpret that episode as a success for the goal of "survival," thereby extracting learning signal even from failed combat encounters.⁵⁴ This multi-objective learning ensures that the agent extracts maximum utility from every frame of interaction.

‍

8. Emerging Frontiers: Safe RL and Sim-to-Real

As ViZDoom matures as a platform, its utility is expanding beyond pure combat efficiency into the domains of safety and generalization, addressing critical bottlenecks in the deployment of RL to real-world systems.

‍

8.1 The HASARD Benchmark

Released in 2025, HASARD (Benchmark for Vision-Based Safe Reinforcement Learning) introduces rigorous safety constraints to the Doom environment. In traditional RL, the agent is free to execute any action that maximizes the cumulative reward. In Safe RL, the agent must maximize reward subject to cost constraints (e.g., "Do not destroy friendly structures," "Do not step in lava").

HASARD leverages ViZDoom to create scenarios like "Collateral Damage" (where shooting civilians or friendly units incurs a heavy cost) and "Armament Burden" (where the agent must manage heavy weaponry without causing self-damage).⁵⁰ The benchmark supports three difficulty levels and tests algorithms such as Lagrangian PPO, which dynamically adjusts the penalty for safety violations during training. This benchmark is pivotal for bridging the gap to real-world robotics, where safety violations are often unacceptable, and "trial-and-error" must be strictly constrained.⁵⁰

‍

8.2 Visual Navigation and SLAM Integration

ViZDoom is also serving as a testbed for Visual SLAM (Simultaneous Localization and Mapping) combined with RL. By utilizing the depth buffer and visual inputs, researchers are training agents to construct internal semantic maps of their environment. This hybrid approach combines classical robotics (mapping and path planning) with deep learning (visual recognition and policy execution).

Agents trained in this manner can navigate to semantic targets (e.g., "Go to the Red Room") rather than just geometric coordinates. The depth buffer provided by ViZDoom allows for the training of depth estimation networks in a supervised manner, which can then be transferred to real-world robots equipped with RGB-D cameras.⁵⁹ This "Sim-to-Real" transfer is facilitated by the realistic texture rendering and complex geometry of the Doom engine, which provides a richer training ground than simple grid worlds.

‍

9. Conclusion

ViZDoom has established itself as a cornerstone of Deep Reinforcement Learning research. By providing a platform that is computationally efficient yet visually and cognitively complex, it successfully isolates the critical challenges of embodied intelligence: partial observability, 3D spatial reasoning, sparse reward structures, and multi-agent competition.

The progression of research within this platform—from simple DQN agents struggling to turn left, to Hierarchical Transformers executing complex strategic maneuvers in Deathmatch—mirrors the broader maturation of the field. The data suggests a definitive trend toward modular and hierarchical architectures ¹, moving away from the "black box" end-to-end approach for complex tasks. Furthermore, the integration of high-throughput asynchronous training ⁴⁷ and curiosity-driven exploration ⁴³ appears to be the defining factor in achieving super-human performance.

For researchers entering this domain, the ecosystem offers a robust and tiered toolset: Gymnasium for standardization, Sample Factory for industrial-scale training, and PPO/StARformer as the algorithmic baselines. As the research focus shifts toward Safe RL (HASARD) and explainable AI, ViZDoom remains a relevant, rigorous, and indispensable proving ground for the next generation of intelligent agents.

Works cited

Vizdoom - Nolan Winsman's Porfolio, accessed November 18, 2025, https://nolanwinsman.com/pdfs/VizDoom-Paper.pdf
Deep Reinforcement Learning in VizDoom First-Person Shooter for Health Gathering Scenario - UPV, accessed November 18, 2025, https://personales.upv.es/thinkmind/dl/conferences/mmedia/mmedia_2019/mmedia_2019_4_30_50038.pdf
Playing Doom Using Deep Reinforcement Learning. - IvLabs, accessed November 18, 2025, https://ivlabs.in/playing-doom-using-deep-reinforcement-learning/
ViZDoom, accessed November 18, 2025, https://vizdoom.cs.put.edu.pl/
[1809.03470] ViZDoom Competitions: Playing Doom from Pixels - ar5iv, accessed November 18, 2025, https://ar5iv.labs.arxiv.org/html/1809.03470
The Innate Curiosity in the Multi-Agent Transformer - ResearchGate, accessed November 18, 2025, https://www.researchgate.net/publication/389573180_The_Innate_Curiosity_in_the_Multi-Agent_Transformer
ViZDoom — EnvPool 0.8.4 documentation - Read the Docs, accessed November 18, 2025, https://envpool.readthedocs.io/en/latest/env/vizdoom.html
ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning, accessed November 18, 2025, https://www.cs.put.poznan.pl/mwydmuch/pdf/2016_vizdoom.pdf
Farama-Foundation/ViZDoom: Reinforcement Learning environments based on the 1993 game Doom :godmode - GitHub, accessed November 18, 2025, https://github.com/Farama-Foundation/ViZDoom
Autonomous agents: Augmenting visual information with raw audio data | PLOS One, accessed November 18, 2025, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0318372
Default scenarios/environments - ViZDoom Documentation, accessed November 18, 2025, https://vizdoom.farama.org/environments/default/
Learning about SLAM and Reinforcement learning : r/reinforcementlearning - Reddit, accessed November 18, 2025, https://www.reddit.com/r/reinforcementlearning/comments/inspcf/learning_about_slam_and_reinforcement_learning/
Natural Language State Representation for Reinforcement Learning - SciSpace, accessed November 18, 2025, https://scispace.com/pdf/natural-language-state-representation-for-reinforcement-3dqyhxy2fr.pdf
Tutorial - ViZDoom - Poznan University of Technology, accessed November 18, 2025, https://vizdoom.cs.put.edu.pl/tutorial
Gymnasium Env - ViZDoom Documentation, accessed November 18, 2025, https://vizdoom.farama.org/api/python/gymnasium/
Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Action Spaces - arXiv, accessed November 18, 2025, https://arxiv.org/html/2407.01310v1
Deep Reinforcement Learning for Playing Doom — Part 1: Getting Started - Leandro Kieliger, accessed November 18, 2025, https://lkieliger.medium.com/deep-reinforcement-learning-in-practice-by-playing-doom-part-1-getting-started-618c99075c77
[PDF] ViZDoom Competitions: Playing Doom From Pixels | Semantic Scholar, accessed November 18, 2025, https://www.semanticscholar.org/paper/ViZDoom-Competitions%3A-Playing-Doom-From-Pixels-Wydmuch-Kempka/de0f96ebabb75700a9febcfe4ee0c5eb0adea7ba
Stable-Baselines3 (SB3) Tutorial: Getting Started With Reinforcement Learning, accessed November 18, 2025, https://araffin.github.io/talk/sb3-gym-quickstart/
APIs and wrappers - ViZDoom Documentation, accessed November 18, 2025, https://vizdoom.farama.org/main/introduction/apis_and_wrappers/
Wrappers - Gym Documentation, accessed November 18, 2025, https://www.gymlibrary.dev/api/wrappers/
Ray RLlib: how to train DreamerV3 on Vizdoom | by Kaige - Medium, accessed November 18, 2025, https://medium.com/@kaige.yang0110/ray-rllib-how-to-train-dreamerv3-on-vizdoom-and-atari-122c8bd1170b
OpenAI Gym - Shimmy Documentation - The Farama Foundation, accessed November 18, 2025, https://shimmy.farama.org/environments/gym/
Evaluating the effects of hyperparameter optimization in VizDoom - DiVA portal, accessed November 18, 2025, https://www.diva-portal.org/smash/get/diva2:1679888/FULLTEXT01.pdf
Using VizDoom Research Platform Scenarios for Benchmarking Reinforcement Learning Algorithms in First-Person Shooter Games - IEEE Xplore, accessed November 18, 2025, https://ieeexplore.ieee.org/iel7/6287639/10380310/10413478.pdf
Hands-on: advanced Deep Reinforcement Learning. Using Sample Factory to play Doom from pixels - Hugging Face Deep RL Course, accessed November 18, 2025, https://huggingface.co/learn/deep-rl-course/unit8/hands-on-sf
Solving Partially Observable 3D-Visual Tasks with Visual Radial Basis Function Network and Proximal Policy Optimization - MDPI, accessed November 18, 2025, https://www.mdpi.com/2504-4990/5/4/91
Automated Curriculum Learning by Rewarding Temporally Rare Events, accessed November 18, 2025, https://pure.itu.dk/files/83642393/Automated_Curriculum_Learning_by_Rewarding_Temporally_Rare_Events.pdf
Using VizDoom Research Platform Scenarios for Benchmarking Reinforcement Learning Algorithms in First-Person Shooter Games - ResearchGate, accessed November 18, 2025, https://www.researchgate.net/publication/377671572_Using_VizDoom_Research_Platform_Scenarios_for_Benchmarking_Reinforcement_Learning_Algorithms_in_First-Person_Shooter_Games
Using Infini-attention ViT with PPO to solve My Way Home | by Nolan Brady - Medium, accessed November 18, 2025, https://medium.com/correll-lab/using-infini-attention-vit-with-ppo-to-solve-my-way-home-7145b2bf1b99
[2202.12174] Collaborative Training of Heterogeneous Reinforcement Learning Agents in Environments with Sparse Rewards: What and When to Share? - arXiv, accessed November 18, 2025, https://arxiv.org/abs/2202.12174
Reward Shaping for Deep Reinforcement Learning in VizDoom - CEUR-WS.org, accessed November 18, 2025, https://ceur-ws.org/Vol-3094/paper_17.pdf
Konferencijos „Lietuvos magistrantų informatikos ir IT tyrimai“ darbai, accessed November 18, 2025, https://epublications.vu.lt/object/elaba:37255249/37255249.pdf
Towards More Generalisable Policies with Deep Reinforcement Learning and Learnt Forward Models - Queen Mary University of London, accessed November 18, 2025, https://qmro.qmul.ac.uk/xmlui/bitstream/handle/123456789/77307/RATCLIFFE_Dino_171033722_EECS_PhD_final.pdf?sequence=1
Deep Reinforcement Learning with DQN vs. PPO in VizDoom - ResearchGate, accessed November 18, 2025, https://www.researchgate.net/publication/358122718_Deep_Reinforcement_Learning_with_DQN_vs_PPO_in_VizDoom
mehdiboubnan/Deep-Reinforcement-Learning-applied-to-DOOM - GitHub, accessed November 18, 2025, https://github.com/mehdiboubnan/Deep-Reinforcement-Learning-applied-to-DOOM
Playing Doom with Anticipator-A3C Based Agents Using Deep Reinforcement Learning and the ViZDoom Game-AI Research Platform - United Arab Emirates - Ministry of Health and Prevention, accessed November 18, 2025, https://nchr.elsevierpure.com/en/publications/playing-doom-with-anticipator-a3c-based-agents-using-deep-reinfor/
Sample Factory: Egocentric 3D Control from Pixels at 100000 FPS with Asynchronous Reinforcement Learning - arXiv, accessed November 18, 2025, https://arxiv.org/pdf/2006.11751
Long Short-Term Memory (LSTM) - NVIDIA Developer, accessed November 18, 2025, https://developer.nvidia.com/discover/lstm
arXiv:1809.01999v1 [cs.LG] 4 Sep 2018, accessed November 18, 2025, https://arxiv.org/pdf/1809.01999
StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning - ResearchGate, accessed November 18, 2025, https://www.researchgate.net/publication/364642891_StARformer_Transformer_with_State-Action-Reward_Representations_for_Visual_Reinforcement_Learning
M-SAT: Multi-State-Action Tokenisation in Decision Transformers for Multi-Discrete Actions, accessed November 18, 2025, https://www.researchgate.net/publication/391769474_M-SAT_Multi-State-Action_Tokenisation_in_Decision_Transformers_for_Multi-Discrete_Actions
Curiosity-driven Exploration by Self-supervised Prediction - Deepak Pathak, accessed November 18, 2025, https://pathak22.github.io/noreward-rl/resources/icml17.pdf
Curiosity-driven Exploration by Self-supervised Prediction - arXiv, accessed November 18, 2025, https://arxiv.org/pdf/1705.05363
Curiosity-driven Exploration by Self-supervised Prediction - Proceedings of Machine Learning Research, accessed November 18, 2025, https://proceedings.mlr.press/v70/pathak17a/pathak17a.pdf
Curiosity-driven Exploration in VizDoom | IEEE Conference Publication, accessed November 18, 2025, https://ieeexplore.ieee.org/document/10036273/
VizDoom - Sample Factory Documentation, accessed November 18, 2025, https://www.samplefactory.dev/09-environment-integrations/vizdoom/
EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine - OpenReview, accessed November 18, 2025, https://openreview.net/pdf?id=BubxnHpuMbG
Hadamax Encoding: Elevating Performance in Model-Free Atari - OpenReview, accessed November 18, 2025, https://openreview.net/pdf/d7f079dc090a77b543c19d2bfa41f3fb172e9174.pdf
HASARD: A Benchmark for Vision-Based Safe Reinforcement Learning in Embodied Agents, accessed November 18, 2025, https://arxiv.org/html/2503.08241v1
Stable Baselines3 Tutorial - Gym wrappers, saving and loading models, accessed November 18, 2025, https://colab.research.google.com/github/araffin/rl-tutorial-jnrr19/blob/sb3/2_gym_wrappers_saving_loading.ipynb
thu-ml/tianshou: An elegant PyTorch deep reinforcement learning library. - GitHub, accessed November 18, 2025, https://github.com/thu-ml/tianshou
Cheat Sheet — Tianshou 0.5.0 documentation, accessed November 18, 2025, https://tianshou.org/en/v0.5.0/tutorials/cheatsheet.html
The Fittest Wins: a Multi-Stage Framework Achieving New SOTA in ViZDoom Competition, accessed November 18, 2025, https://www.researchgate.net/publication/369405085_The_Fittest_Wins_a_Multi-Stage_Framework_Achieving_New_SOTA_in_ViZDoom_Competition
The Fittest Wins: a Multi-Stage Framework Achieving New SOTA in ViZDoom Competition, accessed November 18, 2025, https://ieeexplore.ieee.org/document/10077442/
Players League Research Articles - Page 1 | R Discovery, accessed November 18, 2025, https://discovery.researcher.life/topic/league-of-players/6730922?page=1&topic_name=League%20Of%20Players
Deep Reinforcement Learning-Based Multi-Agent System with Advanced Actor–Critic Framework for Complex Environment - ResearchGate, accessed November 18, 2025, https://www.researchgate.net/publication/389339264_Deep_Reinforcement_Learning-Based_Multi-Agent_System_with_Advanced_Actor-Critic_Framework_for_Complex_Environment
HASARD: Safe Vision RL with Doom for Autonomous Systems - YouTube, accessed November 18, 2025, https://www.youtube.com/watch?v=-B26z9BM5FM
[1612.00380] Playing Doom with SLAM-Augmented Deep Reinforcement Learning - ar5iv, accessed November 18, 2025, https://ar5iv.labs.arxiv.org/html/1612.00380
System overview: (a) Observing image and depth from VizDoom. Running... | Download Scientific Diagram - ResearchGate, accessed November 18, 2025, https://www.researchgate.net/figure/System-overview-a-Observing-image-and-depth-from-VizDoom-Running-Faster-RCNN-b-for_fig1_311299501
Benchmarking Deep Reinforcement Learning for Navigation in Denied Sensor Environments - arXiv, accessed November 18, 2025, https://arxiv.org/html/2410.14616v1

‍

Updated On:

December 1, 2025

Follow on social media: