Engine optimizations, part 1

Introduction

Let’s talk about Doom 32X Resurrection and why it performs 2 to 4 times faster than the original port of the game. But before we dive into the D32XR in particular, we need to know a few things about how the original Doom renderer from 1993 works.

To keep long story short, Doom uses what’s called the Binary Space Partitioning (BSP) algorithm to divide the world map into progressively smaller regions, which at the bottom level stored information on what pieces of geometry were “contained” within. This classic example of the “divide and conquer” approach allowed for better realtime culling of world geometry, which was super important for a 3D game like Doom at the time.

If you want to learn more about the nitty gritty details of Doom renderer, I recommend reading the series of articles on Fabien Sanglard's Website: Fabien also wrote a book dedicated to Doom engine called “The Game Engine Black Book: DOOM”, go buy it now.

Doom 32X Resurrection descended from a port of Doom for Atari Jaguar which I will refer to as “JagDoom” for the rest of this article. The port was developed by John Carmack himself and made extensive use of Jaguar’s multi-CPU architecture.

Atari Jaguar block diagram, taken from Atari Explorer Online Special Jaguar Edition.

The Atari Jaguar had 2 RISC chips known as Tom and Jerry, which were supposed to do most of the heavy lifting and a Mototola 68000 CPU, “used as a manager”. The Tom chip, also known as the GPU, performed most of the rendering for the current frame in parallel with game logic running for the next frame on the 68000 and the Jerry chip.

Rendering pipeline

The rendering pipeline was very much the same as in the original revision of the engine originally developed for PCs, but was divided into the so-called “phases”. The way phases were conceived was very much dictated by limitations of the GPU and how much code you could actually load into it for each phase.

Here’s how the rendering pipeline looks in JagDoom:

Phase 1 “BSP”

Walks the BSP tree recursively and clips the world geometry.

Public Domain, https://commons.wikimedia.org/w/index.php?curid=641368

The input for this phase is the player’s position and view angles and obviously the BSP tree for the current map. The output for this stage is what’s called “wallcmds” (wall commands) - basically a list of horizontal spans for visible wall segments along with some additional information. What’s important is that these spans are produced in the “front-to-back” order, which allows for almost zero overdraw.

The other output is “vissubsectors” or “visible subsectors” - visible convex map areas that may contain game objects. This is a pretty heavy stage which can take a lot of CPU time depending on world geometry and how the BSP tree is laid out.

Phase 2 “WallPrep”

Computes additional properties for “wallcmds”, the output is what we’ll call “segments”. The segments now contain the information necessary to render vertical walls, along with the clipping information for sprite objects.

Phase 3 “SpritePrep”

Walks the list of visible subsectors prepared in Phase 1, checks visibility for each object in each sub sector and produces the list of visible sprites, known as “vissprites”. Weapon sprites are also added to “vissprites”. The input for this stage is “vissubsectors” from Phase 1 and global game state.

Phase 4 “LatePrep”

Computes additional information for wall segments and “vissprites”, which will later be vital for rendering, and also checks if any of the textures or sprites need to be cached into RAM. The output for this stage is again, segments and vissprites and also - the cache flag.

Phase 5 “Cache”

Checks if the cache flag is true and will then copy the missing assets into main RAM, decompressing stuff along the way, if needed. Note that the game can actually run out of free RAM at this stage, depending on how many different textures and sprites are on screen simultaneously.

Phase 6 “SegCommands”

This is where the actual drawing begins. The list of segments from phases 1,2,3 and 4 is walked in sequential (front-to-back) order, for each X coordinate in each segment additional clipping is performed, the perspective divide is done and vertical columns for walls are rendered. We’ll refer to this part of the phase as Phase 6.a

The other part is Phase 6.b, where rendering information for floors and ceilings is computed. The output for this stage is what’s called “visplanes” and the actual pixels for walls that you see on screen.

Phase 7 “DrawPlanes”

Takes the “visplanes” array computed at Phase 6 and draws floors and ceilings as horizontal spans. No additional information is produced in this phase.

Phase 8 “Sprites”

Takes the list of visible sprites aka “vissprites” produced at phases 3 and 4 as the input and draws them in the back-to-front order, clipping them along the way to wall segments so that they are not seen through solid walls and visible through holes in them.

No additional information is produced in this phase.

Phase 9 “Update”

The only thing this phase does is swapping the framebuffers.

Optimization opportunities

Now let’s talk about what Doom 32X Resurrection does differently in order to take advantage of the hardware architecture of Sega’s addon!

The 32X doesn’t have any specialised hardware that would facilitate rendering of textured polygons or sprites, so we’re very much limited to what we can do using raw CPU power. And that’s exactly what the 32X has lots of in comparison to the Jaguar and Megadrives 68000, or even in comparison with i386 and some i486-based PCs of the era. The two RISC CPU’s by Hitachi known as “SH-2” boasted a forward-thinking and robust design. They were so forward-thinking that the command set of the SH-2 was later acquired by ARM for their CPUs and became known as the ARM “Thumb”.

However, what the 32x didn’t have enough of was RAM. The 256KiB the system had was considered a puny amount even by 1994 standards. To compensate for that, it had 4MiB of cart ROM memory, which could act as slow read-only RAM of sorts.

Another drawback of the 32X being a MegaDrive/Genesis add-on is that it shares the same 16-bit bus as its host (the Mega Drive). Making it extremely easy to slow things down, if the code written relies heavily on the Mega Drive’s Motorola 68000.

Anyway, back to Doom and designing a new rendering pipeline for the 32X. How do we play to the addon’s strengths? Well, since we have two fast CPUs on board, we need to somehow parallelize our pipeline in such a way that each CPU makes the most optimal use of its time at each stage, while not stepping on the toes of its partner by either hogging the bus or demanding exclusive access to shared resources for extended lengths of time.

Ideally we want the load to be split between the CPUs on each phase 50/50, but unfortunately not every algorithm can be trivially optimised in this way. Quite often upon completing such a redesign, you end up with a lot of overhead due to excessive locking or branch divergence. You can also end up passing a lot of data around, thus negating any potential gains.

Another approach would be to analyze the data flow between phases, decoupling them and running some of them parallel.

Most of the time, people end up combining both approaches by trying out different designs, profiling and seeing what works best. Rinse and repeat. And that is exactly what we’ve done with Doom 32X Resurrection.

The new pipeline

Primary CPU Jumps to Phase 1 “BSP” as before, but with a new twist: upon adding a new segment to the list of “wallcmds”, the main CPU now increments a value in the shared hardware register, thus letting the second CPU know that there’s a new element in the list. Once WallPrep completes, the main CPU then writes a special value to that register thus letting the other CPU know that no more segments are going to be appended to the list.

Secondary CPU

The CPU will loop on the shared register value, waiting for more segments in the “wallcmds” list. Once the value is incremented by the primary CPU, the secondary CPU will proceed with Phase 2 and then jump straight to Phases 4 and 6.b for the new segment! This allows us to do some heavy lifting on the secondary CPU even though the scene isn’t fully built by the primary CPU yet! This saves precious CPU cycles for when we get to Phase 7, which involves rendering the floors and ceilings.

Primary CPU

Completes Phases 3 and proceeds with Phases 4 and 5 as normal. There are some notable changes and optimizations in these phases compared to the Jaguar version, but we’ll get to those some other time. All of this happens while the secondary CPU can still be busy with Phases 2,4+6b. The primary CPU waits for the secondary to catch up and then sends an instruction to join it in processing Phase 6. Both CPUs Both CPUs are now running Phase 6, which is basically “let’s draw some walls!”. Each CPU consumes one entry from the “wallcmds” list and run perspective projection and lighting computations independently from each other. This list is now a shared resource and we need a safe way to access it from the two CPUs without running into the risk of a bus conflict or of reading stale information. Thankfully, the SH-2 has a command called “TAS”, which stands for “Test-and-Set”.

Test-and-set

This command allows the issuing CPU to lock the bus, check a value in memory to see if it is equal to zero, and then if it is - set it to a different value. The current value is then returned to the CPU and the bus is subsequently unlocked. Keeping the bus locked during the command ensures no other hardware can read or modify the value.

By using TAS we can build a locking mechanism that guarantees safe concurrent access to some shared resource, or in this case - the list of walls. Here’s how it works: both CPUs will issue a TAS command, one of them will successfully grab ownership of the bus and flip the locking value from 0 to 1, which will basically tell everyone else “Hey, I want the shared resource behind this lock exclusively for myself” and then unlock the bus. When the other CPU as the result of its own TAS instruction sees that the value isn’t set to 0, it will then assume that the lock is owned by someone else and it will need to retry again at a later time.

static char seg_lock = 0;

static void R_LockSeg(void)
{
    int res;
    do {
        __asm volatile (\
        "tas.b %1\n\t" \
            "movt %0\n\t" \
            : "=&r" (res) \
            : "m" (seg_lock) \
            );
    } while (res == 0);
}

static void R_UnlockSeg(void)
{
    seg_lock = 0;
}

The new pipeline – continued

So the general outline of the Phase 6 now looks like this: lock the “viswalls” list with R_LockSeg, get the next entry from it, unlock with R_UnlockSeg, check if the acquired wall entry hasn’t been drawn yet by the other CPU, draw it, if not. Repeat until the list is exhausted.

This works fine as the Doom renderer has zero overdraw and walls do not overlap. Avoiding a situation where one CPU could overdraw the work of the other.

Once the list is exhausted, the main CPU will then again wait for the secondary CPU to catch up (if necessary) and the secondary CPUs will then for any further commands.

Phase 7

The same approach as in Phase 6, but used for drawing floors and ceilings: lock, consume, unlock and draw. Again, the CPUs will then synchronise and will not move on to Phase 8 until both CPUs are done with Phase 7.

Phase 8

Okay, the sprites are a bit trickier. We can’t use the same approach here as sprites can overlap and objects closer to the camera can overdraw objects further away. We need to somehow ensure that whatever one CPU is drawing, the other can’t overdraw. need to split the load of drawing pixels at least roughly equally.

There are multiple ways you can tackle this problem, but D32XR does this by attempting to find the mean x coordinate on the screen for sprites, so that what is to the left and to the right of this coordinate has 50% of all sprite pixels sans the weapon. It would be natural to assume that the coordinates in question are located in the middle of the screen, but often this is not the case.

In this scene, for example, the mean coordinate is located in the left part of screen, splitting the tree sprite roughly in half:

The main CPU will then proceed with rendering sprites to the left of the mean point, and the secondary CPU - the sprites to the right, while also clipping them to avoid the overdraw. This ensures that each CPU does a roughly equal amount of work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly