The SNES has it. The Genesis now has it. But can it be done on the PCE? Or rather, a game based on that style of engine? The glorious answer is: yes.
If you look at this logically, the PCE has enough cpu resource to pull something off along those lines. But the devil is alwaaayyyss in the details. As the SNES proves, it’s not so much about the cpu resource as it is the support format. In this case, for the SNES, mode 7 allows byte pixel format (256 colors per pixel). Packed pixel format, to be more precise. The Genesis does as well, but that format is 4bits (16colors).
So what’s the PCE got? Planar graphics. Planar graphics is not Wolf 3D engine friendly. Ahh, but there’s a trick you can do. There’s always a trick, right? You see, it’s possible to setup the PCE in such a way that the tilemap becomes a bitmap display. The quick and dirty details: set the PCE res to 512×224. That’s roughly 64 tilemap entries wide. That’s a bit low, so what if a single byte in the lower LSB of the tilemap could be the index to two pixels? Two 4bit pixels to be exact. Now you have 128 “pixels” in that 64 wide tilemap screen/res. You need a total of 256 tiles in vram to show these pixels. Not bad, not bad at all. But there’s more to it.
A PCE tilemap has the largest size of 128×64. But it can be rearranged, assuming no scrolling, to a layout of 64×128. Not remember what I said above about the 64 tilemap entries equating to double the pixels? That means a 64×128 map gives a linear bitmap of 128×128. The 128 pixels wide are actually double wide, so they will fill the screen.
So now we have a linear bitmap, but zero room for double buffering. That sucks, because no one likes screen tearing. There are a few ways around this, but one very convenient way around this is the vram-vram DMA on the VDC side. According to some tests by another coder, which I haven’t varied yet, if you set the VDC in high res mode during vblank – the VDMA will transfer ~330 bytes per scanline. The understanding is that the VDC is transfer two words per 8 pixel dot clock. I still need to verify this, but if this is true – that means not only would you not need to keep a buffer in local ram, but you also don’t need to waste cpu cycles clearing that buffer or parts of it (render empty pixels to clear sections). Vram also provides a self incrementing pointer. With this kind of bitmap, this means you could do both vertical and horizontal sequential writing. This speeds up writing sequential data. Just to note, SATB vram DMA is 84 words per scanline in low res mode ( a little bit over 3 scanlines) – so it’s reasonable to think that it would be the same for vram-vram DMA as well (84 words is 168bytes in low res per scanline, and 336 in high res mode per scanline).
So now we have a fast bitmap and free clear routine and not need to transfer local to vram buffer. Now 2D raytracing is simple in design, but you still have an issue of pixel fill. You have to read a sliver of bitmap, at as specific scaled factor, and copy it to a pixel column of the bitmap. This is going to dictate the amount of cpu resource to draw the 3D view. The fastest method to write data to vram is embedded opcodes. But that immediate doubles the data in size. Looking at Wolf 3D as an example, the textures are 64×64 pixels. If you embedded them as opcodes, you would still need to have 32 pre-scaled vertical images. 64x64x2x32 is 256k of memory. Doesn’t leave a whole lot of room for textures, plus if the window height is 128 – you need 64 versions not 32 versions. To make matters worse, on a fresh/clean bitmap the first pass of pixels simply writes – but since this a double nibble format the second column write needs to be OR’d against the previous data. Doing that as embedded code against all the scaled frames.. is going to be a horrendous storage requirement.
The normal approach, is just to have the bitmap stored as normal bitmap data (nibble stored in byte format). So take a slower approach but with prescaled data. This is still going to be a fair amount of data, and it’s going to be slower. So what can be do to speed this up, but be reasonable when it comes to storage space? When in doubt, flip the problem on its head. What if the data, the bitmap texture, remained fixed in size… but the code that read it was different depending on the output size needed? Let that sink in…
If the bitmap data is aligned to specific bank boundaries and offsets, then you can create a series of pre-calculated code paths that look to read that data from the logical address (bank mapped) and write it directly to vram. No indexing needed. No indirection needed. Simple LDA addr / sta $0002 / st2 #Fade. There’s no looping. There’s no check system for skipping a read (pixel). Everything is hard coded as specific paths. It’s brilliant. No, I’m not the first to think of this idea (pre-calculated code paths), but it did occur to me that it would really benefit from this rendering style engine. Yes, the code is going to bloat in size, but now I can store lots of different textures in rom.
The catch here, is that you need two sets of code paths and two bitmaps. One bitmap has the nibble stored to the left side of the byte (bits 4-7), and the other bitmap has the nibble stored on the right side of the byte (bits 0-3). The reason for this is on the first past, the pixel data is just written as is to vram (bitmap buffer). All even columns of pixels are like this – nice and fast. But the odd columns need to OR together the second nibble with the even column. This is only +3 more cycles, thanks to the TSB opcode (TSB $0002). So the average is just +1.5 pixel overhead. That’s still really good. Not only do I have sequential access, vertical in this case, but I also have a fast means of READ-MODIFY-WRITE operations.
Did you notice that ST2 #fade opcode above? Since the pseudo bitmap is only 16 colors, I can use all 16 subpalettes for precalculated fades of those 16 colors; 16 fades. I already know the distance of the text from the camera, I can now use this to do 3D light shading. That’s pretty freaking cool. What about objects? I’m still in the planning stages for that, but I can treat them as simple texture overlays. And I can optimize for horizontal or vertical rendering – for whatever is faster for the object design. Also, the objects can overlaid with a fade distance subpalette as well on as per pixel basis. Oh, the weapon or hand can be a sprite.
I ran some numbers and a full texture read out (max height across the screen) to a 128×128 screen, is ~2.3 frames. So 20fps with room to spare in that last frame. To get a idea here, 3 frames is 20fps, 4 frames is 15fps, 5 frames is 12fps, 6 frames is 10fps. I think the ideal place to be would be between 15fps and 12fps, with decent amount of action and objects on screen. I should note here that the max height, player facing a wall up close, is pure pixel fill rate. An open, normal, area would actually yield a higher frame rate.. up to 30fps. (without objects). Another correction too; that 2.3 frames number assumes the wall texture is 128 real pixel tall. If the game was limited to 64 pixel tall textures (like the real Wolf 3D game), double pixel write mode kicks in and drops the overall cycles per pixel write at a much lower rate. It would be less than 2 frames per second (more like 1.33 frames) at 30fps. Double tall pixels get a boost (scaled up textures) in pixel fill rates. Of course that’s just pixel fill. That doesn’t include 2D raytrace or the small overhead of the hsync routine to reposition the map as a bitmap display – but given both to those, it should just about make it in 30fps.