Archive for November, 2015

Wolfenstein 3D on the PC-Engine

Posted in Uncategorized on November 25, 2015 by pcedev

The SNES has it. The Genesis now has it. But can it be done on the PCE? Or rather, a game based on that style of engine? The glorious answer is: yes.

If you look at this logically, the PCE has enough cpu resource to pull something off along those lines. But the devil is alwaaayyyss in the details. As the SNES proves, it’s not so much about the cpu resource as it is the support format. In this case, for the SNES, mode 7 allows byte pixel format (256 colors per pixel). Packed pixel format, to be more precise. The Genesis does as well, but that format is 4bits (16colors).

So what’s the PCE got? Planar graphics. Planar graphics is not Wolf 3D engine friendly. Ahh, but there’s a trick you can do. There’s always a trick, right? You see, it’s possible to setup the PCE in such a way that the tilemap becomes a bitmap display. The quick and dirty details: set the PCE res to 512×224. That’s roughly 64 tilemap entries wide. That’s a bit low, so what if a single byte in the lower LSB of the tilemap could be the index to two pixels? Two 4bit pixels to be exact. Now you have 128 “pixels” in that 64 wide tilemap screen/res. You need a total of 256 tiles in vram to show these pixels. Not bad, not bad at all. But there’s more to it.

A PCE tilemap has the largest size of 128×64. But it can be rearranged, assuming no scrolling, to a layout of 64×128. Not remember what I said above about the 64 tilemap entries equating to double the pixels? That means a 64×128 map gives a linear bitmap of 128×128. The 128 pixels wide are actually double wide, so they will fill the screen.

So now we have a linear bitmap, but zero room for double buffering. That sucks, because no one likes screen tearing. There are a few ways around this, but one very convenient way around this is the vram-vram DMA on the VDC side. According to some tests by another coder, which I haven’t varied yet, if you set the VDC in high res mode during vblank – the VDMA will transfer ~330 bytes per scanline. The understanding is that the VDC is transfer two words per 8 pixel dot clock. I still need to verify this, but if this is true – that means not only would you not need to keep a buffer in local ram, but you also don’t need to waste cpu cycles clearing that buffer or parts of it (render empty pixels to clear sections). Vram also provides a self incrementing pointer. With this kind of bitmap, this means you could do both vertical and horizontal sequential writing. This speeds up writing sequential data. Just to note, SATB vram DMA is 84 words per scanline in low res mode ( a little bit over 3 scanlines) – so it’s reasonable to think that it would be the same for vram-vram DMA as well (84 words is 168bytes in low res per scanline, and 336 in high res mode per scanline).

So now we have a fast bitmap and free clear routine and not need to transfer local to vram buffer. Now 2D raytracing is simple in design, but you still have an issue of pixel fill. You have to read a sliver of bitmap, at as specific scaled factor, and copy it to a pixel column of the bitmap. This is going to dictate the amount of cpu resource to draw the 3D view. The fastest method to write data to vram is embedded opcodes. But that immediate doubles the data in size. Looking at Wolf 3D as an example, the textures are 64×64 pixels. If you embedded them as opcodes, you would still need to have 32 pre-scaled vertical images. 64x64x2x32 is 256k of memory. Doesn’t leave a whole lot of room for textures, plus if the window height is 128 – you need 64 versions not 32 versions. To make matters worse, on a fresh/clean bitmap the first pass of pixels simply writes – but since this a double nibble format the second column write needs to be OR’d against the previous data. Doing that as embedded code against all the scaled frames.. is going to be a horrendous storage requirement.

The normal approach, is just to have the bitmap stored as normal bitmap data (nibble stored in byte format). So take a slower approach but with prescaled data. This is still going to be a fair amount of data, and it’s going to be slower. So what can be do to speed this up, but be reasonable when it comes to storage space? When in doubt, flip the problem on its head. What if the data, the bitmap texture, remained fixed in size… but the code that read it was different depending on the output size needed? Let that sink in…

If the bitmap data is aligned to specific bank boundaries and offsets, then you can create a series of pre-calculated code paths that look to read that data from the logical address (bank mapped) and write it directly to vram. No indexing needed. No indirection needed. Simple LDA addr / sta $0002 / st2 #Fade. There’s no looping. There’s no check system for skipping a read (pixel). Everything is hard coded as specific paths. It’s brilliant. No, I’m not the first to think of this idea (pre-calculated code paths), but it did occur to me that it would really benefit from this rendering style engine. Yes, the code is going to bloat in size, but now I can store lots of different textures in rom.

The catch here, is that you need two sets of code paths and two bitmaps. One bitmap has the nibble stored to the left side of the byte (bits 4-7), and the other bitmap has the nibble stored on the right side of the byte (bits 0-3). The reason for this is on the first past, the pixel data is just written as is to vram (bitmap buffer). All even columns of pixels are like this – nice and fast. But the odd columns need to OR together the second nibble with the even column. This is only +3 more cycles, thanks to the TSB opcode (TSB $0002). So the average is just +1.5 pixel overhead. That’s still really good. Not only do I have sequential access, vertical in this case, but I also have a fast means of READ-MODIFY-WRITE operations.

Did you notice that ST2 #fade opcode above? Since the pseudo bitmap is only 16 colors, I can use all 16 subpalettes for precalculated fades of those 16 colors; 16 fades. I already know the distance of the text from the camera, I can now use this to do 3D light shading. That’s pretty freaking cool. What about objects? I’m still in the planning stages for that, but I can treat them as simple texture overlays. And I can optimize for horizontal or vertical rendering – for whatever is faster for the object design. Also, the objects can overlaid with a fade distance subpalette as well on as per pixel basis. Oh, the weapon or hand can be a sprite.

I ran some numbers and a full texture read out (max height across the screen) to a 128×128 screen, is ~2.3 frames. So 20fps with room to spare in that last frame. To get a idea here, 3 frames is 20fps, 4 frames is 15fps, 5 frames is 12fps, 6 frames is 10fps.  I think the ideal place to be would be between 15fps and 12fps, with decent amount of action and objects on screen. I should note here that the max height, player facing a wall up close, is pure pixel fill rate. An open, normal, area would actually yield a higher frame rate.. up to 30fps. (without objects). Another correction too; that 2.3 frames number assumes the wall texture is 128 real pixel tall. If the game was limited to 64 pixel tall textures (like the real Wolf 3D game), double pixel write mode kicks in and drops the overall cycles per pixel write at a much lower rate. It would be less than 2 frames per second (more like 1.33 frames) at 30fps. Double tall pixels get a boost (scaled up textures) in pixel fill rates. Of course that’s just pixel fill. That doesn’t include 2D raytrace or the small overhead of the hsync routine to reposition the map as a bitmap display – but given both to those, it should just about make it in 30fps.

Advertisements

ADPCM this time..

Posted in Uncategorized on November 24, 2015 by pcedev

Black Tiger had made the comment that if hucards had enough storage, they could have used streaming voices for cinemas using ADPCM and 10bit paired channel output.

That got me thinking, is that really feasible? And the answer is; yes, yes it is. I put out a demo playing two songs. One was 20khz and the other was 33khz, but neither was interrupt driven. So it got me thinking, what kind of acceptable ADPCM playback can I get from timed interrupts? What kind of resource am I looking at? Storage-wise, ADPCM is 4bit per sample. 4bits for a 13bit output is pretty decent IMO (clipped to 10bit for the paired channels).

The mednafen authour wrote the decompressor, and I’ve modified it slightly with a few case optimizations, but otherwise it’s pretty fast. So for 15.3khz (not 15.7khz) output, I’m looking at 50-55% cpu resource. And that’s the normal, non self-modifying, code version. Everything is contained within the VDC interrupt routine, so it’s self managing. That’s always nice because the other option is buffer fill and buffer read, and that gets tricky with timing.

So this soft playback ADPCM streaming sounds great at 20khz, but what does it sound like at 15khz? Hopefully pretty decent. From what I’ve heard in comparison to ADPCM on the CD unit itself, this soft playback routine seems to sound better. It might have to do with how the original ADPCM chip in the PCE CD unit is 10bit output too, but it can clip and overflow rather than saturate into positive or negative amplitudes (i.e. does it clip at 10bit, or 12bit but output 10bit?). Or maybe it’s something else, as in a filtering effect of the PCE audio circuit compared to the ADPCM output circuit of the CD unit.

Typically, CD games use 8khz ADPCM output for sound FX, and sometimes streaming.

So where is all this going? Well, I have a SF2 mapper and a flash card.. and if I reserve 2048k just for streaming audio, I can do a small demo (shmup) with streaming music. I only have 274seconds to work with, if I reserve the lower 512k for the game/demo itself. 274 seconds isn’t a lot, but I can loop tracks. At a minimum, I would need two level tracks and a boss track. Optimally though, I would want a fourth ending track. So something like three 70second tracks and one 64second track. Or whatever. How it’s divided up isn’t really an issue.

I spent yesterday reworking the ADPCM routine into a VDC interrupt routine. I also picked out two levels from two other shooter games of other consoles. The demo is going to be a simple vertical shmup/shooter.  I was toying with the idea of the canyon level of Musha, and the 3D fire level of Axelay, with the Axelay level proceeding the Musha level (kinda makes sense). The graphics won’t be exact, but the effects will be similar. I plan to rip other enemy sprites from verty shmups too, and probably do a different boss for the Musha stage. I have 512k to work with for graphic assets. For both the Axelay 3D level and the Musha canyon stage, I spent quite a bit of time doing calculations for effects as well as redesigning the approach to those effects (with 60fps in mind). It’ll be kinda tight, but I’ve worked with worse.

As for the PCM engines, I did some work on those as well. The first XM player is done and I’ll probably release a very simply demo for it, and then one with a song demo afterwards.

 

But back to Black Tigers ponderings, if you did 7khz ADPCM for voice then that’s 3.5k per second. If you reserved 512k of rom for ADPCM, that gives 150seconds of speech or audio. If you used PSG/chip for music and some sound FX, you could easily put together cinema audio tracks. The silence between speaking or other audio parts, doesn’t need to be stored. Cinemas don’t take a whole lot of resource; I could even do realtime linear interpolation for that 7khz on a 15khz output.

But all this talk about compression, makes me wonder how some other compression schemes out there would sound. Maybe something less cpu resource than ADPCM. Something like range encoding delta PCM via block segments (kinda like the snes).

 

More PCM player stuffs

Posted in Audio on November 16, 2015 by pcedev

The wave conversion tool is up and running and looping support is working flawlessly. It’s forward looping only, but I’m gonna add ping-pong loop support soon. Ping-pong support will be hard coded into the wave, since it’s too much cpu overhead, or work, to change how the PCM frequency driver works.

So the tool outputs a specialized format for the player with a small header containing the loop points. Looping on the player side doesn’t take any additional cpu resource, which is nice.

Did you know that Batman on the PCE does precalculated frequency scaled waves of a single bass guitar instrument, and actually has a loop point section? There’s the attack part and the loop part of the waveform. It means a tiny sample can be made out to be a really long sound. IIRC, there’s 2 octaves which means a total of 24 notes or 24 samples. It gives the bass guitar instrument a nice punch-y sound that the normal PCE channels just quite don’t reach when emulating/modeling it (although they do a good job).

Anyway, more on the player itself. So octaves follow a formula of 2^a, with a being the octave. This means the rate of change is increasing in between octaves. Notes are section of frequencies along with octaves, and they are also part of that 2^a format, except they exist in between octaves ranges. It now becomes something along the lines of 2^(a+(b/12)), with b being the note (ranging from 0 to 11). The frequency difference/distance between each note increases as the frequency increases. It’s not linear.

Since octaves are an exponential function with a base of 2, I can simply binary shift the frequency to get my octave range. Therefore I only need to store 12 note frequencies in a table for once octave, and the rest can be derived from there. But I need more than just notes and octaves; I need to be able to slide a frequency up and down. So I increased the table from 12 notes, with 32 frequency steps between each note – for a total of 384 entries. Still not bad. Not only do I now have frequency sliding control with precision that I can track, but I now have a method of fine tuning as well.

It works like this: O:N:S. O is the octave, N is the note, and S is the step. S ranges from 0 to 31, any carry/borrow gets added/subtracted from N. N ranges from 0 to 11 and carry/borrow affects O. O ranges from 0 to 7, with 3 being a 1:1 or rather no binary shifting. I build the frequency divider from O:N:S number, but I only need to do this when there’s a change in frequency. And when it is performed, it’s pretty fast. There’s no multiplication: only one table fetch, a few shifts, and one addition (finetune). The other nice thing is, all inter-note frequency steps are the same ratio for any octave. So if you do a vibrato effect, it has the same strength if the note is high pitch or low pitch – unlike period based music players that rarely ever compensate for this. Under period based players, this presents a problem if you have an instrument where you have vibrato effect going on and you want to slide (portamento-to-note) to a higher frequency note – that vibrato effect that sounded perfect might sound too extreme at a higher frequency. You would have to compensate by having a function scale back the vibrato strength while going up in frequency, and this would be trial/error (I have yet to see a music driver do this, but it’s doable). So this resolves that issue.

Anyway, the player (driver) is done but I’m building what amounts to a small music engine to demo it off. So that takes time. Plus, I’m trying to modularize this into sections; the driver, interface support, music engine example. I need to keep them all separate so it’s easy to pick and use just what you need (assuming anyone uses this).

 

Do I expect this to revolutionize the PCE sound? No. This is more of a proof of concept. There are definitely strengths and weakness of this approach. Some samples will sound dirty, or unpleasant, so it really depends on the sample itself and I believe that puts a limit on how useful this is. The 5bit resolution of the PCM is also another issue; for some samples it’s fine and for others it can be hiss-y or noisy. Techniques like 1 or 2 point (difference) volume-map to keep noise at a minimum might help, but it really depends on the specific waveform. That said, I think the 6bit PCM player (the other approach) has more promise than this, but this approach, this player, is simpler in both execution and interfacing. I’m also reusing some stuff here for that other driver.

Scaling the input frequency

Posted in Uncategorized on November 11, 2015 by pcedev

So.. I’m writing this wave file converter, adding all kinds of support and options, when I came across the issue of storing multiple samples of the same origin but different octaves. This is an optimization technique to help retain frequency ranges within an instrument.

So to visualize this; the main driver always outputs 7khz. It doesn’t matter what the input frequency is, the output frequency is fixed. Now, to scale a frequency on the fly – the fastest way is to do nearest neighbor scaling. If you think of it as in terms of graphics, a waveform is just a single scanline. That’s it. One scanline. The brightness of a pixel is the amplitude of the waveform. And how does nearest neighbor scaling work on a fixed resolution scanline? You either repeat pixels or skip pixels, all depending if you want to shrink or inflate that image on that single scanline.

Anybody who’s worked in photo editing software, has seen this effect first hand. But it’s even simpler than that; the SNES mode 7. We’re all aware what happens when the snes scales up – it gets blocky. But pay attention to when the image is at a point that is smaller than 1:1. I.e. in shrunken state. The pixels become distorted. This is because the pixels in the shrunken image cannot appear in between the real pixels of 256 resolution. One option, is to increase the horizontal resolution so the steps become finer. The snes obviously doesn’t have this option. But there still comes a point where the image shimmers, as it moves in and out of zoom. That’s where other fancy techniques come into play and interpolate the distance between pixels, and distributes that.. etc.

Audio works much in the same way, but our brain is more forgiving when it comes to sounds than visual data anomalies. So what’s the issue here? How can we solve this?

First, the issue is this: the output frequency of the PCE TIMER driver is 7mhz MAX. If you scale a waveform to play at a frequency below 7khz, then it’ll play just fine with all the data intact (no missing samples). But here’s the catch, you get no resolution benefit for those repeated samples. In other words, you are not working with the optimal frequency band of 7khz. If I scaled a waveform that has a 1:1 rate of 7khz, to 1:2 rate of 3.5khz.. that 7khz main driver gives it zero benefit. Quite the opposite; I’m losing potential frequency resolution output.

Now, this needs to be understood in a larger context. The idea is to avoid issue when the input frequency is higher than the output frequency – anything greater than 1:1 (like 2:1, 4:1, etc). When this happens, you get all sorts of frequency anomalies as well as unintended reflections back into the output (nyquist frequency artifacts). So one approach is to store an instrument sample in multiple octaves. When you move out of that octave range, you switch to a different sample in that group. It cleans up the sound and removes this audio artifacts from the scaling routine. There’s also the added benefit that scaling with nearest neighbor, takes nothing into account. You can be destroying potential frequency ranges simply by skipping <n> amount of samples. If you properly resample, with an external app, you can get those frequency ranges back. Well, to a point – but it’s soo much better than simply skipping samples. Think of this as mip-mapping of textures for 3D graphics. It’s much the same application, although for different reasons.

Ok. So we have this approach that fixes upward scaling (shrinking) by mip-mapping our octave range for a given instrument. The second issue arises now. If all samples in a mip-map range are 1:1, then the difference from (octave+1) to (octave) is the frequency divided by 2. So you work with 7khz from the top and as you go down in notes (notes that approach the octave one step below the current one), the input frequency falls below 7khz. Normally, if the output driver is of high enough frequency, then this wouldn’t be so big of an issue. But 7khz is pretty low as it is, and you definitely want to keep every single HZ of that driver output at such a low rate. Linear interpolation is the derivative of the main function; the distance between two points. [f(x+a)-f(a)]\(x-a). It’s the slope formula. The change in Y over the change in X. In this case, the change in X is one – so (x-a) is redundant. Using the delta symbol, [f(Δx)-f(x)]\Δx.

Doing linear interpolation on the PCE isn’t difficult, but doing it real time is still requires some additional steps. If a sound engine is approaching 20%,30%,.. 50% cpu resource, then you want to save as many cycles as you can. The idea instead, is to encode this one sample interpolation into the wave form itself. Where would you put it? In between two samples that it’s derived from. This automatically doubles the waveform in size, but more importantly double the waveform in size played back on a fixed frequency is the original waveform played back one octave lower (or half its speed). The doesn’t help us directly, but what if the input drive (the frequency scaler) always skipped two samples – regardless of the waveform? Or rather, if you want 1:1 playback of a waveform, you set the input driver to 2:1. Since the frequency of the waveform is double, and the input driver is now default to 2:1 for the top frequency, the original waveform plays back without any artifacts – even though we’re above the nyquist limit (the output driver).

The benefit here, is realized that if the top limit of the input driver is n:1, with n=2, then as n approaches 1… the interpolated samples get played back instead of repeated samples. At 5bit resolution depth, it might not have a huge impact in the output range. But as you approach 8bit resolution, this becomes more significant. 6bit is twice the resolution of 5bit in audio output. That’s a lot.

So anyway, this approach maximizes the output quality of an instrument sample as you work your way through the mip-map set. The only question here now is, what’s better than linear interpolation? What would make a smoother transition from one mip-map sample to another? Maybe if you actually embedded the sample below it into gaps in the sample above it? Of course, those individual samples would have to be resampled separately by themselves, before being inserted back into the “gaps” – else it just because the original again. I’m not sure which method is better. Maybe a blend of the sample below it with the linear interpolation, and have the weight of that blend as it approaches the lower octave. I’d have to do some tests to see if there’s an audio benefit.

This is one of the features I’m working on adding to my wave converter. I was going to do the resampling myself, on the input waveform, but Cool Edit Pro does such a nice job for me. It wins out in sheer lazy-ness factor. I just added the option for linear interpolation or embedded two input wave files.

PCM engines

Posted in Uncategorized on November 9, 2015 by pcedev

I’ve already stated that I’m redoing the 4 channel XM engine, and it’s up and running with a few looped notes until I finish parsing a particular mod file for simple demonstration, but I also have another engine I threw it together over a couple of hours on the weekend; an 8 channel static PCM player. 8 PCM channels at 6bit (higher than the native 5bit) and still leaves 4 normal PCE channels. Yeah, a total of 12 channels.

The second engine required more support though. The first engine only required a small 384 word table. This second engine, because all the PCM channels are mixed in software, requires volume tables because multiplying each sample is waaayy out of the scope of the PCE – tables do the work even faster. The PCM format also needs to be in 2’s complemented numbers. The PCM data might be 6bit, but it’s in 8bit format as signed numbers. Maybe this is overkill, but it just feels cleaner than adding any possible side effect because it’s not centered (a relative centerline). I’ve done mixing in software before with unsigned samples, but the center line moves around (the waveforms still accumulate the same). It just doesn’t sit well with me, so 2’s complement signed format it is. But that means the volume table has to include all 256 entries even it only uses 6bit resolution/values. At 32 levels of volume control, that’s an 8k table. Doesn’t need to be ram; fits anywhere in rom.

But yeah, so it needs more support/tools surrounding it. I had to make a wave file converter in C. I needed to make one anyway, so this isn’t a total waste of time. So what does this engine eat up cpu wise? about 21-22% cpu resource. I kid you not. 8 PCM channels, at 6bit vs 5bit, and still have 4 PCE channels left over – all faster than Air Zonk does to play a single PCM channel. Yeah, Air Zonk has a horrible PCM routine that eats up 30-33% cpu resource. I couldn’t believe it, but I checked it about 20 times over, and each time was 30 to 33% (33% when it has to fetch a new sample to bit shift, 30% when it’s just playing that sample).

Keep in mind, none of these 8 channels in the second engine scale in frequency like the first engine. It’s actually nothing super special or radical. 8 channels are soft mixed, with volume control for each channel, into a single buffer. That buffer is played using two PCE DDA channels to output 10bit audio. That’s it. The other downside is that it’s mono. If you want stereo, you have to take away another two PCE channels for a second 10bit paired output. Not enough channels you say? Want stereo you say? Well, bump up that number up to 37.3% cpu resource and get 16 PCM channels – stereo. You still have 2 regular PCE channels left over. For an extra 5% cpu resource on top of that, I can make 4 of those stereo channels frequency scale XM style. So many options…

That’s ridiculous! What the hell would someone do with 16 PCM 6bit stereo channels!?  But hey, 18 channels on the PCE would make a great demo – no? Brag rights and all that sort of thing, I guess. Does the PCE have enough power to do more than 18 channels? Don’t ask. But yes. It does.