Mask of Destiny wrote:Might be worth a try though. For now, I'm just going to try to modify what we already have. It might be possible to shift the FM or PSG over to the ARM, although Heliophobe said it was too slow to emulate the PSG (as found in the SMS), so...
I find it hard to believe that the ARM in the Dreamcast is so slow that it can generate a 3 square waves and a noise channel in real time. I suppose it might be too slow to handle some of the DAC stuff some SMS games used it for, but even then...
Well, let's put it this way:
When I had Smeg running PSG on the ARM CPU, it was able to do sample accurate PSG emulation ---
barely. I had originally written it in C, and that turned out to be too slow to keep up so I reimplemented it in ARM asm. while I'm not an ARM master by any means, it was decently well optimized.
I switched it over to the SH4 side after Smeg 0.80 and as it used less than 2% CPU time even when playing samples I decided not to sweat the loss.
So how fast is the ARM in practical applications? Someone should work out some benchmarks for it someday, since the 'official' docs are all over the map and all seem to be wrong, but I estimated it at about 3mhz. I do know that the Yamaha sound chip runs at about 45mhz.
This was the theory I came up with about why the ARM is so much slower than advertised - and it's all just theory from this point on, so don't take this as gospel:
The Yamaha sound chip (AICA hereafter) has 64 sound channels, which it is able to play at CD quality (44100khz) when running at 45mhz.
The sound channels in the AICA are not all calculated at once, rather each channel gets a 'turn at the mic' in order. For each sample, every channel is calculated and its final output is added to an accumulator. When all 64 channels have been calculated, the accumulator is sent to the DAC which produces the sound we hear, and the process begins again with channel zero and an empty accumulator.
So, we have 64 channels running at 44100khz. That means that 64 channel * 44100 samples = 2822400 channel calculations/second being calculated. But how many cycles does it take the AICA to calculate a channel? Well, we've got to figure they decided to use a 45mhz crystal for the sound system for a reason. Let's say it's 45,000,000 cycles a second (though it's probably not that exactly) so
45,000,000 / 2822400 = 15.94 =~ 16 (say, that's a nice round number.. in binary at least)
So let's say the AICA runs at 16 * 2822400 = 45158400 cycles/second, .
Okay, so now let's assume it's correct that the AICA takes 16 cycles to calculate a channel. For at least some of these cycles, the AICA will need to access memory. At the very least, it will need one of those 16 cycles to fetch the sample data.
However, it seems to take a lot more than that. For each channel there are 32 4 byte registers - but a number of them are unknown or unused (or it's unknown whether they're unused). I'll assume these are held in RAM and fetched by the AICA rather than being directly mapped to AICA registers, since they seem to be read/write (the ARM can write them at work or byte boundaries, implying a read/modify/write sequence) and there are so many of them.
AICA docs I'm looking at (probably out of date) suggest 11 of these have known functions, so the AICA is likely to fetch at least these 11 every cycle while doing it's calculations, possibly more. That means 12 or more of the 16 cycles may be taken up by AICA memory reads. Since the ARM7 doesn't have an instruction cache(or, at least the one in the GBA doesn't) it is effectively frozen while waiting for the sound chip to release the memory bus (as it must always have higher priority).
So, worse case scenario is that the ARM7 only gets 1 out of 16 cycles (in case the AICA ties up the bus for other things), so it functions at 2.8mhz, which is awfully close to my estimate. Maybe it gets two or three or more out of sixteen, increasing the available time in 2.8mhz increments.
An interesting note is that I had observed such slow speeds while only using a single channel, so the AICA seems to do memory fetch (or at least tie up the bus) and caclulations even when a channel is not active. There might be a way to completely disable the channels, memory fetch and all, and get more ARM time when the unused channels are up - but I have no clue if/how that could work.
Like I said, it's all speculation but it would explain why the ARM seems so go so slow even though it's supposed to be a 25 or 28 or 45mhz CPU.