Genesis Plus/DC Emulation Discussion

BlackAura · Post by **BlackAura** » Tue Jul 19, 2005 11:40 pm

There are basically two things [preventing full speed Genesis emulation on the Dreamcast]...

First, the major thing holding us back at the moment is sound emulation. It's slow. Really slow. We're going for something that sounds like a real MegaDrive, unlike Sega's Smash Pack (which sounded more like a Master System). Enabling sound slows things down. A lot.

We have two versions of Genesis Plus. We have a more accurate version that draws the graphics in software mode, has far fewer graphics glitches, but runs slower. We also have a faster version, which uses the Dreamcast's 3D hardware to draw the graphics, but has major graphics glitches.

With sound disabled, both of those emulators run at full speed, or so close to full speed that you can't really tell the difference.

With sound enabled, the software version slows down a lot (which you can compensate for with frameskipping, but I find that unacceptable). The hardware version runs very close to full speed (generally 60FPS, but it has to drop the occasional frame).

So the second thing is that is already has been done, partially, and is being done.

And before someone starts jumping up and down, and shouting "Why don't you release it!", just get the latest version (the "Preview 3" version). It's not the same as the current version, but it's still pretty good. If you disable the sound, it runs full speed on just about anything you care to throw at it.

DcSteve · Post by **DcSteve** » Thu Jul 21, 2005 6:36 pm

A new Z80 core has been in the works by FOX68K and more information on it has been released today. I wonder if that core will be a necessity for genesisplusdc or if CZ80 is fast enough.

BlackAura · Post by **BlackAura** » Thu Jul 21, 2005 8:58 pm

I wonder if that core will be a necessity for genesisplusdc or if CZ80 is fast enough.

In hardware rendering mode, CZ80 is fast enough. So is C68K, actually (using FAME makes virtually no difference). The sound emulation itself makes a much bigger difference, and the video rendering (specifically, the Dreamcast's video hardware) slows down sometimes.

Post by **Quzar** » Thu Jul 21, 2005 9:05 pm

BlackAura wrote: So is C68K, actually (using FAME makes virtually no difference).

exactly! and nobody believed me when i said that it didn't speed NeoDC up at all. It's probably because it's just a translated version of the x86 FAME. Same will probably be true for FAZE since it's gonna be an sh port of RAZE and not built from ground up for the SH4.

BlackAura · Post by **BlackAura** » Thu Jul 21, 2005 11:35 pm

It's not that FAME isn't fast - it is. However, so is C68k. The improvement FAME makes over C68k is very small, because the rest of the system emulation is taking up so much of the CPU time. An extra 10% speed increase doesn't make a lot of difference when your CPU emulator isn't the slowest part of the program anyway.

At the moment, the Z80 isn't really slowing things down - even underclocking the Z80 doesn't make as much difference as it used to. If cutting the clock speed in half doesn't speed things up, having a CPU emulator that's twice as fast isn't going to speed things up either.

Still, I have GP/DC running with both C68k and FAME (and Musashi, and I even have the PC version using Starscream), and I'll probably have it running on MAME's Z80, C68k, and FAZE as well. It's just not going to make a huge difference. It might make a slight difference in software mode with sound enabled, but I don't think it's going to be much.

That reminds me... I need to work out some way of running it through a profiler. Or at the very least, add a load of profiling code to GP/DC to see what the slow parts are. There might be some parts of the core that are slower than I think they are. However, I'm 99% sure that both the video and sound emulation are causing the slowdowns, and I believe that they're both fixable.

And no, I'm not going to do Smash Pack style fake sound. It sounded absolutely terrible. Even though doing that would pretty much solve our speed problems, I'm not even going to consider it. I'm already doing fake video, and you can see the kind of problems that approach has. I'm still not sure that's a workable solution, unless I start writing game-specific hacks (which I really don't want to do, but may have to if I want to make certain games playable at a decent speed).

Orange_Ribbon · Post by **Orange_Ribbon** » Fri Jul 22, 2005 5:10 am

I have a silly question. If you could answer in none programmer lingo (i would like to understand the answer :-p ) Why does the sound make it hard when (and I am guessing here) the audio hardware on the DC must be better by a significant amount than the Gennsis? I know there must be a reason, all the coders here seem to know there poo.

BlackAura · Post by **BlackAura** » Fri Jul 22, 2005 8:30 am

The Dreamcast's sound hardware is just completely different to the MegaDrive's sound hardware. They work on very different principles.

The MD's sound hardware uses a technique called FM synthesis, which creates sounds using mathematical functions (specifically, combinations of sine waves). The sounds it generates are actually quite complex, and it can generate six (I think) of them at once, in addition to a few simpler sound channels. A MD game just throws a few numbers into the sound hardware, and it starts playing sounds.

The Dreamcast's sound hardware plays back sampled sound effects, and that's all it does. It can apply all kinds of special effects to those sounds, like echo, reverb, or whatever (and we actually don't know how to do that anyway), but fundamentally it can only play back pre-recorded sound. It has no way to generate new sounds by itself - any sound it makes has to be loaded into it by the CPU.

Emulating the MD's sound hardware requires mimicking the maths it uses to generate sounds, generating sound samples in a format the DC hardware can understand, mixing all of the channels into one channel, and sending that to the Dreamcast's sound hardware. That's a lot of work that the MegaDrive could do instantly in hardware, and that the Dreamcast was never intended to do.

Basically, all the sound hardware in the Dreamcast is useless, because it can't do the same thing as the MegaDrive sound hardware. It may as well be an old Sound Blaster (or even worse - a modern PC on-board sound card - they're even worse). All of the work has to be done by the main CPU, which is already busy emulating the CPUs, the graphics, and the rest of the system.

Orange_Ribbon · Post by **Orange_Ribbon** » Fri Jul 22, 2005 2:43 pm

BlackAura wrote:The Dreamcast's sound hardware is just completely different to the MegaDrive's sound hardware. They work on very different principles.

The MD's sound hardware uses a technique called FM synthesis, which creates sounds using mathematical functions (specifically, combinations of sine waves). The sounds it generates are actually quite complex, and it can generate six (I think) of them at once, in addition to a few simpler sound channels. A MD game just throws a few numbers into the sound hardware, and it starts playing sounds.

The Dreamcast's sound hardware plays back sampled sound effects, and that's all it does. It can apply all kinds of special effects to those sounds, like echo, reverb, or whatever (and we actually don't know how to do that anyway), but fundamentally it can only play back pre-recorded sound. It has no way to generate new sounds by itself - any sound it makes has to be loaded into it by the CPU.

Emulating the MD's sound hardware requires mimicking the maths it uses to generate sounds, generating sound samples in a format the DC hardware can understand, mixing all of the channels into one channel, and sending that to the Dreamcast's sound hardware. That's a lot of work that the MegaDrive could do instantly in hardware, and that the Dreamcast was never intended to do.

Basically, all the sound hardware in the Dreamcast is useless, because it can't do the same thing as the MegaDrive sound hardware. It may as well be an old Sound Blaster (or even worse - a modern PC on-board sound card - they're even worse). All of the work has to be done by the main CPU, which is already busy emulating the CPUs, the graphics, and the rest of the system.

Ahhh, I see. so the DC's sound hardware is a glourified cd player. Now I have another odd question. Why not break the roms up. Make a program that captures all the audio into wavs, or mp3s , and keeps it seperate from the graphics part. Then you could just send the audio to the sound hardware. Wouldn't that be kinda like the game was originally. Something happens to trigger a sound, and it sends the mathimatical signal, and the emulator just picks the right wav file to send to the hardware? Okay that seemed more like an idea instead of a Question. But could something like that be done. Isn't also what was being done to turn neogeo roms into Neogeo CD ISOs?

Oh BA thanks for taking time to explain this to me.

Post by **Quzar** » Fri Jul 22, 2005 2:59 pm

because sound is generated by the software realtime. and the code it sends to the sound hardware could be the same as anything else. what you are saying if i understand right, is to pre-process all the different sound codes generated, then keep them stored as wavs. the problem with that is sometimes the math values that are used to create the sounds are not just lying around, but are created at the time.

dr apocalipsis · Post by **dr apocalipsis** » Fri Jul 22, 2005 6:22 pm

BlackAura wrote: The Dreamcast's sound hardware plays back sampled sound effects, and that's all it does. It can apply all kinds of special effects to those sounds, like echo, reverb, or whatever (and we actually don't know how to do that anyway), but fundamentally it can only play back pre-recorded sound. It has no way to generate new sounds by itself - any sound it makes has to be loaded into it by the CPU.

Hum... Little off topic not related to genesis emulation but,
Doesn't Ikaruga generates music using synthesis?

As far I Know Dreamcast version uses synthesis and GameCube redbook audio.

Pd. Sad about hear FAZE will not fix sound problems.

BlackAura · Post by **BlackAura** » Fri Jul 22, 2005 8:28 pm

Orange_Ribbon wrote:Ahhh, I see. so the DC's sound hardware is a glourified cd player.

Pretty much. The MD is more like a 1980s synthesiser.

Now I have another odd question. Why not break the roms up. Make a program that captures all the audio into wavs, or mp3s , and keeps it seperate from the graphics part. Then you could just send the audio to the sound hardware.

That would be theoretically possible, I suppose. At least for the musical parts, you could sample what the tone is supposed to sound like, and play it back at the correct pitch and volume. However, that'd require each game to be sampled in that way, which would take a long time, it wouldn't sound very good (it'd sound better than the Sega Smash Pack did, but it'll still sound a bit crappy), and the code to recognise which samples to play would be pretty complex.

Games can also do pretty complicated effects, and often change the parameters of a note while playing it to achieve some cool sounding effects. You can't easily replicate those.

It's probably not worth it. I think it would be more worthwhile trying to get the existing sound generator working faster, and it'd certainly produce a better result.

Stef.D wrote:In intensive raw bench (primary number calculation), FAME can be close to 3x the speed of C68K which is already about 3x the speed of Musashi core.

I haven't actually done any benchmarking on it...

I don't suppose anyone knows the MAME source code well enough to try sticking FAME into it as a replacement for Musashi? There are some systems that make heavy use of 68k CPUs, which might be a better test of FAME than anything we have at the moment.

dr apocalipsis wrote:Hum... Little off topic not related to genesis emulation but,
Doesn't Ikaruga generates music using synthesis?

As far I Know Dreamcast version uses synthesis and GameCube redbook audio.

Possibly.

There are ways to play music on the Dreamcast hardware without using sound streaming (CD Audio, ADX, MP3, or whatever). You can use a technique similar to the one the SNES used - you record a load of samples of instruments, and play them back to produce music. You can then do a lot of effects on the samples to make it sound different. The Dreamcast's web browser is able to play back MIDI music in this way.

I guess that's what they would have done. It was originally a NAOMI game, after all, and there wasn't an unlimited amount of ROM space available.

I don't know why they wouldn't do that on the Gamecube though. It's just as capable as the Dreamcast is. More so, in fact.

DcSteve · Post by **DcSteve** » Sat Jul 23, 2005 1:13 pm

pvr3 does not, but does your current version have SRAM working, and the hero gfx bug layer fixed?

Post by **Quzar** » Sat Jul 23, 2005 4:27 pm

well because i had some very basic timing benchmarks in neodc and ran them between fame and c68k and the difference was at most 10%.

I got close to the same results before i upgraded from gcc 3.3.1 to 3.4.2. Afterwards though, c68k began to fly. There was almost a 10-15% overall emulator increase in speed when I upgraded (and added some compiler flags). So much so that I was able to have sound enabled and speed stay fairly consistant to what it was in previous versions built with gcc 3.3

if you could email them to me quzar at screamcast.net, PM them to me, or send them to me over msn i would be very grateful (i've been wanting to try to put together a set of benchmarks for the three but i really didn't know where to begin. and this will help me when I try to do some modifications that i've been wanting to try...)

BlackAura · Post by **BlackAura** » Sat Jul 23, 2005 8:53 pm

pvr3 does not, but does your current version have SRAM working, and the hero gfx bug layer fixed?

SRAM - Partially. Works with Musashi and C68k, doesn't work with FAME. Haven't copied that code into the Dreamcast version yet.

Hero... Probably not. I'm not even aware of it.

GPF · Post by **GPF** » Sun Jul 24, 2005 2:09 am

That reminds me... I need to work out some way of running it through a profiler. Or at the very least, add a load of profiling code to GP/DC to see what the slow parts are.

Would be very interesting to dicuss profiling. I have only played around with it a little under cygwin for windows code.

I am wondering if we could somehow alter gcc and/or gprof code to write to a different directory for the profiling output, ie /ram /vmu and then dump to /pc etc.

Or are you talking about just adding your own timing code?

Troy

Stef.D · Post by **Stef.D** » Sun Jul 24, 2005 5:44 am

Quotable Quzar wrote:well because i had some very basic timing benchmarks in neodc and ran them between fame and c68k and the difference was at most 10%.

I got close to the same results before i upgraded from gcc 3.3.1 to 3.4.2. Afterwards though, c68k began to fly. There was almost a 10-15% overall emulator increase in speed when I upgraded (and added some compiler flags). So much so that I was able to have sound enabled and speed stay fairly consistant to what it was in previous versions built with gcc 3.3

if you could email them to me quzar at screamcast.net, PM them to me, or send them to me over msn i would be very grateful (i've been wanting to try to put together a set of benchmarks for the three but i really didn't know where to begin. and this will help me when I try to do some modifications that i've been wanting to try...)

As soon i'm back to my home i'll send you them

I'm thinking about the small difference you observed in NeoDC, did you tried to overclock 68000 a lot ? (something as 120 Mhz) and see how many FPS you got with both core ?
Maybe memory access function are really slow,maybe my GCC version is too old, or maybe my flags compilation aren't correct ^^

Post by **Tyne** » Sun Jul 24, 2005 2:07 pm

BlackAura wrote:
pvr3 does not, but does your current version have SRAM working, and the hero gfx bug layer fixed?
SRAM - Partially. Works with Musashi and C68k, doesn't work with FAME. Haven't copied that code into the Dreamcast version yet.

Hero... Probably not. I'm not even aware of it.

Every so often the main character you control will have it's graphics glitched / scrambled. I've tested this in Streets of Rage 2 (happens a lot), and Universal Soldier (also known as butchered Turrican 2, .. turrican rocks).

BlackAura · Post by **BlackAura** » Sun Jul 24, 2005 7:43 pm

Every so often the main character you control will have it's graphics glitched / scrambled.

If that's the same bug that affects most other games (where the Dreamcast is using graphics from one frame to draw another), I fixed that.

fox68k · Post by **fox68k** » Thu Jul 28, 2005 6:58 am

There is something in FAME is getting me crazy these days while fixing some bugs. Probably you could help me in this point.

Fidling with FAME code i found that removing or leaving some pieces of code makes FAME to run faster or slower (both in raw tests and in GP/DC)

I think this should have to do with code alignment and cache misses, because i do not have opcode routines 32-byte aligned (only API functions and data).

Stef D, do you remember the weird results we got testing FAME and C68k in DC with FAME speed? Sometimes it went slower, apparently without modifying any important code. I remember you asked me about that and my answer was i did not know.

Maybe BlackAura could explain us how code and data should be aligned in DC in order to avoid cache misses and therefore optimize speed.

Thanks in advance.

BlackAura · Post by **BlackAura** » Thu Jul 28, 2005 8:22 am

Maybe BlackAura could explain us how code and data should be aligned in DC in order to avoid cache misses and therefore optimize speed.

I'm afraid not. I can tell you how the cache is rigged up though. The problems you're having sound a lot like cache problems.

The cache is split in two - instruction cache, and data cache. Cache line size is 32 bytes. The instruction cache is 8Kb (256 lines), while the data cache is 16Kb (512 lines). It's a simple direct mapped system, so cache line N mapped to N mod 256, or N mod 512 respectively.

The only reliable way to make something run very quickly, regardless of where it's being put in memory, is to ensure that the entire set of code you're running fits within the instruction cache, the entire data set fits within the data cache, and you don't use any code or data that's external to that.

Obviously, there's not a lot you can do about the data cache. It needs to hold both 68k instructions and data, and the layout and access patterns of that code is determined entirely by the 68k code you're executing. I don't think there's much you can do about that.

The fact that it's happening when you change the surrounding program suggests that the slowdown is being caused by accessing code and data outside of FAME itself. That's probably the memory access handlers. When you change something outside of FAME, it's going to move the FAME code around relative to the 68k code, data, and the I/O handlers. If, for example, the I/O handlers and the bit of code that calls them happen to be almost 8Kb apart, you might suddenly find all I/O slows down dramatically.

Other other thing - adding or removing code will move FAME itself around, and probably by less than 32 bytes. Suddenly, code and data that might have fit within a single cache line could be split over two cache lines, resulting in way more cache misses. Or the other way around - it could suddenly get faster because it has fewer cache misses.

If your opcode handlers are less than 32 bytes, it might be worth aligning them to 32-byte boundaries. If they're less than 16 bytes, you may as well align them to 16 byte boundaries (save a bit of space, and why not?). If they're more than 32 bytes, try as hard as you can to shrink them as far as possible, so each opcode fits in as few cache lines as possible. I don't know how the internal architecture of FAME works, but having the instructions regularly spaced might allow you to use a computed jump instead of a jump table, which might help reduce data cache misses a little bit.

I know of no way to manually force something to align to a specific address in GCC. You can force data to align to a 32 byte boundary, for example, but not code, and not "1024 bytes from the nearest 16Kb boundary". That could possibly be useful, but that's somewhat irrelevant, since we can't do it anyway. Well, we could align something to an 8Kb block, then pad the data out to the appropriate position.

It's probably possible to force the FAME object file to align to an 8kb boundary, or whatever. That (might) help it be a little more consistent, since the internal alignment won't suddenly change depending on what you're linking it to. It won't help external access at all though. Requires using a linker script. Might be worth aligning other object files that FAME interacts with as well, so you can get a reasonably consistent layout between them.

Oh yeah... Cache misses are more expensive than prefetching. A prefetch instruction costs 1 cycle only (I think), but a cache miss also stalls the pipeline while the appropriate data is fetched from main memory. If you're reasonably sure that data you're about to read is not in the cache, prefetch it. If it's probably in the cache, don't prefetch it. I don't know at what point to switch from prefetching to not prefetching - that would depend on exactly how many cycles a cache miss wastes, and what the probability of a cache miss is. There's probably a decent way to model that, but that would require:

1 - Looking at the SH-4 manuals to find out how expensive a cache miss is
2 - Maths, which I don't feel like doing at 11 PM

I think the appropriate maths would be something like this:

x = probability of cache miss (from 0 to 1)
c = number of cycles wasted
prefetch penalty = 1
no prefetch penalty = x * c

If you were to draw a graph of those, the turning point would be the point at which the line y=1 intersects with the line y=x*c, which can be determined by:

x = 1 / c

Actually, that was dead easy. Assuming I haven't made some glaringly obvious mistake that someone more awake would have noticed, of course.

So, if a cache miss had a penalty of 3 cycles, then prefetching would be faster if the probability of a cache miss were greater than 33%.

That's per access, of course. If you're accessing data pretty much at random, prefetching will not help one bit. You need to be able to access the data once, and do something with all 32 bytes of data at once, or at least before they get knocked out of the cache by something else.

What we really need for this is something like Valgrind's cache profiler. Valgrind basically emulates an x86 processor, and one of the things it can do is point out where you're getting lots of cache misses or cache thrashing. An SH-4 code profiler that could do something like that would be pretty useful, although it would be rather strange using one CPU emulator to profile and optimise another...