Z80 emulation

Warmtoe · Post by **Warmtoe** » Sun Jul 04, 2004 3:08 am

The problem with Genesis Plus at the moment is with the speed of the z80 emulation - C68K did a lot for the emulation in general, but the z80 is now holding things down.

Anyone got any experience / ideas on how to speed up the z80 emulation?

I have been fiddling with 32-byte aligning everything and making everything a 32 bit value - but to no avail...

BlackAura · Post by **BlackAura** » Sun Jul 04, 2004 3:23 am

The Z80 emulator in GP has an accuracy switch. Disabling it disables support for undocumented flags and opcodes, and (supposedly) makes it a bit faster. I didn't notice any difference though.

Rev. Layle · Post by **Rev. Layle** » Sun Jul 04, 2004 3:43 am

what makes the z80 so difficult to emulate efficiently, or has no one tackled it in depth long enough to make any leeway on it?

Post by **Quzar** » Sun Jul 04, 2004 4:01 am

that second thing you said... basically. Just like the M68k, we are the only people really being held back by this. On most any PC there are assembly emulators so there is no worry about speed, but we are stuck with a C emulator that just isnt built for speed. Frankly, what i would try to do with any emulator core would be dynarec, but that is just cause i dont understand emulation very well

doragasu · Post by **doragasu** » Sun Jul 04, 2004 6:39 am

Dynarec is only useful when the CPU clock of the system you want to emulate is high (compared to the CPU clock you're using to emulate that system). For example, to emulate a PSX 33MHz CPU in DC, dynarec is the only way if you want to get it fullspeed (along with SH4 ASM), but to emulate an Atari 2600, dynarec is not a good choice as dynarec tends to break compatibiliby, and maybe in some systems a dynarec emulator can be slower than a normal emulator for a really slow CPU.

Warmtoe · Post by **Warmtoe** » Sun Jul 04, 2004 7:52 am

BlackAura wrote:The Z80 emulator in GP has an accuracy switch. Disabling it disables support for undocumented flags and opcodes, and (supposedly) makes it a bit faster. I didn't notice any difference though.

Hmmm - tried all of those - and makes no notable difference... it basically needs to be rewritten like C68K was - just I don't know how to start.

BlackAura · Post by **BlackAura** » Sun Jul 04, 2004 8:44 am

and makes no notable difference

I didn't notice any either...

it basically needs to be rewritten like C68K was - just I don't know how to start.

Yeah, it either needs to be rewritten from scratch in C, using all the insane optimizations that C68k uses (there's a reason nobody else had written a 68k emulator like that yet - requiring half a gig of RAM to compile it's a little much), or write on in SH-4 assembly. Emulators written in assembly tend to use the same kind of tricks that C68k uses (stuff like jump tables, not using real function calls, that kind of thing).

Ever written a CPU emulator before?

Most interpretative emulators follow this basic pattern:

Code: Select all

nCycles += cycles_to_run_this_time;
while(nCycles > 0)
{
    /* Fetch */
    instruction = read_from_memory(registers.instruction_pointer);

    /* Decode */
    switch(instruction)
    {
        case 0x??:
            /* Execute */
            do_instruction_??();
            break;
        default:
            die_painfully();
            break;
    }

    /* Consume time */
    nCycles -= cycle_table[instruction];
}

Obviously that varies a lot, especially the decode phase, depending on the CPU you're emulating

There's a load of documentation about this sort of thing out there. I'll see if I can dig some up...

Ian Micheal · Post by **Ian Micheal** » Sun Jul 04, 2004 11:12 am

Not found any change ether with the z80 with any of the flags speed change is about %1 on some mame drivers using that define. Once the target system using the z80 is about 4 mhz it's slow for many years working on mame drivers ive found the z80 to be near the slowest cpu core. By it's self it's fine soon as you add one other core with it then you can never get it fast enff.

Real pain.

Post by **Quzar** » Sun Jul 04, 2004 11:38 am

Has anyone tried any other z80 emulators? or more like, do they exist? I know there are a few others but i mean any other portable ones (other than the mame cpu core)

Mask of Destiny · Post by **Mask of Destiny** » Sun Jul 04, 2004 12:05 pm

Switch statements are rather inefficient for instruction decoding. Best way to do it is with an array of function pointers with the instruction being decoded as the index into the array.

Flag calculation is probably the other expensive part, ASM won't help you very much there since the SH-4 doesn't do much flag calculation itself. Your best bet is to have tables of precalculated flags for at least the more common operations. I'm pretty sure only the 8-bit instructions change flags so it's only 64KB of data for each table.

BlackAura · Post by **BlackAura** » Sun Jul 04, 2004 12:13 pm

There are quite a few others. It's just a matter of finding one...

Marat Fayzullin's Z80 emulator is not going to be any better than the MAME one. It's slower, and has a lot more bugs.

Marcel de Kogel's Z80em (which I think is based on Marat's one above) is faster and less buggy, but I don't have a clue how well it might work on a Dreamcast. Has optional x86 assembly, which is obviously useless to us.

DOZE and RAZE are both assembly-only, so they're no good to us either. Both are really good emulators though.

Neil Bradley's MZ80 is (I think) both C and assembly, depending on what command-line options you give the generator program.

Other than that, we have the MAME one, which we're already using.

So there are quite a lot of them, but most are x86 assembly only, or are no better than the one we're using.

Switch statements are rather inefficient for instruction decoding. Best way to do it is with an array of function pointers with the instruction being decoded as the index into the array.

Unless the compiler's smart enough to convert the switch statement into a jump table. There's really no way to guarentee that when you're writing the code though, and I don't think GCC is that smart.

Ian Micheal · Post by **Ian Micheal** » Sun Jul 04, 2004 1:12 pm

MZ80 is whats used in genrator and neogeo cd. It crashes for me most times when running.

Rev. Layle · Post by **Rev. Layle** » Sun Jul 04, 2004 1:45 pm

Mask of Destiny wrote:Switch statements are rather inefficient for instruction decoding. Best way to do it is with an array of function pointers with the instruction being decoded as the index into the array.

Flag calculation is probably the other expensive part, ASM won't help you very much there since the SH-4 doesn't do much flag calculation itself. Your best bet is to have tables of precalculated flags for at least the more common operations. I'm pretty sure only the 8-bit instructions change flags so it's only 64KB of data for each table.

i was going to ask, do case staments get optimized by the compiler or are function pointers a fairly good way to approach that?

i have never done emulation before have done my share fo software development (C + many other languages over the last 10-15 years), but i mainly do IT type work these days

quarn · Post by **quarn** » Sun Jul 04, 2004 2:00 pm

BlackAura wrote:Unless the compiler's smart enough to convert the switch statement into a jump table. There's really no way to guarentee that when you're writing the code though, and I don't think GCC is that smart.

Code: Select all

c++ test code:
--------------8<---------------

switch (argc) {
                case 0: printf("0\n"); break;
                case 1: printf("1\n"); break;
                case 2: printf("2\n"); break;
                case 3: printf("3\n"); break;
--------------8<---------------

assembly output:
--------------8<---------------
        mova    .L10,r0
        shll2   r1
        mov.l   @(r0,r1),r1
        braf    r1
        nop
.L11:
        .align 2
.L10:
        .long   .L3-.L11
        .long   .L4-.L11
        .long   .L5-.L11
        .long   .L6-.L11
        .long   .L7-.L11
        .long   .L8-.L11
--------------8<---------------

It sure looks like a jump table to me.

Though, you should probably check the output for your own switch statements if it does it there too.

Rev. Layle · Post by **Rev. Layle** » Sun Jul 04, 2004 2:15 pm

well it wouldn't be THAT hard for the compiler to make a jump table and case values must be a constant of some sort, and therefore, the known values of the switch statement are there, so only a finite and set values of jump points would be available.

i know with function pointers, while cool to use and easy to call in a loop, and i guess each op instruction number can be an array entry. however, calling functions causes the stack to at least push and pop the processors program counter/instruction pointer (as well as other data at times) which could waste clock cycles.

unless you can make calls to functions in an array of function pointers to be inline calls

Stef.D · Post by **Stef.D** » Mon Jul 05, 2004 2:06 am

Z80 is a 8 bits CPU so you can easily use a 256 switch statement for instruction decoding... almost time, C compilers are smart enough to build a 256 entries jump table...
The only reason why i used computed label in C68K is that almost C compiler doesn't support 65536 switch statement (i tried, and it failed with an overflow with almost compilers)... the fact than C68K takes so much memory to compile is just du to a GCC trick, i'm sure others compilers can compile it without much troubles as soon they support computed label (Visual C++ doesn't support it).

Using a function pointer table is actually really slower than having a simple jump table... it's the main difference between musashi and C68K... i'm also using a faster way for fetching instruction (reading data from PC), i can also do it for SP (A7) but speed improvement doesn't worth the effort imo...

Actually, Z80 C emulator are probably already optimised, i believe MZ80 is the fastest, i never tried it though.. I wrote a Z80 core sometime ago, but it was in pure x86 ASM :-/ at least i've some knowledges about it and undocumented opcodes so i can quickly rewrite a new one...
I'll have a look into the actuals C Z80 core, see if we can do really better but i'm not sure...

Post by **Quzar** » Mon Jul 05, 2004 2:36 am

wow. that would be nice. a new z80 core to compliment the new 68k core.

BlackAura · Post by **BlackAura** » Mon Jul 05, 2004 3:53 am

i believe MZ80 is the fastest, i never tried it though

I think it is, but only the assembly version. The C version's not very fast, and (apparently) doesn't work well on the DC anyway.

A 256-way switch statement would certainly be the best way to implement the Z80, but implementing the two-byte instructions would be a pain. As far as I can tell, the first byte usually acts as a modifier, but one of them changes the instruction set for the second byte... Messy. Doing it that way, it'd probably be simpler to implement it in assembly anyway.

Stef.D · Post by **Stef.D** » Mon Jul 05, 2004 4:42 am

BlackAura wrote:
i believe MZ80 is the fastest, i never tried it though
I think it is, but only the assembly version. The C version's not very fast, and (apparently) doesn't work well on the DC anyway.

A 256-way switch statement would certainly be the best way to implement the Z80, but implementing the two-byte instructions would be a pain. As far as I can tell, the first byte usually acts as a modifier, but one of them changes the instruction set for the second byte... Messy. Doing it that way, it'd probably be simpler to implement it in assembly anyway.

The 2 bytes (or more) instructions can cause some troubles with imbriqued switch statements ... because prefixes can appears as many time we want... even if there is no way of doing that !
I guess computed label will help here, to respect the same implementation i did in my ASM core

Stef.D · Post by **Stef.D** » Mon Jul 05, 2004 4:43 am