I wonder if you could use parts of your SH2 code to improve upon the Dreamcast port, Chilly... ? From what I heard the SH4 actually is compatible to SH2 code, aside of the little endian/big endian issue (not that I'd really know what that means).
Yes, there's not a lot of difference there. Mainly speed, endianness, and floating point support. Anyone who can do SH2 code can do SH4 code.
Endianness is just the order of bytes in memory - little endian means the least significant bytes come first, while big endian means the most significant bytes come first. The DC is little-endian, while the 32X (with the SH2) is big-endian. Such differences are easy to deal with... when the programmer thinks about that ahead of time. Part of the issue with Doom when the code first came out is the programmers at id didn't consider endianness, so a lot of work (some of it by myself) went into finding all the places the endianness mattered and fixed the code to deal with it.
In any case, there isn't a lot of assembly in Wolf32X - just two pieces:
multiplying two fixed point numbers
fixed FixedByFrac(fixed a, fixed b)
byte swapping a longword
static __inline__ uint32_t SwapInt32(uint32_t i)
//return ((uint32_t)(i & 0xFF000000) >> 24) |
// ((uint32_t)(i & 0x00FF0000) >> 8) |
// ((uint32_t)(i & 0x0000FF00) << 8) |
// ((uint32_t)(i & 0x000000FF) << 24);
Using uncached memory is actually pretty easy - just OR a value with the pointer (when the MMU is not in use for the DC). On the SH2, the value is 0x20000000. On the SH4, the value is 0xA0000000.
For example, here's how I got uncachable memory for Doom for DC:
vid_size = SCREENWIDTH*SCREENHEIGHT*4+SCREENWIDTH*ST_HEIGHT;
vid_mem = malloc(vid_size);
if (vid_mem == NULL)
I_Error ("Couldn't allocate memory for screens\n");
vid_mem = (byte *)((int)vid_mem | 0xA0000000); // uncached access
Just allocate memory, flush is from the caches, then OR the magic value to the pointer you use to access the memory.
Funny enough, I never got around to putting the assembly fixed multiply or byte swap into Doom for DC.
The current fixed mul is
( fixed_t a,
fixed_t b )
return ((long long) a * (long long) b) >> FRACBITS;
return (fixed_t)(((float)a * (float)b) * FRACINV);
and USE_FLOAT_FIXED is not defined, so it's using long long (64 bit) math.