KOS vs Ninja - simple test.

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
111
DCEmu Junior
DCEmu Junior
Posts: 42
https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
Joined: Thu Jul 07, 2016 7:11 pm
Has thanked: 0
Been thanked: 6 times

KOS vs Ninja - simple test.

Post by 111 »

Long story short: I made 3 simple programs to roughly compare "polygon count" performance of KOS' pvr api to Ninja library (from official sdk. No Kamui2 version for now). The goal was to see which one is faster "out of box" i.e. by only using high level api and no sh-4 assembly (at least explicitly).
Test model has 2088 vertices, 2608 "optimized" triangles, and 860 triangle strips (only these are used for obvious reasons).

Here is a brief description for every program:
- "NINJA_default" - loads and displays a model ("num_bigs" times) exported with default settings (probably not the best ones, but it's not important this time).
- "KOS_pvr_prim_1024" - uses pvr_prim() for vertex submissions. Vertex buffer size is 1024 kilobytes.
- "KOS_DR_1024" - uses "direct render" for vertex submissions.

Use dpad left and right to remove\add an instance.

FPS counters are (supposedly) kinda broken, so they are not really reliable but they are there anyway.

Only test this on real hardware, since emulators are useless for that. But here is a pic anyway:
Image


I don't have a dreamcast, but from what I was told (thanks, megavolt85!), "direct render" version runs rather well and is comparable to Ninja. KOS versions don't have near z clipping, libparallax is used for matrix management and vertices are transformed with "mat_trans_single3" macro, so there is a notable amout of assembly involvement (it's not explicit though, so I won't consider this as "cheating").
=======================

A preliminary verdict would be this: it IS possible to get a decent "polycount performance" with KOS, BUT you absolutly have to use assembly (and do it "the right way"). So perhaps I was partially wrong about KOS' inferiority to official tools and had jumped to conclusions too quickly, BUT I was talking from an "average hobbyst programmer" (i.e. someone who mostly uses high level APIs and is not low-level hardware pro) point of view.

And if anyone here has an overclocked dreamcast I'd like to hear you feedback on this.

PS: performance is not the only problem I had with KOS, but that's another story for another thread.
Attachments
big_test_src.zip
(44.17 KiB) Downloaded 154 times
big_test.zip
(1.41 MiB) Downloaded 179 times
Last edited by 111 on Sun Jan 05, 2020 12:37 am, edited 1 time in total.
These users thanked the author 111 for the post (total 3):
Ian Robinsonmaslevin|darc|
User avatar
Ian Robinson
DC Developer
DC Developer
Posts: 114
Joined: Mon Mar 11, 2019 7:12 am
Has thanked: 206 times
Been thanked: 41 times

Re: KOS vs Ninja - simple test.

Post by Ian Robinson »

Really good test thank you for doing this 111 :)
111
DCEmu Junior
DCEmu Junior
Posts: 42
Joined: Thu Jul 07, 2016 7:11 pm
Has thanked: 0
Been thanked: 6 times

Re: KOS vs Ninja - simple test.

Post by 111 »

Added KOS version sources if anyone needs it.
These users thanked the author 111 for the post:
Ian Robinson
mrneo240
DCEmu Freak
DCEmu Freak
Posts: 86
Joined: Wed Mar 14, 2018 12:22 am
Has thanked: 16 times
Been thanked: 19 times

Re: KOS vs Ninja - simple test.

Post by mrneo240 »

pvr_prim() is garbage.
User avatar
ThePerfectK
Insane DCEmu
Insane DCEmu
Posts: 147
Joined: Thu Apr 27, 2006 10:15 am
Has thanked: 27 times
Been thanked: 35 times

Re: KOS vs Ninja - simple test.

Post by ThePerfectK »

^so what's a smarter method of submitting vertex data? I've looked into DMA and using store queues for vertex submission, but doesn't PVR_Prim() already use DMA for vertex submission?

EDIT: old reading: https://web.archive.org/web/20121022004 ... -dreamcast
These users thanked the author ThePerfectK for the post (total 2):
mrneo240Ian Robinson
Still Thinking!~~
User avatar
ThePerfectK
Insane DCEmu
Insane DCEmu
Posts: 147
Joined: Thu Apr 27, 2006 10:15 am
Has thanked: 27 times
Been thanked: 35 times

Re: KOS vs Ninja - simple test.

Post by ThePerfectK »

the Dreamcast's SH4 has SIMD opcodes, correct? Anybody have any idea how useful these opcodes are for, say, vector transformations? Preferably with an example to demonstrate?
Still Thinking!~~
User avatar
Moopthehedgehog
DCEmu Freak
DCEmu Freak
Posts: 85
Joined: Wed Jan 05, 2011 4:25 pm
Has thanked: 4 times
Been thanked: 39 times

Re: KOS vs Ninja - simple test.

Post by Moopthehedgehog »

ThePerfectK wrote: Tue Jan 21, 2020 10:52 pm the Dreamcast's SH4 has SIMD opcodes, correct? Anybody have any idea how useful these opcodes are for, say, vector transformations? Preferably with an example to demonstrate?
https://github.com/Moopthehedgehog/Drea ... sh4_math.h

Extremely. Let me know if it breaks.
These should run circles around the official stuff (currently this doesn't take advantage of ILP, but that's honestly just because I don't know what would be useful to do with integers while doing all the matrix float math).
These users thanked the author Moopthehedgehog for the post (total 2):
ThePerfectKIan Robinson
I'm sure Aleron Ives feels weird with his postcount back to <10668
:D
mrneo240
DCEmu Freak
DCEmu Freak
Posts: 86
Joined: Wed Mar 14, 2018 12:22 am
Has thanked: 16 times
Been thanked: 19 times

Re: KOS vs Ninja - simple test.

Post by mrneo240 »

Asking for original model and sources to all 3
User avatar
ThePerfectK
Insane DCEmu
Insane DCEmu
Posts: 147
Joined: Thu Apr 27, 2006 10:15 am
Has thanked: 27 times
Been thanked: 35 times

Re: KOS vs Ninja - simple test.

Post by ThePerfectK »

Been playing around with Moop's fast math portion of his HAL and holy crap there is some really great stuff in there for matrices math! I'm finding all sorts of SIMD uses for that stuff, thanks so much moop!
Still Thinking!~~
User avatar
Moopthehedgehog
DCEmu Freak
DCEmu Freak
Posts: 85
Joined: Wed Jan 05, 2011 4:25 pm
Has thanked: 4 times
Been thanked: 39 times

Re: KOS vs Ninja - simple test.

Post by Moopthehedgehog »

Glad it's working for you!

I did recently update it to make it even better, too. Check the version at the top of the file to make sure you're using the latest version. :) I found ways to make use of parallelism for prefetching and added some other things (Sum of Squares, linear interpolation & spherical interpolation, example snippets for things that I can't specifically make a function for, etc.).
These users thanked the author Moopthehedgehog for the post:
Ian Robinson
I'm sure Aleron Ives feels weird with his postcount back to <10668
:D
User avatar
ThePerfectK
Insane DCEmu
Insane DCEmu
Posts: 147
Joined: Thu Apr 27, 2006 10:15 am
Has thanked: 27 times
Been thanked: 35 times

Re: KOS vs Ninja - simple test.

Post by ThePerfectK »

Yup, been playing with the latest HAL commit, it's genuinely useful. I'm putting together a benchmarking demo with all the wonderful new Dreamcast toys we've got in the last 6 months (like dcprof) to see how performant it is, but just general eyeballing shows a big improvement. Like, for vertex transformations for a single triangle, I can transform all 3 points at once in one operation! That means 1/3 instruction fetch/decode, 1/3 opcode execution, etc.

I'm coming up with a bunch of useful situations to use the packed matrix multiplication tools in HAL, the ones you wrote that lets you extract a column vector from XMTRX (i.e. to treat it like 4 vectors at once). It seems pretty handy for when A) You need to update a lot of data locations with the same type of math operation (i.e. multiply 300 vertices, 4 at a time, by the same value), B) total a lot of math operations into one data location (like iterate through some list of values, performing addition with them into a single vector repeatedly like a running total). It's also useful how you can manually load up the back matrix floating point registers and keep some values almost like a cache.

This is all super useful stuff here!

EDIT: Other useful ways to use the available two matrix registers from tapamN here: https://dcemulation.org/phpBB/viewtopic ... 0#p1034105

Seems he still processes vertices in a polygon one at a time, but uses the four packed vectors in the matrix to do multiple components (position, normal, light vector, and UV) in a single operation.
These users thanked the author ThePerfectK for the post:
Ian Robinson
Still Thinking!~~
User avatar
lerabot
Insane DCEmu
Insane DCEmu
Posts: 134
Joined: Sun Nov 01, 2015 8:25 pm
Has thanked: 2 times
Been thanked: 19 times

Re: KOS vs Ninja - simple test.

Post by lerabot »

Keep up the good work folks!
These users thanked the author lerabot for the post:
Ian Robinson
park
DCEmu Junior
DCEmu Junior
Posts: 46
Joined: Tue Jun 24, 2008 10:46 am
Has thanked: 0
Been thanked: 0

Re: KOS vs Ninja - simple test.

Post by park »

111 wrote: Sat Jan 04, 2020 11:56 am Long story short: I made 3 simple programs to roughly compare "polygon count" performance of KOS' pvr api to Ninja library (from official sdk. No Kamui2 version for now). The goal was to see which one is faster "out of box" i.e. by only using high level api and no sh-4 assembly (at least explicitly).
Test model has 2088 vertices, 2608 "optimized" triangles, and 860 triangle strips (only these are used for obvious reasons).

Here is a brief description for every program:
- "NINJA_default" - loads and displays a model ("num_bigs" times) exported with default settings (probably not the best ones, but it's not important this time).
- "KOS_pvr_prim_1024" - uses pvr_prim() for vertex submissions. Vertex buffer size is 1024 kilobytes.
- "KOS_DR_1024" - uses "direct render" for vertex submissions.

Use dpad left and right to remove\add an instance.

FPS counters are (supposedly) kinda broken, so they are not really reliable but they are there anyway.

Only test this on real hardware, since emulators are useless for that. But here is a pic anyway:
Image


I don't have a dreamcast, but from what I was told (thanks, megavolt85!), "direct render" version runs rather well and is comparable to Ninja. KOS versions don't have near z clipping, libparallax is used for matrix management and vertices are transformed with "mat_trans_single3" macro, so there is a notable amout of assembly involvement (it's not explicit though, so I won't consider this as "cheating").
=======================

A preliminary verdict would be this: it IS possible to get a decent "polycount performance" with KOS, BUT you absolutly have to use assembly (and do it "the right way"). So perhaps I was partially wrong about KOS' inferiority to official tools and had jumped to conclusions too quickly, BUT I was talking from an "average hobbyst programmer" (i.e. someone who mostly uses high level APIs and is not low-level hardware pro) point of view.

And if anyone here has an overclocked dreamcast I'd like to hear you feedback on this.

PS: performance is not the only problem I had with KOS, but that's another story for another thread.
Wasted like 6 cd-rs. None seem to boot on a real dreamcast. Tried different setting too. Is there a specific setting I should take heed of?
User avatar
Moopthehedgehog
DCEmu Freak
DCEmu Freak
Posts: 85
Joined: Wed Jan 05, 2011 4:25 pm
Has thanked: 4 times
Been thanked: 39 times

Re: KOS vs Ninja - simple test.

Post by Moopthehedgehog »

Burn the CDIs with IMGBURN with the CDI plugin, use a speed like 16x or 48x, whatever the burner and disc have in common.

Don’t use the cheap, plain unlabeled verbatims—use verbatim “music” CD-Rs or memorex or maxell brands. Burnt over 300 discs this way when testing my own stuff and have near 100% success with them. Verbatim digital vinyls are... decent, but not as good as these others. The “plain” verbatims just don’t work at all (I don’t think they have enough dye or something; they’re extremely cheaply made).

Worst case the CDIs are messed up and have unscrambled binaries, and are maybe being used with a GDEMU via some loader that can load unscrambled binaries in CDIs...

(Note: I haven’t tested them myself, but this is how one can rule out burning as an issue.)
These users thanked the author Moopthehedgehog for the post:
Ian Robinson
I'm sure Aleron Ives feels weird with his postcount back to <10668
:D
park
DCEmu Junior
DCEmu Junior
Posts: 46
Joined: Tue Jun 24, 2008 10:46 am
Has thanked: 0
Been thanked: 0

Re: KOS vs Ninja - simple test.

Post by park »

Moopthehedgehog wrote: Tue Apr 21, 2020 8:42 pm Burn the CDIs with IMGBURN with the CDI plugin, use a speed like 16x or 48x, whatever the burner and disc have in common.

Don’t use the cheap, plain unlabeled verbatims—use verbatim “music” CD-Rs or memorex or maxell brands. Burnt over 300 discs this way when testing my own stuff and have near 100% success with them. Verbatim digital vinyls are... decent, but not as good as these others. The “plain” verbatims just don’t work at all (I don’t think they have enough dye or something; they’re extremely cheaply made).

Worst case the CDIs are messed up and have unscrambled binaries, and are maybe being used with a GDEMU via some loader that can load unscrambled binaries in CDIs...

(Note: I haven’t tested them myself, but this is how one can rule out burning as an issue.)
Thanks moop, yeah it was the cd-rs. Switched to a stack of sonys i had left over and that fixed it. Interesting results. I ran them keeping in mind the fps counter might be broken but I think its just the ninja counter is broken not kos. I tested how far I could go while keeping artifact free and not below 30 fps. Seems pvr_prim can do 7 big the cats , direct render can do 9(actually 10 but 10 seems to corrupt the graphics), eyeballing Ninja to what seems 30 fps its 16 big the cats. So even with direct render seems to be quite the disparity if each big is 2,600 triangles.
User avatar
Moopthehedgehog
DCEmu Freak
DCEmu Freak
Posts: 85
Joined: Wed Jan 05, 2011 4:25 pm
Has thanked: 4 times
Been thanked: 39 times

Re: KOS vs Ninja - simple test.

Post by Moopthehedgehog »

Code: Select all

int pvr_prim(void * data, int size) {
    /* Check to make sure we can do this */
#ifndef NDEBUG
    if(pvr_state.list_reg_open == -1) {
        dbglog(DBG_WARNING, "pvr_prim: attempt to submit to unopened list\n");
        return -1;
    }

#endif  /* !NDEBUG */

    if(!pvr_state.dma_mode) {
        /* Send the data */
        sq_cpy((void *)PVR_TA_INPUT, data, size);
    }
    else {
        return pvr_list_prim(pvr_state.list_reg_open, data, size);
    }

    return 0;
}
Are you using DMA? Looks like it uses the SQ if not. DMA should be 2x faster since SQ get squeezed by a 32-bit path to the SH4's BSC before hitting the 64-bit bus. DMA just goes straight through the 64-bit bus.
I'm sure Aleron Ives feels weird with his postcount back to <10668
:D
TapamN
DC Developer
DC Developer
Posts: 104
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 88 times

Re: KOS vs Ninja - simple test.

Post by TapamN »

Moopthehedgehog wrote: Fri Apr 24, 2020 3:39 pm Are you using DMA? Looks like it uses the SQ if not. DMA should be 2x faster since SQ get squeezed by a 32-bit path to the SH4's BSC before hitting the 64-bit bus. DMA just goes straight through the 64-bit bus.
That's not why you're getting better performance with DMA. The SQ operates at full bus width no problem. The reason is that the tile accelerator can only buffer so many vertices at a time while it writes the display list to video RAM. If the buffer gets full, because it's being fed vertices faster than it can write them, it tells whatever is feeding it to stop and wait. If the CPU is submitting stuff, then the CPU stalls. If the DMA controller is submitting, then the DMA stalls while the CPU can still operate. This can make certain methods of CPU based vertex submission slow.

The Big benchmark does all transformations a head of time, then hammers the TA with precalculated vertices, which has to ask the CPU to stall. For this method of submitting polygons to the PVR, you will probably see better performance with DMA.

But it's possible to outspeed DMA submission under certain conditions. With DMA, you have to (1) calculate the vertices, (2) write them to memory, then (3) transfer them from RAM to the PVR. If you cut out (2) writing them to memory, you can save bandwidth and time. To actually get a speed increase out of this, you have pace the speed you submit vertices to the tile accelerator (to give it time to operate without needing to stall the CPU) AND find useful work to do in between each vertices submitted.

So instead of doing all the T&L in a batch and then submitting, you can transform, light, and submit each vertex one after another. The useful work done between each vertex submission will be the T&L of another vertex. In a way, it's less efficient since you will need to retransform and light some vertices multiple times (and it's not really possible to software pipeline the code when you're doing so much) but it can be faster overall than DMA if you match the speed of the TA.

This has some other advantages. You don't need to reserve memory to buffer the TA commands if you submit them directly, so it saves memory. It's also has more consistent performance; there's no threat of the source data and temporary buffers cache thrashing if there's no buffers to thrash.

However, if the work you're doing per-vertex is really complex, like matrix blending, environment mapping, or many complex lights, then DMAed indexed geometry will likely be faster; you spend too much extra time doing CPU work that the TA is idle. It also takes the TA longer to process triangle strips that cover a large area of the screen than a small area, so if the model covers a lot of the screen, you might get some stalls anyways. The ideal method is probably a hybrid approach, by using some heuristics based on screen size and effects applied to decide whether to submit models directly or buffering them and DMAing later. That's the method I'm planning on taking in my rendering library (whenever I get around to doing working on it again).

I tried adapting the Big model to the format used by my old assembler direct rendering function. I just barely get 19 Bigs per frame at 60 FPS (DMAless). I was expecting to hit the twenties, but the model has very short triangle strips. Better, longer strips would improve things.
These users thanked the author TapamN for the post:
voxel
User avatar
ThePerfectK
Insane DCEmu
Insane DCEmu
Posts: 147
Joined: Thu Apr 27, 2006 10:15 am
Has thanked: 27 times
Been thanked: 35 times

Re: KOS vs Ninja - simple test.

Post by ThePerfectK »

The above makes me want to write a streaming demo to gauge raw primitive rasterization performance, no vertex calculation done per frame at all, just basically stressing the raw speed of vertex submission. I wonder if it'd be possible to write, like, a circular buffer to test all this. Anyone ever tried that?
Still Thinking!~~
User avatar
Moopthehedgehog
DCEmu Freak
DCEmu Freak
Posts: 85
Joined: Wed Jan 05, 2011 4:25 pm
Has thanked: 4 times
Been thanked: 39 times

Re: KOS vs Ninja - simple test.

Post by Moopthehedgehog »

TapamN wrote: Sat Apr 25, 2020 10:57 pm
Moopthehedgehog wrote: Fri Apr 24, 2020 3:39 pm Are you using DMA? Looks like it uses the SQ if not. DMA should be 2x faster since SQ get squeezed by a 32-bit path to the SH4's BSC before hitting the 64-bit bus. DMA just goes straight through the 64-bit bus.
The SQ operates at full bus width no problem.
Nope! SQs flush cache, and there's an architectural bottleneck right between the CPU core and the BSC. See the diagram on page 9, SH7750 Group Hardware manual (R01UH0456EJ0702), freely available from any SH7750 series hardware page: https://www.renesas.com/us/en/products/ ... 7750s.html (under Documents --> Filter by "User's Manual: Hardware," may need to click "filter" twice to get it to actually filter). Can't post it here because the document states that posting snippets requires the explicit written permission of Renesas in it, which I don't have (see item number 11 in the NOTICE on Roman numeral page iii). Page 367 of the document also notes that 64-bit transfer to external memory only applies to the DMAC.

The document also describes how burst bulk transfers get split up and shuttled in bus-width-sized chunks, so the SQs would have this extra 8x sequential 4-byte transfer on the way out from the core before the BSC. I'm not sure if the BSC then reassembles it into 4x 8-byte transfers out to memory or if it just passes it straight through as 8x 4-byte transfers, however. I would guess it doesn't actually reassemble them, since it probably just sees 8x 4-byte transfers coming in as 8x 4-byte transfers.

And yes, I know the Dreamcast uses an SH7091, but as far as I've been able to tell it's identical--down to even the addresses and functions of the "undocumented" (Linux kernel term, not mine) performance counter registers. Actually, I did stumble across an SH7091 hardware manual, and it even had the same diagram in it. (Side note: the document was written in Japanese, even though, if I recall correctly, that diagram wasn't. Thankfully I can actually read Japanese, and I remember briefly checking the table of contents to see if it was identical to the SH7750 one. It was virtually identical, which is interesting because that would mean the SH7750 is just an "international release" of the SH7091 or something.) I can't seem to find it again to double-check, but I probably shouldn't have been able to stumble across it in the first place, so I guess it's better that I can't find it again, anyways, lol. It should be telling that I didn't save it anywhere because they're literally the same and the SH7750 one was last updated in 2013, haha. EDIT: Found the specific diagram in another place--it's buried in this service manual DCJY posted: http://www.thedreamcastjunkyard.co.uk/2 ... rvice.html

In any case, if you'd like to verify such architectural details yourself I have a performance counter module here, which uses the SH4's built-in performance counters: https://github.com/Moopthehedgehog/DreamHAL
TapamN wrote: Sat Apr 25, 2020 10:57 pm If the buffer gets full, because it's being fed vertices faster than it can write them, it tells whatever is feeding it to stop and wait. If the CPU is submitting stuff, then the CPU stalls. If the DMA controller is submitting, then the DMA stalls while the CPU can still operate. This can make certain methods of CPU based vertex submission slow.
Unrelated rant, skip this paragraph for related: I'm glad to know that there's someone here who actually knows how to use DMA. It's absurd how many places I keep seeing people initiate DMA and then keep the CPU sitting there uselessly in a spinloop waiting for it to complete. I kid you not, even Sega's BBA network driver got this wrong, which I calculated would have a devastating performance impact of up to 300 stall cycles (for the smallest size of 60 byte packets), or up to a whopping 7570 stall cycles (for the largest size of 1514 byte packets)--FOR EVERY SINGLE PACKET SENT AND PACKET RECEIVED. Even more unrelated, I think there's also another potentially devastating bug in their BBA driver, but I'm not sure about it yet. Need someone to ping the console in like an online Q3A match with a large packet and see if it makes the BBA freak out and reset itself.
/rant

Related: You may be seeing what looks like a buffer getting "full" because you're spamming the cache writeback buffer too frequently. I recently ran into this when using cache block writeback functions to send data over G2 for dcload-ip's BBA driver. But that's because the G2 bus has some kind of weird FIFO and even just looking at that thing the wrong way can cause 100 cycle delays (ugh!). Mostly solved this by only OCBP'ing one 32-byte chunk at a time as soon as the chunk was ready to go as part of a very funky variation of memcpy(). Got it to only stall for 236-ish cycles every 32-byte write, which is about 27MB/s or so. Actually, that was my old number. I got it faster now, but I don't remember what the new number was.

I'm also not sure what you mean by DMA stalling; the concept of DMA stalling just sounds kinda weird to me. Is that Cycle Steal Mode? IIRC that mode just means "please don't access external memory while DMA is in progress, even though you can," since bus arbitration releases and reacquires the bus each burst unit. You can still use data in the cache since the DMAC lives on the BSC side of the SH4. Just prefetch whatever data you need before initiating the DMA in that case. Although, I don't know why one would ever use Cycle Steal instead of Burst mode, since burst mode doesn't release the bus and just enforces the whole "don't access the BSC while DMA is in progress." Makes Cycle Steal seem kinda pointless, IMO, since it hides when someone does something they shouldn't be, but I guess I can understand why someone might default to it since it still allows access to the memory bus, which could come in handy in certain scenarios.

EDIT: Oh no, I just realized that the SH4's "always-running" 20-byte instruction prefetcher might get in the way of Cycle Steal. That would be... actually that would be really bad if that's the case...
TapamN wrote: Sat Apr 25, 2020 10:57 pm But it's possible to outspeed DMA submission under certain conditions. With DMA, you have to (1) calculate the vertices, (2) write them to memory, then (3) transfer them from RAM to the PVR. If you cut out (2) writing them to memory, you can save bandwidth and time. To actually get a speed increase out of this, you have pace the speed you submit vertices to the tile accelerator (to give it time to operate without needing to stall the CPU) AND find useful work to do in between each vertices submitted.
You can use OCBWB and OCBP with appropriate usage of movca.l to remove the memory-write bottleneck entirely. When using writeback memory type, there's an intermediary write buffer that gets written to, eliminating performance penalties. It's like using the SQ to flush the cache, then DMAing from that spot, except this is the CPU doing it so the SQ can be busy flushing memory elsewhere. I made a "cachefuncs.h" DreamHAL header for this.
TapamN wrote: Sat Apr 25, 2020 10:57 pm AND find useful work to do in between each vertices submitted.
Yeah, this is the tough part :P
But the Dreamcast is meant to do a lot more than 2D and simple 3D, so a sufficiently complex 3D game can probably find plenty of stuff to do here. Things like extra effects, more complex AI, more complex physics calculations, fluid sims, whatever you want. So much more headroom, what are we ever going to do with it all? :P

If this old post is to be believed, with DOA2 only hitting 202 MIPS, that means games only ever really hit about 55% of what the console should be capable of. https://web.archive.org/web/20110406051 ... log/?p=143 (120-200 MIPS on average for games is really... bad, quite frankly. But that's probably because it was still the early days of 3D back then.)
202 MIPS means a lot of time was spent stalling out, when I would estimate numbers closer to 300 MIPS should be easily achievable with modern GCC 9 with like -O3 (which is a LOT better than the older versions many are using, particularly on the size and speed/instruction-level parallelism front; in fact KOS was just updated to be able to use GCC 9.3.0). I wouldn't be surprised if some dedicated demosceners could hit close to the full 400 MIPS, but that takes a lot of assembly and careful optimization, and of course a solid understanding of the hardware.

I can't really comment on the other things, as I'm not a game graphics dev and my OSdev experience to date has been pretty much exclusively on the CPU and MCU side of things (I could write a lot more here, but this post is already becoming quite a bit longer than I intended), but it all sounds good to me. If I may suggest, you might find the math header in DreamHAL useful; I've gotten at least one report of a 20% improvement just with naively replacing stuff here and there as opposed to really optimizing around it all--and if you're good with asm, you could potentially add some application-specific parallelism to some of the more intricate math functions like the matrix stuff for an even bigger boost.
I'm sure Aleron Ives feels weird with his postcount back to <10668
:D
TapamN
DC Developer
DC Developer
Posts: 104
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 88 times

Re: KOS vs Ninja - simple test.

Post by TapamN »

Sorry this reply took so long. Writing and cleaning up benchmarks took some time.
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pm Nope! SQs flush cache,
The SH4 does not have any sort of cache coherency. It's easy to check that SQs do not cause cache flushes. Disable interrupts (so that the timer IRQ handler does not accidentally flush the cache), write something to cache, and use the SQ to write to memory. The reload the data you wrote and see if you get the old data from cache, or the new RAM data. You will get the old cache data. In the benchmark mentioned below, there is a section that tests this and confirms SQs don't cause cache flushes.
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pm there's an architectural bottleneck right between the CPU core and the BSC. See the diagram on page 9, [...] page 367 of the document also notes that 64-bit transfer to external memory only applies to the DMAC.
The bus between the CPU core and BSC runs at 200 MHz, so it has 200 M/s * 4 byte = 800 MB/s bandwidth. The main RAM bus run at 100 MHz, so it has 100 M/s * 8 byte = 800 MB/s theoretical bandwidth. There's no bottleneck here, they have the same bandwidth.

If there WAS a bottleneck, the cache and SQs would be the same speed anyways. Both have to go through the BSC...
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pm Page 367 of the document also notes that 64-bit transfer to external memory only applies to the DMAC.
That says 64-bit accesses can only be done by the DMAC, while 8-bit, 16-bit, 32-bit, and 32-byte can be done by the CPU as well. SQ and cacheline load/writebacks are a 32-byte access, not 64-bit access. I wonder if this would mean that an uncached 64-bit floating point load/store is broken up into two 32-bit memory accesses?

Anyway, it's easy to test the bus by measuring it's bandwidth. If the CPU or SQ can't use full bus width, with the DC's 100 MHz main RAM bus, the absolute max bandwidth would be 400 MB/s (it would really be less due to communication overhead). The only way to go above 400 MB/s would be for the CPU to use the entire bus. I wrote a benchmark to store queue to RAM as fast as possible and see what it's bandwidth was. I got 495 MB/s. A bit lower than I expected (when you factor in overhead, 533 MB/s probably be the real absolute best speed for a 64-bit bus), but still more than 400 MB/s, so it's not limited to just 32-bit wide access. I also checked cache bandwidth by using a MOVCA, OCBWB sequence. This was 493 MB/s. Basically the same thing.
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pmI'm not sure if the BSC then reassembles it into 4x 8-byte transfers out to memory or if it just passes it straight through as 8x 4-byte transfers, however. I would guess it doesn't actually reassemble them, since it probably just sees 8x 4-byte transfers coming in as 8x 4-byte transfers.
Well, since the SH4 gets more than 400 MB/s bandwidth, it has to recombine the data.
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pmAnd yes, I know the Dreamcast uses an SH7091, but as far as I've been able to tell it's identical
It seems like they can be treated as identical. As far as I can tell, the only possible difference between the SH7750 and SH7091 is that the semi-undocumented internal processor version register may be different between the two versions. (I think the register is mentioned in some supplementary documents, but not the main manuals.)
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pmIn any case, if you'd like to verify such architectural details yourself I have a performance counter module here, which uses the SH4's built-in performance counters: https://github.com/Moopthehedgehog/DreamHAL
I ran across the same Linux performance counters years ago when I was researching the SH4's ASERAM and already made my own library for them. (ASERAM is a special 1 KB on-chip protected memory intended to be used for debugging. I still haven't figured out if can be accessed on the Dreamcast. It seems most of the ASE features are enabled by external signals, and I think can't be accessed by the CPU without those signals as a way to protect the debugger from being accidentally corrupted by user code.)
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pm Related: You may be seeing what looks like a buffer getting "full" because you're spamming the cache writeback buffer too frequently.
I'm not using the cache, I'm using the SQs. It takes time for the TA to write strip pointers to VRAM.

I decided to go and finally try benchmarking the TA's input bandwidth.

For submitting continuous off-screen vertices, I got a write performance of around 7.6 million 32-byte vertices on long strips (245 MB/s). DMA and SQ were pretty much the same speed. (These numbers go up if you use short strips, but the effective polygon throughput goes down.) For a full screen quad (triangle pair), this fell to 152,000 quads/sec (128 bytes per quad: 19.5 MB/s). Again, DMA and SQ were tied, but with DMA the CPU would be able to operate in parallel.

The only benchmark I found where there seems to be a real speed improvement using DMA over SQs is drawing large sprites, where SQs are 25% slower. Or it might be a bug in the benchmark?

I did have some problems getting the benchmarks to run correctly. At first, I was getting a bandwidth of 4.5 Mvert/s. I had written code to pushed 6 Mvert/s before, so I was confused why I was getting worse results. My accident, I figured out that for some reason, the PVR runs slowly the first frame after it's initialized. So I modified the benchmark to output a dummy frame before doing the timing.

Another problem was that for the large poly benchmark, I was getting ~200 MB/s for SQs and ~9 MB/s for DMA. An error with copy and paste had the SQ writing to RAM instead of the TA, and there was also a mistake with the bandwidth calculation.
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pmI'm also not sure what you mean by DMA stalling; the concept of DMA stalling just sounds kinda weird to me.
Well, whatever is writing to the TA has to stall when the TA's FIFO is full. Otherwise you'd get dropped vertices. I don't know exactly how it's implemented on the bus. I know that the TA area is set to MPX mode on the BSC, and MPX has a "!RDY" pin that the destination has to lower to signal that it's ready to receive data, and can raise to signal a wait.
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pm You can use OCBWB and OCBP with appropriate usage of movca.l to remove the memory-write bottleneck entirely. When using writeback memory type, there's an intermediary write buffer that gets written to, eliminating performance penalties. It's like using the SQ to flush the cache, then DMAing from that spot, except this is the CPU doing it so the SQ can be busy flushing memory elsewhere. I made a "cachefuncs.h" DreamHAL header for this.
Huh? I'm probably misunderstanding you, but the way you describe it kind of makes it sound like the SQ and write buffer can access memory at the same time. Only one thing can access memory at the same time, if multiple things try to access RAM at the same time, somethings going to stall. The purpose of the write buffer is to give priority to reads.

Say you cache miss. You have to load a new cache line from RAM, but the cache line it's replacing is dirty. Without the write buffer, the CPU would have to write the dirty cache line to memory (slow), load the new cache line (slow), before the CPU resumes execution. With the write buffer, the CPU can stuff the dirty cache line into the write buffer (quick), load the new cache line (slow), then resume execution. While the CPU is executing, the write buffer can be written to memory in the background. The write buffer allows the CPU to resume execution earlier, and increases parallelization.

If the write buffer is full, and the CPU tries to stick something in it, the CPU will stall until the buffer is clear. Back-to-back OCBWB/OCBP can do this.

On a related note, if you try to PREF while there is already a memory access (even another PREF), the CPU will stall until the PREF can start. Careless PREFs can actually make code run slower because of this.
Moopthehedgehog wrote: Mon Apr 27, 2020 5:31 pm I can't really comment on the other things, as I'm not a game graphics dev and my OSdev experience to date has been pretty much exclusively on the CPU and MCU side of things (I could write a lot more here, but this post is already becoming quite a bit longer than I intended), but it all sounds good to me. If I may suggest, you might find the math header in DreamHAL useful; I've gotten at least one report of a 20% improvement just with naively replacing stuff here and there as opposed to really optimizing around it all--and if you're good with asm, you could potentially add some application-specific parallelism to some of the more intricate math functions like the matrix stuff for an even bigger boost.
I've mainly focused on T&L and discovering ways to abuse the PVR. I have done a some work on my own for-fun OS for the SuperH, which I intended to be compatible with the HP Jornada 690 as well (a SuperH-3 DSP based pocket sized computer that comes with Windows CE). I was able to get interrupts, gUSA atomics, timers, basic preemption, serial, and the UBC working before I stopped working on it.

I've already created something of my own similar to your DreamHAL. "Libsh4", as well as a semitested matrix library and a bunch of headers with lists of hardware registers.

The problem with certain types of inline asm is that the compiler can't schedule around it well or do more advanced optimizations like software pipelining. I find it's better to just use full assembler then trying to stick asm in C/C++ code. You can write faster code and it's cleaner than GCC's ugly, complex inline asm syntax.

Some of your DreamHAL floating point stuff looks slower than just letting GCC handle it.

Doing an approximate reciprocal with FSRRA(x*x) is faster than a divide (9 cycles versus 12 cycles) if you can guarantee x is positive, but the conditional to handle negative x would make it a couple cycles slower due to the amount of time it takes to do the floating point comparisons and branches, and GCC can't schedule the inline asm FSRRA as well as it can with a FDIV. (GCC was able to generate FSRRA (and probably schedule it correctly) with the right options in older versions, but I can't get it to work on 9.3.)

The cross product with FTRV seems like it would be slower than having GCC handle a straight forward C implementation because of all the FMOVs. Using individual FADD/FSUB/FMULs you can issue a cross product in about 9 cycles and GCC could load data from RAM in parallel, but your FTRV cross product spends at least 13 cycles shuffling data into XMTRX (and that's not counting all the extra moves GCC may generate to satisfy all the float register location requests), then you still have to do the actual FTRV.

I've included my memory and Big the Cat benchmark code, if anyone wants to see them. Maybe later I'll try to see if I can generate better triangle strips and get better results on the Big benchmark.

The memory benchmark runs and prints it's results to the serial/ethernet console. On the Big benchmark, you can change the number of Bigs drawn with left and right on the d-pad. Pressing A will print some stats to the console. At the bottom of the screen, there are two bars that display the CPU and GPU load along with some ticks to measure what the load is. Each small tick is one millisecond, and each large tick is 1/60 of a second.
Attachments
sqbench.zip
(28.74 KiB) Downloaded 98 times
bigbench.zip
(66.69 KiB) Downloaded 115 times
Post Reply