KOS vs Ninja - simple test.

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
User avatar
Moopthehedgehog
DCEmu Freak
DCEmu Freak
Posts: 85
Joined: Wed Jan 05, 2011 4:25 pm
Has liked: 4 times
Been liked: 36 times

Re: KOS vs Ninja - simple test.

Post by Moopthehedgehog » Sun May 10, 2020 4:23 pm

TapamN wrote:
Sun May 10, 2020 12:46 am
Sorry this reply took so long. Writing and cleaning up benchmarks took some time.
No worries, lol. Not in any rush here. :)
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
Nope! SQs flush cache,
The SH4 does not have any sort of cache coherency. It's easy to check that SQs do not cause cache flushes. Disable interrupts (so that the timer IRQ handler does not accidentally flush the cache), write something to cache, and use the SQ to write to memory. The reload the data you wrote and see if you get the old data from cache, or the new RAM data. You will get the old cache data. In the benchmark mentioned below, there is a section that tests this and confirms SQs don't cause cache flushes.
Sorry, yeah, looks like I used poor word choice here. I had actually posted a way to make a kind of "fake" 2-way associativity by abusing 29-bit addressing that makes use of manual cache management in the Simulant Discord not too long ago. (Unrelated, but I gotta admit that it's nice having something like IRC that actually has conversation history.)

What I was getting at is that I was under the impression that SQs were only able to store data from the SH4's cache--not that they cause cache flushes. But that is useful to know that SQs don't explicitly trigger cache flushes! So by "cache flush" I had really meant to say "they write data out from the cache only." Which would imply that SQs don't really work right in write-through memory, which I think might be true but I'm not 100% sure.
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
there's an architectural bottleneck right between the CPU core and the BSC. See the diagram on page 9, [...] page 367 of the document also notes that 64-bit transfer to external memory only applies to the DMAC.
The bus between the CPU core and BSC runs at 200 MHz, so it has 200 M/s * 4 byte = 800 MB/s bandwidth. The main RAM bus run at 100 MHz, so it has 100 M/s * 8 byte = 800 MB/s theoretical bandwidth. There's no bottleneck here, they have the same bandwidth.

If there WAS a bottleneck, the cache and SQs would be the same speed anyways. Both have to go through the BSC...
Yep, you're totally right. I forgot the frequency was different there, haha. Yes, the cache and SQ should be at the same speed. The weird part is only whether the BSC converts 8x 4-byte writes into 4x 8-byte writes. If it doesn't, then writing out to RAM would be only half-bandwidth due to the halving of bus frequency.
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
Page 367 of the document also notes that 64-bit transfer to external memory only applies to the DMAC.
That says 64-bit accesses can only be done by the DMAC, while 8-bit, 16-bit, 32-bit, and 32-byte can be done by the CPU as well. SQ and cacheline load/writebacks are a 32-byte access, not 64-bit access. I wonder if this would mean that an uncached 64-bit floating point load/store is broken up into two 32-bit memory accesses?
Yeah, I think that's why fmov.d does the "pair move" thing instead of a proper double-precision load in little-endian. Again causes me to wonder if the BSC does 8x4 to 4x8 conversion for 32-bytes. I don't know that it does.
TapamN wrote:
Sun May 10, 2020 12:46 am
Anyway, it's easy to test the bus by measuring it's bandwidth. If the CPU or SQ can't use full bus width, with the DC's 100 MHz main RAM bus, the absolute max bandwidth would be 400 MB/s (it would really be less due to communication overhead). The only way to go above 400 MB/s would be for the CPU to use the entire bus. I wrote a benchmark to store queue to RAM as fast as possible and see what it's bandwidth was. I got 495 MB/s. A bit lower than I expected (when you factor in overhead, 533 MB/s probably be the real absolute best speed for a 64-bit bus), but still more than 400 MB/s, so it's not limited to just 32-bit wide access. I also checked cache bandwidth by using a MOVCA, OCBWB sequence. This was 493 MB/s. Basically the same thing.
I have a memset that I timed runs in only 2 cycles (and I made variations that do 8, 4, 2, and 1 byte sizes). Using my similarly max-optimized memcpy/memmove functions, In write-through memory I've seen consistent behavior where the 8-byte one takes the same number of cycles to move the same quantity of data as the 4-byte, which would indicate that the BSC does not do 8x4 to 4x8 with the data. Using write-back memory and the 8-byte function behaves more in line with expectation, since it's hitting the cache and there's a 64-bit path to the cache.

Also, being unnecessarily pedantic, the bus is not 800MB/sec unless you use "Apple" or "marketing" MB notation. It's really 760MiB/s and integral factors of it. :P I refuse to bow down to marketing pressures and continue to use MB to mean MiB, generally, so my numbers basically never mean marketing MBs.
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
I'm not sure if the BSC then reassembles it into 4x 8-byte transfers out to memory or if it just passes it straight through as 8x 4-byte transfers, however. I would guess it doesn't actually reassemble them, since it probably just sees 8x 4-byte transfers coming in as 8x 4-byte transfers.
Well, since the SH4 gets more than 400 MB/s bandwidth, it has to recombine the data.
Benchmark needs to be done with performance counter cycles to ensure that no CPU activity interferes with the counter.
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
And yes, I know the Dreamcast uses an SH7091, but as far as I've been able to tell it's identical
It seems like they can be treated as identical. As far as I can tell, the only possible difference between the SH7750 and SH7091 is that the semi-undocumented internal processor version register may be different between the two versions. (I think the register is mentioned in some supplementary documents, but not the main manuals.)
Is that PVR, CVR or PRR? PVR and PRR are documented in the appendix of the SH7750 manual, CVR is not and has something like info relating to the cache properties of the CPU.
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
In any case, if you'd like to verify such architectural details yourself I have a performance counter module here, which uses the SH4's built-in performance counters: https://github.com/Moopthehedgehog/DreamHAL
I ran across the same Linux performance counters years ago when I was researching the SH4's ASERAM and already made my own library for them. (ASERAM is a special 1 KB on-chip protected memory intended to be used for debugging. I still haven't figured out if can be accessed on the Dreamcast. It seems most of the ASE features are enabled by external signals, and I think can't be accessed by the CPU without those signals as a way to protect the debugger from being accidentally corrupted by user code.)
...Yeah so Linux actually got a lot of that stuff wrong. The bit labels and their functions in that Linux code are mostly just wrong--I spent almost an entire month researching the things, including doing stuff like letting one of the 48-bit counters overflow to check for an overflow bit (there isn't one). I also came across the ASE stuff--some part of it IS accessible by the CPU, but I forget what. There's a section on ASE in one of the ST Micro SH4 docs their site. Edit: Here, chapter 13: https://www.st.com/content/ccc/resource ... 153464.pdf
Great, they give a whole register map and no addresses, lol, but the fact that ASE mode jumps to code at 0xFC000000 should be some kind of a clue. So I would guess 0xFC000000 is the ASE RAM area. The footnote that various regs are in the CCN also indicates that they would be in the 0xFF000000 to 0xFF1FFFFF area, which is where the performance counters are. IIRC the performance counters are actually supposed to be part of ASE or something. There's also the AUD section in chapter 12, which may or may not be related. I know that H-UDI uses the /ASEBRK pin, too.

By the way, old demo versions of CodeScape actually have parts of Sega's SDK 10.1 in them--I don't know what they were thinking leaving that stuff in a freely available public demo, but it's there, including an entire instruction manual that includes info on how to use the various modes of the performance counters.
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
Related: You may be seeing what looks like a buffer getting "full" because you're spamming the cache writeback buffer too frequently.
I'm not using the cache, I'm using the SQs. It takes time for the TA to write strip pointers to VRAM.
I think you're getting hit by bus arbitration here. I dunno, something still doesn't seem right about this, but I haven't looked into the PVR much.
TapamN wrote:
Sun May 10, 2020 12:46 am
The only benchmark I found where there seems to be a real speed improvement using DMA over SQs is drawing large sprites, where SQs are 25% slower. Or it might be a bug in the benchmark?

I did have some problems getting the benchmarks to run correctly. At first, I was getting a bandwidth of 4.5 Mvert/s. I had written code to pushed 6 Mvert/s before, so I was confused why I was getting worse results. My accident, I figured out that for some reason, the PVR runs slowly the first frame after it's initialized. So I modified the benchmark to output a dummy frame before doing the timing.

Another problem was that for the large poly benchmark, I was getting ~200 MB/s for SQs and ~9 MB/s for DMA. An error with copy and paste had the SQ writing to RAM instead of the TA, and there was also a mistake with the bandwidth calculation.
Sounds like your benchmark might need some work... Or more research. I dunno, I find the sheer amount of REing necessary to do things properly tends to be pretty exhausting, and I'm just doing the GAPS bridge to make a proper driver for it, lol.
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
I'm also not sure what you mean by DMA stalling; the concept of DMA stalling just sounds kinda weird to me.
Well, whatever is writing to the TA has to stall when the TA's FIFO is full. Otherwise you'd get dropped vertices. I don't know exactly how it's implemented on the bus. I know that the TA area is set to MPX mode on the BSC, and MPX has a "!RDY" pin that the destination has to lower to signal that it's ready to receive data, and can raise to signal a wait.
Ah, ok. That makes more sense. It's kinda like the G2 FIFO, where the stall gets consumed by a write operation so you'd never see it unless you were timing with a cycle-counting performance counter. This is why audio can be such a massive performance killer on DC when not using G2 DMA, too, and no profiler would be able to detect the impact of it except the cycle counter. And the stall can be hundreds of cycles per write--it's terrible!
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
You can use OCBWB and OCBP with appropriate usage of movca.l to remove the memory-write bottleneck entirely. When using writeback memory type, there's an intermediary write buffer that gets written to, eliminating performance penalties. It's like using the SQ to flush the cache, then DMAing from that spot, except this is the CPU doing it so the SQ can be busy flushing memory elsewhere. I made a "cachefuncs.h" DreamHAL header for this.
Huh? I'm probably misunderstanding you, but the way you describe it kind of makes it sound like the SQ and write buffer can access memory at the same time. Only one thing can access memory at the same time, if multiple things try to access RAM at the same time, somethings going to stall. The purpose of the write buffer is to give priority to reads.
Yes! You can! movca.l doesn't do a cache read upon cache miss, it just allocates a cache block. So by using movca.l to allocate a cache block while using the SQ to write to RAM, you can effectively act like two memory writes are happening at once! (relevant functions in DreamHAL's cachefuncs.h header)
TapamN wrote:
Sun May 10, 2020 12:46 am
On a related note, if you try to PREF while there is already a memory access (even another PREF), the CPU will stall until the PREF can start. Careless PREFs can actually make code run slower because of this.
Ah! I was wondering about this. The manual says pref turns into a "nop," which shouldn't stall. Has this been experimentally verified? I don't believe I've seen behavior that would indicate pref was slowing things down versus turning into a nop if the bus is in use or a prior pref is already in flight.
TapamN wrote:
Sun May 10, 2020 12:46 am
Moopthehedgehog wrote:
Mon Apr 27, 2020 5:31 pm
I can't really comment on the other things, as I'm not a game graphics dev and my OSdev experience to date has been pretty much exclusively on the CPU and MCU side of things (I could write a lot more here, but this post is already becoming quite a bit longer than I intended), but it all sounds good to me. If I may suggest, you might find the math header in DreamHAL useful; I've gotten at least one report of a 20% improvement just with naively replacing stuff here and there as opposed to really optimizing around it all--and if you're good with asm, you could potentially add some application-specific parallelism to some of the more intricate math functions like the matrix stuff for an even bigger boost.
I've mainly focused on T&L and discovering ways to abuse the PVR. I have done a some work on my own for-fun OS for the SuperH, which I intended to be compatible with the HP Jornada 690 as well (a SuperH-3 DSP based pocket sized computer that comes with Windows CE). I was able to get interrupts, gUSA atomics, timers, basic preemption, serial, and the UBC working before I stopped working on it.

I've already created something of my own similar to your DreamHAL. "Libsh4", as well as a semitested matrix library and a bunch of headers with lists of hardware registers.

The problem with certain types of inline asm is that the compiler can't schedule around it well or do more advanced optimizations like software pipelining. I find it's better to just use full assembler then trying to stick asm in C/C++ code. You can write faster code and it's cleaner than GCC's ugly, complex inline asm syntax.
That's neat!

DreamHAL is supposed to eventually be for the whole CPU--not an OS, just hardware-helper functions because the SH4 can be pretty maddening to deal with from scratch, let alone optimize for. Easier than x86, for sure, but a lot of things that are "automated" on x86 have to be done manually on SH4 (like cache management, TLB reloads, etc.). The math header is just kind of its initial "claim to fame" since it allows for doing trig, fast divides, roots, vector and matrix ops (and abusing them to do other things) and more that GCC just has no interface for. If you notice, for example, the various matrix ops store things like the cross product matrix and other things in XMTRX so that they can be reused in other matrix functions later. GCC doesn't touch XMTRX in m4-single-only, not just per the SH4 C ABI but also because XMTRX can't be accessed without the double-precision operations. :) (Aside: I think GDB might use XMTRX for something, but eh, I don't use GDB and I figure anyone using the matrix ops at a level where that would be a problem is "hardcore" enough to be able to debug without needing GDB... Or at least can find a way around that.)

While I agree that GCC's syntax is ugly, I know it, and I've seen other attempts just do it all wrong, which makes them slower than not doing anything in asm at all. For example, explicitly reserving registers instead of allowing GCC to pick registers can have a substantial hidden cost, where GCC needs to emit moves all over the place to get data into the right registers.
TapamN wrote:
Sun May 10, 2020 12:46 am
Some of your DreamHAL floating point stuff looks slower than just letting GCC handle it.
Oof, shots fired, lol--but I assure you it's not slower.
GCC really makes a mess of things sometimes--for example using software FPU emulation when the hardware can handle NaNs and infinities just fine (this one blew my mind when I got an error about it from trying to use __builtin_isnan() and __builtin_isinf()). GCC also can't do the vectore and matrix operations at all. That said, I haven't gotten much feedback on the matrix operations, as most people thus far have been using the single-instruction wrappers and the like, which does parallelize around most things GCC does to it.

Actually, the one bit of matrix-related feedback I got was that my cross product emitted a negative zero where someone was expecting a positive zero, which was causing problems for them (even though I triple-checked and both my cross product and their math were doing the exact same mathematical operations in the same order, too)--but I believe that to be FTRV-rounding-related and they never even tried to look into it. It also doesn't make much sense when in SH4 hardware -0.0f == 0.0f, but that's pretty much the only feedback I've gotten about them from anyone. ¯\_(ツ)_/¯
TapamN wrote:
Sun May 10, 2020 12:46 am
Doing an approximate reciprocal with FSRRA(x*x) is faster than a divide (9 cycles versus 12 cycles) if you can guarantee x is positive, but the conditional to handle negative x would make it a couple cycles slower due to the amount of time it takes to do the floating point comparisons and branches, and GCC can't schedule the inline asm FSRRA as well as it can with a FDIV. (GCC was able to generate FSRRA (and probably schedule it correctly) with the right options in older versions, but I can't get it to work on 9.3.)
Hehe, this is where lots of research went into. It's still faster than FDIV, and the combination of C and asm allows GCC to optimize things it can optimize. It's all about working *with* the compiler and not in spite of or against it.
TapamN wrote:
Sun May 10, 2020 12:46 am
The cross product with FTRV seems like it would be slower than having GCC handle a straight forward C implementation because of all the FMOVs. Using individual FADD/FSUB/FMULs you can issue a cross product in about 9 cycles and GCC could load data from RAM in parallel, but your FTRV cross product spends at least 13 cycles shuffling data into XMTRX (and that's not counting all the extra moves GCC may generate to satisfy all the float register location requests), then you still have to do the actual FTRV.
Those FMOVs do 2 floats per cycle; with dual-issue pipelining and FMOV's being 0-cycle LS group, it should be half of what you say, unless LS grouping takes precedence over the 0-cycle... In any case, not all float instructions are in the same group--that's what people who didn't read Ch. 8 of the manual say! So other things can be parallelized into those matrix functions... and the only reason they're not is because I simply don't know what to put in there that would be useful! :grin:
TapamN wrote:
Sun May 10, 2020 12:46 am
I've included my memory and Big the Cat benchmark code, if anyone wants to see them. Maybe later I'll try to see if I can generate better triangle strips and get better results on the Big benchmark.

The memory benchmark runs and prints it's results to the serial/ethernet console. On the Big benchmark, you can change the number of Bigs drawn with left and right on the d-pad. Pressing A will print some stats to the console. At the bottom of the screen, there are two bars that display the CPU and GPU load along with some ticks to measure what the load is. Each small tick is one millisecond, and each large tick is 1/60 of a second.
Nice! Hopefully I'll have time to take a look at these at some point. The GAPS bridge nonsense is currently driving me up a wall and I don't know how long that's gonna go on for...
I'm sure Aleron Ives feels weird with his postcount back to <10668
:D
TapamN
DCEmu Freak
DCEmu Freak
Posts: 54
Joined: Sun Oct 04, 2009 11:13 am
Has liked: 0
Been liked: 12 times

Re: KOS vs Ninja - simple test.

Post by TapamN » Wed May 13, 2020 12:34 pm

Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
Sorry, yeah, looks like I used poor word choice here. I had actually posted a way to make a kind of "fake" 2-way associativity by abusing 29-bit addressing that makes use of manual cache management in the Simulant Discord not too long ago. (Unrelated, but I gotta admit that it's nice having something like IRC that actually has conversation history.)
The OCINDEX cache mode? A while ago, I tried combining OCINDEX mode and MOVCA to allocate cache above the TA submission area, and submitting vertices using cache writebacks. It functioned, but the advantages I thought it would have over SQs didn't work out.
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
What I was getting at is that I was under the impression that SQs were only able to store data from the SH4's cache--not that they cause cache flushes. But that is useful to know that SQs don't explicitly trigger cache flushes! So by "cache flush" I had really meant to say "they write data out from the cache only." Which would imply that SQs don't really work right in write-through memory, which I think might be true but I'm not 100% sure.
SQ store data from the CPU. There's no way to move stuff from cache to SQ, only CPU to SQ. They are a separate block of RAM from the cache. The writes to the SQs are as fast as writing to cache, though.
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
I have a memset that I timed runs in only 2 cycles (and I made variations that do 8, 4, 2, and 1 byte sizes). Using my similarly max-optimized memcpy/memmove functions, In write-through memory I've seen consistent behavior where the 8-byte one takes the same number of cycles to move the same quantity of data as the 4-byte, which would indicate that the BSC does not do 8x4 to 4x8 with the data. Using write-back memory and the 8-byte function behaves more in line with expectation, since it's hitting the cache and there's a 64-bit path to the cache.
When you benchmarked your memcpy, was it for copies larger than cache? You were benchmarking how long it took to copy main RAM and not cache, right? According to this document, a row-hit read and a row-hit write would take 7+6=13 bus cycles, or 26 CPU cycles. You can easily copy 32 bytes using longs in that time. It seems like the reason both were the same speed is because the memory bus is too slow for doubles to make a difference. Nothing to do with the BSC.

If it takes 13 bus cycles to copy 32 bytes, we would predict a copy rate of 100 MHz / 13 cycles * 32 bytes = 246 MB/s. I updated the benchmark to see how fast the CPU can read a cacheline, and write it back. I got 223.5 MB/s best case doing reads with cache and writes with SQ. Using cache for both falls to 185 MB/s for some reason.
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
Also, being unnecessarily pedantic, the bus is not 800MB/sec unless you use "Apple" or "marketing" MB notation. It's really 760MiB/s and integral factors of it. :P I refuse to bow down to marketing pressures and continue to use MB to mean MiB, generally, so my numbers basically never mean marketing MBs.
True. I usually go with MB==MiB as well, and I did that at first, but then I decided to go with 800 MB/s since that's what generally gets listed on Dreamcast tech specs, even though it's not 100% accurate.
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
Benchmark needs to be done with performance counter cycles to ensure that no CPU activity interferes with the counter.
Why? It's using the hardware timer, which I've used to do cycle accurate timing before by running code*. CPU usage won't interfere with it, that's kind of the point of having separate hardware timers, getting accurate timing regardless of what the CPU does.

I modified the benchmark to allow selecting between using KOS timers, my perf counter code, and DreamHAL's perf countercode. All non-DMA tests generate the same values withing >1%. DMA tests were way off at first; turns out CPU activity can interfere with the performance counter. I tested DMA bandwidth using blocking DMA. Blocking DMA ends up running the idle task, which executes the SLEEP instruction, causing the CPU and performance counter to stop! I modified the DMA tests to instead poll until DMA is done before stopping the timer, and this resulted in the perf counters generating the same results as the timer for DMA as well.

* The hardware timer does not have enough precision on its own to do cycle accurate timing, but by multiplying the time it takes for code to execute, you can bring it up to where the timer can measure it. Say you have a block of code you want to measure. If you run it in a loop 200,000,000 times, each cycle in the loop add one second to the execution time, which can be easily measured. It's possible to do cycle accurate timing using sunrises if you loop enough. Doing multiple loops of the code being profiled also reduces the effect background threads or interrupt handlers might have on the measurement.
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
Is that PVR, CVR or PRR? PVR and PRR are documented in the appendix of the SH7750 manual, CVR is not and has something like info relating to the cache properties of the CPU.
None of them were in the programming/hardware manuals I use. I first found out about them in MAME's source, and I only found official documentation about them when going through the application notes and errata. Which manual did you find them in?
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
...Yeah so Linux actually got a lot of that stuff wrong. The bit labels and their functions in that Linux code are mostly just wrong
Huh? I just rewrote the Linux stuff in my own style and it worked first try. Maybe what you looked at was for another model of SH4, SH4A, SHX or something?
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
I also came across the ASE stuff--some part of it IS accessible by the CPU, but I forget what. There's a section on ASE in one of the ST Micro SH4 docs their site. Edit: Here, chapter 13: https://www.st.com/content/ccc/resource ... 153464.pdf
Great, they give a whole register map and no addresses, lol, but the fact that ASE mode jumps to code at 0xFC000000 should be some kind of a clue. So I would guess 0xFC000000 is the ASE RAM area. The footnote that various regs are in the CCN also indicates that they would be in the 0xFF000000 to 0xFF1FFFFF area, which is where the performance counters are. IIRC the performance counters are actually supposed to be part of ASE or something. There's also the AUD section in chapter 12, which may or may not be related. I know that H-UDI uses the /ASEBRK pin, too.
It's been years since I looked into the ASE stuff, so I've forgotten most of it.
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
Sounds like your benchmark might need some work... Or more research. I dunno,
As far as I can tell, the benchmark is correct at this point. It's still possible I'm wrong, but that's why I'm uploading the source, to see if someone else could notice something I didn't.
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
Ah! I was wondering about this. The manual says pref turns into a "nop," which shouldn't stall. Has this been experimentally verified? I don't believe I've seen behavior that would indicate pref was slowing things down versus turning into a nop if the bus is in use or a prior pref is already in flight.
Yes, I've tested it. The NOP conversion is if the PREF would cause a exception, like a TLB miss.
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
Oof, shots fired, lol--but I assure you it's not slower.
GCC really makes a mess of things sometimes--for example using software FPU emulation when the hardware can handle NaNs and infinities just fine (this one blew my mind when I got an error about it from trying to use __builtin_isnan() and __builtin_isinf()).
Er, not trying to be rude or anything, just helpful. I think there's some kind of errata for the SH4's floating point, so GCC might be doing a software workaround to get true IEEE FP results? What optimization settings were you using when you benchmarked your math functions against the C versions? Did you have -ffast-math and -ffp-contract=fast turned on?
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
TapamN wrote:
Sun May 10, 2020 12:46 am
Doing an approximate reciprocal with FSRRA(x*x) is faster than a divide (9 cycles versus 12 cycles) if you can guarantee x is positive, but the conditional to handle negative x would make it a couple cycles slower due to the amount of time it takes to do the floating point comparisons and branches, and GCC can't schedule the inline asm FSRRA as well as it can with a FDIV. (GCC was able to generate FSRRA (and probably schedule it correctly) with the right options in older versions, but I can't get it to work on 9.3.)
Hehe, this is where lots of research went into. It's still faster than FDIV, and the combination of C and asm allows GCC to optimize things it can optimize. It's all about working *with* the compiler and not in spite of or against it.
I added some DreamHAL math tests to the benchmark program. It took some work to stop GCC from optimizing out the C code, but I got something with a disassembly that looks good. I had to stick the math functions into noinline functions, to make it harder to GCC optimize out what is being benchmarked, and when testing the FDIV instruction, I had to use inline assembly to stop GCC from precalculating the result, even with FDIV in it's own noinline function. I also force the results to be immediately written to memory so that the latency of the operation is included in the benchmark.

For both cross product and division, in the benchmark plain C had better performance. The exact numbers don't really mean much, since there's stuff like loop and function call overhead that you wouldn't have when you're actually using them and the surrounding code would have an effect on performance. I goal is to see if there is a difference is more than what would be considered noise. To do 100,000 function calls to a division routine took 11.5 ms using plain C, and 12.5 ms using the approximate divide. The difference would be greater if I wasn't forcing a worst cast latency stall. For cross products, C was 17.2 ms and FTRV was 23.2 ms.
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
Those FMOVs do 2 floats per cycle; with dual-issue pipelining and FMOV's being 0-cycle LS group, it should be half of what you say, unless LS grouping takes precedence over the 0-cycle... In any case, not all float instructions are in the same group--that's what people who didn't read Ch. 8 of the manual say! So other things can be parallelized into those matrix functions... and the only reason they're not is because I simply don't know what to put in there that would be useful! :grin:
Zero latency and instruction grouping are orthogonal. Instructions can dual execute if the second instruction does not depend on the result of the first instruction, unless the first instruction is zero latency, and both instructions are compatible groups, like LS and FE. FMOV and FNEG are all LS group, which cannot execute simultaneously with each other, so much of the cross product setup is executing a single instruction at a time. The timing for the cross product moves looks like this:

Code: Select all

fmov FR5, FR7
fschg
!FSCHG does not depend on FMOV and they are compatible instruction groups, so they can dual execute (Total 1 cycle elapsed)

fmov DR12, XD12
!FMOV and the next instruction are the same group (LS), so they cannot dual execute (Total 2 cycles elapsed)

fmov DR14, XD14
!FMOV and the next instruction are the same group (LS), so they cannot dual execute (Total 3 cycles elapsed)

fmov DR10, XD0
!FMOV and the next instruction are the same group (LS), so they cannot dual execute (Total 4 cycles elapsed)

fmov DR2, XD2
!FMOV and the next instruction are the same group (LS), so they cannot dual execute (Total 5 cycles elapsed)

fmov DR0, DR10
!FMOV and the next instruction are the same group (LS), so they cannot dual execute  (Total 6 cycles elapsed)

fmov DR0, DR12
!FMOV and the next instruction are the same group (LS), so they cannot dual execute (Total 7 cycles elapsed)

fmov DR0, DR14
fschg
!FSCHG does not depend on FMOV and they are compatible instruction groups, so they can dual execute (Total 8 cycles elapsed)

fneg FR9
!FNEG and the next instruction are the same group (LS), so they cannot dual execute (Total 9 cycles elapsed)

fmov FR8, FR2
!FMOV and the next instruction are the same group (LS), so they cannot dual execute (Total 10 cycles elapsed)

fneg FR2
!FNEG and the next instruction are the same group (LS), so they cannot dual execute  (Total 11 cycles elapsed)

fmov FR4, FR1
!FMOV and the next instruction are the same group (LS), so they cannot dual execute (Total 12 cycles elapsed)

fneg FR4
frchg
!FRCHG does not depend on FNEG, and they are compatible instruction groups, so they can dual execute (Total 13 cycles elapsed)

!Reviewing this, there's a actually a stall here, since any LS instructions have their latency increased to 3 cycles if they
!are feeding into the FIPR unit, even a 0 latency register-to-register FMOV. The "fmov FR4, FR1" was two cycles ago,
!so there's a one cycle stall here. (Total 14 cycles elapsed)

ftrv XMTRX, FV0
An expanded, improved version of the benchmark is included.
Attachments
sqbench2.zip
(52.99 KiB) Downloaded 18 times
User avatar
Moopthehedgehog
DCEmu Freak
DCEmu Freak
Posts: 85
Joined: Wed Jan 05, 2011 4:25 pm
Has liked: 4 times
Been liked: 36 times

Re: KOS vs Ninja - simple test.

Post by Moopthehedgehog » Thu May 14, 2020 1:55 pm

TapamN wrote:
Wed May 13, 2020 12:34 pm
The OCINDEX cache mode? A while ago, I tried combining OCINDEX mode and MOVCA to allocate cache above the TA submission area, and submitting vertices using cache writebacks. It functioned, but the advantages I thought it would have over SQs didn't work out.
Not OCINDEX, no. Basically the trick is doing a write-through write to memory, and then reading back that data in write-back memory, doing all the operations on the write-back memory (which is only in the cache), and then invalidating the cache without allowing the data to get written back to the starting area (e.g. do a memcpy() to get the final data to its destination followed by an ocbi instruction). Then you effectively have used the same memory area as (pseudo-)two memory areas, which is kinda of like a "fake" 2-way associativity in that the address for each fake "way" is an integer multiple of the "way" size--0 in this case. But I suppose a better way to refer to it is using a single memory address to simultaneously hold two different pieces of data, one temporary and the other more permanent.

It's a caching hack, to be sure, but effective use of it can save 32-byte memory fetches. You can take it a step further, especially if you have 32 bytes of data (or don't care about what's already at the destination's cache block), and replace the memcpy() I mentioned with a movca.l-optimized memcpy() to save a cacheline read of the destination before writing to it.
TapamN wrote:
Wed May 13, 2020 12:34 pm
SQ store data from the CPU. There's no way to move stuff from cache to SQ, only CPU to SQ. They are a separate block of RAM from the cache. The writes to the SQs are as fast as writing to cache, though.
Oh, I had assumed that "by way of the CPU's registers" was implied. I mean, how else is the data going to get there? The only way to write data to the SQs is from the CPU's cache by way of registers (or by directly storing whatever's in 8 of the registers per SQ).

Architecturally, the CPU stores its data in the cache and the SQs each just have a 32-byte storage buffer probably set up in the same way as the cache writeback buffer. The SQs are like the writeback buffer on steroids that the programmer has complete control over, and since the SQs are on the CPU itself there's no place from which they could source data outside of the CPU's registers (which either get their data from the data cache or they build it from scratch with 8-bit immediates, which actually come from the instruction cache--so no matter what you're getting data from a cache somewhere).

While it's on my mind, one can actually order data by its distance from the CPU core, which also corresponds directly with how fast each can be accessed. Perhaps this list may be useful to some:

(Closest, Fastest)
1. Registers
2. Instruction Cache
3. Data Cache (with modern CPUs, lower "L" means closer to the CPU)
4. Memory bus to main RAM
5. External busses to peripherals
6. External bus to permanent storage (HDD, CD, Tape, DNA, whatever)
(Furthest, Slowest)

Usually 5 and 6 are multiplexed, making them the same distance, with peripherals in most cases outspeeding permanent storage. Main RAM has a dedicated memory bus. On Dreamcast, main RAM is multiplexed with the Holly on the memory bus, so communication with that peripheral from the SH4 should be comparably quick. Other peripherals are directly connected to the Holly, so the Holly acts kind of like the Northbridge and Southbridge of PC motherboards and multiplexes 5 and 6.

On SH4, the following is true about transferring data:

SQ*: 1 --> 4,5,6
OCBP/OCBWB instructions: 3 --> 4,5,6
DMA**: 4 <--> 5,6, or 5 <--> 6

* Behind-the-scenes, source data for the SQs comes from 2,3.
** 5 <--> 6 requires an appropriate setup, e.g. using MMU address translation, or setting up the physical address space like the Dreamcast has it--which appears to be the Holly actually intercepting the memory bus for addresses that aren't destined for main RAM. Modern chips with DMACs like ARM just allow peripheral-to-peripheral DMA without needing to worry about the memory map.
TapamN wrote:
Wed May 13, 2020 12:34 pm
When you benchmarked your memcpy, was it for copies larger than cache? You were benchmarking how long it took to copy main RAM and not cache, right? According to this document, a row-hit read and a row-hit write would take 7+6=13 bus cycles, or 26 CPU cycles. You can easily copy 32 bytes using longs in that time. It seems like the reason both were the same speed is because the memory bus is too slow for doubles to make a difference. Nothing to do with the BSC.
Why would I benchmark copying to main ram for a cycle count measuring instruction latency? Cycle counts in documentation for things like mov.l and fmov.s only apply to reads and writes from the cache, as mov.l stalls the whole pipeline when there's a cache miss. Registers only operate on cache anyways--you can't get anything into or out of them without going through the cache (SQs aside--but those are special SH4 things and this is true of x86 and ARM, too).

I mean, the test is really easy: use the 64-bit memset on write-through memory (P1 in DreamHAL), you'll see it's exactly the same speed as memset 32-bit for the same amount of data. Then, switch to write-back memory (P0 in DreamHAL), and you'll see 64-bit memset fly 2x as fast as 32-bit memset. That means the BSC passes the outgoing 64-bits as 2x sequential 32-bit writes, and they have to be written one-by-one to the memory bus running at half the frequency of the internal bus. Since the write-through buffer is only 64-bits wide, the size of the cache buffer won't be a problem here. Interestingly, you can also see that the write-back and write-through buffers are diagrammed in terms of longwords on page 122 of the SH7750 manual, which is a hint that the writes that go out are in terms of 4 byte units.
TapamN wrote:
Wed May 13, 2020 12:34 pm
If it takes 13 bus cycles to copy 32 bytes, we would predict a copy rate of 100 MHz / 13 cycles * 32 bytes = 246 MB/s. I updated the benchmark to see how fast the CPU can read a cacheline, and write it back. I got 223.5 MB/s best case doing reads with cache and writes with SQ. Using cache for both falls to 185 MB/s for some reason.
This is potentially incorrect, writing to a cache-miss destination means the cacheline must first be read into the cache before writing to it. That's why you see a slowdown. If you used movca.l, which there's no way to make GCC emit from C as far as I know, you'd see what you expect. It's also possible you encountered cache trashing if you made the copy too large.
TapamN wrote:
Wed May 13, 2020 12:34 pm
True. I usually go with MB==MiB as well, and I did that at first, but then I decided to go with 800 MB/s since that's what generally gets listed on Dreamcast tech specs, even though it's not 100% accurate.
Well, I learned something new today: https://physics.nist.gov/cuu/Units/binary.html
"Some designers of local area networks have used megabit per second to mean 1 048 576 bit/s, but all telecommunications engineers use it to mean 10^6 bit/s."

Advertised Wi-Fi speeds I knew were already bad enough since they're actually half-duplex, but this just makes things even worse.
TapamN wrote:
Wed May 13, 2020 12:34 pm
Why? It's using the hardware timer, which I've used to do cycle accurate timing before by running code*. CPU usage won't interfere with it, that's kind of the point of having separate hardware timers, getting accurate timing regardless of what the CPU does.

I modified the benchmark to allow selecting between using KOS timers, my perf counter code, and DreamHAL's perf countercode. All non-DMA tests generate the same values withing >1%. DMA tests were way off at first; turns out CPU activity can interfere with the performance counter. I tested DMA bandwidth using blocking DMA. Blocking DMA ends up running the idle task, which executes the SLEEP instruction, causing the CPU and performance counter to stop! I modified the DMA tests to instead poll until DMA is done before stopping the timer, and this resulted in the perf counters generating the same results as the timer for DMA as well.

* The hardware timer does not have enough precision on its own to do cycle accurate timing, but by multiplying the time it takes for code to execute, you can bring it up to where the timer can measure it. Say you have a block of code you want to measure. If you run it in a loop 200,000,000 times, each cycle in the loop add one second to the execution time, which can be easily measured. It's possible to do cycle accurate timing using sunrises if you loop enough. Doing multiple loops of the code being profiled also reduces the effect background threads or interrupt handlers might have on the measurement.
Why would you benchmark raw performance using something other than an isolated, standalone test environment? That's not a scientific test.
I made such an environment in DreamHAL expressly for the purpose of doing isolation testing. The fact that you have threads going on, interrupts firing, and peripherals doing who-knows-what means this is absolutely not a controlled environment. Sorry that I don't have the timer module made yet, but I do have the performance counters, printf, and a dcload-ip interface. One just needs to clean out dc_main.c of all the graphics testing stuff to have a simple, clean, and isolated environment where P1 is write-through and P0/U0/P3 is write-back.
TapamN wrote:
Wed May 13, 2020 12:34 pm
None of them were in the programming/hardware manuals I use. I first found out about them in MAME's source, and I only found official documentation about them when going through the application notes and errata. Which manual did you find them in?
Renesas "SH7750, SH7750S, SH7750R Group User's Manual: Hardware"

There are really only 3 documents anybody needs to program SH4, as I explain on this wiki page:
https://dreamcast.wiki/Useful_programming_tips
TapamN wrote:
Wed May 13, 2020 12:34 pm
Moopthehedgehog wrote:
Sun May 10, 2020 4:23 pm
...Yeah so Linux actually got a lot of that stuff wrong. The bit labels and their functions in that Linux code are mostly just wrong
Huh? I just rewrote the Linux stuff in my own style and it worked first try. Maybe what you looked at was for another model of SH4, SH4A, SHX or something?
Nope, the linux kernel just got stuff wrong. They don't mention the bus/ratio mode, for example, which leaves it to the whim of the CPU to decide that it wants to be in that mode, which runs about 11.96x too fast, or in cycle count mode. It took a long time to figure that one out, why my Dreamcasts would suddenly be running the counter 12x too fast (because the bit is undefined at boot). I think they also got the method to stop the counters wrong, and try to read back from a disabled counter rather than a stopped counter. I don't remember all the details, but the big takeaway was that if one derived a driver from the Linux kernel code, it simply wouldn't work as expected, so I ultimately wrote my own interface for the counters from the ground up. Plus that means I could make it non-GPL, too!
TapamN wrote:
Wed May 13, 2020 12:34 pm
It's been years since I looked into the ASE stuff, so I've forgotten most of it.
Yeah, I don't think we really need it for anything, anyways, other than the performance counters' being super handy.
TapamN wrote:
Wed May 13, 2020 12:34 pm
As far as I can tell, the benchmark is correct at this point. It's still possible I'm wrong, but that's why I'm uploading the source, to see if someone else could notice something I didn't.
Well, yeah, you're using KOS, which initializes peripherals even if you don't want them by default and has threads to deal with, as opposed to an isolated environment. Like I mentioned, I made DreamHAL's "standalone program" part because I needed such an environment.
TapamN wrote:
Wed May 13, 2020 12:34 pm
Yes, I've tested it. The NOP conversion is if the PREF would cause a exception, like a TLB miss.
Weird, I was wondering why using pref would sometimes slow down my code (although I was able to change things around to get it to speed things up). I thought pref was supposed to happen kinda "in the background," however...
TapamN wrote:
Wed May 13, 2020 12:34 pm
Er, not trying to be rude or anything, just helpful. I think there's some kind of errata for the SH4's floating point, so GCC might be doing a software workaround to get true IEEE FP results? What optimization settings were you using when you benchmarked your math functions against the C versions? Did you have -ffast-math and -ffp-contract=fast turned on?
Yeah, the "shots fired" thing was a joke. :P
As far as I know there isn't some kind of errata. I genuinely think they might have just screwed that one up. The SH4 is IEEE-754 compliant. I wasn't benchmarking those things--I was just using the builtin functions as part of a float to string converter, and it gave me that notice of software floats. It was really weird, but I didn't need to use the builtins for that anyways.

By the way, the DreamHAL math stuff is designed to be compiled with GCC -O3. It's fine on lower levels, but it takes advantage of some optimization tricks GCC does at higher levels. Isolation testing my math header doesn't really produce the same results that one would see in a real-world situation, which is what it's designed for.
TapamN wrote:
Wed May 13, 2020 12:34 pm
I added some DreamHAL math tests to the benchmark program. It took some work to stop GCC from optimizing out the C code, but I got something with a disassembly that looks good. I had to stick the math functions into noinline functions, to make it harder to GCC optimize out what is being benchmarked, and when testing the FDIV instruction, I had to use inline assembly to stop GCC from precalculating the result, even with FDIV in it's own noinline function. I also force the results to be immediately written to memory so that the latency of the operation is included in the benchmark.
...Yeah this isn't how those math functions are meant to be used: The whole point is that GCC is supposed to optimize out the C code, precalculate results, and do all the other things it does (e.g. optimizing out structs to return 4 floats in 4 registers per the generic C ABI instead of actually using a struct) otherwise you get some results that are totally bananas like ridiculous numbers of excess fmov and memory accesses. So I have no idea what "looks good" means to you, but based on this description it sounds to me like you weren't really using it as intended. This is an unusual case of synthetic benchmarking producing worse results than real-world use, and I'm optimizing for real-world use. It's much harder to benchmark it as it is actually intended to be used, but that's the idea.

The cross product function this is especially true of. If you want to use the cross product in a loop, you only need to use the cross product function once, and then in the loop use MATH_Matrix_Multiply() or MATH_Matrix_Transform() over and over. All of my matrix functions that end with frchg to save the result matrix into XMTRX are like that. That would also include the outer product function, for example.

Alternatively, you use the cross product function once to store a "global" matrix in XMTRX and use MATH_Matrix_Multiply(), et al. whenever you want to invoke it.
TapamN wrote:
Wed May 13, 2020 12:34 pm
Zero latency and instruction grouping are orthogonal. Instructions can dual execute if the second instruction does not depend on the result of the first instruction, unless the first instruction is zero latency, and both instructions are compatible groups, like LS and FE. FMOV and FNEG are all LS group, which cannot execute simultaneously with each other, so much of the cross product setup is executing a single instruction at a time. The timing for the cross product moves looks like this:
Yeah, you're right. So, as I said, "other things can be parallelized into those matrix functions... and the only reason they're not is because I simply don't know what to put in there that would be useful! :grin:" Later matrix functions I do actually try to make better use of parallelism with prefetching, which is why I have those notes about putting the __builtin_prefetch() directive in a good spot. But an enterprising developer can weave in stuff that uses the integer registers (probably very application-specific stuff), since per SH4 C ABI we get 8 floats and 4 ints that can be passed in as function arguments without resorting to stack pushes (which then have a risk of spilling beyond a cacheline and stalling out). Of course, inlining makes function arguments moot, but I try to stick to the ABI's limits where I can in case a function can't be inlined for some reason.

One other thing: Don't use these: -ffast-math -funsafe-math-optimizations -ffinite-math-only
-ffast-math isn't necessary when using DreamHAL math functions, -funsafe-math-optimizations (implied by fast-math) can cause some very strange problems, and -ffinite-math-only (also implied by fast-math) is straight up broken on GCC 9 and can produce very strange errors. All those compile options I have in DreamHAL's Compile.sh aren't just for show (and the ones implied by O3 I have there because I sometimes test in Og and I want/need them always on). Edit: I also am not sure that -mrelax does anything on GCC 9, as I see in your makefile. I think it might just happen by default now (I often see "size before relaxing" in my SH4 binary output maps and I don't have any relaxation directives).

Side note: Ofast is also borked and is actually slower than O3 on GCC 9, instead of being "-O3 -ffast-math." mrneo240 actually found that Os and O3 were producing the same performance results on NuQuake, and both were several FPS faster than Ofast.

Edit: I forgot to mention that I did add the "positive only" invert to the math header, btw. In hindsight this was a silly thing not to have, lol.
Which reminds me, another use of DreamHAL's math header is that it provides building blocks for making more complex functions out of specific instructions. So you could chain the positive-only invert into an even faster divide than MATH_Fast_Divide() and would only work for positive numbers, for example.

That stated, I like your libsh4. I can't say I've ever seen 0xc0ffee used before, either, haha. I'll have to remember that one.
I also think GCC 9 does that addition with registers that you mention in sq.h automatically now, so I am not sure that you need a struct for that.
I'm sure Aleron Ives feels weird with his postcount back to <10668
:D
TapamN
DCEmu Freak
DCEmu Freak
Posts: 54
Joined: Sun Oct 04, 2009 11:13 am
Has liked: 0
Been liked: 12 times

Re: KOS vs Ninja - simple test.

Post by TapamN » Sun May 24, 2020 5:22 am

Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
Not OCINDEX, no. Basically the trick is doing a write-through write to memory, and then reading back that data in write-back memory, doing all the operations on the write-back memory (which is only in the cache), and then invalidating the cache without allowing the data to get written back to the starting area (e.g. do a memcpy() to get the final data to its destination followed by an ocbi instruction). Then you effectively have used the same memory area as (pseudo-)two memory areas, which is kinda of like a "fake" 2-way associativity in that the address for each fake "way" is an integer multiple of the "way" size--0 in this case. But I suppose a better way to refer to it is using a single memory address to simultaneously hold two different pieces of data, one temporary and the other more permanent.
Do you have some pseudo code that more precisely describes how the trick is preformed? To my interpretation, that sounds more like a trick to get 16KB OCRAM rather than 2-way associativity. Multiway caches are for avoiding cache thrashing, not reusing part of the cache like that.

One issue with tricks relying on doing a OCBI to avoid writing cache to memory is having an interrupt happen, and it has to flush some of the cache to RAM before you can invalidate it. You would get memory corruption from having RAM from being a mix of what was already in RAM, and partially the temporary stuff that got flushed by the IRQ handler. My OCINDEX trick relied on the fact that nothing ever uses an address with bit 25 set, so nothing would ever accidentally alias with what I had in that part of the cache. Using OCINDEX the way it is intended is a much more reliable, simple way to do pseudo 2-way caches.
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
Architecturally, the CPU stores its data in the cache and the SQs each just have a 32-byte storage buffer probably set up in the same way as the cache writeback buffer. The SQs are like the writeback buffer on steroids that the programmer has complete control over, and since the SQs are on the CPU itself there's no place from which they could source data outside of the CPU's registers (which either get their data from the data cache or they build it from scratch with 8-bit immediates, which actually come from the instruction cache--so no matter what you're getting data from a cache somewhere)
Oh, you meant, "The SQ gets the data from the CPU's registers, and the CPU's registers get their data from the cache"? Ok, it didn't occur to mean that you meant it indirectly like that.
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
Why would I benchmark copying to main ram for a cycle count measuring instruction latency?
I thought you were talking about memcpy bandwidth when you were saying that double FMOVs weren't faster than 32-bit MOVs, not latency.
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
TapamN wrote:
Wed May 13, 2020 12:34 pm
If it takes 13 bus cycles to copy 32 bytes, we would predict a copy rate of 100 MHz / 13 cycles * 32 bytes = 246 MB/s. I updated the benchmark to see how fast the CPU can read a cacheline, and write it back. I got 223.5 MB/s best case doing reads with cache and writes with SQ. Using cache for both falls to 185 MB/s for some reason.
This is potentially incorrect, writing to a cache-miss destination means the cacheline must first be read into the cache before writing to it. That's why you see a slowdown. If you used movca.l, which there's no way to make GCC emit from C as far as I know, you'd see what you expect. It's also possible you encountered cache trashing if you made the copy too large.
Using MOVCA.L to remove the read from a read/write bandwidth test kind of changes the test. The point of that benchmark was to measure how fast an ideal memcpy could go by preforming a cacheline read, then a cacheline write (i.e. you have to read in the source cacheline and write a cacheline at the destination. Reading the destination would be avoided with MOVCA.L). Replacing the read with MOVCA.L for that benchmark would end up just testing write bandwidth, not read/write bandwidth, so it'd be cache_main_ram_write_test, which does use MOVCA.L with inline assembler.

Cache thrashing is what happens when alternating between two or more memory locations that alias in cache. For example, in the addresses 0x8c012020 and 0x8c01a020 bits 13-5 are the same, so they alias in the cache because they are stored in the same physical cacheline. If you access one (loading it into cache), then access the other, the second access will remove the first one from the cache because they overlap. If you then read the first again, it has to be reloaded. The slowdown from ping-ponging between locations is what's meant by cache thrashing. The benchmark never goes back to previous cachelines, it's accessing them linearly, so it's not really thrashing, just missing (on purpose).
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
Why would you benchmark raw performance using something other than an isolated, standalone test environment? That's not a scientific test.
I made such an environment in DreamHAL expressly for the purpose of doing isolation testing. The fact that you have threads going on, interrupts firing, and peripherals doing who-knows-what means this is absolutely not a controlled environment.
KOS, in its initial boot state, is a controlled environment. You don't need an environment that controlled to get meaningful benchmarks. All the KOS would do is add a touch more variance between runs, and slow down the benchmark down by probably around 1%. So you might measure FSRRA's latency to be 5.99 or 6.02 once in a while instead of a totally consistent 6.00, it's no bit deal. It's still clear what it is. Now that I think about it, I've done some cycle accurate measurements under SuperH Linux, with things like X, an SSH server, and text editors in the background. KOS is super stripped down and controlled compared to that.

And some things, like the bandwidth tests, are going to be done while running under KOS anyways, so it could be argued that it makes the benchmark more practical then theoretical.

If it really causes that much worry, turning off IRQs would be basically identical to not using KOS. It also stops preemption, so any background threads can't affect measurements. AFAIK, none of the DC's hardware can access the main RAM bus without the SH4's permission, so the peripherals won't have any effect on the benchmark without the SH4 servicing their IRQs. The only possible source of randomness at that point would be RAM refresh, and maybe the phase of the CPU clock to the bus clock.

(From what I can tell, the BSC totally owns the main RAM bus. Peripherals can only read/write to the main bus through DMA (or have the CPU poll the hardware for data). Some on-chip devices like the SCIF can optionally directly request DMAs. Things like main RAM to video RAM DMA, despite being activated by writes to Holly and not the DMAC, are preformed by the PVR sending a DMA request to the SH4 using the DDT mode of the SH4's DMA controller, which allows an external device to reprogram the DMAC.)

KOS doesn't use or enable DMA of on-chip peripherals, so that's not an issue. I don't think any of the DC's external hardware spontaneously requests DMA, it only happens in response to a request from the SH4. Disabling interrupts is equivalent to not initializing hardware in regards to benchmarking overhead.

It looks like turning off IRQs under KOS makes the benchmarks consistently run very slightly faster, but it's less than a one percent difference. Nothing significant. The included zip adds a #define that can disable IRQs before executing a benchmark. (DMA bandwidth benchmarks will hang without their completion IRQs signaling to KOS that they have finished, so they still have interrupts on.)
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
TapamN wrote:
Wed May 13, 2020 12:34 pm
None of them were in the programming/hardware manuals I use. I first found out about them in MAME's source, and I only found official documentation about them when going through the application notes and errata. Which manual did you find them in?
Renesas "SH7750, SH7750S, SH7750R Group User's Manual: Hardware"
Oh, I see. The Hardware Manual PDF I've been using is from 2002, and lacks the appendix describing the version registers. I think I stuck with an old version so I wouldn't have to figure out if I was looking at 7750R exclusive stuff. It looks like they added them to the main manual in some later version.
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
Nope, the linux kernel just got stuff wrong. They don't mention the bus/ratio mode, for example, which leaves it to the whim of the CPU to decide that it wants to be in that mode, which runs about 11.96x too fast, or in cycle count mode.
Really? I looking at my implementation and the current versions of Linux's, and they will always leave that bus/ratio bit clear. They don't OR the enable/start/mode bits to the control register, they set the control register to a value that always leaves the ratio bit clear, so it's always defined when a counter is started. Maybe an old version ORed the start bit on?
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
TapamN wrote:
Wed May 13, 2020 12:34 pm
As far as I can tell, the benchmark is correct at this point. It's still possible I'm wrong, but that's why I'm uploading the source, to see if someone else could notice something I didn't.
Well, yeah, you're using KOS, which initializes peripherals even if you don't want them by default and has threads to deal with, as opposed to an isolated environment. Like I mentioned, I made DreamHAL's "standalone program" part because I needed such an environment.
That's not going to have a significant effect on the results.

Buuut, I really should make sure my old code compiles on new GCC, so this is as good a time as any to try to get my incomplete OS running on it (and I suspect you might still complain that disabling IRQs in KOS isn't good enough ;) ). So I've included a version of the benchmark that doesn't use KOS. Most results are within >1% of what the benchmarks do on KOS with IRQs disabled. The biggest difference is that the triangle normal calculating benchmark has a 4.5% speed increase running under KOS. I think I have all relevant compiler flags set to the same thing. I suspect it has to do with a different way of cache thrashing, since it's has a lot of memory accesses, something like what this guy talks about (the talk is more than what the title implies. It's about benchmarking. I recommend watching the whole thing, but for what I'm talking about, try starting at 10:00 or 12:00).
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
Weird, I was wondering why using pref would sometimes slow down my code (although I was able to change things around to get it to speed things up). I thought pref was supposed to happen kinda "in the background," however...
From what I've seen, the issue is that the SH4 can only handle one active PREF/OCBWB/OCBP at a time. That one instruction does run in the background, but if another one is started before the first has completed, it stalls until the first is done.
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
By the way, the DreamHAL math stuff is designed to be compiled with GCC -O3. It's fine on lower levels, but it takes advantage of some optimization tricks GCC does at higher levels.
Ok, I've changed it to O3.
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
This is an unusual case of synthetic benchmarking producing worse results than real-world use, and I'm optimizing for real-world use. It's much harder to benchmark it as it is actually intended to be used, but that's the idea.
I would think that that the synthetic benchmark gives the FTRV cross product a better chance than a non-synthetic. GCC doesn't know what the latencies of the instructions are in the asm block are, so it won't try to schedule instructions to avoid stalls with it. With the C version, GCC would be free to schedule and rearrange all instructions in real code, but the synthetic benchmark prevents that, so it should have a greater handicap than the FTRV version.

Well, I can add a more realistic cross product benchmark. The benchmark calculates the normals of a bunch of triangles, one version with a C cross product, and one with a FTRV cross product. Like the artificial benchmarks, the exact execution time doesn't have much meaning, only whether they are the same speed or one is faster than the other (I based it off the Big the Cat benchmark, so it does quite a bit of pointer chasing through indexed geometry, so most of the execution time will be cache misses.) I got 1.11 ms for the C version, and 1.22 ms for the FTRV version on -O3 without KOS.
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
The cross product function this is especially true of. If you want to use the cross product in a loop, you only need to use the cross product function once, and then in the loop use MATH_Matrix_Multiply() or MATH_Matrix_Transform() over and over. All of my matrix functions that end with frchg to save the result matrix into XMTRX are like that. That would also include the outer product function, for example.
I don't think that would help in the new benchmark, since both sides of the cross product change each call, or am I misunderstanding? I'm just calling MATH_Cross_Product in it.
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
One other thing: Don't use these: -ffast-math -funsafe-math-optimizations -ffinite-math-only
-ffast-math isn't necessary when using DreamHAL math functions, -funsafe-math-optimizations (implied by fast-math) can cause some very strange problems, and -ffinite-math-only (also implied by fast-math) is straight up broken on GCC 9 and can produce very strange errors. All those compile options I have in DreamHAL's Compile.sh aren't just for show (and the ones implied by O3 I have there because I sometimes test in Og and I want/need them always on). Edit: I also am not sure that -mrelax does anything on GCC 9, as I see in your makefile. I think it might just happen by default now (I often see "size before relaxing" in my SH4 binary output maps and I don't have any relaxation directives).
Ah, ok. I've been using GCC 4.7 until recently. I've finally updated (to 9), but I haven't done much coding with it, so I haven't had time to run across those problems yet. I'll keep that in mind.

The -mrelax was left over from the template I used for the makefile from some other project. You probably already know this, but for anyone reading this that doesn't, the -mrelax option is supposed to replace JSR instructions with faster, smaller BSR instructions. I forgot I had it there, I don't think it would do anything there.

After putting the mrelax option were it's supposed to go (the final linking stage) it seemed to work at first, but after making some changes to the code, it started crashing, so I guess it's still broken. :-/ It's on by default and works flawlessly on the compiler I use for the SH3-DSP device, but it's some version of GCC 4 (I think 4.3?).
Moopthehedgehog wrote:
Thu May 14, 2020 1:55 pm
I also think GCC 9 does that addition with registers that you mention in sq.h automatically now, so I am not sure that you need a struct for that.
Well, that's good. But I've already designed a bunch of code around using structs like that for hardware access, so I might as well use them, I guess. I think it's still kind of nice to use the structs as a pseudo-namespace for registers.

Any plans to make a full, fast vector library? I have a crappy, simple vector library that's intended for cases where execution time doesn't matter, but nothing suited for performance.

On your vector functions, I was surprised at the way GCC was able to optimize away the stores when returning the vectors. A long time ago, I wrote C++ sprite drawing code on GCC 3.4 that used a inlined operator overloaded vector library, but when I looked at the generated assembly the function had tons of unnecessary load and stores, even though I followed all the advice I had found about the correct, efficient way of doing it. I rewrote it in C style, and the resulting code was half the size.

It's a lot cleaner to do vectors your way. With my libsh4 math stuff, you can't do anything like sh4Ftrv(a+b, c, 0, 1) since it will try to assign the results to a+b, 0, and 1, which isn't valid. It's possible to do that with your method. A full, fast, 2D, 3D, and 4D vector library done in your style, being able to pass elements and vectors around as a unit rather than only individual elements or pointers to floats, would be very useful.

New benchmarks included, the complete KOS suite with the ability to disable IRQs, and a smaller KOS-free version.
Attachments
benches.7z
(90.11 KiB) Downloaded 23 times
MetalliC
DCEmu Crazy Poster
DCEmu Crazy Poster
Posts: 28
Joined: Wed Apr 23, 2014 3:04 pm
Has liked: 0
Been liked: 0

Re: KOS vs Ninja - simple test.

Post by MetalliC » Thu Jun 04, 2020 2:59 pm

as of SH7750 and SH7091 differences:
at hardware level main difference is - 7750 have GPIO PORT A lines shared with data bus, while 7091 have them shared with address bus. which is pretty wise move because DC uses quite few bits of address bits for local SDRAM access only, and access everything else in MPX mode (address and data multiplexed at data bus).
so, 7091 allows to have PortA access enabled together with 64bit SDRAM or/and MPX access, while 7750 have restriction to use 32-bit max data bus while have enabled GPIO PortA.

more on MPX - all HOLLY access (e.g. TA, G2, VRAM, registers, etc) done in MPX mode, which means its not pure "single-cycle 64-bit bus", but multiple-cycle - 1st bus cycle is address phase (address put at data bus) and next cycle(s) is actual data transfer(s) => there is always additional bus cycle for address setup.

PVR / version register - there is known to exists several revisions of 7091 indicated by last character printed on chip label, smth like HD6417091Y
here is 7091 version numbers I've seen in various devices: 040205c1 - Dreamcast / NAOMI, 040205c5 - Atomiswave / Naomi2, 40206c8 - SystemSP. but I'd imagine there was made even more revisions.

as of max transfer rates - Sega docs claims DMA to Texture RAM via TA-FIFO is up to 700MByte/sec.
interesting, which will be SQ transfer rate to there ?
TapamN
DCEmu Freak
DCEmu Freak
Posts: 54
Joined: Sun Oct 04, 2009 11:13 am
Has liked: 0
Been liked: 12 times

Re: KOS vs Ninja - simple test.

Post by TapamN » Mon Jun 15, 2020 7:04 pm

MetalliC wrote:
Thu Jun 04, 2020 2:59 pm
as of SH7750 and SH7091 differences:
at hardware level main difference is - 7750 have GPIO PORT A lines shared with data bus, while 7091 have them shared with address bus. which is pretty wise move because DC uses quite few bits of address bits for local SDRAM access only, and access everything else in MPX mode (address and data multiplexed at data bus).
so, 7091 allows to have PortA access enabled together with 64bit SDRAM or/and MPX access, while 7750 have restriction to use 32-bit max data bus while have enabled GPIO PortA.
Huh, I have a Windows CE SuperH-4 based laptop (Compaq Aero 8000) and I tried to upgrade the RAM from 16 MB to 64 MB, the max it claims to support, using standard SO-DIMMs, but it only recognized 32 MB. I wonder if this is why, it's using the GPIO port and the laptop is using a 32-bit data bus instead of the full 64-bit? (Or the RAM I put in is mislabeled, and it's really just 32 MB.) I'm going to order a 128MB stick and see what happens if I put that in.

This is getting further off topic, but I have a bunch of Windows CE devices and pocket sized computers. If anyone's curious, I took pictures of them and wrote some descriptions.
MetalliC wrote:
Thu Jun 04, 2020 2:59 pm
as of max transfer rates - Sega docs claims DMA to Texture RAM via TA-FIFO is up to 700MByte/sec.
interesting, which will be SQ transfer rate to there ?
I don't remember the exact speed, but SQ to 64-bit VRAM was much slower than DMA. My first guess would be something around a quarter or less speed? It would be easy to modify my benchmark to check it.

Edit: Regarding the Aero 8000, it looks like the 64MB piece of RAM detected as 32MB really is 64MB. It has eight D4564163G5 chips, each of which is 64 Mbit, which adds up to 64MB. I got a 128MB stick, put it in, and it was detected as 64 MB, so I guess it really is using a 32-bit data bus.
Post Reply