No worries, lol. Not in any rush here.
Sorry, yeah, looks like I used poor word choice here. I had actually posted a way to make a kind of "fake" 2-way associativity by abusing 29-bit addressing that makes use of manual cache management in the Simulant Discord not too long ago. (Unrelated, but I gotta admit that it's nice having something like IRC that actually has conversation history.)TapamN wrote: ↑Sun May 10, 2020 12:46 amThe SH4 does not have any sort of cache coherency. It's easy to check that SQs do not cause cache flushes. Disable interrupts (so that the timer IRQ handler does not accidentally flush the cache), write something to cache, and use the SQ to write to memory. The reload the data you wrote and see if you get the old data from cache, or the new RAM data. You will get the old cache data. In the benchmark mentioned below, there is a section that tests this and confirms SQs don't cause cache flushes.
What I was getting at is that I was under the impression that SQs were only able to store data from the SH4's cache--not that they cause cache flushes. But that is useful to know that SQs don't explicitly trigger cache flushes! So by "cache flush" I had really meant to say "they write data out from the cache only." Which would imply that SQs don't really work right in write-through memory, which I think might be true but I'm not 100% sure.
Yep, you're totally right. I forgot the frequency was different there, haha. Yes, the cache and SQ should be at the same speed. The weird part is only whether the BSC converts 8x 4-byte writes into 4x 8-byte writes. If it doesn't, then writing out to RAM would be only half-bandwidth due to the halving of bus frequency.TapamN wrote: ↑Sun May 10, 2020 12:46 amThe bus between the CPU core and BSC runs at 200 MHz, so it has 200 M/s * 4 byte = 800 MB/s bandwidth. The main RAM bus run at 100 MHz, so it has 100 M/s * 8 byte = 800 MB/s theoretical bandwidth. There's no bottleneck here, they have the same bandwidth.Moopthehedgehog wrote: ↑Mon Apr 27, 2020 5:31 pm there's an architectural bottleneck right between the CPU core and the BSC. See the diagram on page 9, [...] page 367 of the document also notes that 64-bit transfer to external memory only applies to the DMAC.
If there WAS a bottleneck, the cache and SQs would be the same speed anyways. Both have to go through the BSC...
Yeah, I think that's why fmov.d does the "pair move" thing instead of a proper double-precision load in little-endian. Again causes me to wonder if the BSC does 8x4 to 4x8 conversion for 32-bytes. I don't know that it does.TapamN wrote: ↑Sun May 10, 2020 12:46 amThat says 64-bit accesses can only be done by the DMAC, while 8-bit, 16-bit, 32-bit, and 32-byte can be done by the CPU as well. SQ and cacheline load/writebacks are a 32-byte access, not 64-bit access. I wonder if this would mean that an uncached 64-bit floating point load/store is broken up into two 32-bit memory accesses?Moopthehedgehog wrote: ↑Mon Apr 27, 2020 5:31 pm Page 367 of the document also notes that 64-bit transfer to external memory only applies to the DMAC.
I have a memset that I timed runs in only 2 cycles (and I made variations that do 8, 4, 2, and 1 byte sizes). Using my similarly max-optimized memcpy/memmove functions, In write-through memory I've seen consistent behavior where the 8-byte one takes the same number of cycles to move the same quantity of data as the 4-byte, which would indicate that the BSC does not do 8x4 to 4x8 with the data. Using write-back memory and the 8-byte function behaves more in line with expectation, since it's hitting the cache and there's a 64-bit path to the cache.TapamN wrote: ↑Sun May 10, 2020 12:46 am Anyway, it's easy to test the bus by measuring it's bandwidth. If the CPU or SQ can't use full bus width, with the DC's 100 MHz main RAM bus, the absolute max bandwidth would be 400 MB/s (it would really be less due to communication overhead). The only way to go above 400 MB/s would be for the CPU to use the entire bus. I wrote a benchmark to store queue to RAM as fast as possible and see what it's bandwidth was. I got 495 MB/s. A bit lower than I expected (when you factor in overhead, 533 MB/s probably be the real absolute best speed for a 64-bit bus), but still more than 400 MB/s, so it's not limited to just 32-bit wide access. I also checked cache bandwidth by using a MOVCA, OCBWB sequence. This was 493 MB/s. Basically the same thing.
Also, being unnecessarily pedantic, the bus is not 800MB/sec unless you use "Apple" or "marketing" MB notation. It's really 760MiB/s and integral factors of it. I refuse to bow down to marketing pressures and continue to use MB to mean MiB, generally, so my numbers basically never mean marketing MBs.
Benchmark needs to be done with performance counter cycles to ensure that no CPU activity interferes with the counter.TapamN wrote: ↑Sun May 10, 2020 12:46 amWell, since the SH4 gets more than 400 MB/s bandwidth, it has to recombine the data.Moopthehedgehog wrote: ↑Mon Apr 27, 2020 5:31 pmI'm not sure if the BSC then reassembles it into 4x 8-byte transfers out to memory or if it just passes it straight through as 8x 4-byte transfers, however. I would guess it doesn't actually reassemble them, since it probably just sees 8x 4-byte transfers coming in as 8x 4-byte transfers.
Is that PVR, CVR or PRR? PVR and PRR are documented in the appendix of the SH7750 manual, CVR is not and has something like info relating to the cache properties of the CPU.TapamN wrote: ↑Sun May 10, 2020 12:46 amIt seems like they can be treated as identical. As far as I can tell, the only possible difference between the SH7750 and SH7091 is that the semi-undocumented internal processor version register may be different between the two versions. (I think the register is mentioned in some supplementary documents, but not the main manuals.)Moopthehedgehog wrote: ↑Mon Apr 27, 2020 5:31 pmAnd yes, I know the Dreamcast uses an SH7091, but as far as I've been able to tell it's identical
...Yeah so Linux actually got a lot of that stuff wrong. The bit labels and their functions in that Linux code are mostly just wrong--I spent almost an entire month researching the things, including doing stuff like letting one of the 48-bit counters overflow to check for an overflow bit (there isn't one). I also came across the ASE stuff--some part of it IS accessible by the CPU, but I forget what. There's a section on ASE in one of the ST Micro SH4 docs their site. Edit: Here, chapter 13: https://www.st.com/content/ccc/resource ... 153464.pdfTapamN wrote: ↑Sun May 10, 2020 12:46 amI ran across the same Linux performance counters years ago when I was researching the SH4's ASERAM and already made my own library for them. (ASERAM is a special 1 KB on-chip protected memory intended to be used for debugging. I still haven't figured out if can be accessed on the Dreamcast. It seems most of the ASE features are enabled by external signals, and I think can't be accessed by the CPU without those signals as a way to protect the debugger from being accidentally corrupted by user code.)Moopthehedgehog wrote: ↑Mon Apr 27, 2020 5:31 pmIn any case, if you'd like to verify such architectural details yourself I have a performance counter module here, which uses the SH4's built-in performance counters: https://github.com/Moopthehedgehog/DreamHAL
Great, they give a whole register map and no addresses, lol, but the fact that ASE mode jumps to code at 0xFC000000 should be some kind of a clue. So I would guess 0xFC000000 is the ASE RAM area. The footnote that various regs are in the CCN also indicates that they would be in the 0xFF000000 to 0xFF1FFFFF area, which is where the performance counters are. IIRC the performance counters are actually supposed to be part of ASE or something. There's also the AUD section in chapter 12, which may or may not be related. I know that H-UDI uses the /ASEBRK pin, too.
By the way, old demo versions of CodeScape actually have parts of Sega's SDK 10.1 in them--I don't know what they were thinking leaving that stuff in a freely available public demo, but it's there, including an entire instruction manual that includes info on how to use the various modes of the performance counters.
I think you're getting hit by bus arbitration here. I dunno, something still doesn't seem right about this, but I haven't looked into the PVR much.TapamN wrote: ↑Sun May 10, 2020 12:46 amI'm not using the cache, I'm using the SQs. It takes time for the TA to write strip pointers to VRAM.Moopthehedgehog wrote: ↑Mon Apr 27, 2020 5:31 pm Related: You may be seeing what looks like a buffer getting "full" because you're spamming the cache writeback buffer too frequently.
Sounds like your benchmark might need some work... Or more research. I dunno, I find the sheer amount of REing necessary to do things properly tends to be pretty exhausting, and I'm just doing the GAPS bridge to make a proper driver for it, lol.TapamN wrote: ↑Sun May 10, 2020 12:46 am The only benchmark I found where there seems to be a real speed improvement using DMA over SQs is drawing large sprites, where SQs are 25% slower. Or it might be a bug in the benchmark?
I did have some problems getting the benchmarks to run correctly. At first, I was getting a bandwidth of 4.5 Mvert/s. I had written code to pushed 6 Mvert/s before, so I was confused why I was getting worse results. My accident, I figured out that for some reason, the PVR runs slowly the first frame after it's initialized. So I modified the benchmark to output a dummy frame before doing the timing.
Another problem was that for the large poly benchmark, I was getting ~200 MB/s for SQs and ~9 MB/s for DMA. An error with copy and paste had the SQ writing to RAM instead of the TA, and there was also a mistake with the bandwidth calculation.
Ah, ok. That makes more sense. It's kinda like the G2 FIFO, where the stall gets consumed by a write operation so you'd never see it unless you were timing with a cycle-counting performance counter. This is why audio can be such a massive performance killer on DC when not using G2 DMA, too, and no profiler would be able to detect the impact of it except the cycle counter. And the stall can be hundreds of cycles per write--it's terrible!TapamN wrote: ↑Sun May 10, 2020 12:46 amWell, whatever is writing to the TA has to stall when the TA's FIFO is full. Otherwise you'd get dropped vertices. I don't know exactly how it's implemented on the bus. I know that the TA area is set to MPX mode on the BSC, and MPX has a "!RDY" pin that the destination has to lower to signal that it's ready to receive data, and can raise to signal a wait.Moopthehedgehog wrote: ↑Mon Apr 27, 2020 5:31 pmI'm also not sure what you mean by DMA stalling; the concept of DMA stalling just sounds kinda weird to me.
Yes! You can! movca.l doesn't do a cache read upon cache miss, it just allocates a cache block. So by using movca.l to allocate a cache block while using the SQ to write to RAM, you can effectively act like two memory writes are happening at once! (relevant functions in DreamHAL's cachefuncs.h header)TapamN wrote: ↑Sun May 10, 2020 12:46 amHuh? I'm probably misunderstanding you, but the way you describe it kind of makes it sound like the SQ and write buffer can access memory at the same time. Only one thing can access memory at the same time, if multiple things try to access RAM at the same time, somethings going to stall. The purpose of the write buffer is to give priority to reads.Moopthehedgehog wrote: ↑Mon Apr 27, 2020 5:31 pm You can use OCBWB and OCBP with appropriate usage of movca.l to remove the memory-write bottleneck entirely. When using writeback memory type, there's an intermediary write buffer that gets written to, eliminating performance penalties. It's like using the SQ to flush the cache, then DMAing from that spot, except this is the CPU doing it so the SQ can be busy flushing memory elsewhere. I made a "cachefuncs.h" DreamHAL header for this.
Ah! I was wondering about this. The manual says pref turns into a "nop," which shouldn't stall. Has this been experimentally verified? I don't believe I've seen behavior that would indicate pref was slowing things down versus turning into a nop if the bus is in use or a prior pref is already in flight.
That's neat!TapamN wrote: ↑Sun May 10, 2020 12:46 amI've mainly focused on T&L and discovering ways to abuse the PVR. I have done a some work on my own for-fun OS for the SuperH, which I intended to be compatible with the HP Jornada 690 as well (a SuperH-3 DSP based pocket sized computer that comes with Windows CE). I was able to get interrupts, gUSA atomics, timers, basic preemption, serial, and the UBC working before I stopped working on it.Moopthehedgehog wrote: ↑Mon Apr 27, 2020 5:31 pm I can't really comment on the other things, as I'm not a game graphics dev and my OSdev experience to date has been pretty much exclusively on the CPU and MCU side of things (I could write a lot more here, but this post is already becoming quite a bit longer than I intended), but it all sounds good to me. If I may suggest, you might find the math header in DreamHAL useful; I've gotten at least one report of a 20% improvement just with naively replacing stuff here and there as opposed to really optimizing around it all--and if you're good with asm, you could potentially add some application-specific parallelism to some of the more intricate math functions like the matrix stuff for an even bigger boost.
I've already created something of my own similar to your DreamHAL. "Libsh4", as well as a semitested matrix library and a bunch of headers with lists of hardware registers.
The problem with certain types of inline asm is that the compiler can't schedule around it well or do more advanced optimizations like software pipelining. I find it's better to just use full assembler then trying to stick asm in C/C++ code. You can write faster code and it's cleaner than GCC's ugly, complex inline asm syntax.
DreamHAL is supposed to eventually be for the whole CPU--not an OS, just hardware-helper functions because the SH4 can be pretty maddening to deal with from scratch, let alone optimize for. Easier than x86, for sure, but a lot of things that are "automated" on x86 have to be done manually on SH4 (like cache management, TLB reloads, etc.). The math header is just kind of its initial "claim to fame" since it allows for doing trig, fast divides, roots, vector and matrix ops (and abusing them to do other things) and more that GCC just has no interface for. If you notice, for example, the various matrix ops store things like the cross product matrix and other things in XMTRX so that they can be reused in other matrix functions later. GCC doesn't touch XMTRX in m4-single-only, not just per the SH4 C ABI but also because XMTRX can't be accessed without the double-precision operations. (Aside: I think GDB might use XMTRX for something, but eh, I don't use GDB and I figure anyone using the matrix ops at a level where that would be a problem is "hardcore" enough to be able to debug without needing GDB... Or at least can find a way around that.)
While I agree that GCC's syntax is ugly, I know it, and I've seen other attempts just do it all wrong, which makes them slower than not doing anything in asm at all. For example, explicitly reserving registers instead of allowing GCC to pick registers can have a substantial hidden cost, where GCC needs to emit moves all over the place to get data into the right registers.
Oof, shots fired, lol--but I assure you it's not slower.
GCC really makes a mess of things sometimes--for example using software FPU emulation when the hardware can handle NaNs and infinities just fine (this one blew my mind when I got an error about it from trying to use __builtin_isnan() and __builtin_isinf()). GCC also can't do the vectore and matrix operations at all. That said, I haven't gotten much feedback on the matrix operations, as most people thus far have been using the single-instruction wrappers and the like, which does parallelize around most things GCC does to it.
Actually, the one bit of matrix-related feedback I got was that my cross product emitted a negative zero where someone was expecting a positive zero, which was causing problems for them (even though I triple-checked and both my cross product and their math were doing the exact same mathematical operations in the same order, too)--but I believe that to be FTRV-rounding-related and they never even tried to look into it. It also doesn't make much sense when in SH4 hardware -0.0f == 0.0f, but that's pretty much the only feedback I've gotten about them from anyone. ¯\_(ツ)_/¯
Hehe, this is where lots of research went into. It's still faster than FDIV, and the combination of C and asm allows GCC to optimize things it can optimize. It's all about working *with* the compiler and not in spite of or against it.TapamN wrote: ↑Sun May 10, 2020 12:46 am Doing an approximate reciprocal with FSRRA(x*x) is faster than a divide (9 cycles versus 12 cycles) if you can guarantee x is positive, but the conditional to handle negative x would make it a couple cycles slower due to the amount of time it takes to do the floating point comparisons and branches, and GCC can't schedule the inline asm FSRRA as well as it can with a FDIV. (GCC was able to generate FSRRA (and probably schedule it correctly) with the right options in older versions, but I can't get it to work on 9.3.)
Those FMOVs do 2 floats per cycle; with dual-issue pipelining and FMOV's being 0-cycle LS group, it should be half of what you say, unless LS grouping takes precedence over the 0-cycle... In any case, not all float instructions are in the same group--that's what people who didn't read Ch. 8 of the manual say! So other things can be parallelized into those matrix functions... and the only reason they're not is because I simply don't know what to put in there that would be useful!TapamN wrote: ↑Sun May 10, 2020 12:46 am The cross product with FTRV seems like it would be slower than having GCC handle a straight forward C implementation because of all the FMOVs. Using individual FADD/FSUB/FMULs you can issue a cross product in about 9 cycles and GCC could load data from RAM in parallel, but your FTRV cross product spends at least 13 cycles shuffling data into XMTRX (and that's not counting all the extra moves GCC may generate to satisfy all the float register location requests), then you still have to do the actual FTRV.
Nice! Hopefully I'll have time to take a look at these at some point. The GAPS bridge nonsense is currently driving me up a wall and I don't know how long that's gonna go on for...TapamN wrote: ↑Sun May 10, 2020 12:46 am I've included my memory and Big the Cat benchmark code, if anyone wants to see them. Maybe later I'll try to see if I can generate better triangle strips and get better results on the Big benchmark.
The memory benchmark runs and prints it's results to the serial/ethernet console. On the Big benchmark, you can change the number of Bigs drawn with left and right on the d-pad. Pressing A will print some stats to the console. At the bottom of the screen, there are two bars that display the CPU and GPU load along with some ticks to measure what the load is. Each small tick is one millisecond, and each large tick is 1/60 of a second.