High Performance Rendering

rpk · Post by **rpk** » Sat May 06, 2023 4:31 pm

Hi,

I've been working on improving rendering performance in HarleQuest! and am trying to wrap my head around some of the different options.

I read this: http://yam.20to4.net/dreamcast/hints/index.html (specifically the Optimizing TnL Loops bit) and have a general plan, but wanted to check here in case anyone with more experience could point me in the right direction.

Here's the current gist:
- Vertices are split into separate arrays by attribute (positions, normals, colours, etc)
- Skinning loop reads positions and weights, applies up to three pose matrices, writes positions back
- Transform loop reads positions, applies a single MVP matrix, writes positions back
- Lighting loop reads positions, normals, colours and one light position, does NdotLs, scales colours, writes colours back
- Submission loop reads positions, uvs, and colours by tri-strip index, performs clipping and emits to PVR TA input using SQs with no explicit wait

The first step (splitting into arrays by attribute) would be done at load time, but all other steps would be done in batches of 32 to maximise chances of parts still being in cache.

If I understand things right, that'll be really fast because a) it improves register allocation in each stage meaning better pipelining, b) we only T&L a unique vertex *once*, not once per occurrence in the strip, c) we can make liberal use of prefetching and `movca.l` on write-back where possible.

The choice of store queues instead of DMA is because:
- There's no setup time or persistent buffer needed
- The last stage can write data from cache -> registers -> TA via SQ without needing to write back to RAM
- The CPU can keep working while the previous SQ transfer is in progress (although I've heard it makes everything slow to a crawl?)

My current approach in the game is a lot simpler than this: one fat Vertex struct with all the attributes that we loop over, do all the necessary steps, then clip & submit at the end. This is basically the opposite approach to the one described (AoS instead of SoA) and according to the logic in that page is going to be way slower.

Have I got anything fundamentally wrong? Does writing back to RAM and doing DMA actually save a bunch of cycles somewhere? Does the cache/pipelining work differently to how I'm thinking here? Wanted to check in before I go and rewrite my renderer

Any thoughts welcome. Thanks!

rpk · Post by **rpk** » Sun May 07, 2023 8:06 am

I've had some feedback on the Simulant discord about how SQs don't execute in parallel the way I thought they did. This line from section 4.6.3 of the
SH7750 CPU manual was confusing me:

While the contents of one SQ are being transferred to external
memory, the other SQ can be written to without a penalty cycle, but writing to the SQ involved in
the transfer to external memory is deferred until the transfer is completed

I thought that meant the CPU was free to go and do other things while the SQ write was in progress, but in fact it seems you're limited by the usual instruction group superscalar rules in that only certain instruction types can be executed in parallel with the SQ burst write.

You could technically do an FTRV to transform a vertex while a store queue PREF was in-flight, but AFAICT you'd need to have loaded the FR registers ahead of time because the FMOV is also in the LS group.

This makes DMA look a lot more appealing as it's truly asynchronous, so we could DMA finished vertices in batches while the CPU works on transforming the next batch.

TapamN · Post by **TapamN** » Sun May 07, 2023 4:09 pm

That's not the right interpretation. It's saying that writing to a SQ that currently being written to RAM will cause a stall until the SQ has finished, but writing to the other, idle SQ will still run at full speed. Accessing cache is still fast while an SQ is being written to RAM, but accessing RAM (such as through a cache miss, or bypassing the cache) will also stall the CPU because the SQ writeout has the bus busy. It has nothing to do with superscalar execution.

I've talked about stuff related to your question earlier in different places that might be helpful.

I don't have time right now to fully reply to your first post. I'll try to later.

rpk · Post by **rpk** » Mon May 08, 2023 2:32 pm

I see!

So just to make sure I understand:
You can write to the first SQ, issue PREF, go do some other work like transforming a vertex that's already in cache (FMOV, FTRV, etc), then write to the second SQ, and it's only when you PREF the second SQ (perform the second burst write) that the CPU will stall until the first transfer has finished? This is what I thought initially (before my second post).

If that's the case, would RAM access stall until the SQ transfer is finished even if the SQ transfer is writing to PVR memory rather than main RAM? I'd presume the answer is "yes" because they use the same bus.

Thanks for sharing your previous posts on the topic, they're really informative. I'm digesting the info in there now and figuring out how to write my own tests for this kind of thing.

No worries if you're too busy to reply to my first post, but any tips or corrections are very welcome. Thanks!

rpk · Post by **rpk** » Sat May 13, 2023 2:15 pm

I decided to make a benchmark to try and get a deeper understanding of all this. Starting with the KallistiOS 2ndmix example, I made some modifications so that it will render N cubes, letting the user switch between 3 different submission methods:

A) KOS PVR API (the pvr_prim() functions)
B) Store Queue burst transfer to Tile Accelerator input
C) Store Queue burst transfer to DMA input, then KOS' Scene DMA to TA input

The benchmark does all its transformations up front, then submits all the resulting vertices using the chosen submission method. I'm yet to experiment with doing work while the SQ burst transfers are in progress, and I haven't looked at doing the transform loops in small batches for cache friendliness yet.

However, right off the bat there were certain findings:

Method A (pvr_prim) was consistently slower than B and C (around 2ms to 3ms with 150+ cubes).

Methods B and C were roughly equivalent in terms of framerate, but I had a nagging feeling that DMA may have introduced a frame of latency because of the comments at the top of this file:
https://github.com/KallistiOS/KallistiO ... rnal.h#L64

Code: Select all

When vertex DMA is enabled, we go into a naive 3-stage setup. This can
   be improved later, but it's a start for now.

   In this mode, we augment the timing diagram above:

   VBlanks  SH4-to-RAM  DMA-to-TA   ISP/TSP         View
   0        ->R0        -       -           -
   1        ->R1        R0->T0      -           -
   2        ->R0        R1->T1      T0->F0          -
   3        ->R1        R0->T0      T1->F1          F0
   4        ->R0        R1->T1      T0->F0          F1
   ...

I added frame latency tracing to my local branch of KOS which showed that, regardless of how many cubes I was rendering or how much work I was doing, there was *always* an extra frame of latency when using DMA:

I'll break down the "Frame time" values:
17 is the time between page flips
3 is the time between pvr_scene_start() and pvr_scene_finish() (transform, SQ to RAM)
0 is the time between DMA start and TA interrupts for all lists done
5 is the time between starting the queued render and 'render done' interrupt

My tracing works by keeping a ring buffer of 4 values. When a new frame is submitted to the pipeline we increment the head index and write the current VBlank counter into the buffer at that point. When the page flips at the end of the pipeline, we increment the tail index and read the buffer value, subtract it from the current VBlank counter, and the difference is the number of VBlanks since that frame was submitted.

I wanted to eliminate this frame of latency if possible, so I delved into the PVR driver code to try and understand what's going on.

The bulk of the work is done here:
https://github.com/KallistiOS/KallistiO ... _irq.c#L77

Code: Select all

void pvr_int_handler(uint32 code) {
    int bufn = pvr_state.view_target;

    // What kind of event did we get?
    switch(code) {
        case ASIC_EVT_PVR_OPAQUEDONE:
            //DBG(("irq_opaquedone\n"));
            pvr_state.lists_transferred |= 1 << PVR_OPB_OP;
            break;
        case ASIC_EVT_PVR_TRANSDONE:
            //DBG(("irq_transdone\n"));
            pvr_state.lists_transferred |= 1 << PVR_OPB_TP;
            break;
        case ASIC_EVT_PVR_OPAQUEMODDONE:
            pvr_state.lists_transferred |= 1 << PVR_OPB_OM;
            break;
        case ASIC_EVT_PVR_TRANSMODDONE:
            pvr_state.lists_transferred |= 1 << PVR_OPB_TM;
            break;
        case ASIC_EVT_PVR_PTDONE:
            pvr_state.lists_transferred |= 1 << PVR_OPB_PT;
            break;
        case ASIC_EVT_PVR_RENDERDONE:
            //DBG(("irq_renderdone\n"));
            pvr_state.render_busy = 0;
            pvr_state.render_completed = 1;
            pvr_sync_stats(PVR_SYNC_RNDDONE);
            break;
        case ASIC_EVT_PVR_VBLINT:
            pvr_sync_stats(PVR_SYNC_VBLANK);
            break;
    }
    ...

If you follow the body of that interrupt handler, you'll see that we only advance the rendering pipeline in the case of a VBlank interrupt. Any other interrupt type and we just return.

Code: Select all

    ...
    if(!pvr_state.to_texture[bufn]) {
        // If it's not a vblank, ignore the rest of this for now.
        if(code != ASIC_EVT_PVR_VBLINT)
            return;
    }
    else {
    ...

I also noticed in pvr_scene.c that it doesn't actually seem to start the next DMA immediately and just sets a flag for the next VBlank to pick up:
https://github.com/KallistiOS/KallistiO ... ene.c#L273

Code: Select all

        // Flip buffers and mark them complete.
        o = irq_disable();
        pvr_state.dma_buffers[pvr_state.ram_target].ready = 1;
        pvr_state.ram_target ^= 1;
        irq_restore(o);

Finally, I noticed that nothing tries to step the pipeline when DMA completes, and similarly just sets flags:
https://github.com/KallistiOS/KallistiO ... _irq.c#L65

Code: Select all

    // If that was the last one, then free up the DMA channel.
    if(!did) {
        //DBG(("dma_complete(buf %d)\n", pvr_state.ram_target ^ 1));

        // Unlock
        mutex_unlock((mutex_t *)&pvr_state.dma_lock);
        pvr_state.lists_dmaed = 0;

        // Buffers are now empty again
        pvr_state.dma_buffers[pvr_state.ram_target ^ 1].ready = 0;

        // Signal the client code to continue onwards.
        sem_signal((semaphore_t *)&pvr_state.ready_sem);
        thd_schedule(1, 0);
    }

Note that the sem_signal(...) and thd_schedule(...) calls are just there to wake up the user's thread if it's blocked on pvr_wait_ready(). It has nothing to do with the next stages in the pipeline (TA, ISP/TSP, Display).

I tried fixing each of these in turn, triggering the DMA immediately after the source buffer is full, stepping the pipeline on *any* PVR interrupt (only blocking the page flip on VBlank) and making DMA completion also attempt to step the pipeline, but there is still this extra frame of latency when using DMA.

I ran out of time trying to track it down but I will revisit it before HarleQuest! ships and possibly tidy up my frame latency tracing so it can be added into KallistiOS.

I know that this benchmark isn't properly representative of how a real game would work using DMA (you'd probably want to trigger DMA in batches with the CPU working on the next batch in advance), but it's the default way KallistiOS works and I wanted to keep it as one of the submission methods for comparison.

I'm mostly posting this to keep track of things, but if anyone has thoughts or suggestions, feel free to post a reply.

Ian Robinson · Post by **Ian Robinson** » Sun May 14, 2023 6:52 pm

TapamN wrote: ↑Sun May 07, 2023 4:09 pm That's not the right interpretation. It's saying that writing to a SQ that currently being written to RAM will cause a stall until the SQ has finished, but writing to the other, idle SQ will still run at full speed. Accessing cache is still fast while an SQ is being written to RAM, but accessing RAM (such as through a cache miss, or bypassing the cache) will also stall the CPU because the SQ writeout has the bus busy. It has nothing to do with superscalar execution.

I've talked about stuff related to your question earlier in different places that might be helpful.

I don't have time right now to fully reply to your first post. I'll try to later.

Just asking what about this problem you noted on pvr and dma in kos.

TapamN wrote:
I got the following timings for copying a frame buffer from main RAM to video RAM for a 640x480 16-bit frame buffer:
memcpy: 20.80 ms

KOS sq_cpy: 8.24 ms

Modified sq_cpy: 3.89 ms

DMA (includes optimized cache flush, waits for DMA to complete): 1.98 ms

DMA (includes optimized cache flush, DMA works in background): 0.08 ms
For a 320x240 resolution screen, they would take about 1/4th the time. If you let the DMA work in the background, you wouldn't actually get all of the 1.90 ms saved when waiting, since the DMA will slow down CPU memory access. You might only save something like 1 ms total in an real game.

It looks like you tried to use DMA, but had trouble since KOS's DMA isn't designed to update the frame buffer. Video RAM DMA will only work if you enable the 3D driver... Also, KOS's cache flush function is partially broken; it does flush the cache, but it's much slower than it needs to be. Using KOS's dcache_flush_range would add an extra 1.84 ms to DMA timings I listed.

So this part about KOS's cache flush function is partially broken; it does flush the cache, but it's much slower than it needs to be. Using KOS's dcache_flush_range would add an extra 1.84 ms to DMA timings
It looks like you tried to use DMA, but had trouble since KOS's DMA isn't designed to update the frame buffer. Video RAM DMA will only work if you enable the 3D driver... Also, KOS's cache flush function is partially broken; it does flush the cache, but it's much slower than it needs to be. Using KOS's dcache_flush_range would add an extra 1.84 ms to DMA timings I listed.

I would like to humbly request your assistance in addressing a technical issue that you brought to my attention. It pertains to a deficiency in KOS's cache flush function, which appears to be performing suboptimally, leading to significant latency issues in the order of 1.84 ms.

To provide additional context, the cache flush function does, in fact, flush the cache, but the method employed by KOS is comparatively slower than other available alternatives. Consequently, there is a pressing need to introduce an efficient fix that will mitigate the latency issues currently experienced by users.

Given your extensive knowledge and expertise in the field of technical programming, I would be greatly honored if you could share your insights and potentially offer a solution that would serve to rectify this deficiency. It is my sincere belief that your contributions will serve to benefit a large and growing community of KOS users.

Please accept my sincere gratitude for your time and consideration.

TapamN · Post by **TapamN** » Tue May 16, 2023 6:43 am

Sorry my follow up has taken so long. I would have responded sooner, but a new Zelda game distracted me...

rpk wrote: ↑Mon May 08, 2023 2:32 pmSo just to make sure I understand:
You can write to the first SQ, issue PREF, go do some other work like transforming a vertex that's already in cache (FMOV, FTRV, etc), then write to the second SQ, and it's only when you PREF the second SQ (perform the second burst write) that the CPU will stall until the first transfer has finished? This is what I thought initially (before my second post).

If that's the case, would RAM access stall until the SQ transfer is finished even if the SQ transfer is writing to PVR memory rather than main RAM? I'd presume the answer is "yes" because they use the same bus.

That sounds right. You can do anything you want on the CPU without stalling while an SQ is being flushed, except:

1. Access anything on the RAM bus (read/write RAM or external hardware). Accessing anything already in cache is fine. Writing to the other, idle SQ is fine. FLUSHING the idle SQ counts as a RAM access and will stall; I *THINK* preforming a prefetch to load stuff into the cache will also cause a stall.
2. Write to the SQ in the process of being flushed, before it has finished (CPU stall is likely to avoid corrupting SQ transfer and writing some mashup of old and new data to the SQ's destination)

On the topic of SQs, do not use GCC's __builtin_prefetch to trigger SQ writes, since GCC might move the pref instruction to some place it thinks is "better" and would allow a real prefetch to have more time to run, which might end up being before your data has been written to the SQ. Always use inline asm for the prefetch.

I've found it to be tricky to get the prefetch instruction to actually help. Where exactly the PREF instruction is placed can make a big difference, and once I've found it to be faster to not prefetch at all. You'll have experiment and profile it.

My current approach in the game is a lot simpler than this: one fat Vertex struct with all the attributes that we loop over, do all the necessary steps, then clip & submit at the end. This is basically the opposite approach to the one described (AoS instead of SoA) and according to the logic in that page is going to be way slower.

I don't think he's necessarily talking about AoS vs SoA. More about how not to run out of registers and have to constantly reload stuff.

I've mostly focused on using SQs to submit vertices, and the following information comes from the perspective of using SQs. I have done some testing on SQ vs DMA, but they were pretty flawed tests.

Breaking T&L up so that you do all calculations up front, then do a final submit pass isn't always ideal because it can take the TA a while to process a vertex. When the TA writes a strip to memory, it writes the vertex data and also has to write a pointer to it for tile that the strip might be in. (The TA doesn't rasterize the strip to figure out what tiles it might be in, it just uses the AABB of the strip) The TA has a buffer to hold some commands so the CPU doesn't always stall when the TA is busy, but those random, single writes for the tile pointers can take a while on large strips, so there's the risk of the TA's buffer getting full. Very rapidly submitting everything in a batch is a good way to make the CPU stall on the TA.

For very simple T&L (prelit or one simple light), I've found using AoS and doing all T&L in one pass, and recalculating shared vertices, to generally be a bit faster than going out of the way to save shared vertices and go back and reload them (when submitting directly via SQ). Even though it's seems less efficient to recalculate shared vertices, the gains from having the CPU and TA working in parallel can outweigh it. It can still be worth to it do indexing, even when you recalculate vertices, because it reduces the size of the model and saves bandwidth/cache space.

Using DMA to submit instead of SQs is one way to prevent the CPU from stalling (DMA will stall instead), which might be why Sega's libraries seemed to favor DMA.

If you're doing expensive T&L, like skinning or complex lights, you would still need to break stuff up into passes (loading multiple matrices per vertex when doing skeleton animation is obviously pretty wasteful). I think a good way to avoid wasting a lot of time submitting might be better to just do work during the final submit pass, so instead of just doing something like:

1. Transform vertex pass
2. Calculate vertex first light pass
3. Calculate vertex second light pass
4. Sum light results pass
5. Apply perspective pass
6. Submit vertex pass

...you instead combine steps 5 and 6 (maybe even step 4, too). The downside is potentially having to write and optimize multiple versions of the final submit pass. And you'd need versions with near clipping.

When you do multipass T&L, it's very important to be aware of how the cache works. It's easy for your vertex source data to cause your work buffer to get evicted from the cache and flushed to RAM since the SH4 uses a direct mapped cache, so that the next pass will cache miss when it reads or writes the work buffer. If the sum of size of the source data and the size of the work buffer is larger than 16KB, you are guaranteed to get some cache misses, but can still get them with any size, just with a lower probability (if it happens depends on the address of the source data and work buffer, so you could get good performance when the model is loaded in one place in RAM, but worse when it's loaded in another).

The first option someone might look at to help with this is OCRAM, which allows half the of the SH4's cache to be used as fast RAM, so you don't have to worry about it being evicted, but the SH4 has a much better feature called OCINDEX. One way of looking at is that is gives you two independent software controlled 8KB caches instead of one 16KB. (Another way of looking at OCINDEX is as a version of OCRAM that can use main RAM as swap memory if you need more than 8KB.) Which cache is used is selected by a bit in the address. You can access the source data though one bank, and store the work buffer in the second bank.

If your work buffer needs to store more than the 8KB half-cache, you can build up the work buffer in 8KB blocks efficiently without evicting the work buffer between passes. You'll want to make sure the first vertices you submit are still in cache (e.g., don't build the first 8KB of the buffer, then the second 8KB that evicts the first 8KB, and start submitting from the cold first 8KB; either build the buffers the other way, or reorder your indices so you start with what's already in cache.)

One draw back with AoS in multipass T&L is bandwidth efficiency. Say you have a pass that reads a vertex position from RAM (not already in cache), then writes the transformed position to a buffer. It's possible to transform a vertex in about 4.5ish cycles if source and destination are in cache, but loading a cache line is something like 9-12 cycles. If you do AoS, with one vertex per line, you'll be memory bound and only accomplish one transform per 9-12. If you use a SoA, you can fit two positions in one cacheline, halving the number of cache misses. AoS does have the advantage that after the first pass, following passes will have their data already in cache, but I'm not sure are many T&L setups where that's significantly useful.

One idea I've had (but haven't tried out) would be a hybrid system. For example, you calculate lighting once per unique vertex, then do the simple AoS one-pass transform-and-submit, but instead of storing a normal or static lighting value in the source vertex, stores an index into the calculated lighting values.

Summary:

For simple T&L, doing T&L and submission in one pass is fine.
For complex T&L, do it in passes. Use OCINDEX and have source data and work buffer in different cache banks. Make sure the submit pass isn't too fast by having it do actual work as well.
Benchmark the effect of prefetch instructions.

rpk · Post by **rpk** » Fri May 19, 2023 5:09 pm

Thanks so much for this!

A lot of your points confirm things I've already learned (or at least suspected) and some points are new to me.

TapamN wrote: ↑Tue May 16, 2023 6:43 am Sorry my follow up has taken so long. I would have responded sooner, but a new Zelda game distracted me...

Understandable! I haven't tried it yet, but have heard it's better than BotW.

TapamN wrote: ↑Tue May 16, 2023 6:43 am On the topic of SQs, do not use GCC's __builtin_prefetch to trigger SQ writes, since GCC might move the pref instruction to some place it thinks is "better" and would allow a real prefetch to have more time to run, which might end up being before your data has been written to the SQ. Always use inline asm for the prefetch.

Yep, I found that out the hard way as did some of the others on the Simulant discord. It seems GCC doesn't understand the dual purpose of the pref instruction on the Dreamcast and treats it more as a suggestion. In my tests on godbolt.org I found that GCC can actually ignore the builtin prefetch intrinsic completely, not just re-order it.

TapamN wrote: ↑Tue May 16, 2023 6:43 am I've found it to be tricky to get the prefetch instruction to actually help. Where exactly the PREF instruction is placed can make a big difference, and once I've found it to be faster to not prefetch at all. You'll have experiment and profile it.

I can *sort of* see why prefetch could be slower than not using it for the next vertex (if RAM bus is busy on N-1, you've already got N in cache, trying to prefetch N+1 would be a bad idea).

I guess it would go something like this:

- Prefetch N
- Prefetch N+1
- Process N
- Submit N
- Process N+1
- Prefetch N+2
- Submit N+1
- Process N+2
- Prefetch N+3
- Submit N+2
...etc...
vs.
- Process N
- Submit N
- Process N+1
- Submit N+1
- Process N+2
- Submit N+2

or something like that. As you say, I'll have to profile it.

TapamN wrote: ↑Tue May 16, 2023 6:43 am I don't think he's necessarily talking about AoS vs SoA. More about how not to run out of registers and have to constantly reload stuff.

Yep, my original text was a bit confusing there. What I meant was: the purpose of doing operations in separate stages is to improve register allocation, and the purpose of going through all the required stages on a batch of vertices at a time is to pull them into the cache and fully process them before moving on (avoiding repeatedly loading and evicting the same data to/from cache).

As an extrapolation of that second idea, if many verts simply don't need certain data (ie. no vertex weights because they don't need skinning), breaking up the verts into SoA instead of AoS could mean that way more can fit into cache at the same time. I have lots of static geometry *and* lots of animated geometry, and I will probably have dynamic lighting enabled selectively as well. If every vertex was an AoS "fat" vertex with all the data it doesn't need, the batch size would have to be much lower. That said, the fact each attribute is a different size might lead to weird patterns of cache thrashing as we work through an entire strip's worth of verts and the attribute addresses go in and out of phase. Again this is something I'll need to profile.

TapamN wrote: ↑Tue May 16, 2023 6:43 am For very simple T&L (prelit or one simple light), I've found using AoS and doing all T&L in one pass, and recalculating shared vertices, to generally be a bit faster than going out of the way to save shared vertices and go back and reload them (when submitting directly via SQ). Even though it's seems less efficient to recalculate shared vertices, the gains from having the CPU and TA working in parallel can outweigh it. It can still be worth to it do indexing, even when you recalculate vertices, because it reduces the size of the model and saves bandwidth/cache space.

What do you mean by "going out of the way to save shared vertices and go back and reload them"? If indices are already being used, I'd have thought it would be more efficient to first do T&L on the unique verts, *then* go through the indices to build the strips, clip to znear and submit (reordering indices for spatial locality using NvTriStrip or what have you). That way we're not re-processing shared verts and only need to take the hit on jumping around in memory when building strips. I guess in this case, the CPU would be free to do the znear clipping on the next vertex while the SQ burst write to the TA is in progress for the current vertex, so they're still able to operate in parallel. Again, I'd need to profile this

TapamN wrote: ↑Tue May 16, 2023 6:43 am Using DMA to submit instead of SQs is one way to prevent the CPU from stalling (DMA will stall instead), which might be why Sega's libraries seemed to favor DMA.

I did notice that they favour DMA. In my above post you'll see that DMA seems to introduce an extra VBlank of latency vs Store Queues in KallistiOS, but I think this is an implementation issue rather than a fundamental DMA one.

TapamN wrote: ↑Tue May 16, 2023 6:43 am When you do multipass T&L, it's very important to be aware of how the cache works. It's easy for your vertex source data to cause your work buffer to get evicted from the cache and flushed to RAM since the SH4 uses a direct mapped cache...

Yeah, this is what I'm getting at with "the attribute addresses go in and out of phase" a couple points back.

TapamN wrote: ↑Tue May 16, 2023 6:43 am The first option someone might look at to help with this is OCRAM, which allows half the of the SH4's cache to be used as fast RAM, so you don't have to worry about it being evicted, but the SH4 has a much better feature called OCINDEX. One way of looking at is that is gives you two independent software controlled 8KB caches instead of one 16KB. (Another way of looking at OCINDEX is as a version of OCRAM that can use main RAM as swap memory if you need more than 8KB.) Which cache is used is selected by a bit in the address. You can access the source data though one bank, and store the work buffer in the second bank.

This is excellent (OCINDEX). I'd heard of it before but your explanation really made it click.

TapamN wrote: ↑Tue May 16, 2023 6:43 am One draw back with AoS in multipass T&L is bandwidth efficiency. Say you have a pass that reads a vertex position from RAM (not already in cache), then writes the transformed position to a buffer. It's possible to transform a vertex in about 4.5ish cycles if source and destination are in cache, but loading a cache line is something like 9-12 cycles. If you do AoS, with one vertex per line, you'll be memory bound and only accomplish one transform per 9-12. If you use a SoA, you can fit two positions in one cacheline, halving the number of cache misses. AoS does have the advantage that after the first pass, following passes will have their data already in cache, but I'm not sure are many T&L setups where that's significantly useful.

How about this with SoA? Some stages need the output of a previous stage as the input to the next stage, so you could just leave them in cache:
Pass 1 - Apply MVP Matrix: Positions in (left bank), Positions out (right bank)
Pass 2 - Lighting: Positions in (right bank), Colours out (left bank)
Pass 3 - Clip & Submit: Positions & Colours in (both banks), submit via SQ to TA knowing it's all in cache, doing clipping while SQ transfer is in progress

The idea would be to do all three passes per 8KiB chunk, meaning the time taken to begin the next 8KiB chunk would coincide with the time taken for the TA to deal with what we've just sent it.

I'll need to revise this and make sure I actually understand, but I think I'm starting to be able to reason about it at least.

TapamN wrote: ↑Tue May 16, 2023 6:43 am One idea I've had (but haven't tried out) would be a hybrid system. For example, you calculate lighting once per unique vertex, then do the simple AoS one-pass transform-and-submit, but instead of storing a normal or static lighting value in the source vertex, stores an index into the calculated lighting values.

Is that mainly to reduce calculating lighting multiple times on shared vertices?

TapamN wrote: ↑Tue May 16, 2023 6:43 am Summary:

For simple T&L, doing T&L and submission in one pass is fine.
For complex T&L, do it in passes. Use OCINDEX and have source data and work buffer in different cache banks. Make sure the submit pass isn't too fast by having it do actual work as well.
Benchmark the effect of prefetch instructions.

Yep, that makes sense. Again, thanks for clarifying this stuff. I'll post here when I get time to take another look at my benchmarks.

TapamN · Post by **TapamN** » Thu May 25, 2023 6:12 am

rpk wrote: ↑Fri May 19, 2023 5:09 pmUnderstandable! I haven't tried it yet, but have heard it's better than BotW.

I definitely like it better than BotW.

rpk wrote: ↑Fri May 19, 2023 5:09 pmI can *sort of* see why prefetch could be slower than not using it for the next vertex (if RAM bus is busy on N-1, you've already got N in cache, trying to prefetch N+1 would be a bad idea).

That time when prefetch was slowing things down was for a function that would FTRV an array of 4D vectors. Nothing else, read each vector from the array, FTRV it, then write it somewhere else. I was trying to optimize for reading from something not in cache, and writing to cache, so there were no writes to memory or TA. No matter where I put the PREF or how I did it (Like prefetch exactly what I access next, or prefetch a few cachelines ahead), it ended up slower than without prefetching.

I'm still not sure why it was so much slower. I would expect it to be slightly slower when reading something in cache, because the PREF instruction would be enough to make each vector one cycle slower, but somehow taking cache misses still seemed faster than trying to hide them with PREF.

Maybe there's a way to use PREF that didn't occur to me that helps, but what I checked didn't.

rpk wrote: ↑Fri May 19, 2023 5:09 pmWhat do you mean by "going out of the way to save shared vertices and go back and reload them"? If indices are already being used, I'd have thought it would be more efficient to first do T&L on the unique verts, *then* go through the indices to build the strips, clip to znear and submit (reordering indices for spatial locality using NvTriStrip or what have you). That way we're not re-processing shared verts and only need to take the hit on jumping around in memory when building strips. I guess in this case, the CPU would be free to do the znear clipping on the next vertex while the SQ burst write to the TA is in progress for the current vertex, so they're still able to operate in parallel. Again, I'd need to profile this

I think I explained it poorly. As an example, for a cube, you have 8 unique vertices and 12 triangles, which might be two strips of 6 tris (like two U shapes put together), or 8*2=16 vertices that need to be submitted.

When doing direct SQ submission, the time taken to submit the cube is 16*TST, where TST is the vertex transform+submit time for one vertex. (TST can vary from vertex to vertex, depending on if the vertex is in cache or not, but that's not relevant for this example.)

If you transform each vertex once, then do a submit pass, the total time is 8*TT+16*ST. TT is the time to generate a vertex in this case, and ST is the time to submit a vertex. TT will be much less than TST, since you have more flexiblity to optimize the code.

However, with simple T&L when submitting directly to the TA with SQs, I've found that the value of TST and ST can end up being quite close due to the limited speed of the TA. The 8*TT ends up costing more time than simpifying TST to TT+ST saves. So it's only worth it if TST is very expensive compared to TT and ST (because you have to do too much and spill registers a ton) or you can find a way reduce the value of ST (maybe writing to RAM then doing DMA can do this?).

But this is all based on testing I did around 10 years ago. I might have done something wrong, so it might be worth double checking what I came up with. What kind of model is being rendered can affect the shared vertex count and triangle strip vertex counts (if you have a tree model made out of a bunch of disconnected quads, you will have very few shared vertices, but something like a racing game or Sonic level, with long winding paths, will have long strips with many shared vertices), so the importance/weights of the TST, TT, and ST values can change.

rpk wrote: ↑Fri May 19, 2023 5:09 pmI did notice that they favour DMA. In my above post you'll see that DMA seems to introduce an extra VBlank of latency vs Store Queues in KallistiOS, but I think this is an implementation issue rather than a fundamental DMA one.

It's possible hardware-wise to use DMA without the extra frame of latency, but there could be efficiency problems. You wouldn't want to wait until the DMA buffer has been completely written before triggering a single large TA DMA, since you don't want DMA to cross a frame. So you would need to occasionally kick off smaller DMAs periodically while writing to the buffer, but PVR DMA really sucks at many small DMAs; you lose a lot of bandwidth somehow. It would need to be tested to find out what are good options.

rpk wrote: ↑Fri May 19, 2023 5:09 pmThis is excellent (OCINDEX). I'd heard of it before but your explanation really made it click.

How about this with SoA? Some stages need the output of a previous stage as the input to the next stage, so you could just leave them in cache:
Pass 1 - Apply MVP Matrix: Positions in (left bank), Positions out (right bank)
Pass 2 - Lighting: Positions in (right bank), Colours out (left bank)
Pass 3 - Clip & Submit: Positions & Colours in (both banks), submit via SQ to TA knowing it's all in cache, doing clipping while SQ transfer is in progress

Did you mean "Normals in (right bank)" instead of positions in? Or where you generating the normals from scratch from the positions? If you meant normals, loading the normals would push the transformed positions out that bank.

I think I should give a warning about how OCINDEX works, since I only gave an analogy to show how OCINDEX is useful. Normally, the cache is mapped over RAM, repeating every 16KB, but OCINDEX makes one 8KB bank map repeatedly over the first 32MB of RAM, then the second 8KB bank over the next 32MB of RAM, then back to the first bank for the next 32MB, and keeps alternating over the address space. But since the DC's main RAM is mirrored a few times, you can still access all of RAM through each bank. But if you try to access data from one bank, and it's not in that bank, it will load old data from RAM, and not the newly generated data from the other bank.

This is because all OCINDEX physically does is remap the way the cache is addressed. Normally, the SH4 uses the lowest 14 bits of the memory address to index into the cache. OCINDEX uses the lowest 13 bits and bit 25 (IIRC). Since main RAM is mirrored in that area, you can still access it mostly normally with the bit set, but the SH4's cache will still behave like there's potentially different, real RAM that, and not share things between banks.

If you want to be able to move data from bank 0 and bank 1 (or viceversa), you have two options: Manually copy the data over, or invalidate the cacheline on the bank 1 and flush the cacheline on bank 0 to ram. Then you access the flushed data through bank 1. If you don't go out of your way to flush caches, it's only safe to read something from a bank that was written to that bank. KOS and GCC will only access bank 0, unless you pass a pointer to bank 1 (like for fread).

I forgot to mention, there's a link in this post to a Pastebin that contains an asm function to enable and disable OCINDEX in KOS, plus some convenience macros.

rpk wrote: ↑Fri May 19, 2023 5:09 pmThe idea would be to do all three passes per 8KiB chunk, meaning the time taken to begin the next 8KiB chunk would coincide with the time taken for the TA to deal with what we've just sent it.

The TA's buffer is nowhere that large, and is probably more around the size of a few strips or vertices. I would guess something like 8 to 32 vertices? Rapidly writing 8KB of data will definitely outpace the TA.

rpk wrote: ↑Fri May 19, 2023 5:09 pmIs that mainly to reduce calculating lighting multiple times on shared vertices?

The main reason is to avoid register pressure so that the lighting can be done more efficiently, but that is a nice bonus as well.

For example, if you want to do two lights, you need (ignoring the matrix) one vector register group for the position, one for the normal, two for the lights, and 6 registers for the light colors, or about 24 registers. You can't do that without spilling the SH4's 16 FP registers.

If you have XMTRX free, one great trick when doing multiple parallel/sun lights is to do lighting with FTRV. You might normally do something like this:

Code: Select all

for each vertex
    lightval0 = dot(light0dir, normal)
    lightval1 = dot(light1dir, normal)
    lightval2 = dot(light2dir, normal)
    //clamp lightvals here
    red = lightval0 * light0color.red + lightval1 * light1color.red + lightval2 * light2color.red
    //and do blue and green

You're doing a bunch of dot products. FTRV is just a bunch of dot products. So load the directions into XMTRX and replace it with FTRV.

Code: Select all

XMTRXrow0 = light0dir
XMTRXrow1 = light1dir
XMTRXrow2 = light2dir
XMTRXrow3 = zeroes
for each vertex
    lightvals = FTRV(normal)

That's probably about 50% faster than using three FIPRs. (Of course, that won't work to well for a true point/spot light, since lightdir is different for each vertex... maybe you could instead calculate some kind of rotated normal that's equivalent?) But we still need to add the lights together, and summing the values for red, green, and blue is also a dot product. Time for more FTRV.

Code: Select all

XMTRXcol0 = light0color
XMTRXcol1 = light1color
XMTRXcol2 = light2color
XMTRXcol3 = zeroes
for each vertex
    //clamp lightvals
    vertcolor = FTRV(lightvals)

Writing this has made me think about my TA DMA tests before. I used T&L code optimized for SQ TA access, and wasn't taking advantage of the faster write speed of RAM. I want to try to retest DMA sometime, with code better optimized for memory access, rather than TA SQ access, and see how that turns out.

TapamN · Post by **TapamN** » Sun Jun 04, 2023 6:39 am

I managed to step away from TotK for a while to try redoing indexed rendering. I was wondering why my T&L code was so much slower than I thought it should have been. It was clearly limited by cache misses, but prefetching did not help, so I used the SH4's performance counter to try to figure out how prefetching works. Overview of what I've discovered so far:

There is a mandatory 2 cycle stall when loading something into cache, be it a normal cache miss or a prefetch. So a prefetch instruction will always stall for two cycles if it has to load anything. It is totally impossible to move something from RAM into cache without execution stopping for at least two cycles per cacheline. I don't know about how/if this interacts with store queues.

Actually, the 2 cycle stall might really be 1 or 2, depending on if the PREF/load executes on beat with the 100mhz bus clock, or off beat with the bus clock. In my testing, I always got a 2 cycle stall.

It seems like any write to cache during a prefetch will stall until the prefetch completes. This is so incredibly stupid. You basically cannot write to cache during a prefetch. It looks like you want about 18 cycles (!?!) between a prefetch and write to completely avoid a stall. At least it seems safe to read data already fully loaded in cache. I haven't checked if writing to an SQ during prefetch will stall.

I'll post my benchmark code and results later, so that others can double check my methods, once I finished. I'm hoping I got something wrong about the write-during-prefetch thing, and that it's not really that bad...

Edit: I remembered there was a paper by Hitachi about the design of the SH4's cache, "A 2-ns-Access, 285-MHz, Two-Port Cache Macro Using
Double Global Bit-Line Pairs", which mentions (even in the title) that the cache is dual ported. I skimmed the paper because it was mostly about the silicon design and just assumed both ports were read/write, but going back to it, it does state that one port is for reads, and the other port is for writes.

Ian Robinson · Post by **Ian Robinson** » Sat Jun 24, 2023 7:09 pm

TapamN wrote: ↑Sun Jun 04, 2023 6:39 am I managed to step away from TotK for a while to try redoing indexed rendering. I was wondering why my T&L code was so much slower than I thought it should have been. It was clearly limited by cache misses, but prefetching did not help, so I used the SH4's performance counter to try to figure out how prefetching works. Overview of what I've discovered so far:

There is a mandatory 2 cycle stall when loading something into cache, be it a normal cache miss or a prefetch. So a prefetch instruction will always stall for two cycles if it has to load anything. It is totally impossible to move something from RAM into cache without execution stopping for at least two cycles per cacheline. I don't know about how/if this interacts with store queues.

Actually, the 2 cycle stall might really be 1 or 2, depending on if the PREF/load executes on beat with the 100mhz bus clock, or off beat with the bus clock. In my testing, I always got a 2 cycle stall.

It seems like any write to cache during a prefetch will stall until the prefetch completes. This is so incredibly stupid. You basically cannot write to cache during a prefetch. It looks like you want about 18 cycles (!?!) between a prefetch and write to completely avoid a stall. At least it seems safe to read data already fully loaded in cache. I haven't checked if writing to an SQ during prefetch will stall.

I'll post my benchmark code and results later, so that others can double check my methods, once I finished. I'm hoping I got something wrong about the write-during-prefetch thing, and that it's not really that bad...

Edit: I remembered there was a paper by Hitachi about the design of the SH4's cache, "A 2-ns-Access, 285-MHz, Two-Port Cache Macro Using
Double Global Bit-Line Pairs", which mentions (even in the title) that the cache is dual ported. I skimmed the paper because it was mostly about the silicon design and just assumed both ports were read/write, but going back to it, it does state that one port is for reads, and the other port is for writes.

Any update on this I hope it's not this bad

TapamN · Post by **TapamN** » Wed Jun 28, 2023 6:04 am

I've been turning my benchmark code for the SH4's performance counter into a kind of library/framework, which is included in the attached file. I'm not going to spend the time to fully document and clean it up (This took longer than expected; I want to go back and finish the VGA patches. At least I'm mostly finished with Zelda for now...), but I guess I got it somewhere it might be usable by other people.

Not being able to write to cache during prefetch unfortunately seems correct. It also probably means you can't read the cache when a line is being flushed with OCBWB.

I also found also some kind of delay when switching between read memory accesses and write accesses (I'm talking about real memory accesses, not cache access). Doing (RAM read, RAM write)x4 is much slower than (RAM read)x4, (RAM write)x4. Doing (Readx2, Writex2)x2 is in-between. I got

Code: Select all

RRRRWWWW 28.2 cycles
RRWWRRWW 32.2 cycles
RWRWRWRW 39.8 cycles

You can experiement with this by modifying the BenchSQ assembly function in bench_sq.S.

My code for transforming an array of 4D vectors, which could do ~4.8 cycles per vector if source and destination are in cache, would run at ~15.2 cycles per vector if the source was not in cache. I wrote a version that tries to do only cached reads during prefetch, and group prefetches/writes into 64 byte blocks instead of 32, and it manages ~11.4 cycles per vector with uncached source.

Another thing I noticed is that all memory stalls seem to stall the rest of the CPU, like FPU. So if you do a FDIV, then cache miss before the FDIV completes, the FDIV will also stall and make no progress during the cache miss. If you prefetch something, then do a slow instruction like an FDIV, and can't waste enough time before reading the prefetched data to ensure you won't stall, adding manual delays before the read might be helpful.

As for comparing once-per-vertex indexed T&L versus recalculating shared vertices...

I used a low-poly sphere with 330 triangles, with 206 unique vertices. It has long strips and all vertices are shared, so it's biased towards indexed T&L. Something like tree made out of scattered, disconnected quads would have fewer shared vertices and would be biased towards repeated T&L. The sum of the length of all strips for the sphere was 402 vertices.

For the indexed T&L, I transforms the vertex position and wrote the UV coord to a buffer in-cache. Position and UV were stored as SoA. Then from this buffer, each indexed vertex is fetched, perspective is applied (meaning perspective is calculated multiple times for shared vertices, but skipping perspective there doesn't significantly change the speed), then the TA command is written to the SQ and flushed. I used 16-bit indices, with the highest bit marking whether a vertex is the end of the strip or not. No lighting or clipping is done (it doesn't even set the lighting value; I used GL_REPLACE for the TexEnv).

For repeated T&L, a vertex is fetched, position is transformed, perspective if applied, one light + ambient is calculated (disabling the lighting would not speed anything up), and it is submitted to the TA via SQs. The length of each strip is stored in a separate array as 16-bit values. No clipping is done.

In the end, the timing for both versions end up as...

Code: Select all

                 T&L     TA Submit   Total (All numbers in cycles)
              (206 vert) (402 vert)  (330 tris)
Indexed  T&L     5053     10986      16039 (fullbright lighting)
Repeated T&L        0     17456      17456 (includes one light + ambient)

For indexed T&L, about 3000 cycles were for transforming the position, and 1850 cycles were for setting the UV coordinate (Which works out to nearly 9 cycles to copy 8 bytes. Yikes, that stupid can't-write-during-prefetch thing is murder. Not using prefetch is only one cycle slower.) Not sure about the missing 150 cycles, maybe some of my notes are bad. The indexed T&L would be slower if any kind of lighting was added (even prelit), there's also going to be some additional overhead from having DMA in the background, and indexed's efficiency will vary depending on how cache friendly everything is arranged, so in a real-world scenario it wouldn't be winning.

Indexed's T&L drops to 1K cycles if the source is already in cache (submit time is unchanged), or about 12K total. Repeated T&L's total drops to 15K cycles if already in cache.

I think there's a bit of room to improve the indexed performance, but it's difficult to tell how much. Those numbers are just what I got with the time I spent, using what I already had available.

I'm not including a working rendering benchmark for indexed vs repeated T&L in the attachment because it relied on a lot of my existing code that take even more time to seperate out, but I included the assembly I used for the indexed T&L (transform_p.S, setuvs_p3.S, and submit.S). The repeated T&L code I used is what's here.

About the library for helping collect data with the SH4's performance counter...

The performance counter on the SH4 can only measure two events at a time. The library reruns the function multiple times measuring a different event each time so it can record all events. It can also measure multiple functions and print their results as CSV. (At first, I tried copy and pasting individual results into a spreadsheet, but that was very error prone.) There are examples in main.c.

There are two ways to measure a function. You can measure a function by passing a function pointer to the benchmark function, or inline the measurement code into an assembly function, which allows for more precise measurements or control over what is measured. You can run the benchmark multiple times so you can get an average to hide noise from anything like SDRAM refresh.

A pctTracker32 struct manages the measurement state, and a pctTrackerResults32 stores the accumulated perf. counter results. They use 32-bit values for the counter results, so some counters (like the cycle counter) could overflow if run for too long. I didn't bother implementing 64-bit versions, since I don't think you need to run the tests for that long.

For both regular functions and functions with inline measurements, you initialize pctTracker32 and pctTrackerResults32 structs, then call pctInitFunc to set up the function called (not necessary if measuring asm function with inline tracker) and an optional setup function, called before the measured function is called. You can use the setup function to setup the cache to a consistent state. There are some convenience functions for flushing/preloading the cache that can be used in your setup functions.

When you use the tracker, it adds the perf counter results to the struct, so you can do multiple runs and average things out later. Make sure to manually clear the pctTrackerResults32 when required.

You can set up the parameters passed to a regular function with PCT_SET_? macros. They set up the r4-r7, fr4-fr11, and some stack parameters passed to the function, so you have match the SH4 calling convention manually. If you use the inline tracker macros, the cache setup still gets called, but the rest of the setup has to be done yourself. You can still use the space in the pctTracker32 struct to pass data, though.

DoPrefFtrvBenchmark in main.c shows how to measure a single, regular function, via function pointer.

bench_sq.S shows how to set up an assembly function for inline measurement. DoSQBenchmark in main.c shows how to call it.

DoPrefBenchmark and DoTABenchmark in main.c show measuring an array of normal functions.

If you want to measure multiple inline measured functions, you have to do that yourself.

It's possible to enable/disable measuring certain events with the pctEvent* functions. For example, if you are measuring the TA by drawing polygons, and you don't want to overfill the vertex buffer, you can disable testing certain events to reduce the number of times your code is called.

pctPrint* functions print results stored in a pctTrackerResults32. You can scale values to print averages. If your function loops over 100 items, and you want to know the average number of events per item, you can pass 100 as the scale value.

There are more explanations in the code and headers, but if you have any questions about using it, just ask. Since I didn't originally plan ahead for others to use it, I had to go back and clean it up, but I don't really feel like putting the effort in to do it fully.

Oh, Ian, I just noticed that you asked about flushing the cache in one of your posts. I guess I missed it (despite being bolded) since it was still in the quote. I wrote something that was a direct replacement for KOS's dcache_flush_range, but looking at it again, I think there might be a problem with it, so I want to double check it before I post it. This library has a function in pctracker.c for flushing and invalidating the entire cache, pctPurgeCache. that might help. It might do more than you need, but it's still probably faster than what KOS has.

Ian Robinson · Post by **Ian Robinson** » Sun Oct 15, 2023 8:31 am

I was looking talking to a sega dev of the past and they told me about this.

also the super h has a fantastic feature where you can turn half the cache into a 'zero page'
the cache is addressable, and when you flip the bit you get like 8kb of super high speed memory to use. great for compression algorithms and similar things in the contexts of games, this would be great for doing stuff like skinning

High Performance Rendering

High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering

Re: High Performance Rendering