Working through the code in sq.c for sq_cpy

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
Post Reply
User avatar
ThePerfectK
Insane DCEmu
Insane DCEmu
Posts: 147
https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
Joined: Thu Apr 27, 2006 10:15 am
Has thanked: 27 times
Been thanked: 35 times

Working through the code in sq.c for sq_cpy

Post by ThePerfectK »

I'm trying to gain a better understanding of not how to use store queue commands (that's kind of obvious with KOS) but rather exactly what the inline assembly the KOS command is calling actually does. As in, how the SH4 itself can be controlled to perform a store queue manually. Just for knowledge, not for any real reason.

So I'm working through the sq_cpy function in sq.c and I have some questions regarding the actual loop. Let me post what I'm pretty sure the function is doing, line by line, and then I'll ask my question:

Code: Select all

unsigned int *d = (unsigned int *)(void *)
                      (0xe0000000 | (((unsigned long)dest) & 0x03ffffe0));
that seems pretty obvious, it's setting the top 3 bits to enable P4 in memory, then binding the destination address to the 64mb area that it resides in according to the memory map, correct? I also notice that the area mapped from 0xe0000000 to 0xe3ffffff is known as the Store Queue region in the SH4 manual, I'll ask about that in a bit.

Code: Select all

const unsigned int *s = src;
creating a pointer for the source address to copy from

Code: Select all

/* Set store queue memory area as desired */
    QACR0 = ((((unsigned int)dest) >> 26) << 2) & 0x1c;
    QACR1 = ((((unsigned int)dest) >> 26) << 2) & 0x1c;
so looking in the SH-4 manual, bits 2, 3, and 4 of QACR0/1 need to be set to the appropriate area that the destination address resides in, which is what the above does. It isolates the top 6 bits of the destination address, then shifts them over to the right twice so they align in the right spots in the Queue Area control register. These bits will control which 64mb area the store queue transfer will wind up in due to the mapping of the 32-bit logical to the 29-bit physical, correct?

Code: Select all

/* fill/write queues as many times necessary */
    n >>= 5;
n is our loop count to determine how many 32-byte blasts to do, n is the number of bytes to transfer. So if we say, 31 bytes, it only loops once, but if we say 32-63, it'll loop twice, then 64-95 it'll loop twice, and so forth
everything been really straight forward, now comes the confusion with the actual loop

Code: Select all

while(n--) {
        asm("pref @%0" : : "r"(s + 8));  /* prefetch 32 bytes for next loop */
        d[0] = *(s++);
        d[1] = *(s++);
        d[2] = *(s++);
        d[3] = *(s++);
        d[4] = *(s++);
        d[5] = *(s++);
        d[6] = *(s++);
        d[7] = *(s++);
        asm("pref @%0" : : "r"(d));
        d += 8;
    }
to unpack this, the inline assembly itself isn't the problem, I'm just confused about how the prefetch is being used each time, am I to understand that the prefetch command has two functions? The first inline assembly command is a pref Is that throwing the content of the source address + 8 bytes into a general register, or...? It says it prefetches 32 bytes, but I don't see how that command actually does this, it looks more like it's adding the next 8 bytes at the source into cache? Or am I wrong? The purpose of this prefetch is to push the next 8 bytes from the source address into cache so it'll be faster available to pack into the address our d pointer is pointing at, right?

then the next 8 commands, they're setting individual bytes beginning at pointer d, each one being set to the next element s points to, straight forward I guess? Our d pointer in this case is specially formatted, am I correct, to land in the "storage queue region" correct?

then the final asm command, another pref, this time one that signals to actually preform the store queue transfer, right?

so my question -- where exactly are we loading values into the store queue here? I don't see where we actually ever specified what is in the store queue. What determines what is and isn't in the store queue? I see that the memory area from 0xe0000000 to 0xe3ffffff are called the "store queue region" is it that, if we allocate some data in there, as *d points to, that's loading data into the store queue?

Just want to make sure I have all this correct for my own benefit.


EDIT: A bit more reading and I think I understand it now. I saw the store queue write section of the SH4 manual, which specifies how to write to the store queues by address:

Code: Select all

A write to the SQs can be performed using a store instruction on P4 area 0xE000
0000 to 0xE3FF FFFC. A longword or quadword access size can be used. The
meaning of the address bits is as follows:

[31:26]: 111000 Store queue specification
[25:6]: Don’t care Used for external memory transfer/access right
[5]: 0/1 0: SQ0 specification 1: SQ1 specification
[4:2]: LW specification Specifies longword position in SQ0/SQ1
[1:0] 00 Fixed at 0
Now the formatting of the d pointer makes more sense. 0xe0000000 sets the top 3 bits, then 0x03ffffe0 ensures that the next 3 bits after are 0's, and our destination address only occupies bits 25-6. Since bits 31-26 are set to 111000 this is a code that we're filling the store queue, which is located at this address.

The middle 20 bits are the destination address, which, coupled with our QACR0/1 knows which 64mb area of memory to transfer to, which frees up a few more bits for selecting which store queue spot we're writing to. Bits 2, 3, and 4 (representing 8 store queue bytes) are which byte in the store queue longword we're writing to, then bit 5 is the specific store queue we're writing to (either 0 or 1, 0 in sq_cpy's case).

we set up d so that d[0] points to the first byte in the longword store queue 0. the following:

Code: Select all

  d[0] = *(s++);
        d[1] = *(s++);
        d[2] = *(s++);
        d[3] = *(s++);
        d[4] = *(s++);
        d[5] = *(s++);
        d[6] = *(s++);
        d[7] = *(s++);
fills up each byte in the longword with a value, priming the store queue for transfer, then issuing a prefetch command to this special storequeue region address, which indicates to the SH4 to begin a storequeue transfer, using the QA CR0 to know which area to transfer to.

This makes more sense if you look at sq_clr as you can see the priming of the two store queues easier:

Code: Select all

/* clears n bytes at dest, dest must be 32-byte aligned */
void sq_clr(void *dest, int n) {
    unsigned int *d = (unsigned int *)(void *)
                      (0xe0000000 | (((unsigned long)dest) & 0x03ffffe0));

    /* Set store queue memory area as desired */
    QACR0 = ((((unsigned int)dest) >> 26) << 2) & 0x1c;
    QACR1 = ((((unsigned int)dest) >> 26) << 2) & 0x1c;

    /* Fill both store queues with zeroes */
    d[0] = d[1] = d[2] = d[3] = d[4] = d[5] = d[6] = d[7] =
                                           d[8] = d[9] = d[10] = d[11] = d[12] = d[13] = d[14] = d[15] = 0;

    /* Write them as many times necessary */
    n >>= 5;

    while(n--) {
        __asm__("pref @%0" : : "r"(d));
        d += 8;
    }

    /* Wait for both store queues to complete */
    d = (unsigned int *)0xe0000000;
    d[0] = d[8] = 0;
}
the line in the middle setting consecutive bytes at d sorta makes sense to me now. The first 8 begin with the final 6 bits arranged that so the 3 bits controlling the byte destination in the store queue longword begins at 0, with the store queue control bit (4 bits after the first bit controlling the byte destination, aka 16 values later) set to 0. Since we're moving 16 bytes past d, that eventually fills all the bytes in the first store queue longword (8 bytes), then trips bit 5 to point to sq1, which fills the next series of bytes in the second store queue (8 bytes)., with all bytes being 0. When it prefetches, it's sending those bytes through the store queue, then incrementing to the next 8 bytes down the line for the next store queue to clear, until the loop ends.

Makes much more sense, but I'm a bit confues because the sh4 manual says bits 0, 1 of the store queue region address must remain 0, which would make counting forward 16 bytes not work correctly. What am I missing that makes this all fit correctly?
These users thanked the author ThePerfectK for the post:
Ian Robinson
Still Thinking!~~
Post Reply