How to define a variable at a specific memory location, and whether it's cached or not

ThePerfectK · Post by **ThePerfectK** » Mon Feb 13, 2023 1:37 pm

I wanted to take a quick second to, in an abridged manner, show how you can make sure a variable is defined where you want it to be at an absolute memory address. The advantage of being able to do this is that you can define multiple memory blocks at specific locations to prevent them from destroying cache cohesion when accessed. An example of when you might want to do this would be if you had a memory block that spanned multiple cache lines, which you needed to keep in cache, but also needed to very quickly access a variable and had to make sure that second variable would not fall into any cache lines the first variable occupied.

One would usually learn about this kind of stuff when they study an architecture course in college and dive into the sections of an executable. To keep this short, there are multiple sections inside of an executable binary where your compiler and linker will place needed variables depending on content. The linker uses a memory map to define these spaces in the executable. An example of such a space is the block starting symbol address, also called .bss. This is an area of the executable that gets mapped to memory that contains statically allocated variables, so, for example, when you declare a static variable, this location in the executable is where those variables reside when placed in memory. There are many such sections inside of an executable for other kinds of variables, like constants and such, but that's not important.

What is important is that we can use a loader script with our linker to define our own areas of the executable, map them to specific memory addresses, and then put variables into them. You can do this by simply adding a few options to your linker settings in your makefile, and defining variables in your source code accordingly.

This is the line I use in my makefiles to call the linker when linking my dreamcast .o objects into my elf after compilation:

Code: Select all

$(KOS_CCPLUS) $(KOS_LDFLAGS) -o ./dreamcast-debug.elf $^ -T link.ld $(CXXFLAGS) $(LIBS) $(KOS_LIBS)

the -T flag allows us to load a linker script which will define additional sections of our binary for us. In this case, it's looking for a file called link.ld. That file contains the following:

Code: Select all

SECTIONS
{
  .mySegment 0x8C200000 : {KEEP(*(.Section1))}
  .mySegment 0x8C204000 : {KEEP(*(.Section2))}
}

This defines two additional sections of our binary, one called ".Section1" which begins at memory address 0x8C200000, and one called ".Section2" which begins at memory address 0x8C201000. According to the dreamcast memory map (https://dreamcast.wiki/Memory_map), 0x80000000-0x9FFFFFFF is a range of cached, privledged mode execution variables, and within that area, 0x0C000000 to 0x0FFFFFFF is the location of system ram.

We define where our variables are allocated in our source code using gcc reserved keywords, like so:

Code: Select all

int Variable1 __attribute__((section(".Section1"))) = 0x9ABCDEF0;
int Variable2 __attribute__((section(".Section2"))) = 0x01010101;

In this case, we have an integer that is called Variable1 defined in our .Section1, and another int called Variable2 at .Section2. We gave them unique initialization so we can check to make sure they are residing in memory correctly, like so:

Code: Select all

printf("adr %p\n", (void*)&Variable1);
printf("val 0x%x\n", Variable1);
Variable1 = 0;
printf("val 0x%x\n", Variable1);

printf("adr %p\n", (void*)&Variable2);
printf("val 0x%x\n", Variable2);
Variable2 = 0;
printf("val 0x%x\n", Variable2);

When we compile and run our elf through dcload, we can see our new segments of our executable defined as our program loads, along with the byte count of variables stored in it (4 in each case, because an int is 4 bytes big):

And when we run the program, it confirms our variables are in the right location:

You can actually take this a bit further to ensure cache cohesion: if you define a variable in the range 0xA0000000-0xBFFFFFFF instead of 0x80000000-0x9FFFFFFF, you'll get a variable which is not cached when accessed. The Dreamcast memory map defines this area as "Privileged mode only, no cache." Just like in the previous area, the range of 0x0C000000 to 0x0FFFFFFF is the location of system ram. Keep in mind that this is the same system ram as before, just mirrored, so, for example, 0x8C200000 and 0xAC200000 point to the same area of memory. The only difference is accessing the variable through 0xAC200000 will skip putting it into ram. A variable must still be defined at 0xAC200000. We do this by creating a segment in our link.ld:

Code: Select all

  .mySegment3 0xAC206000 : {KEEP(*(.NonCachedSection))}

and then, like before, create a variable in that section:

Code: Select all

int NonCachedVariable __attribute__((section(".NonCachedSection"))) = 0xDEADBEEF;

Now we can access our NonCachedVariable without it affecting our cache, even if it resides on the same cacheline as some data block in our cache!

Hopefully this helps someone out there more properly manage their cache cohesion!

GyroVorbis · Post by **GyroVorbis** » Sun Feb 26, 2023 5:27 am

What the hell!?! This was extremely useful. Thank you very much for this!!!

TapamN · Post by **TapamN** » Mon Feb 27, 2023 10:16 pm

I'm not sure that method is a good way to do it. Manually setting addresses of variables in the linker script seems error prone, and I'm worried that GCC will overlap variables in RAM this way. For example, if a large normal, cached buffer is declared, and gets placed to it ranges from 0x8c100000 to 0x8c500000, and you have a uncached buffer that ranges from 0xac200000 to 0xac300000, is GCC smart enough to warn that they overlap the same area in RAM? It's creates a stealth union between the buffers, which is probably not what you'd want.

I think a more reliable and simpler way to access uncached RAM is to use macros to convert between cached and uncached pointers.

Code: Select all

#define NONCACHED(a) (typeof (&(a)[0]))(((unsigned int)(a)) |  (1 << 29))
#define CACHED(a)    (typeof (&(a)[0]))(((unsigned int)(a)) & ~(1 << 29))
#define OCI_BANK0(a) (typeof (&(a)[0]))(((unsigned int)(a)) & ~(1 << 25))
#define OCI_BANK1(a) (typeof (&(a)[0]))(((unsigned int)(a)) |  (1 << 25))

Also, uncached RAM is very slow, and bypassing the cache only helps in very specific circumstances. If you're just writing RAM linearly, see if you can use store queues instead. If read thrashing is a problem, it's better to enable the OCINDEX mode of the data cache then access some of your data from the second bank. KOS doesn't have a way to enable OCINDEX built-in, but this assembly can enable it. Just link the .s file into your program and call sh4EnableOCIndex(). (I can't list the assembly directly in this post because phpBB sucks and gives me a server error if I try.)

ThePerfectK · Post by **ThePerfectK** » Mon Mar 06, 2023 2:21 am

TapamN wrote: ↑Mon Feb 27, 2023 10:16 pm I'm not sure that method is a good way to do it. Manually setting addresses of variables in the linker script seems error prone, and I'm worried that GCC will overlap variables in RAM this way. For example, if a large normal, cached buffer is declared, and gets placed to it ranges from 0x8c100000 to 0x8c500000, and you have a uncached buffer that ranges from 0xac200000 to 0xac300000, is GCC smart enough to warn that they overlap the same area in RAM? It's creates a stealth union between the buffers, which is probably not what you'd want.

You are very correct that GCC won't warn about this, and I actually ran into this when posting the example the very first time. This is very much manual memory management, with all the pitfalls that come with it, so definitely YMMV. For what it's worth, my primary use for this is to create a memory pools that explicitly do not overlap, and to use it sporadically for very specific reasons. For example, I've been experimenting with creating cache-line sized buffers that fall within different boundaries of the sense-amplifiers that control memory access (i.e. 2048 kilobyte boundaries), carefully arranging the buffers to not only not-overlap each other, but also not overlap the boundaries of each individual bank in memory so I can read multiple blocks of memory concurrently without a performance penalty as the sense amplifier has to move to another bank. If you're using memory management like this, you're well into the weeds, this is not something somebody should be implementing without a good understanding of Dreamcast memory.

The exigence for this was specifically this article: https://dcemulation.org/index.php?title ... amcast_RAM

With careful management, you could theoretically access 4 banks of memory simultaneously with minimal performance penalty as each independent ram chip has a single sense amplifier that covers two 2048kb banks each.

I think a more reliable and simpler way to access uncached RAM is to use macros to convert between cached and uncached pointers.
Code: Select all
#define NONCACHED(a) (typeof (&(a)[0]))(((unsigned int)(a)) |  (1 << 29))
#define CACHED(a)    (typeof (&(a)[0]))(((unsigned int)(a)) & ~(1 << 29))
#define OCI_BANK0(a) (typeof (&(a)[0]))(((unsigned int)(a)) & ~(1 << 25))
#define OCI_BANK1(a) (typeof (&(a)[0]))(((unsigned int)(a)) |  (1 << 25))

This is a great alternative, though. There's also ways to create executable sections directly through the linker without a ld script as well.

Also, uncached RAM is very slow, and bypassing the cache only helps in very specific circumstances. If you're just writing RAM linearly, see if you can use store queues instead. If read thrashing is a problem, it's better to enable the OCINDEX mode of the data cache then access some of your data from the second bank. KOS doesn't have a way to enable OCINDEX built-in, but this assembly can enable it. Just link the .s file into your program and call sh4EnableOCIndex(). (I can't list the assembly directly in this post because phpBB sucks and gives me a server error if I try.)

Much appreciate not only the tip, but also all you've posted over the years. I can't say enough how many posts from you I've come across over the years that are overflowing with great advice. Do you happen to reside on any chat servers, like IRC or discord? There are many times when I'd love to pick someones brains regarding if what I'm doing is going to thrash my cache or not, and all I can really rely on is dcprof.

nymus · Post by **nymus** » Mon Mar 13, 2023 9:35 am

ThePerfectK wrote: ↑Mon Mar 06, 2023 2:21 am With careful management, you could theoretically access 4 banks of memory simultaneously with minimal performance penalty as each independent ram chip has a single sense amplifier that covers two 2048kb banks each.

Would you mind clarifying this? I thought that regardless of the physical ram arrangement/connection, the memory bus would always lock for each access?

ThePerfectK · Post by **ThePerfectK** » Mon Mar 13, 2023 6:49 pm

nymus wrote: ↑Mon Mar 13, 2023 9:35 am
ThePerfectK wrote: ↑Mon Mar 06, 2023 2:21 am With careful management, you could theoretically access 4 banks of memory simultaneously with minimal performance penalty as each independent ram chip has a single sense amplifier that covers two 2048kb banks each.
Would you mind clarifying this? I thought that regardless of the physical ram arrangement/connection, the memory bus would always lock for each access?

The memory in the Dreamcast is split among two chips that are 4 banks of 512K 32-bit cells of memory that are read in tandem, so it's easier to think of them a single memory array of 4 banks of 512K 64-bit cells. 512K of 64-bit cells is equal to 4MB, so 4 banks of 4 MB = 16MB system ram.

Each bank has it's own circuitry to access memory inside of it; the circuitry chops up memory into accessible chunks of 2048 rows of 2kb memory cells (each cell 64-bits big) in each bank (2048*2048 = 4MB per bank). In order to access a row of memory, a sense amplifier must be activated and connected to the row. Each bank has it's own sense amplifier, so there are 4 possible rows of 2kb memory cells that can be accessed at once. If you try to access a cell of memory outside of the currently selected row in your bank of memory, there is a performance penalty as the current active row must be deactivated, and a new row must be activated via the sense amplifier inside the circuitry.

Since there are 4 banks, and each bank has it's own sense amplifier for 4 sense amplifiers total, you can access 4 different 2kb regions of memory at once without having to move a sense amplifier. Each 2KB region is equal to 64 cache lines, with each cache line being 32-bytes big (coincidentally the same size as the average vertex type the dreamcast normally operates at). 64 * 32-bytes = 2048 bytes = 2KB * 2048 rows = 4MB * 4 banks = 16 MB system ram.

Additionally, Dreamcast cache is direct mapped, meaning there is a 16KB rolling region of memory that gets mapped to cache. In other words, there are 512 cache lines available on the dreamcast, each 32-bytes big (512*32 = 16348 = 16KB), and those 512 cache lines are mapped to repeating areas of memory in 16KB chunks. This means the first cache line maps to memory at location 0x00000000 to 0x00000020 (32-bytes), and also 0x00003FDC (16KB from 0) to 0x00003DFC (16KB + 32 bytes), and 0x00008000 (32KB from 0) to 0x00008020 (32KB + 32 bytes); the second cache line maps to memory at location 0x00000021 (32-bytes + 1) to 0x00000041 (32-bytes + 32-bytes), and 0x00003DFD (16KB + 32 bytes + 1) to 0x00003E1D (16KB + 32 bytes + 32 bytes), and so forth. If you access areas of memory 16KB apart from each other, they'll overwrite the same cache line.

The purpose I'm using manual memory mapping for is to straddled these 2KB memory rows, making sure they don't overlap so there wouldn't be a cache line thrashing as I accessed each of them, but also making sure they fall on different banks so I can use all 4 sense amplifiers. In this way, I can access four 2KB memory buffers at once with the bare minimum performance penalty, avoiding sense amplifier activation/deactivation, while keeping all the buffers in cache.

My *actual* use for this? I've been porting Sonic CD to the Dreamcast for several years now, and part of the way Sonic 1-knuckles works is that level layouts are a series of atlas lookups. A level layout is stored as a buffer of words that point to another buffer that contains chunks, each chunk being 8x8 pointers to a block buffer, with each block being a 2x2 pointer to a tile buffer, with each tile being an 8x8 graphic. These individual buffer elements are small enough to fit into these boundaries I'm speaking of very easily, so you can render an entire level layout while keeping all the relevant look-up buffers in cache. The definition for sonic's chunk buffer, for example, is 64 blocks at 32-bytes each, so it fits snuggly into the 2048 byte boundary. I draw the background layers 128x128 pixels at a time, which is the same as one chunk, so I keep a chunk reference in one 2048 boundary, which leaves me 3 more for general level layout, block buffer lookup, and meta tile reference (for collision).

All that's to my best understanding, if there is anything that doesn't line up with the above, please comment.

Ian Robinson · Post by **Ian Robinson** » Thu Mar 30, 2023 10:11 pm

TapamN wrote: ↑Mon Feb 27, 2023 10:16 pm I'm not sure that method is a good way to do it. Manually setting addresses of variables in the linker script seems error prone, and I'm worried that GCC will overlap variables in RAM this way. For example, if a large normal, cached buffer is declared, and gets placed to it ranges from 0x8c100000 to 0x8c500000, and you have a uncached buffer that ranges from 0xac200000 to 0xac300000, is GCC smart enough to warn that they overlap the same area in RAM? It's creates a stealth union between the buffers, which is probably not what you'd want.

I think a more reliable and simpler way to access uncached RAM is to use macros to convert between cached and uncached pointers.
Code: Select all
#define NONCACHED(a) (typeof (&(a)[0]))(((unsigned int)(a)) |  (1 << 29))
#define CACHED(a)    (typeof (&(a)[0]))(((unsigned int)(a)) & ~(1 << 29))
#define OCI_BANK0(a) (typeof (&(a)[0]))(((unsigned int)(a)) & ~(1 << 25))
#define OCI_BANK1(a) (typeof (&(a)[0]))(((unsigned int)(a)) |  (1 << 25))
Also, uncached RAM is very slow, and bypassing the cache only helps in very specific circumstances. If you're just writing RAM linearly, see if you can use store queues instead. If read thrashing is a problem, it's better to enable the OCINDEX mode of the data cache then access some of your data from the second bank. KOS doesn't have a way to enable OCINDEX built-in, but this assembly can enable it. Just link the .s file into your program and call sh4EnableOCIndex(). (I can't list the assembly directly in this post because phpBB sucks and gives me a server error if I try.)

Can you put a simple example to go along with this ? It looks to be what i needed and could not get to work on kos..

TapamN · Post by **TapamN** » Sat Apr 01, 2023 2:16 am

ThePerfectK wrote: ↑Mon Mar 06, 2023 2:21 amI can't say enough how many posts from you I've come across over the years that are overflowing with great advice. Do you happen to reside on any chat servers, like IRC or discord? There are many times when I'd love to pick someones brains regarding if what I'm doing is going to thrash my cache or not, and all I can really rely on is dcprof.

Not at the moment, but maybe in the future. I think it's better to keep stuff like that someplace like here, since the Internet Archive can find it and it's less likely to get lost than a proprietary, closed site.

ThePerfectK wrote: ↑Mon Mar 13, 2023 6:49 pm All that's to my best understanding, if there is anything that doesn't line up with the above, please comment.

I think the way it works is that random accesses are slower if they need to change rows. A DRAM chip is kind of a 2D grid of memory, with each row containing 2KB. Changing just what column you access is faster changing the row and/or column. The DC has 4 SDRAM chips, and each one tracks it's own row. Each chip is for a different 4MB block of RAM. The SH4 can only access one chip at a time.

It's hard for me come up with an explaination without diagrams.

Let's say you want to do a large memcpy, with src/dst on different rows, and you read/write a cacheline at a time. If the source and destination are in the same bank (same 4MB area), you will row miss every cacheline read because you have to row change every access, but if they are in different banks, you usually row hit because you can reuse each bank's current row.

Ian Robinson wrote: ↑Thu Mar 30, 2023 10:11 pm Can you put a simple example to go along with this ? It looks to be what i needed and could not get to work on kos..

Do you mean the NONCACHED() macro or the OCINDEX stuff? NONCACHED is like this:

Code: Select all

    extern short framebuffer[640*480];

    draw_something(x, y, NONCACHED(framebuffer));

For OCINDEX like this:

Code: Select all

    //Call sh4EnableOCIndex() ahead of time. Or don't. It'll still work, but will be slower.

    static char workbuff[10*1024];

    void * oci_workbuff = OCI_BANK1(workbuff);
    transform_vects(positions, vcnt, oci_workbuff);
    do_lighting(normals, lightx, lighty, lightz, vcnt, oci_workbuff);
    copy_uvs(uvs, vcnt, oci_workbuff);
    submit_verts(indices, tricnt, oci_workbuff);

ThePerfectK · Post by **ThePerfectK** » Sat Apr 01, 2023 3:00 pm

TapamN wrote: ↑Sat Apr 01, 2023 2:16 am Not at the moment, but maybe in the future. I think it's better to keep stuff like that someplace like here, since the Internet Archive can find it and it's less likely to get lost than a proprietary, closed site.

I get that too, a vast majority of help I've gotten from DCemulation has been relatively ancient posts. The only reason I ask is because sometimes I have questions that would really benefit from an immediate back and forth and would love to pick your brain sometime. Admittedly that's selfish, though.

TapamN wrote: ↑Sat Apr 01, 2023 2:16 am It's hard for me come up with an explaination without diagrams.

Let's say you want to do a large memcpy, with src/dst on different rows, and you read/write a cacheline at a time. If the source and destination are in the same bank (same 4MB area), you will row miss every cacheline read because you have to row change every access, but if they are in different banks, you usually row hit because you can reuse each bank's current row.

Yeah, when I was working out the layout of the SDRAM, I would draw grids on my whiteboard to conceptualize it. I understand the structure, and I think you understand what I want to do with it. Instead of strictly operating in buffers of 16KB or less, I prefer to work with four 2KB window buffers on separate SDRAM banks.

If I build 2KB buffers with 16KB alignment, I can actually create multiple buffers, or one large linked-list buffer of 2 KB nodes, on the same SDRAM bank that will not overlap cache for buffers on other SDRAM banks. I would incur a sense amplifier performance penalty (and cache miss) only on that SDRAM bank as I access different 2KB buffer nodes, but the other banks would be unaffected. Additionally, if I offset the starting index of each buffer depending on the SDRAM bank (i.e. bank 0 begins at 0, bank 1 begins at 0x800, bank 2 begins at 0x1000, and bank 3 begins at 0x1800) in addition to the 16 KB alignment, they would remain accessible without a sense amplifier penalty, and would also remain in cache as each 2KB buffer would explicitly fall beside one another.

I prefer working in buffers of this size and spacing because not only do I not incur a sense amplifier penalty when accessing the banks, but it also forces me to limit my transformations or other operations, like DMA vertex submissions, into a small window buffer which helps with parallelization in theory. I refer to a few posts you made a long while back regarding your ideal T&L loop.

Combined with using OCRAM as scratch ram, it, to me, is the best use of Dreamcast cached memory. Four 2 KB window buffers across four SDRAM banks with no sense amplifier hits uses 8 KB of cache, using the other 8KB of cache would incur a sense amplifier penalty when first accessed so this is the maximum amount of ram we can work with at once with the bare minimal penalty. That leaves another 8KB for scratch ram which works out perfectly.

Ultimately, the way I'm describing my memory usage is best suited for DMA vertex submission, where I'm trying to submit vertices in small batches while simultaneously working on the next batch. This is opposed to, say, using Store Queues, where the best approach would be to do all your work upfront in a large buffer, then submit all the worked upon data at once with rapid quick SQ bursts.

Ian Robinson · Post by **Ian Robinson** » Sun May 14, 2023 6:47 pm

TapamN wrote: ↑Sat Apr 01, 2023 2:16 am
ThePerfectK wrote: ↑Mon Mar 06, 2023 2:21 amI can't say enough how many posts from you I've come across over the years that are overflowing with great advice. Do you happen to reside on any chat servers, like IRC or discord? There are many times when I'd love to pick someones brains regarding if what I'm doing is going to thrash my cache or not, and all I can really rely on is dcprof.
Not at the moment, but maybe in the future. I think it's better to keep stuff like that someplace like here, since the Internet Archive can find it and it's less likely to get lost than a proprietary, closed site.

ThePerfectK wrote: ↑Mon Mar 13, 2023 6:49 pm All that's to my best understanding, if there is anything that doesn't line up with the above, please comment.
I think the way it works is that random accesses are slower if they need to change rows. A DRAM chip is kind of a 2D grid of memory, with each row containing 2KB. Changing just what column you access is faster changing the row and/or column. The DC has 4 SDRAM chips, and each one tracks it's own row. Each chip is for a different 4MB block of RAM. The SH4 can only access one chip at a time.

It's hard for me come up with an explaination without diagrams.

Let's say you want to do a large memcpy, with src/dst on different rows, and you read/write a cacheline at a time. If the source and destination are in the same bank (same 4MB area), you will row miss every cacheline read because you have to row change every access, but if they are in different banks, you usually row hit because you can reuse each bank's current row.

Ian Robinson wrote: ↑Thu Mar 30, 2023 10:11 pm Can you put a simple example to go along with this ? It looks to be what i needed and could not get to work on kos..
Do you mean the NONCACHED() macro or the OCINDEX stuff? NONCACHED is like this:
Code: Select all
    extern short framebuffer[640*480];

    draw_something(x, y, NONCACHED(framebuffer));
For OCINDEX like this:
Code: Select all
    //Call sh4EnableOCIndex() ahead of time. Or don't. It'll still work, but will be slower.

    static char workbuff[10*1024];

    void * oci_workbuff = OCI_BANK1(workbuff);
    transform_vects(positions, vcnt, oci_workbuff);
    do_lighting(normals, lightx, lighty, lightz, vcnt, oci_workbuff);
    copy_uvs(uvs, vcnt, oci_workbuff);
    submit_verts(indices, tricnt, oci_workbuff);

Yes thank you TapamN OCINDEX

How to define a variable at a specific memory location, and whether it's cached or not

How to define a variable at a specific memory location, and whether it's cached or not

Re: How to define a variable at a specific memory location, and whether it's cached or not

Re: How to define a variable at a specific memory location, and whether it's cached or not

Re: How to define a variable at a specific memory location, and whether it's cached or not

Re: How to define a variable at a specific memory location, and whether it's cached or not

Re: How to define a variable at a specific memory location, and whether it's cached or not

Re: How to define a variable at a specific memory location, and whether it's cached or not

Re: How to define a variable at a specific memory location, and whether it's cached or not

Re: How to define a variable at a specific memory location, and whether it's cached or not

Re: How to define a variable at a specific memory location, and whether it's cached or not