32-bit memory aligning

Post by **Quzar** » Mon Jan 24, 2005 2:51 pm

When is this needed and why?

I know that this can cause problems in many different instances, but I don't really know why, where it can be a problem, and how to solve it in all cases.

for example, if i have a large array, and am at some point casting it to a pointer, then do i need to worry about the resulting data being aligned, or would it definetly be aligned already?

edit: whoops, changed byte to bit

Fosters · Post by **Fosters** » Mon Jan 24, 2005 4:27 pm

What you mean is 32-bit.

If you have 8 bytes in a row eg.
ABCDEFGH
you can read each byte safely using an 8 bit read.
or you can read the 32bit value ABCD or the 32 bit value EFGH.
but you cant read CDEF in one 32bit read.

Of course you want to read as much as you can in one go for speed.
If you know you are always going to read at the 32-bit boundaries then you are ok.

HTH.

nymus · Post by **nymus** » Tue Jan 25, 2005 4:24 am

When programming in a hll you don't have to worry about memory alignment. The compiler will detect when you are making a cast and compensate. When programming in asm however, the depending on the cpu, you'll either cause a memory access error(eg sh4) or slow things 4x(eg ia32?).

Imagine a wall made from bricks and how they stagger (one brick=4bytes). Sometimes you end up with a half-brick close to the edge and you have to mentally connect it with the next line (to get a 4-byte brick). Cheap/efficient cpus like the sh4 don't bother with this calculation. They tell you it's wrong and you have to make sure your bricks are complete.

BlackAura · Post by **BlackAura** » Tue Jan 25, 2005 5:14 am

Well, that's not strictly true. You can run into alignment problems when using a high level language, but it's not very common. If you're dealing with data that's been packed together fairly tightly (maybe from a file or something), you load it into some random block of memory, and then try accessing it using pointer arithmetic, then it's going to cause problems. Similarly, if you're trying to copy memory from one place to another using 32-bit values at a time (which is faster) but don't take alignment into consideration, then that'll cause problems too (4x slowdown, or crash, as you said).

Actually, you can only do any of those if you're doing pointer arithmetic, or being careless with casting (say, a void * to an int * without checking if it's aligned correctly).

Anyway, simple rule. 1 byte values can be stored anywhere. 2 byte values must be stored at an address that's a multiple of 2. 4 byte values must be stored at an address that's a multiple of 4. The compiler will handle alignment issues by placing things correctly in memory, and KallistiOS will always allocate memory on a 4-byte boundary anyway. So, unless you're doing insane levels of pointer arithmetic or accessing an arbitrary block of data as shorts or ints, you shouldn't need to worry about it too much.

There is another kind of alignment, which involves aligning data structures on cache lines (32 bytes). If you are dealing with data that's aligned in 32-byte blocks, you can get away with only one cache miss for every item in that block. If you're sure that block is not going to be cached when you access it the first time, you can also do a prefetch to fetch the data from memory before it's accessed. That can help speed things up quite a lot, because a prefetch is less expensive (in terms of CPU cycles) than a cache miss.

Post by **Quzar** » Tue Jan 25, 2005 5:23 am

When creating variables, is there any point where I would need to do align them manually?

like this?

Code: Select all

U8		subcpu_memspace[65536] __attribute__((aligned(32)));

?

BlackAura · Post by **BlackAura** » Tue Jan 25, 2005 5:40 am

Yes. If you want to align them to a cache line for whatever reason. Genesis Plus currently does this. Video memory is aligned on a 32 byte boundary, and all tiles are exactly 32 bytes long, and start on a 32 byte boundary in video memory. At the end of each frame, I need to convert all modified tiles to PVR textures an upload them. I can then simply prefetch the appropriate tile, which is quicker than letting a cache miss happen (which pretty much stalls the CPU until the required data has been fetched).

nymus · Post by **nymus** » Wed Jan 26, 2005 8:50 am

Thanks for the correction/verification, Blackaura.

Does gcc align structs by the largest member or by the total size closest to a cache line? I've seen code in gcc dealing with the cache so I thought it was meant to relieve the programmer of the task of specifying alignment for user-defined (not hardware-imposed) data structures just to optimize access.

Continuing this discussion (feel free to correct/add etc)...

Alignment issues generally result from the hardware limitations/design and when programming in asm, every section of code, every section of data, every pointer operation, every branch needs to be checked for alignment. As such, we could probably just say that alignment is the largest factor determining whether or not a seemingly correct program runs, runs quickly or crashes.

An hll takes away the burden of aligning functions and user-defined structures (generic ones like "point"). When accessing hardware-defined structures, you cannot ignore alignment and have to tell the compiler to align the data according to hardware specifications eg a store queue can only operate on 32-byte aligned blocks.

Specific user-defined structures generally specify alignment or packing to dictate arrangement in memory or in a file or, like BA said, to optimize specific cases (again, doesn't the compiler do this?). Taking your example of a tile, if you allocate memory for a tile dynamically and kos gives you a 4-byte aligned block, doesn't this defeat the purpose of specifying 32-byte alignment in the code? Wouldn't you, in this case, have to manually create an aligned pointer and then call new/malloc as you would for a store queue? or alternatively have a statically allocated and 32-byte aligned block created manually by gcc and then using that as a staging area...

As far as pointer casts are concerned, I think these should be viewed as programming errors because every pointer access raises the opportunity for a misaligned access.

fox68k · Post by **fox68k** » Wed Jan 26, 2005 8:51 am

This topic is really interesting

These guys are really cool!!

Well, i have few questions:

1) When i want to align data to a 32-byte boundary in asm, i write this:

.align 32

And the assembler replies me then: "Warning: alignment too large: 15 assumed".

I guess given value to .align directive is a byte count. What's wrong? Why?

2) How could i know if a memory region in DC is cached or not? That way i will know when i should use SQs.

3) If i know i will access frequently to some data, should i cache it before i access to? Will be worth of it?

Thank you very much in advance!

nymus · Post by **nymus** » Wed Jan 26, 2005 9:10 am

fox68k wrote: And the assembler replies me then: "Warning: alignment too large: 15 assumed".

Could this be what is confusing me?
[edit] I seriously need to re-read the as manual concerning alignment...

The store queues do not care whether or not you are in a cached region. All that matters is that the region you are writing to is in their access range and that the cache is configured to allow access to store queues (from user mode, through mmu, enabled/disabled etc)

As far as cache efficiency is concerned, I too have my concerns as expressed in my earlier post and if I recall correctly, Rand Linden mentioned something about it being pointless to try optimizing cache accesses in an hll. Too complex??

If you can, then it will definitely help. That's why the cache exists and that's why the sh4 has the special cache instructions. However, I think these are meant for "occasional" access as opposed to frequent access.

My reasoning:
If your program is accessing a region of memory frequently, eg in a loop, then the first couple of accesses will have to create a "cache context" by prefetching the data from memory. The cpu will do this automatically without your intervention (ie without your manual prefetch instruction) and from then on, all you data will be in the cache and you can loop through it 10, 100 times without worrying/stalling.

However, if you need to access a single block of memory once before or after the loop, then your program will hit a wall at high speed and stall. The data you had in the cache would probably be sent back to main memory and when its time for your loop to start over, it will need to prefetch the frequently accessed data once again.

What do you think?

fox68k · Post by **fox68k** » Wed Jan 26, 2005 12:47 pm

nymus wrote:
fox68k wrote: And the assembler replies me then: "Warning: alignment too large: 15 assumed".
Could this be what is confusing me?
[edit] I seriously need to re-read the as manual concerning alignment...

Surprisingly enough, AS allows me to assemble at a 32-byte boundary with .balign but not with .align. I have looked at AS manual and, as you pointed, they are are equivalent on SH platforms.

nymus wrote: The store queues do not care whether or not you are in a cached region. All that matters is that the region you are writing to is in their access range and that the cache is configured to allow access to store queues (from user mode, through mmu, enabled/disabled etc)

As far as cache efficiency is concerned, I too have my concerns as expressed in my earlier post and if I recall correctly, Rand Linden mentioned something about it being pointless to try optimizing cache accesses in an hll. Too complex??

If you can, then it will definitely help. That's why the cache exists and that's why the sh4 has the special cache instructions. However, I think these are meant for "occasional" access as opposed to frequent access.

My reasoning:
If your program is accessing a region of memory frequently, eg in a loop, then the first couple of accesses will have to create a "cache context" by prefetching the data from memory. The cpu will do this automatically without your intervention (ie without your manual prefetch instruction) and from then on, all you data will be in the cache and you can loop through it 10, 100 times without worrying/stalling.

However, if you need to access a single block of memory once before or after the loop, then your program will hit a wall at high speed and stall. The data you had in the cache would probably be sent back to main memory and when its time for your loop to start over, it will need to prefetch the frequently accessed data once again.

What do you think?

Yes, i think you are right. That is how caches work.
Anyway, there is a feature you do not talk about there: cacheable and no cacheable areas. If i understand fine the SH4 docs, some areas are not cacheable, this is, when the CPU accesses them, it does not fill the cache with them. This could be useful when we have hardware registers memory mapped, but could be a serious bottleneck if i do not take that in account. In this case, there will be a cache miss each time i access data, right? So i should make a prefetch op before accessing data. So whether or not a memory region is cacheable is important.

I guess dealing with caches in a HLL is very hard since you do not have total control over generated code. In effect, you should trust in your compiler most of the times.

I would like somebody could confirm this out.

BlackAura · Post by **BlackAura** » Thu Jan 27, 2005 3:59 am

Oh, how I miss KDE (posting this from a Windows laptop...). I'm actually having to use Notepad to type this response, because I want to see what I'm typing. None of the standard keyboard shortcuts (like ctrl+backspace) work, and Windows doesn't let me set Notepad to always on top. And I'm missing select-copy, middle-click-paste. I've tried to do that at least four times already. And Konqueror's built-in spell checker. I fell completely crippled here.

Anyway, on with the show...

nymus wrote:Does gcc align structs by the largest member or by the total size closest to a cache line? I've seen code in gcc dealing with the cache so I thought it was meant to relieve the programmer of the task of specifying alignment for user-defined (not hardware-imposed) data structures just to optimize access.

As far as I'm aware, GCC doesn't know anything about the cache configuration of the target machine. It does know that unaligned 16- or 32-bit accesses will crash on an SH-4, or will run much slower on an x86. So it'll automatically pad data structures out to the appropriate size. So if you try this:

Code: Select all

struct wossname
{
	char thing;
	int other_thing;
	short wossname;
	char foo;
	float bar;
};

Stuff should end up arranged line this:

Code: Select all

struct wossname
{
	char thing;
	char padding[3];
	int other_thing;
	short wosname;
	char foo;
	char padding2;
	float bar;
};

Well at least, in theory. There's probably a standard way of padding things, and it's probably done at least partially by the assembler rather than the compiler. It should do the same thing (probably resulting in the same code) on x86 systems, but for speed reasons instead of crash reasons. That is, of course, unless you specify packing.

Taking your example of a tile, if you allocate memory for a tile dynamically and kos gives you a 4-byte aligned block, doesn't this defeat the purpose of specifying 32-byte alignment in the code? Wouldn't you, in this case, have to manually create an aligned pointer and then call new/malloc as you would for a store queue? or alternatively have a statically allocated and 32-byte aligned block created manually by gcc and then using that as a staging area...

The memalign function does that for you. Something like:

Code: Select all

buffer = memalign(32, 0x10000);

That's pretty much how I allocate the tile buffer in Genesis Plus. I'm not sure how to works now, but in older versions of KOS that function allocated a bit more memory than you told it to, and returned the first aligned pointer within that block. It's actually present in POSIX as well, although I think it has a different name. Can't look up the man pages at the moment, so...

As far as pointer casts are concerned, I think these should be viewed as programming errors because every pointer access raises the opportunity for a misaligned access.

For 99.9% of applications, I would agree with you there. Emulators are the only exception I can think of, and even then only on x86 systems. If you have a large block of memory, you could access any 16- or 32-bit values directly, either by pointer casting in C, or using the appropriate read instruction in assembly. For any other application, you shouldn't be needing pointer casts, and if you do then you need to fix your design. For emulators, you have little control over the data structures you're dealing with, so...

fox68k wrote:2) How could i know if a memory region in DC is cached or not? That way i will know when i should use SQs.

Main memory is cached. Video memory isn't. Neither is sound memory.

Writing to video memory manually is a bad idea. To write a single value, the main CPU needs to read a 32-byte block from video memory, modify it, and write it back. Considering that VRAM is uncached, that's going to be slow. And it is. To write an entire store queue at once, the CPU only has to perform one write operation, and it can do that in parallel with the main CPU. If you're generating data and writing it to VRAM, store queues are just about the most efficient way to do it. If your data is already laying around in main memory, DMA is better.

Sound RAM is a different beast. Writing single values is also very, very slow for the same reasons. However, SRAM is very, very slow. If you try to use store queues, you'll end up blocking the main CPU. The best bet is to write to a buffer in main RAM, and DMA the entire thing over when you're done.

You can use store queues to write to main memory too. The store queue completely bypasses the cache, so if you're going to access that memory after writing it, you'll need to worry about cache consitency issues. I've not benchmarked this, so I don't know if it'd be any faster or slower than normal memory access. After all, main memory is cached. I'd guess that you'd make performance gains if using store queues would allow you to avoid clobbering the cache. Say, copying data from one place to another place that goes in the same cache line.

3) If i know i will access frequently to some data, should i cache it before i access to? Will be worth of it?

The prefetch instuction always consumes the same number of cycles. If the data you're prefetching isn't in the cache, it will be fetched. If it's not in the cache, it will be after the prefetch instuction.

If the data is already in the cache, a prefetch instruction will uselessly consume CPU cycles. If the data isn't in the cache, a prefetch instruction will still consume the same number of CPU cycles, but it will prevent a stall the first time you try to access it. A cache miss would likely cause a pipeline stall, which wasted a lot more CPU cycles than the prefetch instruction does.

So, it's worth it if you're fairly confident that the data is not in the cache already. And if you know what you'll need before it's actually needed.

nymus wrote:As far as cache efficiency is concerned, I too have my concerns as expressed in my earlier post and if I recall correctly, Rand Linden mentioned something about it being pointless to try optimizing cache accesses in an hll. Too complex??

Kinda... Attempting to optimize for the code cache is absolutely pointless in a high-level language, because you have very little control over where the compiler / assembler / linker puts things. It's possible to partially optimize for data cache access, but not a lot. All you can really do is make sure you're not doing anything seriously bad (like repeatedly clobbering the cache), align data structures to 32 byte boundaries, possibly make sure data structures fit within a single cache line, do some prefetching, and try not to do any more main memory access than is absolutely necessary. For anything else, you'd have to be writing in assembly, and manually fiddling around with things.

fox68k wrote:I would like somebody could confirm this out.

Yep.

Post by **Quzar** » Thu Jan 27, 2005 6:30 am

thanks to all of you. i hoped that i could get a good thread with consolidated info on the topic, so that when i get to a point where i can understand it, i will have a place to go.

that being said, continue with the discussion!