SH4 assembly function call in C

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
Post Reply
User avatar
Newbie
Insane DCEmu
Insane DCEmu
Posts: 171
https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
Joined: Sat Jul 27, 2013 1:16 pm
Has thanked: 0
Been thanked: 0

SH4 assembly function call in C

Post by Newbie »

Hi everyone,

I am playing around SH4 assembly function call in C today.
My goal is simply to use assembly inlined code in a C program to call a function of KOS.

Below you can see a sample of calling a function in pure C.
The "_icache_flush_range" is used as a test function, calling it with 0 / 0 parameters produce nothing.

I use some pragma to not be annoyed by any optimization.

So this C code works perfectly.

Code: Select all


	#pragma GCC push_options
	#pragma GCC optimize ("O0")

	int main(int argc, char* argv[])
	{	
		icache_flush_range(0, 0);
		
		return 0;	
	}

Now I try to execute the same function with same parameters by inlining assembly in a C main function.

You can see a sample of my assembly code below.

I just take r4 and r5 as parameters containers then push the "_icache_flush_range" address in r1 before jump to it.

After execution of the function, I made a BRA to not execute accidentally the words storing the function address.

Code: Select all


	8c010252:	00 e4       	mov	#0,r4
	8c010254:	00 e5       	mov	#0,r5
	8c010256:	02 d1       	mov.l	8c010260 <_main+0x20>,r1	! 8c028720 <_icache_flush_range>
	8c010258:	0b 41       	jsr	@r1
	8c01025a:	09 00       	nop	
	8c01025c:	05 a0       	bra	8c01026a <END>
	8c01025e:	09 00       	nop	
	8c010260:	20 87       	.word 0x8720
	8c010262:	02 8c       	.word 0x8c02
	8c010264:	09 00       	nop	
	8c010266:	09 00       	nop	
	8c010268:	09 00       	nop	
	8c01026a <END>:

To use this code, i simply put it in my C main function like below.

Code: Select all


	#pragma GCC push_options
	#pragma GCC optimize ("O0")
	
int main(int argc, char* argv[])
{	
	__asm__ __volatile__(".BYTE 0x00");__asm__ __volatile__(".BYTE 0xe4");
	__asm__ __volatile__(".BYTE 0x00");__asm__ __volatile__(".BYTE 0xe5");
	__asm__ __volatile__(".BYTE 0x02");__asm__ __volatile__(".BYTE 0xd1");
	__asm__ __volatile__(".BYTE 0x0b");__asm__ __volatile__(".BYTE 0x41");

	__asm__ __volatile__(".BYTE 0x09");__asm__ __volatile__(".BYTE 0x00");
	__asm__ __volatile__("BRA END");

	__asm__ __volatile__(".BYTE 0x09");__asm__ __volatile__(".BYTE 0x00");
	__asm__ __volatile__(".BYTE 0x20");__asm__ __volatile__(".BYTE 0x87");
	__asm__ __volatile__(".BYTE 0x02");__asm__ __volatile__(".BYTE 0x8c");
	__asm__ __volatile__(".BYTE 0x09");__asm__ __volatile__(".BYTE 0x00");
	__asm__ __volatile__(".BYTE 0x09");__asm__ __volatile__(".BYTE 0x00");
	__asm__ __volatile__(".BYTE 0x09");__asm__ __volatile__(".BYTE 0x00");
	__asm__ __volatile__("END:");

	return 0;	
}

When i compile and execute it, a fantastic exception occurs ...

Unhandled exception: PC 8c010278, code 1, evt 00e0
R0-R7: 00000000 00000000 00001fe0 fffffc00 00000020 00000000 f0000000 00000000
R8-R15: 00000000 00000000 00000000 00000000 00000000 00000000 00000007 00000007
SR 40000101 PR 8c01025c
Stack Trace: frame pointers not enabled!
kernel panic: unhandled IRQ/Exception
arch: aborting the system
I certainly missing something important (one or several instructions) but i did not found information about it in the Hitachi manuals.

Thanks for any kind of help.
User avatar
BlueCrab
The Crabby Overlord
The Crabby Overlord
Posts: 5658
Joined: Mon May 27, 2002 11:31 am
Location: Sailing the Skies of Arcadia
Has thanked: 9 times
Been thanked: 69 times
Contact:

Re: SH4 assembly function call in C

Post by BlueCrab »

There's a number of problems with what you're trying to do in this code... Not trying to be mean, just pointing it out ahead of time.

You can't just inline bytes into the code and expect that to work... You're relying on everything to be exactly as it is in your compiled example that you disassembled, which is not all that likely to happen -- especially since it doesn't even know that you're trying to call the function you're trying to call...

Not only that, but GCC doesn't know that you're changing registers or anything else like that in the middle of the pre-assembled assembly you're trying to put in there, which will certainly break things if the example is any more than the trivial thing you're doing. You're definitely going to confuse GCC's code output with what you're doing.

If you're trying to inline some assembly, why not ... actually inline the assembly code?

Why not do something like this (I probably screwed something up in here, but it is close enough to demonstrate):

Code: Select all

__asm__ __volatile__(
    "mov #0, r4\n\t"
    "mov #0, r5\n\t"
    "mov.l label, r1\n\t"
    "jsr @r1\n\t"
    "nop\n\t"
    "bra out\n\t"
    "nop\n\t"
    ".balign 4\n"
    "label:\n\t"
    ".long _icache_flush_range\n\t"
    "out:\n\t" : : "r1", "r4", "r5", "memory" );
That all said, it is NOT safe to call functions with inline assembly in general. GCC won't know that those functions might clobber registers and won't back up the register values to the stack ahead of time, which will break things in all but the most trivial of situations. You could potentially make it safe by listing every single call-clobbered register in the clobber list, but by the time you've done that, you've given up any potential advantage you might have gained by using the assembly code in the first place.
User avatar
Newbie
Insane DCEmu
Insane DCEmu
Posts: 171
Joined: Sat Jul 27, 2013 1:16 pm
Has thanked: 0
Been thanked: 0

Re: SH4 assembly function call in C

Post by Newbie »

Hi,

There's a number of problems with what you're trying to do in this code... Not trying to be mean, just pointing it out ahead of time.
Your are always welcome ! :)
I thank you to respond to my questions and you always make very useful notices.
It is brilliant to provide the flush instruction like SH4A as a function in KOS : thank you very much.

By the way, I just want to dig more into how it works (i am curious).

I know it could be strange to request such a thing (calling directly a function by hard coded opcodes).

We could say, it is a sort of research or study : understanding and testing.

We could say "hacking" in the good sense of this word.

For this, i make some test with the code i written above.

If I replace the "JSR @R1" instruction by a NOP, the code works and does not crash : this mean that using registers R4 and R5 (even hard coded) is not the problem.

I test with invoking JSR with register R12 (instead of R1), it crashes the same way : so it is not a question of number of register used in the jump (no special reserved register).

I came to examine the function prolog and epilog of the two codes : the working one (which is the classical way to invoke a function in C) and the bad one (the way with the hard coded opcodes).



This is what I found : there is a use of STS /LDS instruction with the working code that is not present with the crashing one.
STS
This instruction stores system register MACH, MACL, or PR in the destination.
STS.L PR,@–Rn Rn – 4 -> Rn, PR -> (Rn)

LDS
Stores the source operand into the system registers MACH, MACL, or PR.
LDS.L @Rm+,PR (Rm) -> PR, Rm + 4 -> Rm
In prolog, it stores PR in R15.

In epilog, it restores R15 in PR.

This is the only difference between the crashing attempt and the good old working code.

Is it useful to understand what is made behind the scene by GCC that I did not in my code ?

Thanks for help.
Attachments
PROLOG.JPG
EPILOG.JPG
User avatar
BlueCrab
The Crabby Overlord
The Crabby Overlord
Posts: 5658
Joined: Mon May 27, 2002 11:31 am
Location: Sailing the Skies of Arcadia
Has thanked: 9 times
Been thanked: 69 times
Contact:

Re: SH4 assembly function call in C

Post by BlueCrab »

PR is the "procedure return" register. It stores where the rts opcode returns to at the end of the function. The jsr and bsr opcodes change that register to point to the opcode immediately after the delay slot. Saving and restoring it would be pretty important in general. :wink:

That said, even with doing that, it is not safe (in general) to call functions via inline assembly, since GCC doesn't know that you're doing that. You will end up clobbering registers that GCC isn't expecting to be clobbered, which will break things in all but the most trivial of cases. This all boils down to the C calling convention that is employed.

Basically, things work out like this by default on SH4 (basic integer registers only -- not including floating point or system registers in this discussion):
Call clobbered registers: r0, r1, r2, r3, r4, r5, r6, r7
Call preserved registers: r8, r9, r10, r11, r12, r13, r14, r15
Temporary variables (within a function): r0, r1, r2, r3
Parameters (for passing to a function): r4, r5, r6, r7 (the rest go on the stack)
Return values: r0 (and r1, if you're returning a 64-bit value)

GCC expects that call-clobbered registers will potentially be different before and after a function call. So, if the values in them are important (and GCC knows that you're calling a function), then GCC will automatically back up those values (by pushing them onto the stack). If GCC doesn't know you're calling a function, it can't know that it is important to do that, unless you tell it that you're clobbering those registers.

Call preserved registers must be the same before a function call and after it returns. That is to say, if a function uses those registers, it is responsible for backing up and restoring the values in them.

In your example, you're calling icache_flush_range without backing up any of the call-clobbered registers. That function actually modifies every single call-clobbered register within its code.
nymus
DC Developer
DC Developer
Posts: 968
Joined: Tue Feb 11, 2003 4:12 pm
Location: In a Dream
Has thanked: 5 times
Been thanked: 6 times

Re: SH4 assembly function call in C

Post by nymus »

*Oops! beaten by Bluecrab :)
Hi. I don't have much experience with assembly but I don't mind offering some input. The best way to learn how assembly and C interact is to code it in the "standard" way, i.e. using separate .c and .s files as follows:

Please note that this code "could" be incorrect. I'm just using it to show general concepts.

*correction: do not store pr in delay slot...

Code: Select all

//main.c

// our assembly function
extern void call_icache_flush(void);

int main(int argc, char *argc[])
{
    call_icache_flush();

    return 0;
}

Code: Select all

// call_icache_flush.s
.global _call_icache_flush ! make it visible to linker
.align 2 ! instructions should be alignd to 2-bytes
_call_icache_flush: ! note underscore. assembly functions have the same name as c with underscore
    mov.l    call_addr, r0    ! convention is to use r0-r3 for local and return variables, r4-r7 for parameters (I think)
    mov      #0, r4              ! we have no parameters and don't use r4, r5 so we zero the registers icache_... uses
    mov      r4, r5               ! you could do it sequentially like Bluecrab did above; this just demonstrates first steps of optimization
    sts.l      pr, @-r15        ! this function was called by another using jsr so our caller's address is in 'pr'. we have to save it    
    jsr @r0                         ! here we use the sh4 "delay slot" which is executed after every branch/jump
    nop
jsr_return:                        ! jsr returns here so cpu saves this address in pr before calling r0 (icache_...) so it can come back
    lds.l      @r15+, pr        ! phew! since we saved our pr, we know the address of our caller so lets restore pr
    rts                                ! cpu returns to address stored in pr
    nop
.aling 4                             ! data must be 4-byte aligned
call_addr: .long _icache_flush_range
r15 is used as a "stack pointer." It contains a reserved area of memory where any function can store its state if it wants to call another function or if it needs to reuse registers. The convention is to use sts.l @-r15 when storing which decrements the address by 4 bytes then stores the data so that the next sts.l @-r15 does not ovewrite. When restoring, we use lds.l @r15+ which does the opposite.

In our case, we don't want to pass parameters to icache_.... and we don't use any registers. If we did, we would save the registers (r0-r7) that we are using because we must assume that the function we call will use those registers and overwrite our data.

Code: Select all

// main.c
int add2(int a, int b) {
    int c = a + b;
    return c;
}
// add2 above will expect a and b in r4 and r5 respectively
// it will add r4 to r5 and store the result in r0 before returning

// our add3 is written in assembly and calls add2
extern int add3(int, int, int);

int main(int argc, char *argv[]) {
    add3(5, 6, 7);
    return 0;
}

Code: Select all

// avg.s
.global _add3
.align 2
_add3: ! we expect 3 parameters (r4, r5, r6). we call add for the last two, then add the first
    mov.l        r4, @-r15        ! we don't want to lose our first parameter, so we save it
    mov          r5, r4               ! our second parameter(6) is the first parameter to 'add2' so we move it to r4
    mov          r6, r5               ! our third parameter(7) is the second parameter to 'add2' so we move it to r5
    mov.l        @add2_addr, r0  ! get the address of 'add2'
    sts.l           pr, @-r15         ! we don't know if add2 will also call another function so we save our pr
    jsr             @r0                  ! we are now calling add2(6, 7)
    nop
    ! note, we saved our r4 so our stack has r15(pr return address) and r4 behind it.
    ! r0 has the answer of 6 + 7
    lds.l          @r15+, pr         ! first restore pr
    mov.l        @r15+, r4         ! restore our first parameter (5)
    rts                                     ! we use delay slot. note we restored our return address. if not we would be lost
    add           r4, r0                ! we add 5 to 13. r0 now has 18

.align 4
add2_addr: .long _add2
If all the code above was in one assembly file and we knew that a "label" we call does not call another label itself, then we would not need to save pr. Having lots of code in a single .s file allows us to use r0-r7 freely as long as we can keep track of the registers we use. saving to memory is expensive but we have to do it judiciously, otherwise functions from different files would not know where to return and overwrite each other's data.
Last edited by nymus on Sat Jul 02, 2016 3:24 pm, edited 1 time in total.
behold the mind
inspired by Dreamcast
nymus
DC Developer
DC Developer
Posts: 968
Joined: Tue Feb 11, 2003 4:12 pm
Location: In a Dream
Has thanked: 5 times
Been thanked: 6 times

Re: SH4 assembly function call in C

Post by nymus »

A quick note: You should experiement using library code, not kernel code. The icache flushing code likely requires privileged mode so calling it from your user code is not advisable.

Try using the examples by modifying them to use assembly calls instead of c. Try looking at the assembly output of gcc using simple functions (no optimization shows every step) (gcc -S file.c) or calling library functions like printf using assembly:

*correction: should send address of string to printf

Code: Select all

// main.c
#include <stdio.h>

int main(int argc, char *argv[]) {
    const char *name = "Newbie";
    printf("Hello. My name is %s\n", name);

    return 0;
}

Code: Select all

// main.s
.global _main
.align 2
_main:
    mov.l        printf_addr, r0
    mov.l        hello_addr, r4
    mov.l        name_addr, r5
    sts.l          pr, @-r15    
    jsr             @r0
    nop

! we're back
    lds.l         @r15+, pr
    rts
    mov         #0, r0
.align 4
printf_addr: .long _printf
name_addr: .long name
hello_addr: .long hello
name: .ascii "Newbie"
hello: .ascii "Hello. My name is %s\n"
Last edited by nymus on Sat Jul 02, 2016 3:31 pm, edited 1 time in total.
behold the mind
inspired by Dreamcast
Chilly Willy
DC Developer
DC Developer
Posts: 414
Joined: Thu Aug 20, 2009 11:00 am
Has thanked: 0
Been thanked: 2 times

Re: SH4 assembly function call in C

Post by Chilly Willy »

Gcc can handle clobber registers and memory on the SH just like the x86. That said, inline assembly in gcc is not all that fun. It's much easier and cleaner to make an assembly FILE that holds the pure assembly functions. For example, make a test.s file with this

Code: Select all

        .text
        .align  4

! int SetSHSR(int level);
! On entry: r4 = new irq level
! On exit:  r0 = old irq level

        .global _SetSHSR
_SetSHSR:
        stc     sr,r1
        mov     #0x0F,r0
        shll2   r0
        shll2   r0
        and     r0,r1                   /* just the irq mask */
        shlr2   r1
        shlr2   r1
        not     r0,r0
        stc     sr,r2
        and     r0,r2
        shll2   r4
        shll2   r4
        or      r4,r2
        ldc     r2,sr
        rts
        mov     r1,r0
This gives you a single function with the following c declaration

extern int SetSHSR(int level);

Note that it's in the .code segment (placed in .text), and assembly entries meant to be accessed via C need a '_' in front. You get two ways of doing comments in SuperH assembly files: ! full-line comment, and /* comment */ the same as in c.

Note: if the file extension is .s, the file must be completely assembly. If you give it the extension .S, it will be passed through the CPP (preprocessor) allowing you to do things like #define symbol value and structs and similar things.
User avatar
BlueCrab
The Crabby Overlord
The Crabby Overlord
Posts: 5658
Joined: Mon May 27, 2002 11:31 am
Location: Sailing the Skies of Arcadia
Has thanked: 9 times
Been thanked: 69 times
Contact:

Re: SH4 assembly function call in C

Post by BlueCrab »

nymus wrote:A quick note: You should experiement using library code, not kernel code. The icache flushing code likely requires privileged mode so calling it from your user code is not advisable.
KOS doesn't actually provide a "user" mode. Lots of things would break horribly if we did. :wink:
Chilly Willy
DC Developer
DC Developer
Posts: 414
Joined: Thu Aug 20, 2009 11:00 am
Has thanked: 0
Been thanked: 2 times

Re: SH4 assembly function call in C

Post by Chilly Willy »

BlueCrab wrote:
nymus wrote:A quick note: You should experiement using library code, not kernel code. The icache flushing code likely requires privileged mode so calling it from your user code is not advisable.
KOS doesn't actually provide a "user" mode. Lots of things would break horribly if we did. :wink:
Almost all early consoles and many computers avoided user mode if such a mode was available. The Genesis ran in supervisor state on the 68000. So did the Atari ST. In fact, the user stack pointer on the ST was often used as a pseudo-DMA pointer for sound samples. The Apple Mac ran in supervisor mode, too. The Commodore Amiga was one of the few that ran user apps in user mode with supervisor mode reserved for interrupts and fundamental kernel operations.
User avatar
BlueCrab
The Crabby Overlord
The Crabby Overlord
Posts: 5658
Joined: Mon May 27, 2002 11:31 am
Location: Sailing the Skies of Arcadia
Has thanked: 9 times
Been thanked: 69 times
Contact:

Re: SH4 assembly function call in C

Post by BlueCrab »

Chilly Willy wrote:
BlueCrab wrote:
nymus wrote:A quick note: You should experiement using library code, not kernel code. The icache flushing code likely requires privileged mode so calling it from your user code is not advisable.
KOS doesn't actually provide a "user" mode. Lots of things would break horribly if we did. :wink:
Almost all early consoles and many computers avoided user mode if such a mode was available. The Genesis ran in supervisor state on the 68000. So did the Atari ST. In fact, the user stack pointer on the ST was often used as a pseudo-DMA pointer for sound samples. The Apple Mac ran in supervisor mode, too. The Commodore Amiga was one of the few that ran user apps in user mode with supervisor mode reserved for interrupts and fundamental kernel operations.
To add a bit about the Dreamcast specifically... KallistiOS (as well as all other homebrew SDKs for the Dreamcast) and Sega's own Katana library all use supervisor mode only. I believe that the Windows CE SDK for the Dreamcast had separate user and supervisor modes (games ran in user mode and made system calls down to the OS kernel running in supervisor mode).
Chilly Willy
DC Developer
DC Developer
Posts: 414
Joined: Thu Aug 20, 2009 11:00 am
Has thanked: 0
Been thanked: 2 times

Re: SH4 assembly function call in C

Post by Chilly Willy »

Windows CE... ugh. The Dreamcast looks like the old Amiga line (1000/500/2000), and AROS would probably run really well on it. The DC looks more like a CD based Amiga than any of the CD based Amigas. :lol:
User avatar
Newbie
Insane DCEmu
Insane DCEmu
Posts: 171
Joined: Sat Jul 27, 2013 1:16 pm
Has thanked: 0
Been thanked: 0

Re: SH4 assembly function call in C

Post by Newbie »

Thanks all !

I have gathered very useful information that i will use in the future ! I will use assembly instead of hard inlining.

From BlueCrab :

PR is the "procedure return" register. It stores where the rts opcode returns to at the end of the function. The jsr and bsr opcodes change that register to point to the opcode immediately after the delay slot.

Basically, things work out like this by default on SH4 (basic integer registers only -- not including floating point or system registers in this discussion):
Call clobbered registers: r0, r1, r2, r3, r4, r5, r6, r7
Call preserved registers: r8, r9, r10, r11, r12, r13, r14, r15
Temporary variables (within a function): r0, r1, r2, r3
Parameters (for passing to a function): r4, r5, r6, r7 (the rest go on the stack)
Return values: r0 (and r1, if you're returning a 64-bit value)

GCC expects that call-clobbered registers will potentially be different before and after a function call.
So, if the values in them are important (and GCC knows that you're calling a function), then GCC will automatically back up those values (by pushing them onto the stack). If GCC doesn't know you're calling a function, it can't know that it is important to do that, unless you tell it that you're clobbering those registers.

Call preserved registers must be the same before a function call and after it returns. That is to say, if a function uses those registers, it is responsible for backing up and restoring the values in them.
From nymus :

r15 is used as a "stack pointer." It contains a reserved area of memory where any function can store its state if it wants to call another function or if it needs to reuse registers.

mov.l rn, @-r15 ! save register rn value

mov.l @r15+, rn ! restore a value in register rn

From Chilly Willy :

Note that it's in the .code segment (placed in .text), and assembly entries meant to be accessed via C need a '_' in front. You get two ways of doing comments in SuperH assembly files: ! full-line comment, and /* comment */ the same as in c.

By the way, I noticed that for very simple C code GCC output very complicated ASM code.
For example :

this

Code: Select all


	int index = 24;

	do
	{
		--index;
		
	}while(index > 0);

produces

8c01028e: e3 61 mov r14,r1
8c010290: d0 71 add #-48,r1
8c010292: 18 e2 mov #24,r2
8c010294: 2f 11 mov.l r2,@(60,r1)
8c010296: e3 61 mov r14,r1
8c010298: d0 71 add #-48,r1
8c01029a: e3 62 mov r14,r2
8c01029c: d0 72 add #-48,r2
8c01029e: 2f 52 mov.l @(60,r2),r2
8c0102a0: ff 72 add #-1,r2
8c0102a2: 2f 11 mov.l r2,@(60,r1)
8c0102a4: e3 61 mov r14,r1
8c0102a6: d0 71 add #-48,r1
8c0102a8: 1f 51 mov.l @(60,r1),r1
8c0102aa: 15 41 cmp/pl r1
8c0102ac: 29 01 movt r1
8c0102ae: 1c 61 extu.b r1,r1
8c0102b0: 18 21 tst r1,r1
8c0102b2: f0 8b bf 8c010296
But in fact, it could be simply :

Code: Select all


	MOV #24,R5; 

	LOOP:

	DT R5;

	BF LOOP;


Second example :

Code: Select all


	uint8 * datas = (uint8 *) (&&label);

	label:
	__asm__ __volatile__("NOP");
	__asm__ __volatile__("NOP");
	__asm__ __volatile__("NOP");


8c010286: e3 61 mov r14,r1
8c010288: d0 71 add #-48,r1
8c01028a: e9 d2 mov.l 8c010630,r2 ! 8c01026a
8c01028c: 2e 11 mov.l r2,@(56,r1)
Well I am very confused on how could read / write in a address location in memory like a byte array.

Something like that but in ASM (this is a simple dumy example to illustrate):

Code: Select all


	uint8 * datas = (uint8 *) (&&label);
	
	datas[4] = datas[0];

	label:
	__asm__ __volatile__("NOP");
	__asm__ __volatile__("NOP");
	__asm__ __volatile__("NOP");


8c010252: e3 61 mov r14,r1
8c010254: cc 71 add #-52,r1
8c010256: 0c d2 mov.l 8c010288 <B+0x14>,r2 ! 8c01026e
8c010258: 2f 11 mov.l r2,@(60,r1)
8c01025a: e3 61 mov r14,r1
8c01025c: cc 71 add #-52,r1
8c01025e: 1f 51 mov.l @(60,r1),r1
8c010260: 04 71 add #4,r1
8c010262: e3 62 mov r14,r2
8c010264: cc 72 add #-52,r2
8c010266: 2f 52 mov.l @(60,r2),r2
8c010268: 20 62 mov.b @r2,r2
8c01026a: 2c 62 extu.b r2,r2
8c01026c: 20 21 mov.b r2,@r1
8c01026e: 09 00 nop
8c010270: 09 00 nop
8c010272: 09 00 nop
I suspect (like the loop case) that there are a way much simpler.

Thanks for help.
User avatar
BlueCrab
The Crabby Overlord
The Crabby Overlord
Posts: 5658
Joined: Mon May 27, 2002 11:31 am
Location: Sailing the Skies of Arcadia
Has thanked: 9 times
Been thanked: 69 times
Contact:

Re: SH4 assembly function call in C

Post by BlueCrab »

Well, if you turn off optimization, GCC is going to generate just plain ugly code. Take your simple do-while loop case... When compiled with KOS' default flags (with a minor change to the code of making the int "volatile" so that it doesn't optimize the loop away entirely), you get a much different piece of assembly:

Code: Select all

        mov     #24,r1
        mov.l   r1,@r15
.L2:
        mov.l   @r15,r1
        add     #-1,r1
        mov.l   r1,@r15
        mov.l   @r15,r1
        cmp/pl  r1
        bt      .L2
Ignoring the pieces about storing/retrieving the value to/from memory (which are caused by making it volatile), you get this:

Code: Select all

        mov     #24,r1
.L2:
        add     #-1,r1
        cmp/pl  r1
        bt      .L2
Which is almost the same as your case with dt. In fact, if it weren't for the fact that I made the variable volatile, GCC would use dt in place of the add+cmp/pl sequence, as making the variable volatile forces GCC to read it from memory each time it is used (although, you'd have to make the loop do something else, or it'll just optimize it away entirely since it is useless code).


Let's modify your code slightly to test my theory. I'll use this loop instead of what you have:

Code: Select all

void func(void);

void test(void) {
    int var = 24;
    do {
        func();
    } while (--var > 0);
}
Here's the entirety of what GCC generates for that with KOS' default non-debug flags (-O2 -fomit-frame-pointer):

Code: Select all

_test:
        mov.l   r8,@-r15
        mov     #24,r8
        mov.l   r9,@-r15
        mov.l   .L5,r9
        sts.l   pr,@-r15
.L2:
        jsr     @r9
        nop
        dt      r8
        bf      .L2
        lds.l   @r15+,pr
        mov.l   @r15+,r9
        rts
        mov.l   @r15+,r8
.L6:
        .align 2
.L5:
        .long   _func
Now, let's strip out all the bookkeeping work to leave us just with the loop itself (and not the function prologue/epilogue or the call to the useless function inside it):

Code: Select all

        mov     #24,r8
.L2:
        dt      r8
        bf      .L2
Exactly as you've suggested GCC should make the code. :wink:

Also, remember, GCC doesn't ever take segment of a function to be a lone island when it's generating code. All those __asm__ __volatile__ segments force GCC to effectively make a barrier so to speak. Everything in the code before that point must be finished and nothing after it can be done until they're done. But the main point is that with -O0, you get crap for code. :wink:
nymus
DC Developer
DC Developer
Posts: 968
Joined: Tue Feb 11, 2003 4:12 pm
Location: In a Dream
Has thanked: 5 times
Been thanked: 6 times

Re: SH4 assembly function call in C

Post by nymus »

Compiling without optimization is a nice way to learn basic steps and understand what optimization -O2 does. Assembly is more nuanced than just reducing the number of instructions. Manual inlining as you wanted to do might be faster because gcc would unroll the loop smartly for you.

A lot of the code you see is using r14 which is used as the "frame pointer." When you tell gcc to -fomit-frame-pointer like Bluecrab demonstrated, all that code goes away. The frame pointer is useful when debugging and is usually removed for product release.

Try and read the code and see if you can figure out what gcc is doing. This is actually one of the best ways to learn the fundamentals of assembly. You'll see gcc adding/subtracting the pointer while copying r14 to another register so that it can do the next add/sub without delay because "mov rn, rn" is free. When you enable optimization, the compiler will do some stuff better than most average assembly coders and you lose the opportunity to learn why it is doing these things.

A 3-instruction loop is VERY SLOW on a simple risc CPU like our sh4; the pipeline is 5 stages so 3 instructions will take 15 cycles + memory latency + compute latency, whereas that long stretch of code with the frame pointer might take just a few cycles longer because some of the instructions are run in parallel and are better piplined.

Try just compiling with -fomit-frame-pointer and don't specify optimization. You will see that gcc actually does some instruction ordering for parallel and pipelined execution while leaving your basic code intact so you can follow the C and assembly more easily. Once you enable optimizations gcc will then restructure your program's logic and it will be harder to follow but more "efficient."
behold the mind
inspired by Dreamcast
nymus
DC Developer
DC Developer
Posts: 968
Joined: Tue Feb 11, 2003 4:12 pm
Location: In a Dream
Has thanked: 5 times
Been thanked: 6 times

Re: SH4 assembly function call in C

Post by nymus »

Here is a nice example you could use to see that looks can be deceiving...

C function:

Code: Select all

int test(int values[5]) {
	int total = 0;

	for(int i = 0; i < 5; ++i) {
		total += values[i];
	}

	return total;
}
When you compile using just "sh-elf-gcc -std=c99 -fomit-frame-pointer -S test.c" (no optimization), you will see gcc saving the stack pointer (r15) so that you might be able to debug the function by pausing after every instruction (about 52 total, you'll see).

-O1 and over automatically enables -fomit-frame-pointer. It's almost 1-1 with the C code.
#20 is 5*4 bytes (values[5]) in r3 and the loop counter (5) in r1. 10 instructions. 5 registers.
Note r3 is not even used. gcc just leaves it there so you can see your code translated correctly to assembly.

Code: Select all

_test:
	mov 	r4,r3
	add	#20,r3
	mov 	#0,r0
	mov	#5,r1
.L2:
	mov.l	@r4+,r2
	dt	r1
	bf.s	.L2
	add	r2,r0
	rts	
	nop
with -O2 gcc knows you understand that it can eliminate that unused r3. 8 instructions. 3 registers.

Code: Select all

_test:
	mov	 #0,r0
	mov	#5,r1
.L2:
	mov.l	@r4+,r2
	dt 	r1
	bf.s 	.L2
	add 	r2,r0
	rts	 
	nop
-O3... No loop!! (i.e. no pipeline reset!) 10 instructions. 3 registers. The (r0 += (r1= values[r4+4i]) is actually pipelined nicely because it takes a couple of cycles after reading r1 before the memory is ready to write to it. In those cycles the cpu will do the add.

Code: Select all

_test:
	mov.l	@(4,r4),r1
	mov.l	@r4,r0
	add	r1,r0
	mov.l	@(8,r4),r1
	add	r1,r0
	mov.l	@(12,r4),r1
	add	r1,r0
	mov.l	@(16,r4),r1
	rts	
	add	r1,r0
Note 10 instructions on sh4 is 20 bytes because our instructions are a compact 16-bits. A cache line is 32 bytes so we could add 6 more instructions and it "might" still be faster than looping.
behold the mind
inspired by Dreamcast
User avatar
Newbie
Insane DCEmu
Insane DCEmu
Posts: 171
Joined: Sat Jul 27, 2013 1:16 pm
Has thanked: 0
Been thanked: 0

Re: SH4 assembly function call in C

Post by Newbie »

Hi,

The "sh-elf-gcc -std=c99 -O1 -S test.c" command line is definitely awesome to understand the translation between C and ASM.

I will now use it massively.

Thanks
Post Reply