SH4 Contention between IF and MA

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
Post Reply
Olivier
DCEmu Newbie
DCEmu Newbie
Posts: 7
https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
Joined: Thu Jul 29, 2021 8:26 am
Location: Aix-en-Provence, France
Has thanked: 4 times
Been thanked: 0

SH4 Contention between IF and MA

Post by Olivier »

Hello to all.
I have a question, coming from SH2.
The SH2 Programming Manual states (§7.4.3, p174) :
When an instruction is located in on-chip memory (ROM/RAM) or on-chip cache, there are instruction fetch stages (‘if’ written in lower case) that do not generate bus cycles as explained in section 7.4.2 above. When an "if" is in contention with an MA, the slot will not split, as it does when an IF and an MA are in contention, because "if"s and MAs can be executed simultaneously. Such slots execute in the number of states the MA requires for memory access, as illustrated in figure 7.8.
When programming, avoid contention of MA and IF whenever possible and pair MAs with ifs to increase the instruction execution speed
That means :

Code: Select all

	.align	4
	MOV.L	@R2,R3	IF ID EX MA WB
	ADD	R0,R1	   if ID EX
	MOV	R4,R5	      IF ID EX
	SHLR	R6	         if ID
will not split when doing "SHLR", since this instruction was already on-chip.

But if I have :

Code: Select all

	.align	4
	NOP		IF ID EX
	MOV.L	@R2,R3	   if ID EX MA WB
	ADD	R0,R1	      IF ID EX
	MOV	R4,R5	         if ID EX
	SHLR	R6	            -  IF ID
Here, "SHLR" will split because it has to be fetched first, while the "MA" of the MOV.L instruction.

NB: I'm purposedly letting the superscalar thing aside here.

I didn't find anything about the subject in the Hitachi manuals, especially Programming Manual, figure 8.3 p166
The instructions bus seems to be 32-bits wide, so would allow two instructions at a time.

So here is my question : does SH4 have the same contention problem as SH2, or did the smart engineers from Hitachi manage to get around it?
Olivier
DCEmu Newbie
DCEmu Newbie
Posts: 7
Joined: Thu Jul 29, 2021 8:26 am
Location: Aix-en-Provence, France
Has thanked: 4 times
Been thanked: 0

Re: SH4 Contention between IF and MA

Post by Olivier »

Add:
If we use the C code given by @nymus in the "SH4 assembly function call in C":

Code: Select all

int test(int values[5]) {
	int total = 0;

	for(int i = 0; i < 5; ++i) {
		total += values[i];
	}

	return total;
}
The -O3 compiled code is:

Code: Select all

_test:
	mov.l	@(4,r4),r1	I D X M S

	mov.l	@r4,r0		  i D X M S

	add	r1,r0		    I D X N S
	mov.l	@(8,r4),r1	    I D X M S

	add	r1,r0		      i D X N S
	mov.l	@(12,r4),r1	      i D X M S

	add	r1,r0		        - - - I D X N S	; these lines
	mov.l	@(16,r4),r1	        - - - I D X M S

	rts			                i D X N S

	add	r1,r0		                  I D X N S
If we look at the line "these lines", I think they are delayed three times because on the previous instructions, Memory Access is being used.

Is my understanding correct?
User avatar
BlueCrab
The Crabby Overlord
The Crabby Overlord
Posts: 5652
Joined: Mon May 27, 2002 11:31 am
Location: Sailing the Skies of Arcadia
Has thanked: 9 times
Been thanked: 69 times
Contact:

Re: SH4 Contention between IF and MA

Post by BlueCrab »

I'm pretty sure there's no contention between IF and MA stages on the SH4, as instruction fetch would be going to the icache and any memory access would be going to the ocache -- that is to say that they use two completely separate memory access paths as the SH4 is a "Harvard Architecture" CPU. I believe the earliest SuperH CPU that fits into the Harvard mold was the SH2A.
Olivier
DCEmu Newbie
DCEmu Newbie
Posts: 7
Joined: Thu Jul 29, 2021 8:26 am
Location: Aix-en-Provence, France
Has thanked: 4 times
Been thanked: 0

Re: SH4 Contention between IF and MA

Post by Olivier »

Totally makes sense. Thank you so much for the answer and the reminder about the Harvard / Von Neumann architecture ˆˆ
TapamN
DC Developer
DC Developer
Posts: 104
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 88 times

Re: SH4 Contention between IF and MA

Post by TapamN »

Those instruction/data access issues only happen on SuperH's with unified caches, like the SH-2 and SH-3. All SuperHs read instructions by fetching a long worth of instructions at a time (2 instructions). This fetch is always aligned to a long boundry.

On the SuperH 1 though 3, what this works out to is that load/store instructions on even word addresses (bottom two bits of instruction address are 00) can take one cycle, while odd word addresses (bottom two bits of instruction address are 10) take an extra cycle.

Since the SH-4 has separate instruction and data caches, it can read from both at the same time without the odd instruction alignment penalty.

On the SH-4, the first pair of instructions after a jump can never be dual executed if the address is word odd, since this would require a unaligned access. If you want to be able to dual issue the first two instructions at the start of a loop, you need to make sure it starts on a long aligned address.

On the SH-4, the timing of the assembly you listed would look like this (assuming no cache misses), with blank lines representing cycle boundries:

Code: Select all

mov.l	@(4,r4),r1	I D X M S

mov.l	@r4,r0		  i D X M S

[stall cycle (add's r0 not ready)]

add	r1,r0		      I D X N S
mov.l	@(8,r4),r1	      I D X M S

[stall cycle (add's r1 not ready)]

add	r1,r0		          i D X N S
mov.l	@(12,r4),r1	          i D X M S

[stall cycle (add's r1 not ready)]

add	r1,r0	                      I D X N S
mov.l	@(16,r4),r1	              I D X M S

rts			                i D X N S

add	r1,r0		                    I D X N S
If you graph it out, the result of a load is not ready until after the MA stage. If an instruction tries to use the result of a load before it is ready, it is delayed so that it's EX stage is after the load's MA stage.

Your disassembly would take 11 cycles to execute. If the compiler was smarter, a more efficent way of writing that code would be like this:

Code: Select all

mov.l	@r4,r0		I D X M S

mov.l	@(4,r4),r1	  I D X M S

mov.l	@(8,r4),r2	    I D X M S

add	r1, r0                I D X M S
mov.l	@(12,r4),r1	      I D X M S

add	r2, r0                  I D X M S
mov.l	@(16,r4),r2	        I D X M S

add	r1, r0                    I D X M S

rts			            i D X N S

add	r2,r0		              I D X N S
This would take 9 cycles.

History pedantry: The SH-DSP (from ~1997) would be the first modified Harvard architecture SuperH. But only partially; the DSP instructions had a pair of on-chip RAM pools that it would read simultaneously from without interfering with instruction fetches, but regular loads and stores still used the shared cache, like the vanilla SH-2/3. The SH-4 was the first full modified Harvard. The SH-2A came out after the SH-4.
These users thanked the author TapamN for the post (total 2):
OlivierBlueCrab
Olivier
DCEmu Newbie
DCEmu Newbie
Posts: 7
Joined: Thu Jul 29, 2021 8:26 am
Location: Aix-en-Provence, France
Has thanked: 4 times
Been thanked: 0

Re: SH4 Contention between IF and MA

Post by Olivier »

Thank you again @TapamN for you quick and complete answer ^^
Post Reply