What's the fastest implementation of the cross product on the SH4?

kazade · Post by **kazade** » Wed Nov 24, 2021 2:41 pm

I've recently hit a performance bottleneck in some physics code that makes heavy use of the cross product, and it got me thinking about what's the fastest cross-product implementation for the SH4?

DreamHAL makes use of the XMTRX instruction but I don't think that's optimal. I know TapamN mentioned in another thread that a 9-cycle cross product is possible? Maybe?

I started playing with the FIPR instruction, to see if we could abuse that but I think the best I could come up with is around 18 cycles (assuming that fipr has 5 cycle latency where I can fmov/fneg stuff).

Any thoughts?

Twada · Post by **Twada** » Fri Nov 26, 2021 9:32 pm

Hello. Thank you for developing a great engine!

When I realized the cross product with a combination of ordinary multiplication and subtraction, it became like this.
The register allocation is as follows.
fr0-fr2: output
fr8-fr10: vec3f_0
fr4-fr6: vec3f_1
fr3, fr7: tmp

Code: Select all

	fmov	fr6, fr3	!1
	fmul	fr9, fr3	
	fmov	fr5, fr0	!2
	fmul	fr10, fr0
	fmov	fr4, fr7	!3
	fmul	fr10, fr7
	fsub	fr3, fr0	!4
	fmov	fr6, fr1
	fmul	fr8, fr1	!5
	fmov	fr5, fr3
	fmul	fr8, fr3	!6
	fmov	fr4, fr2
	fmul	fr9, fr2	!7
	fsub	fr7, fr1	!8
	fsub	fr3, fr2	!9,10,11

I think fmov and fmul can be run at the same time. The number of cycles I expect is uncertain.
I also want to know the fast cross product calculation!

GyroVorbis · Post by **GyroVorbis** » Tue Oct 24, 2023 2:05 am

Anyone got anything better than this? KOS is missing any sort of cross product. If there are no objections, I'll PR this.

TapamN · Post by **TapamN** » Thu Oct 26, 2023 3:26 pm

For a cross product, I think it would be better to just let GCC decide how to preform it than using inline asm. That way GCC has more freedom to allocate and move registers around, rather than having to get things lined up for how the asm wants it.

And, if you were writing asm yourself, I think it could be made one cycle faster by using an FNEG/FMAC pair for one element. You would have to already have the right value in fr0 ahead of time from an earlier element.

Ian Robinson · Post by **Ian Robinson** » Wed Apr 24, 2024 9:55 am

TapamN wrote: ↑Thu Oct 26, 2023 3:26 pm For a cross product, I think it would be better to just let GCC decide how to preform it than using inline asm. That way GCC has more freedom to allocate and move registers around, rather than having to get things lined up for how the asm wants it.

And, if you were writing asm yourself, I think it could be made one cycle faster by using an FNEG/FMAC pair for one element. You would have to already have the right value in fr0 ahead of time from an earlier element.

well looks pretty bad here letting the compiler do it
https://godbolt.org/z/rvPc7hq44 x86_64

https://godbolt.org/z/qKGhKxqPf sh4

?

Code: Select all

Toshiyasu Morita ```=> GCC code quality

   GCC produces code which is (IMHO) pretty good for SH4, but not great.
   GCC has very good target-independent optimization (common subexpression
   elimination, invariant loop expression hoisting, etc) but mediocre
   SH-specific optimizations. On complicated functions one can produce
   code which runs 50% faster than GCC output. Factors which affect
   GCC's code quality for complicated functions:

   1) Since the SH4 has only a "few" registers (compared to PPC/MIPS)
      the first scheduling pass is disabled, because it creates many
      register spills. The main effect of this is that code scheduling
      can be somewhat weak.```  ```but mediocre
   SH-specific optimizations. On complicated functions one can produce
   code which runs 50% faster than GCC output.```

TapamN · Post by **TapamN** » Tue Apr 30, 2024 11:38 pm

Ian Robinson wrote: ↑Wed Apr 24, 2024 9:55 am well looks pretty bad here letting the compiler do it
https://godbolt.org/z/qKGhKxqPf sh4

The output for the first one is pretty stupid, but the second and third are perfectly acceptable. You would probably inline a cross product anyways, so all those loads from the first two wouldn't be happening.

What's the fastest implementation of the cross product on the SH4?

What's the fastest implementation of the cross product on the SH4?

Re: What's the fastest implementation of the cross product on the SH4?

Re: What's the fastest implementation of the cross product on the SH4?

Re: What's the fastest implementation of the cross product on the SH4?

Re: What's the fastest implementation of the cross product on the SH4?

Re: What's the fastest implementation of the cross product on the SH4?