What's the fastest implementation of the cross product on the SH4?

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
Post Reply
Insane DCEmu
Insane DCEmu
Posts: 142
Joined: Tue May 02, 2017 3:11 pm
Has liked: 2 times
Been liked: 21 times

What's the fastest implementation of the cross product on the SH4?

Post by kazade » Wed Nov 24, 2021 2:41 pm

I've recently hit a performance bottleneck in some physics code that makes heavy use of the cross product, and it got me thinking about what's the fastest cross-product implementation for the SH4?

DreamHAL makes use of the XMTRX instruction but I don't think that's optimal. I know TapamN mentioned in another thread that a 9-cycle cross product is possible? Maybe?

I started playing with the FIPR instruction, to see if we could abuse that but I think the best I could come up with is around 18 cycles (assuming that fipr has 5 cycle latency where I can fmov/fneg stuff).

Any thoughts?
These users liked the author kazade for the post:
Ian Robinson
DCEmu Cool Newbie
DCEmu Cool Newbie
Posts: 13
Joined: Wed Jan 20, 2016 4:55 am
Has liked: 0
Been liked: 10 times

Re: What's the fastest implementation of the cross product on the SH4?

Post by Twada » Fri Nov 26, 2021 9:32 pm

Hello. Thank you for developing a great engine!

When I realized the cross product with a combination of ordinary multiplication and subtraction, it became like this.
The register allocation is as follows.
fr0-fr2: output
fr8-fr10: vec3f_0
fr4-fr6: vec3f_1
fr3, fr7: tmp

Code: Select all

	fmov	fr6, fr3	!1
	fmul	fr9, fr3	
	fmov	fr5, fr0	!2
	fmul	fr10, fr0
	fmov	fr4, fr7	!3
	fmul	fr10, fr7
	fsub	fr3, fr0	!4
	fmov	fr6, fr1
	fmul	fr8, fr1	!5
	fmov	fr5, fr3
	fmul	fr8, fr3	!6
	fmov	fr4, fr2
	fmul	fr9, fr2	!7
	fsub	fr7, fr1	!8
	fsub	fr3, fr2	!9,10,11
I think fmov and fmul can be run at the same time. The number of cycles I expect is uncertain.
I also want to know the fast cross product calculation!
These users liked the author Twada for the post (total 2):
freakdaveIan Robinson
Post Reply