What's the fastest implementation of the cross product on the SH4?

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
Post Reply
kazade
Insane DCEmu
Insane DCEmu
Posts: 145
https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
Joined: Tue May 02, 2017 3:11 pm
Has thanked: 3 times
Been thanked: 34 times

What's the fastest implementation of the cross product on the SH4?

Post by kazade »

I've recently hit a performance bottleneck in some physics code that makes heavy use of the cross product, and it got me thinking about what's the fastest cross-product implementation for the SH4?

DreamHAL makes use of the XMTRX instruction but I don't think that's optimal. I know TapamN mentioned in another thread that a 9-cycle cross product is possible? Maybe?

I started playing with the FIPR instruction, to see if we could abuse that but I think the best I could come up with is around 18 cycles (assuming that fipr has 5 cycle latency where I can fmov/fneg stuff).

Any thoughts?
These users thanked the author kazade for the post:
Ian Robinson
Twada
DC Developer
DC Developer
Posts: 42
Joined: Wed Jan 20, 2016 4:55 am
Has thanked: 18 times
Been thanked: 53 times

Re: What's the fastest implementation of the cross product on the SH4?

Post by Twada »

Hello. Thank you for developing a great engine!

When I realized the cross product with a combination of ordinary multiplication and subtraction, it became like this.
The register allocation is as follows.
fr0-fr2: output
fr8-fr10: vec3f_0
fr4-fr6: vec3f_1
fr3, fr7: tmp

Code: Select all

	fmov	fr6, fr3	!1
	fmul	fr9, fr3	
	fmov	fr5, fr0	!2
	fmul	fr10, fr0
	fmov	fr4, fr7	!3
	fmul	fr10, fr7
	fsub	fr3, fr0	!4
	fmov	fr6, fr1
	fmul	fr8, fr1	!5
	fmov	fr5, fr3
	fmul	fr8, fr3	!6
	fmov	fr4, fr2
	fmul	fr9, fr2	!7
	fsub	fr7, fr1	!8
	fsub	fr3, fr2	!9,10,11
I think fmov and fmul can be run at the same time. The number of cycles I expect is uncertain.
I also want to know the fast cross product calculation!
These users thanked the author Twada for the post (total 3):
freakdaveIan Robinsonkazade
User avatar
GyroVorbis
Elysian Shadows Developer
Elysian Shadows Developer
Posts: 1873
Joined: Mon Mar 22, 2004 4:55 pm
Location: #%^&*!!!11one Super Sonic
Has thanked: 79 times
Been thanked: 61 times
Contact:

Re: What's the fastest implementation of the cross product on the SH4?

Post by GyroVorbis »

Anyone got anything better than this? KOS is missing any sort of cross product. If there are no objections, I'll PR this.
TapamN
DC Developer
DC Developer
Posts: 104
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 88 times

Re: What's the fastest implementation of the cross product on the SH4?

Post by TapamN »

For a cross product, I think it would be better to just let GCC decide how to preform it than using inline asm. That way GCC has more freedom to allocate and move registers around, rather than having to get things lined up for how the asm wants it.

And, if you were writing asm yourself, I think it could be made one cycle faster by using an FNEG/FMAC pair for one element. You would have to already have the right value in fr0 ahead of time from an earlier element.
These users thanked the author TapamN for the post (total 2):
GyroVorbisIan Robinson
Post Reply