I've recently hit a performance bottleneck in some physics code that makes heavy use of the cross product, and it got me thinking about what's the fastest cross-product implementation for the SH4?
DreamHAL makes use of the XMTRX instruction but I don't think that's optimal. I know TapamN mentioned in another thread that a 9-cycle cross product is possible? Maybe?
I started playing with the FIPR instruction, to see if we could abuse that but I think the best I could come up with is around 18 cycles (assuming that fipr has 5 cycle latency where I can fmov/fneg stuff).
Any thoughts?
What's the fastest implementation of the cross product on the SH4?
-
- Insane DCEmu
- Posts: 145
- https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
- Joined: Tue May 02, 2017 3:11 pm
- Has thanked: 3 times
- Been thanked: 34 times
What's the fastest implementation of the cross product on the SH4?
- These users thanked the author kazade for the post:
- Ian Robinson
-
- DC Developer
- Posts: 42
- Joined: Wed Jan 20, 2016 4:55 am
- Has thanked: 18 times
- Been thanked: 53 times
Re: What's the fastest implementation of the cross product on the SH4?
Hello. Thank you for developing a great engine!
When I realized the cross product with a combination of ordinary multiplication and subtraction, it became like this.
The register allocation is as follows.
fr0-fr2: output
fr8-fr10: vec3f_0
fr4-fr6: vec3f_1
fr3, fr7: tmp
I think fmov and fmul can be run at the same time. The number of cycles I expect is uncertain.
I also want to know the fast cross product calculation!
When I realized the cross product with a combination of ordinary multiplication and subtraction, it became like this.
The register allocation is as follows.
fr0-fr2: output
fr8-fr10: vec3f_0
fr4-fr6: vec3f_1
fr3, fr7: tmp
Code: Select all
fmov fr6, fr3 !1
fmul fr9, fr3
fmov fr5, fr0 !2
fmul fr10, fr0
fmov fr4, fr7 !3
fmul fr10, fr7
fsub fr3, fr0 !4
fmov fr6, fr1
fmul fr8, fr1 !5
fmov fr5, fr3
fmul fr8, fr3 !6
fmov fr4, fr2
fmul fr9, fr2 !7
fsub fr7, fr1 !8
fsub fr3, fr2 !9,10,11
I also want to know the fast cross product calculation!
- These users thanked the author Twada for the post (total 3):
- freakdave • Ian Robinson • kazade
- GyroVorbis
- Elysian Shadows Developer
- Posts: 1874
- Joined: Mon Mar 22, 2004 4:55 pm
- Location: #%^&*!!!11one Super Sonic
- Has thanked: 81 times
- Been thanked: 64 times
- Contact:
Re: What's the fastest implementation of the cross product on the SH4?
Anyone got anything better than this? KOS is missing any sort of cross product. If there are no objections, I'll PR this.
-
- DC Developer
- Posts: 108
- Joined: Sun Oct 04, 2009 11:13 am
- Has thanked: 2 times
- Been thanked: 92 times
Re: What's the fastest implementation of the cross product on the SH4?
For a cross product, I think it would be better to just let GCC decide how to preform it than using inline asm. That way GCC has more freedom to allocate and move registers around, rather than having to get things lined up for how the asm wants it.
And, if you were writing asm yourself, I think it could be made one cycle faster by using an FNEG/FMAC pair for one element. You would have to already have the right value in fr0 ahead of time from an earlier element.
And, if you were writing asm yourself, I think it could be made one cycle faster by using an FNEG/FMAC pair for one element. You would have to already have the right value in fr0 ahead of time from an earlier element.
- These users thanked the author TapamN for the post (total 2):
- GyroVorbis • Ian Robinson
- Ian Robinson
- DC Developer
- Posts: 116
- Joined: Mon Mar 11, 2019 7:12 am
- Has thanked: 213 times
- Been thanked: 41 times
Re: What's the fastest implementation of the cross product on the SH4?
well looks pretty bad here letting the compiler do itTapamN wrote: ↑Thu Oct 26, 2023 3:26 pm For a cross product, I think it would be better to just let GCC decide how to preform it than using inline asm. That way GCC has more freedom to allocate and move registers around, rather than having to get things lined up for how the asm wants it.
And, if you were writing asm yourself, I think it could be made one cycle faster by using an FNEG/FMAC pair for one element. You would have to already have the right value in fr0 ahead of time from an earlier element.
https://godbolt.org/z/rvPc7hq44 x86_64
https://godbolt.org/z/qKGhKxqPf sh4
?
Code: Select all
Toshiyasu Morita ```=> GCC code quality
GCC produces code which is (IMHO) pretty good for SH4, but not great.
GCC has very good target-independent optimization (common subexpression
elimination, invariant loop expression hoisting, etc) but mediocre
SH-specific optimizations. On complicated functions one can produce
code which runs 50% faster than GCC output. Factors which affect
GCC's code quality for complicated functions:
1) Since the SH4 has only a "few" registers (compared to PPC/MIPS)
the first scheduling pass is disabled, because it creates many
register spills. The main effect of this is that code scheduling
can be somewhat weak.``` ```but mediocre
SH-specific optimizations. On complicated functions one can produce
code which runs 50% faster than GCC output.```
-
- DC Developer
- Posts: 108
- Joined: Sun Oct 04, 2009 11:13 am
- Has thanked: 2 times
- Been thanked: 92 times
Re: What's the fastest implementation of the cross product on the SH4?
The output for the first one is pretty stupid, but the second and third are perfectly acceptable. You would probably inline a cross product anyways, so all those loads from the first two wouldn't be happening.Ian Robinson wrote: ↑Wed Apr 24, 2024 9:55 am well looks pretty bad here letting the compiler do it
https://godbolt.org/z/qKGhKxqPf sh4
- These users thanked the author TapamN for the post:
- Ian Robinson