How can I make this faster?

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
Post Reply
kazade
Insane DCEmu
Insane DCEmu
Posts: 145
https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
Joined: Tue May 02, 2017 3:11 pm
Has thanked: 3 times
Been thanked: 34 times

How can I make this faster?

Post by kazade »

https://github.com/Kazade/GLdc/blob/mas ... raw.c#L446

The above link shows the function in GLdc that takes the user submitted pointers to vertex data, and transforms it into a buffer of vertices more suitable to the PVR.

When running the quadmark benchmark, which submits a load of vertices in bulk, this function takes 20ms (it was 30!)

I've tried various methods of looping and transforming this data and I can't seem to improve it. Anyone got any suggestions? Am I doing something obviously slow?
kazade
Insane DCEmu
Insane DCEmu
Posts: 145
Joined: Tue May 02, 2017 3:11 pm
Has thanked: 3 times
Been thanked: 34 times

Re: How can I make this faster?

Post by kazade »

It's been pointed out to me that this might already be about as fast as it can go...

The quadmark benchmark... anyone know what frame/poly rates PH3NOMs libGL gets?
park
DCEmu Junior
DCEmu Junior
Posts: 46
Joined: Tue Jun 24, 2008 10:46 am
Has thanked: 0
Been thanked: 0

Re: How can I make this faster?

Post by park »

kazade wrote: Fri Aug 17, 2018 2:50 pm It's been pointed out to me that this might already be about as fast as it can go...

The quadmark benchmark... anyone know what frame/poly rates PH3NOMs libGL gets?
Phenom spoke about some kos benchmark before. A couple times but this is the only one I can remember off the top of my head because I remember posting that makarons emu author said that trigger heart on the DC ran at 44000 triangles 33000 verts at peak per frame at 60 fps:
Right, my initial tests using KGLX were maxing aroung 150k triangles / sec @ 60fps.

But I believe pvr_mark gets ~1.2 mil verts/sec, not triangles/sec. Triangles would be 400000/sec, or 6666 triangles/frame @ 60fps.

Using an early build of my GL library on DC, I made a benchmark of ~1.18 mil verts/sec @ 60fps.
In this benchmark, a bunch of "particles" are created dynamically, with a random lifespan, color, direction and velocity, translucent, untextured.
Each particle is 4 vertex quad, that is actually a triangle strip of 2 triangles for the PVR.
So, at 1.18 mil verts/sec @ 60fps is actually 590000 triangles / sec, or 9833 triangles/frame.
But as TapamN mentioned, this is certainly not an optimzed benchmark, instead, a brute-force method.
kazade
Insane DCEmu
Insane DCEmu
Posts: 145
Joined: Tue May 02, 2017 3:11 pm
Has thanked: 3 times
Been thanked: 34 times

Re: How can I make this faster?

Post by kazade »

Right, yeah I need to do some profiling - GLdc isn't managing nearly that much :(
kazade
Insane DCEmu
Insane DCEmu
Posts: 145
Joined: Tue May 02, 2017 3:11 pm
Has thanked: 3 times
Been thanked: 34 times

Re: How can I make this faster?

Post by kazade »

OK I'm starting to wonder if I'm missing something obvious here. I've spent about a week just profiling an optimising the draw code in GLdc. When running the first quad mark test (with clipping disabled) it spends roughly 11ms copying user-submitted vertex data, under 2ms transforming, and under 2ms performing perspective divide. Then it spends just over 3ms submitting the OP, TR and PT lists.

Even though I've drastically reduced these times (probably by two thirds) it's not having the expected poly-per-second gains I'd imagine in the quadmark benchmark. It's hovering around 57000pps but it has been since before I started optimising.

If anyone with any knowledge of the SH4/PVR has any time, I'd appreciate someone looking over the GLdc code and seeing if there is anything obviously wrong with it.
park
DCEmu Junior
DCEmu Junior
Posts: 46
Joined: Tue Jun 24, 2008 10:46 am
Has thanked: 0
Been thanked: 0

Re: How can I make this faster?

Post by park »

Maybe it's the benchmark? I am not a programmer but in an effort to help you out I'll post what other homebrewers have said to increase geometry output on the Dreamcast. They pretty much say that having stripped models that are 1:1 in vertex to polygon count can massively increase how much you can display. I don't think the will help you much but who knows. This stuff was written over at assembles years back.
My current transform and lighting code is all in sh4 asm and takes 40 cycles per vertex (loading of vertex data, transform, light, project, send to the powervr) and the pipeline stalls for 1-2 cycles only. The code is 95% free of any kind of stall, and manages to complete 2 instructions every cycle with loads of parallelism from cache prefetches from RAM and cache flushes to the tile accelerator. Achieving this level of performance with no hardware feedback on stalls and cache misses is virtually impossible, and tha katana devbox provides this vital information
On my demo i worked with real meshes like characters, cars etc...and my exporter cuts the geometry size in half with zero detail loss when stripping them. So i have 80000 verts / 3 = 26666 real triangles per frame with one light and one texture. If i enable stripping you can double this number so it is 53333 "effective" triangles per frame. At 60 fps with stripping enabled you get 53333 * 60 = 3.2 million triangles per second. Along with a 640x480 24bpp framebuffer, you are left with 1.2 megabytes for textures, which is not much. So the whole concept is to make it superfast to free cpu time, since if you push too many vertices you won't have enough space for textures...
anonymous :I've done some homebrew for the DC. I can get the DC to max out at about 4.1 million polygons per frame with ~30% CPU idle time left over. At that point, the PVR can't go any faster.

My test was drawing lit, transformed, textured, anisotropicly filtered tori at 60 FPS. Lighting was a dot, clamp to >0.0, and ambient add. There were 144 tori, each was a single triangle strip containing 480 vertices, or 478 polygons. All of these tori were rotating around in front of the camera. You could probably get a higher polygons-sent-to-TA number if you were sending polygons that are off screen or something, and can be culled by the PVR.
User avatar
bogglez
Moderator
Moderator
Posts: 578
Joined: Sun Apr 20, 2014 9:45 am
Has thanked: 0
Been thanked: 0

Re: How can I make this faster?

Post by bogglez »

Hi.

In terms of the hardware you need to make efficient use of the instruction and data cache here. park quoted someone saying "I can get the DC to max out at about 4.1 million polygons per frame with ~30% CPU idle time left over", but even the 70% of the CPU being busy could be attributed to mostly waiting for data, which doesn't actually count as "idle".

Instruction cache:
To optimize usage of the instruction cache you want to
- get as many ifs out of the loop as possible and
- split the loop into multiple loops after each other.

Most of the time you will be drawing hundreds of triangles, yet you're running if(drawing_quads) code on every iteration and not only mess with branch prediction but also needlessly clutter the icache with quad code although you're guaranteed to draw triangles. Use specialized functions for triangles and quads each to reduce the amount of instructions in the loop and reduce branching to eliminate/improve instruction prefetching/amount of RAM access.

You probably shouldn't do transformation, lighting, etc. together, because this will fill the icache too much or throw lighting instructions out of the icache for the transformation instructions and vice versa, because the cache uses a direct mapping (so if the addresses modulo some value happen to be on the same cache line your performance will tank hard).

Data cache:
Make sure that the data in vptr, cptr, etc. doesn't fall into the same data cache line. I forgot the cache size on DC, but let's say it was 16kb. If your coordinates and color data are around 16kb apart (that would be 1365 float vec3s) chances are they'll fall into the same data cache line and evict each other's data.

It all strongly depends on how your input data is laid out.. If coordinates and normals are separate arrays struct mesh { vec3 coords[]; vec2 uvs[]; vec3 normals[]; } for example, you will want to make a pass through all coordinates first, then all normals (your icache will only have coordinate transformation instructions, data cache only coordinates, etc.).

If the input data is interleaved struct vertex { vec3 coords; vec2 uv; vec3 normals }; struct mesh { vertex vertices[]; } then you'd throw out already loaded normal data by only applying coordinate transformation, so tiling is preferred.
Then, instead of only applying transformation to all 1k or so triangles (then lighting, etc), you will want to work on "tiles" of data https://en.wikipedia.org/wiki/Loop_nest_optimization
You'd work on as many vertices as fit into the dcache and apply all functions on that data, then move over to the next tile of data.

You could actually implement 2 different drawing functions depending on how the input data is laid out (https://en.wikipedia.org/wiki/AOS_and_SOA).

Also make sure that the output destination doesn't fall into the same data cache line as the input data, or your writes to output will mark the input data as outdated and access RAM instead of using the much faster cache. You can bitwise or the output address in a special way to not update the cache (since you will not read from it for now anyway and don't want output to evict input data).

This video may help you: https://www.youtube.com/watch?v=rX0ItVEVjHc
Wiki & tutorials: http://dcemulation.org/?title=Development
Wiki feedback: viewtopic.php?f=29&t=103940
My libgl playground (not for production): https://bitbucket.org/bogglez/libgl15
My lxdream fork (with small fixes): https://bitbucket.org/bogglez/lxdream
User avatar
lerabot
Insane DCEmu
Insane DCEmu
Posts: 134
Joined: Sun Nov 01, 2015 8:25 pm
Has thanked: 2 times
Been thanked: 19 times

Re: How can I make this faster?

Post by lerabot »

Super interesting information Bogglez!
kazade
Insane DCEmu
Insane DCEmu
Posts: 145
Joined: Tue May 02, 2017 3:11 pm
Has thanked: 3 times
Been thanked: 34 times

Re: How can I make this faster?

Post by kazade »

Bogglez, that's amazing thank you! And you park!
User avatar
BB Hood
DC Developer
DC Developer
Posts: 189
Joined: Fri Mar 30, 2007 12:09 am
Has thanked: 41 times
Been thanked: 10 times

Re: How can I make this faster?

Post by BB Hood »

This article is related and may help also:

http://dev.dcemulation.org/tutorials/me ... zation.htm
User avatar
bogglez
Moderator
Moderator
Posts: 578
Joined: Sun Apr 20, 2014 9:45 am
Has thanked: 0
Been thanked: 0

Re: How can I make this faster?

Post by bogglez »

BB Hood wrote: Fri Aug 31, 2018 11:12 am This article is related and may help also:

http://dev.dcemulation.org/tutorials/me ... zation.htm
Seems useful, I added it to the wiki with better formatting http://dcemulation.org/?title=Efficient ... amcast_RAM
Wiki & tutorials: http://dcemulation.org/?title=Development
Wiki feedback: viewtopic.php?f=29&t=103940
My libgl playground (not for production): https://bitbucket.org/bogglez/libgl15
My lxdream fork (with small fixes): https://bitbucket.org/bogglez/lxdream
Post Reply