Quake 3 lightmaps - PVR Multi-Texture

bogglez · Post by **bogglez** » Wed Aug 06, 2014 12:24 am

PH3NOM wrote:
bogglez wrote: I was wondering whether you can replace those 8 dot products with two matrix multiplications to gain some speed. Put the vectors on the right side into the matrix and multiply it by cv, then do the if checks after the matrix multiplication (or just write "bbox_in = result[0] > 0 + result[1] > 0 ...). That should take 1/4th the time + some overhead to set up the matrix. It may also improve on the branching. Maybe that's faster?
For the bounding box algorithm, each matrix would only transform 1 vector.

I think you're missing something there. You have the following code:

Code: Select all

   vec3f_dot(cv->x, cv->y, cv->z, cam->x - leaf->mins[0], cam->y - leaf->mins[1], cam->z - leaf->mins[2], dot);
   vec3f_dot(cv->x, cv->y, cv->z, cam->x - leaf->mins[0], cam->y - leaf->mins[1], cam->z - leaf->maxs[2], dot);
   vec3f_dot(cv->x, cv->y, cv->z, cam->x - leaf->maxs[0], cam->y - leaf->mins[1], cam->z - leaf->mins[2], dot);
   vec3f_dot(cv->x, cv->y, cv->z, cam->x - leaf->maxs[0], cam->y - leaf->mins[1], cam->z - leaf->maxs[2], dot);

Use this as the matrix:
cam->x - leaf->mins[0], cam->y - leaf->mins[1], cam->z - leaf->mins[2]
cam->x - leaf->mins[0], cam->y - leaf->mins[1], cam->z - leaf->maxs[2]
cam->x - leaf->maxs[0], cam->y - leaf->mins[1], cam->z - leaf->mins[2]
cam->x - leaf->maxs[0], cam->y - leaf->mins[1], cam->z - leaf->maxs[2]

Use cv as the vector to multiply with this matrix. Multiplying a matrix row with a vector is the same as a dot product. So you're doing 4 dot products. Or did I misunderstand you?

And cool screenshots as always.

PH3NOM · Post by **PH3NOM** » Wed Aug 06, 2014 9:00 pm

Thanks again for the encouragement

But I still don't think that would be the most efficient approach.

The problem is that the bounding box is per face, and there are several faces i.e. several bounding boxes to be checked.
Therefore each bounding box is only used once, meaning as a matrix it will be constructed and loaded into the matrix registers for the sole purpose of transforming only 1 vector. That is what I meant in my last post...

Thing is, I believe using matrix transforms is most efficient when used as follows
(pseudocode)
construct_matrix(m);
load_matrix(m);
for(I = 0; I < count; I++)
transform_vector(v);

Basically, construct and then load the matrix, and then transform a batch of vertices.
However, using matrix transforms for the bounding box algorithm would look like this:
for(I = 0; I < count; I++)
{
construct_matrix(m);
load_matrix(m);
transform_vector(vsrc, vdst); // Notice here I would not overwrite the input vector, instead store the transform to a separate output vector
construct_matrix(m);
load_matrix(m);
transform_vector(vsrc, vdst);
}

As two separate matrices need to be constructed and then loaded into the matrix registers for each face, that means each face of the BSP needs to re-load the matrix registers two times per face. Looking at matrix.s in KOS, mat_load(...) requires at least 11 cycles. Times two, obviously were using 22 cycles just loading the matrix into the register.

In the end, I think ftrv is just 4 calls to fipr in pipelined succession. I can imagine that 4 calls to fipr after compiler optimizations, might even produce very close throughput to ftrv, minus the wasted time loading the matrix registers. But, it would be interesting to benchmark just for kicks

But thank you again and please don't hesitate to share your thoughts further!

PH3NOM · Post by **PH3NOM** » Sun Aug 10, 2014 8:27 pm

So, I think the API is ready for first release...

However, before public release I would like BlueCrab to approve the current build first.

Since things are a bit disorganized, I will upload the current release here:

OpenGL-DC-R02.rar: Password Protected - Not for Public Release; (1.09 MiB) Downloaded 154 times

So to all, please standby for an official release!

Post by **BlueCrab** » Sun Aug 10, 2014 8:55 pm

PH3NOM wrote:

So, I think the API is ready for first release...

However, before public release I would like BlueCrab to approve the current build first.

Since things are a bit disorganized, I will upload the current release here:
OpenGL-DC-R02.rar
So to all, please standby for an official release!

I'll definitely try to have a look at it tomorrow morning sometime.

Jae686 · Post by **Jae686** » Mon Aug 11, 2014 5:07 am

oh yes!

PH3NOM · Post by **PH3NOM** » Mon Aug 11, 2014 10:26 pm

My first commit to KOS has been completed, thanks to BlueCrab:
http://sourceforge.net/p/cadcdev/kallis ... 5cb594b9f/

That comprises the changes to KOS that will compliment the Open GL API that I have been working on.

This mean we should see a public release quite soon!

Post by **BlueCrab** » Mon Aug 11, 2014 11:32 pm

Well, that's not really your first code committed (I did commit most of your vec3f stuff before, along with some other stuff).

nymus · Post by **nymus** » Tue Aug 12, 2014 4:42 am

Thanks! Great work!

Jae686 · Post by **Jae686** » Tue Aug 12, 2014 6:10 am

so in order to update the toolchain I only have to a pull and rebuild, or should I rebuild it from scratch ?

Post by **BlueCrab** » Tue Aug 12, 2014 8:31 am

To update KOS, all you should have to do is pull and rebuild KOS itself (and kos-ports).

PH3NOM · Post by **PH3NOM** » Tue Aug 12, 2014 10:14 pm

BlueCrab wrote:Well, that's not really your first code committed (I did commit most of your vec3f stuff before, along with some other stuff).

Right... Looking at the diff, I realized that the base vec3f stuff was already committed. Thanks again!

And just for kicks, I have enabled two players to be able to run around in the Quake 3 BSP:

PH3NOM · Post by **PH3NOM** » Wed Aug 13, 2014 10:02 pm

4 Players in Quake 3 maps with Open GL DC seems to break the 60fps budget, but not bad.

Still, 30fps with 4 players seems to be realistic, as here we are hitting ~47fps:

bogglez · Post by **bogglez** » Thu Aug 14, 2014 8:51 am

I think if you factor in animations, sounds, game logic and physics (and texturing to a lesser extent), the CPU time can go up quite dramatically. As it is you would only be allowed to add 12ms to still hit 30 FPS.
Where do the 16ms CPU time come from exactly? Are you by chance repeating the same transformations for all 4 players? Not sure right now whether walking the BSP 4 times results in redundancies that you can avoid.

EDIT:
Also, from the few code I can see in your last screenshot (which looks awesome): There is no need to recalculate the perspective matrix, it is constant (as long as you don't resize the window, which won't happen), so just store the 4 perspective matrices for the players and use glLoadMatrixf.
Likewise, don't always calculate the lookat matrix. Only do so when the view angle or position of the player changes. In the case of a first person shooter this will admittedly be the case most of the time, but still there's no harm in doing so and it will improve the best case scenario while the worst case scenario stays the same.
The glLoadIdentity() is also wasted if you instead just load the finished lookat matrix instead.

PH3NOM · Post by **PH3NOM** » Thu Aug 14, 2014 9:00 pm

bogglez wrote:I think if you factor in animations, sounds, game logic and physics (and texturing to a lesser extent), the CPU time can go up quite dramatically. As it is you would only be allowed to add 12ms to still hit 30 FPS.
Where do the 16ms CPU time come from exactly? Are you by chance repeating the same transformations for all 4 players? Not sure right now whether walking the BSP 4 times results in redundancies that you can avoid.

EDIT:
Also, from the few code I can see in your last screenshot (which looks awesome): There is no need to recalculate the perspective matrix, it is constant (as long as you don't resize the window, which won't happen), so just store the 4 perspective matrices for the players and use glLoadMatrixf.
Likewise, don't always calculate the lookat matrix. Only do so when the view angle or position of the player changes. In the case of a first person shooter this will admittedly be the case most of the time, but still there's no harm in doing so and it will improve the best case scenario while the worst case scenario stays the same.
The glLoadIdentity() is also wasted if you instead just load the finished lookat matrix instead.

You make some good points there

As I still need to finish my code for implementing the PVS system using glDrawArrays( my initial pass at using the PVS system was actually using immediate mode ), things are probably not the best they can be.
For now, I am parsing every face the BSP into an Array that can be submitted as a single call to glDrawArrays().

Then when rendering 4 viewports, the entire BSP is being rendered 4 times.
Because the cameras are at separate positions, we can not re-use the transformed vertices for separate players.

If you are curious about the render matrices, I construct them as follows:

Code: Select all

void _glKosMatrixApplyRender() {
    mat_load(Matrix + GL_SCREENVIEW);  // matrix set by glViewport
    mat_apply(Matrix + GL_PROJECTION);  // matrix set by gluPerspective
    mat_apply(&MatrixLookAt);                   // matrix set by gluLookAt
    mat_apply(Matrix + GL_MODELVIEW);   // matrix set by user when glMatrixMode == GL_MODELVIEW
    mat_store(Matrix + GL_RENDER);        
}

void _glKosMatrixLoadRender() {
    mat_load(Matrix + GL_RENDER);
}

What you cant see in the code in that screen is this:

Code: Select all

gluPerspective(60.0f, MPDAR, 0.1f, 10000.0f);

Because I want the user to be able to change Display Aspect Ratio depending on the monitor they are using, for now I have the DAR as a variable that can be set real-time. This means the Perspective Matrix is not constant.

But, your approach of pre-calculating matrices that do not explicitly need to be calculated each frame makes a lot of sense.

Thanks for following!

PH3NOM · Post by **PH3NOM** » Fri Dec 26, 2014 1:47 pm

Another bump, I have finally gotten the Light Maps to render correctly.

However, this process is using a 2-pass render approach to achieve the Multi-Texture, meaning every vertex gets transformed / clipped twice.
This means the next step is for me to add a solid Multi-Texture system to the OpenGL API, where each vertex only needs to be transformed / clipped once.

Some screens

bogglez · Post by **bogglez** » Sat Dec 27, 2014 6:23 am

Yesss, yes, yes, outstanding work!

Did you VQ-compress the textures this time? Some statistics about the scene?

PH3NOM · Post by **PH3NOM** » Sun Dec 28, 2014 8:43 pm

Hey man

Still not using VQ textures, I acutally looked at using the VQ encoder posted on your thread here
viewtopic.php?f=29&t=103369
but I am not a user of that QT environment and have not had any luck getting that code to compile on windows.
Can someone post an executable for windows of that VQ encoder that supports rectangle textures?

At any rate, I have made some updates to my OpenGL API to support a basic GL_ARB_multitexture.
http://www.dei.isep.ipp.pt/~matos/cg/do ... RB.3G.html
The first thing I did was re-organize the clipping code, and add support for clipping vertices with 2 sets of uv coordinates.
Next, I updated the texture binding code to support 2 texture units to be bound, by using glActiveTextureARB(...).
Finally, I added the ability to support multiple texture coordinate arrays submission when using glDrawArray, by using glClientActiveTextureARB(...).

So, I have finished my first pass at a working Multi-Texture system that supports the minimum requirement of 2 texture units up and running, time for some testing...

In order to test things out, I had to pre-process the BSP faces to arrange them as arrays, with each array containing all of the vertices from every face that shares the same lightmap and texture ID.
As a result, the main draw subroutine looks like this

Spoiler!

This code is rendering every face of the BSP, without using the PVS system, and every vertex is being NearZ Clipped.

Test 1: Rendering using a 2-Pass approach. Result: 20msec/frame = ~51fps

Test 2: Rendering using a 1-Pass approach using OpenGL Multi-Texture. Result: 17msec/frame = ~58fps
We can see an increase of 7fps in this scenario.

Test 1.1: Bigger Map Rendering using a 2-Pass approach. Result: 34msec/frame = ~29fps

Test 2.1: Bigger Map Rendering using a 1-Pass approach using OpenGL Multi-Texture. Result: 26msec/frame = ~39fps
We can see an increase of 10fps in this scenario.

In conclusion, I have finished my investigation on the topic of this thread.

In closing, this is what it looks like to render Quake 3 BSP's without Light Maps:

And this is what it looks like to render Quake 3 BSP's with Light Maps using KGL Multi-Texture:

bogglez · Post by **bogglez** » Mon Dec 29, 2014 5:27 am

You've really outdone yourself this time

I was wondering how much texture memory and RAM you're using because sounds, meshes, etc need to be loaded for a full blown game.

BTW this looks like a great benchmark. If you were to move a camera through this scene on a fixed path and record performance statistics, the data could be used to do profiling in the future. Especially if some sections of the scene show special features (e.g. high poly Sonic model in one room with lighting, many small meshes or some particle system in another room etc)

bbmario · Post by **bbmario** » Tue Dec 30, 2014 4:32 pm

This is amazing, PH3NOM. Great work!

PH3NOM · Post by **PH3NOM** » Tue Dec 30, 2014 5:02 pm

Hmm, good idea about setting a fixed path for the camera to create a consistent benchmark.

These maps are using just under 2mb of RAM and ~1.75mb of VRAM including Light Map textures.

I have actually finished my 2nd pass at the Multi-Texture system, improving performance even further.

My first approach was the obvious one; for every vertex submitted, after all processing is done(lighting, transforming, clipping, etc.), copy the resulting vertex into the TR vertex buffer with the 2nd u/v set.
This requires storing each vertex twice in the vertex buffer, as well as memory time for copying each vertex.

My new approach does not consume any extra space in the vertex buffer, as it actually modifies the existing vertices in the vertex buffer as a post-process after the original vertices have already been submitted to the pvr. This approach only requires the memory time of copying each u/v set, rather than the entire vertex.

This map saves ~3msec/frame, and now sails at 60fps with ZClipping and Multi-Texturing every vertex submitted, running just over 2mil verts/sec

This bigger map saves ~5msec/frame, again with ZClipping and Multi-Texturing every vertex submitted, running just over 2.5mil verts/sec

Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture

Re: Quake 3 lightmaps - PVR Multi-Texture