If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
bogglez wrote:
I was wondering whether you can replace those 8 dot products with two matrix multiplications to gain some speed. Put the vectors on the right side into the matrix and multiply it by cv, then do the if checks after the matrix multiplication (or just write "bbox_in = result[0] > 0 + result[1] > 0 ...). That should take 1/4th the time + some overhead to set up the matrix. It may also improve on the branching. Maybe that's faster?
For the bounding box algorithm, each matrix would only transform 1 vector.
I think you're missing something there. You have the following code:
Use cv as the vector to multiply with this matrix. Multiplying a matrix row with a vector is the same as a dot product. So you're doing 4 dot products. Or did I misunderstand you?
But I still don't think that would be the most efficient approach.
The problem is that the bounding box is per face, and there are several faces i.e. several bounding boxes to be checked.
Therefore each bounding box is only used once, meaning as a matrix it will be constructed and loaded into the matrix registers for the sole purpose of transforming only 1 vector. That is what I meant in my last post...
Thing is, I believe using matrix transforms is most efficient when used as follows
(pseudocode)
construct_matrix(m);
load_matrix(m);
for(I = 0; I < count; I++)
transform_vector(v);
Basically, construct and then load the matrix, and then transform a batch of vertices.
However, using matrix transforms for the bounding box algorithm would look like this:
for(I = 0; I < count; I++)
{
construct_matrix(m);
load_matrix(m);
transform_vector(vsrc, vdst); // Notice here I would not overwrite the input vector, instead store the transform to a separate output vector
construct_matrix(m);
load_matrix(m);
transform_vector(vsrc, vdst);
}
As two separate matrices need to be constructed and then loaded into the matrix registers for each face, that means each face of the BSP needs to re-load the matrix registers two times per face. Looking at matrix.s in KOS, mat_load(...) requires at least 11 cycles. Times two, obviously were using 22 cycles just loading the matrix into the register.
In the end, I think ftrv is just 4 calls to fipr in pipelined succession. I can imagine that 4 calls to fipr after compiler optimizations, might even produce very close throughput to ftrv, minus the wasted time loading the matrix registers. But, it would be interesting to benchmark just for kicks
But thank you again and please don't hesitate to share your thoughts further!
I think if you factor in animations, sounds, game logic and physics (and texturing to a lesser extent), the CPU time can go up quite dramatically. As it is you would only be allowed to add 12ms to still hit 30 FPS.
Where do the 16ms CPU time come from exactly? Are you by chance repeating the same transformations for all 4 players? Not sure right now whether walking the BSP 4 times results in redundancies that you can avoid.
EDIT:
Also, from the few code I can see in your last screenshot (which looks awesome): There is no need to recalculate the perspective matrix, it is constant (as long as you don't resize the window, which won't happen), so just store the 4 perspective matrices for the players and use glLoadMatrixf.
Likewise, don't always calculate the lookat matrix. Only do so when the view angle or position of the player changes. In the case of a first person shooter this will admittedly be the case most of the time, but still there's no harm in doing so and it will improve the best case scenario while the worst case scenario stays the same.
The glLoadIdentity() is also wasted if you instead just load the finished lookat matrix instead.
bogglez wrote:I think if you factor in animations, sounds, game logic and physics (and texturing to a lesser extent), the CPU time can go up quite dramatically. As it is you would only be allowed to add 12ms to still hit 30 FPS.
Where do the 16ms CPU time come from exactly? Are you by chance repeating the same transformations for all 4 players? Not sure right now whether walking the BSP 4 times results in redundancies that you can avoid.
EDIT:
Also, from the few code I can see in your last screenshot (which looks awesome): There is no need to recalculate the perspective matrix, it is constant (as long as you don't resize the window, which won't happen), so just store the 4 perspective matrices for the players and use glLoadMatrixf.
Likewise, don't always calculate the lookat matrix. Only do so when the view angle or position of the player changes. In the case of a first person shooter this will admittedly be the case most of the time, but still there's no harm in doing so and it will improve the best case scenario while the worst case scenario stays the same.
The glLoadIdentity() is also wasted if you instead just load the finished lookat matrix instead.
You make some good points there
As I still need to finish my code for implementing the PVS system using glDrawArrays( my initial pass at using the PVS system was actually using immediate mode ), things are probably not the best they can be.
For now, I am parsing every face the BSP into an Array that can be submitted as a single call to glDrawArrays().
Then when rendering 4 viewports, the entire BSP is being rendered 4 times.
Because the cameras are at separate positions, we can not re-use the transformed vertices for separate players.
If you are curious about the render matrices, I construct them as follows:
void _glKosMatrixApplyRender() {
mat_load(Matrix + GL_SCREENVIEW); // matrix set by glViewport
mat_apply(Matrix + GL_PROJECTION); // matrix set by gluPerspective
mat_apply(&MatrixLookAt); // matrix set by gluLookAt
mat_apply(Matrix + GL_MODELVIEW); // matrix set by user when glMatrixMode == GL_MODELVIEW
mat_store(Matrix + GL_RENDER);
}
void _glKosMatrixLoadRender() {
mat_load(Matrix + GL_RENDER);
}
What you cant see in the code in that screen is this:
Because I want the user to be able to change Display Aspect Ratio depending on the monitor they are using, for now I have the DAR as a variable that can be set real-time. This means the Perspective Matrix is not constant.
But, your approach of pre-calculating matrices that do not explicitly need to be calculated each frame makes a lot of sense.
Another bump, I have finally gotten the Light Maps to render correctly.
However, this process is using a 2-pass render approach to achieve the Multi-Texture, meaning every vertex gets transformed / clipped twice.
This means the next step is for me to add a solid Multi-Texture system to the OpenGL API, where each vertex only needs to be transformed / clipped once.
Still not using VQ textures, I acutally looked at using the VQ encoder posted on your thread here viewtopic.php?f=29&t=103369
but I am not a user of that QT environment and have not had any luck getting that code to compile on windows.
Can someone post an executable for windows of that VQ encoder that supports rectangle textures?
At any rate, I have made some updates to my OpenGL API to support a basic GL_ARB_multitexture. http://www.dei.isep.ipp.pt/~matos/cg/do ... RB.3G.html
The first thing I did was re-organize the clipping code, and add support for clipping vertices with 2 sets of uv coordinates.
Next, I updated the texture binding code to support 2 texture units to be bound, by using glActiveTextureARB(...).
Finally, I added the ability to support multiple texture coordinate arrays submission when using glDrawArray, by using glClientActiveTextureARB(...).
So, I have finished my first pass at a working Multi-Texture system that supports the minimum requirement of 2 texture units up and running, time for some testing...
In order to test things out, I had to pre-process the BSP faces to arrange them as arrays, with each array containing all of the vertices from every face that shares the same lightmap and texture ID.
As a result, the main draw subroutine looks like this
This code is rendering every face of the BSP, without using the PVS system, and every vertex is being NearZ Clipped.
Test 1: Rendering using a 2-Pass approach. Result: 20msec/frame = ~51fps
Test 2: Rendering using a 1-Pass approach using OpenGL Multi-Texture. Result: 17msec/frame = ~58fps
We can see an increase of 7fps in this scenario.
Test 1.1: Bigger Map Rendering using a 2-Pass approach. Result: 34msec/frame = ~29fps
Test 2.1: Bigger Map Rendering using a 1-Pass approach using OpenGL Multi-Texture. Result: 26msec/frame = ~39fps
We can see an increase of 10fps in this scenario.
In conclusion, I have finished my investigation on the topic of this thread.
In closing, this is what it looks like to render Quake 3 BSP's without Light Maps:
And this is what it looks like to render Quake 3 BSP's with Light Maps using KGL Multi-Texture:
I was wondering how much texture memory and RAM you're using because sounds, meshes, etc need to be loaded for a full blown game.
BTW this looks like a great benchmark. If you were to move a camera through this scene on a fixed path and record performance statistics, the data could be used to do profiling in the future. Especially if some sections of the scene show special features (e.g. high poly Sonic model in one room with lighting, many small meshes or some particle system in another room etc)
Hmm, good idea about setting a fixed path for the camera to create a consistent benchmark.
These maps are using just under 2mb of RAM and ~1.75mb of VRAM including Light Map textures.
I have actually finished my 2nd pass at the Multi-Texture system, improving performance even further.
My first approach was the obvious one; for every vertex submitted, after all processing is done(lighting, transforming, clipping, etc.), copy the resulting vertex into the TR vertex buffer with the 2nd u/v set.
This requires storing each vertex twice in the vertex buffer, as well as memory time for copying each vertex.
My new approach does not consume any extra space in the vertex buffer, as it actually modifies the existing vertices in the vertex buffer as a post-process after the original vertices have already been submitted to the pvr. This approach only requires the memory time of copying each u/v set, rather than the entire vertex.
This map saves ~3msec/frame, and now sails at 60fps with ZClipping and Multi-Texturing every vertex submitted, running just over 2mil verts/sec
This bigger map saves ~5msec/frame, again with ZClipping and Multi-Texturing every vertex submitted, running just over 2.5mil verts/sec