OpenGL - New Build in the works

PH3NOM · Post by **PH3NOM** » Fri Feb 01, 2013 9:41 pm

So, I have started a new thread instead of hijacking this one
viewtopic.php?f=29&t=102181&start=60

As I mentioned, I have been working on a new build of OpenGL for the DC.
Still an early work in progress, it is faster than KGL, and I am releasing source in hopes for advise on improvement.

I am posting a small demo, with full source included along with a .cdi disc image to burn and test.

gl-particle-test.rar: (744.99 KiB) Downloaded 163 times

** This is a simple "Random Particle Generator" (C) Josh PH3NOM Pearson 2013.
** Written to test my GL API, this example demonstrates several things:
** -GL Pipeline Vertex Throughput ( Also PVR TR Poly Throughput )
** -GL Pipeline Mixed Submission of Opaque and Transparent Polys.
** -KOS C++ Functionality/Speed
** -KOS C++ Dynamic Memory Usage

Use d-pad to move cursor, and press 'start' to begin particle generation.

RyoDC · Post by **RyoDC** » Sat Feb 02, 2013 7:46 am

Cool stuff!
Phenom, so OpenGL for DC works as fast as pvr api do?

PH3NOM · Post by **PH3NOM** » Sun Feb 03, 2013 9:24 pm

RyoDC wrote:Cool stuff!
Phenom, so OpenGL for DC works as fast as pvr api do?

Well, it sends Vertex Data to the PVR directly ( via DMA or SQ, depending on how you decide to compile it

), instead of using the KOS PVR functions.
So, in that regard, yes. But, we also have to perform Matrix Translations using the SH4's matrix routines for each vertex.
That said, GL must also perform certian things behind the scenes to make things work easy for the user.
In the end, things are as fast as they can possibly be when conforming to the GL API standards. ( Unless Tapamn can step in and help

)

To show how Texture Binding works in my current Gl API, I have created a new demo, "Kamikaze v.0.1"

I have included the full source code. Please note, my GL Library is a work in progress and not meant for outside use.

gltest02-kamikaze01.rar: (834.02 KiB) Downloaded 139 times

Bouz · Post by **Bouz** » Mon Feb 04, 2013 1:49 pm

Nice work! What solution did you finally choose to handle the three different poly list types?

GyroVorbis · Post by **GyroVorbis** » Mon Feb 04, 2013 4:26 pm

PH3NOM wrote:Well, it sends Vertex Data to the PVR directly ( via DMA or SQ, depending on how you decide to compile it ), instead of using the KOS PVR functions.

Does that mean this flag controls whether you use the intermediate RAM buffer + DMA approach (for being able to switch between PVR list types) or the direct rendering with the SQs (and having to submit one list at a time) approach?

PH3NOM · Post by **PH3NOM** » Tue Feb 05, 2013 6:51 pm

Kamikaze v.0.2 is Now Available, with source included.
Press Y to shoot, Use D-Pad to move ship.
When you shoot, you use energy, so keep your eye on the meter!

gltest-kamikaze-02.rar: (1.3 MiB) Downloaded 149 times

Bouz wrote:Nice work! What solution did you finally choose to handle the three different poly list types?

Well, I used the method I described in the other thread
viewtopic.php?f=29&t=102181&start=60
But I only handle OP and TR polys, not PT. What are they used for?

GyroVorbis wrote:
PH3NOM wrote:Well, it sends Vertex Data to the PVR directly ( via DMA or SQ, depending on how you decide to compile it ), instead of using the KOS PVR functions.
Does that mean this flag controls whether you use the intermediate RAM buffer + DMA approach (for being able to switch between PVR list types) or the direct rendering with the SQs (and having to submit one list at a time) approach?

No, I am not using the KOS DMA Buffer functions, I manage all of that myself, have a look at GL/gl-render.c in the code I uploaded above. If you look a the last function in that file "RenderCallback()", you can see that if DMA is enabled, Veretex Data is sent directly to the TA with the function "pvr_dma_load_ta()". If DMA is not enabled, SQ's are used instead:

Code: Select all

sq_cpy((pvr_vertex_t*)  0x10000000, (pvr_vertex_t*)VERT_LIST[OP][i].vertex, 0x20*VERT_LIST[OP][i].vertices );

If you look at pvr.h you will see

Code: Select all

#define PVR_TA_INPUT		0x10000000	/* TA command input */

I am curious if you have made a benchmark of Vertex throughput using the KOS DMA set_vertbuf() stuff?

Post by **BlueCrab** » Tue Feb 05, 2013 8:50 pm

PH3NOM wrote:But I only handle OP and TR polys, not PT. What are they used for?

Punchthrus are polygons that have essentially one bit of alpha. Either a color is totally visible, or it is totally invisible (like, for instance, ARGB1555 textures). Punchthrus are much faster to render than translucent polygons. The fill-rate for punchthrus is essentially equivalent to opaque polygons.

I am curious if you have made a benchmark of Vertex throughput using the KOS DMA set_vertbuf() stuff?

I think there is a benchmark of vertex throughput with the dma in the KOS examples, if I'm not mistaken. I think it was called serpent_dma or something like that. I also remember there being a version of pvrmark with dma, although it may well not be in the examples.

PH3NOM · Post by **PH3NOM** » Fri Feb 08, 2013 12:20 pm

BlueCrab wrote:
PH3NOM wrote:But I only handle OP and TR polys, not PT. What are they used for?
Punchthrus are polygons that have essentially one bit of alpha. Either a color is totally visible, or it is totally invisible (like, for instance, ARGB1555 textures). Punchthrus are much faster to render than translucent polygons. The fill-rate for punchthrus is essentially equivalent to opaque polygons.

I am curious if you have made a benchmark of Vertex throughput using the KOS DMA set_vertbuf() stuff?
I think there is a benchmark of vertex throughput with the dma in the KOS examples, if I'm not mistaken. I think it was called serpent_dma or something like that. I also remember there being a version of pvrmark with dma, although it may well not be in the examples.

Oh ok thanks for the info.

But, looking at the "Serpent DMA" in KOS, it is using mat_transform_sq

/* Transform and write vertices to the TA via the store queues */

But that seems to contradict what you have said about mixing vertex submission modes while Vertex DMA is enabled.
So, I still dont understand....

Post by **BlueCrab** » Fri Feb 08, 2013 9:22 pm

The store queue stuff that is going on in the serpent_dma example is actually using the store queue to write to the DMA buffer in main ram. I'm guessing the comment was a remnant of some earlier iteration that did use the store queues to write directly to the TA.

pvr_vertbuf_tail() returns a pointer to the vertex buffer for the specified list in main RAM, and since that is being set as the SQ destination, it is definitely writing to main RAM and not to the TA directly.

Basically, the store queues are used there so that the mat_transform_sq() function can still be used to do the transformation, at least that is my guess. I didn't write the example, so I can't say for sure, but that's the only logical reason for it.

Jae686 · Post by **Jae686** » Sun Feb 10, 2013 4:36 pm

looking forward to test it (as soon as I have my DC fixed)

Anthony817 · Post by **Anthony817** » Mon Feb 11, 2013 7:58 pm

Wow! Keep up the great work phenom, always nice to see this site keeping the Dreamcast alive with new homebrew stuff!

PH3NOM · Post by **PH3NOM** » Wed Feb 20, 2013 8:42 pm

So, I have worked out a basic transform stack with my build of GL, and began some tests with rendering different scenes.

Immediately, I observed some strange behaviour of polygons behind other polygons showing through each other.

This was fixed by setting the poly context depth flag to PVR_DEPTHWRITE_ENABLE for opaque polygons.
This is done in my build by the call glEnable(GL_DEPTH_TEST);

Next, I noticed the PVR does not like receiving vertices that are outside of the view frustum; some very natsy results can occur, up to and including a drop in framerate down to 30fps with graphical glitches on screen.

So that brings me to this topic, Clipping polygons to the view frustum. KGL obviously had some problems with clipping, as seen in several demos ( anyone try that "open dynamics" buggy demo? http://www.boob.co.uk/devtools.html ) and some testing I did a while back viewtopic.php?f=29&t=102059

So, I really want to implement a nice but fast clipping algorithm.
I have read a basic outline of the sutherland-hodgeman algorithm, and I have begun my own implementation loosely based on that outline.
My algorithm basically walks the vertices, determining if it is inside the view frustum, and then transforms the vertex if needed.
At first it seems simple, but then you realize vertices may also be added to the new clipped polygon...
For example, I have tested using a bounding box within the view frustum for visual confirmation.

The initial scene, we see a triangle centered within a box

Moving the triangle to the right, without clipping the triange, the geometry continues beyond the wall of the box

Now, with my clip algorithm enabled, the triangle becomes a quadrilateral, and remains inside of the box

I am posting, mainly because I am curious what solutions other devs here have used to manage 3D clipping for the PVR, any thoughts welcome.

Oh yeah, here is the function I wrote to clip a vertex on the x axis, say to a view frustum x min or max
It calculates the point of intersection between the vertices to the frustum, and transforms the vertex accordingly

Code: Select all

static matrix_t TM __attribute__((aligned(32))) =
{
      { 1.0f, 0.0f, 0.0f, 0.0f },
      { 0.0f, 1.0f, 0.0f, 0.0f },
      { 0.0f, 0.0f, 1.0f, 0.0f },
      { 0.0f, 0.0f, 0.0f, 1.0f }
};

static vector4f dvt; // Displacement Transform Vector

void LineClipFrustumX3fv( vector3f v1, vector3f v2, float fx )
{
     /* Calculate Displacement Vector ( |Dv| ) As a Matrix */
     TM[0][0] = v2[0]-v1[0]; 
     TM[1][1] = v2[1]-v1[1];
     TM[2][2] = v2[2]-v1[2];
     TM[3][3] = 1.0f;
     mat_load(&TM);

     /* Transform Clip Point ( |Dv|*mag ) */
     dvt[0] = dvt[1] = dvt[2] = (fx - v1[0])/TM[0][0]; /* Magnitude */
     dvt[3] = 1.0f;
     mat_trans_nodiv( dvt[0], dvt[1], dvt[2], dvt[3] );
     
     v1[0] += dvt[0];      /* Update the Vertices to Transformed Clip Point */
     v1[1] += dvt[1];
     v1[2] += dvt[2];
}

TapamN · Post by **TapamN** » Fri Mar 01, 2013 1:56 pm

PH3NOM wrote:Next, I noticed the PVR does not like receiving vertices that are outside of the view frustum; some very natsy results can occur, up to and including a drop in framerate down to 30fps with graphical glitches on screen.

It's just polygons that cross the near plane that are the problem. (i.e. polygons that are both part in-front of the camera, and part behind it.) You don't need to bother clipping anything else, the PVR handles XY clipping and far clipping for you.

The depth write enable bit is actually equivalent to the glDepthMask call. An enabled GL_DEPTH_TEST is how things normally work, but a disabled depth test is equivalent to setting the depth compare to gl_always and disabling depth writes (but doesn't actually change the glDepthFunc or glDepthMask values).

I use the clipping algorithm described in this paper, which is designed to clip triangle strips efficiently.

Also, generating a matrix for each edge you clip seems... uh... extremely inefficient. Normally, you just calculate where the plane intersects the edge (generating a value from 0.0 to 1.0), and then generate vertex data for where the intersection point by linear interpolating the vertex data between the two vertices. No need to mess with matrices.

PH3NOM · Post by **PH3NOM** » Fri Mar 01, 2013 9:22 pm

As always, thank you for your input TapamN.

The test I ran indicated X and Y Clipping was also needed, but I will look at it again to see if the said polygons were also actually crossing the Z Near Plane at some point off screen.

For now, calling glDepthMask has the same effect, but thank you for the clarification, as I have focused on sorting more immediate problems first.

About the clipping math, first I worked things out on paper. I think at its core, it is the same as you speak of.
I came up with a formula to determine where 2 vertices intersect a plane, when at least one component of that plane is known.
That known component represents the plane ( i.e. view frustum) that the vertices are known to cross.

Let vertex 1 be |v1|, vertex 2 be |v2|, and the componet c ( 0=x, 1=y, 2=z ), and that component is a know value, val.

First, I determine the Displacement of the Vertices as ( vector - vector ) |D| = |v2| - |v1|

Next, I determine the Magnitude of Displacement (0.0->1.0) of the known component M = (val-|v1|[c]) / |D|[c]

From there, I multiply the Displacement Vector by the Magnitude of Displacement. (vector*scalar ) |D| *= M

The Clipping point is found by adding the Displacement Vector to the vertex that is outside of the clip region ( vector + vector ) |v1|+=|D|

Thats how I figured things out, it can be implemented easily without using Matrix Math. Please let me know if I am doing more work than needed here.

Regarding speed, I considered that mat_transform claims 15mil vert/sec throughput, and here its being done without perspective division. Also, I only posted the version that does not apply U/V transformation. I was able to use the loaded matrix to assist in caclulating the transformed U/V coordinates...

I had imagined applying the clip transform while applying the screen-space transform in the pipeline, but it is still too early yet

PH3NOM · Post by **PH3NOM** » Wed Mar 13, 2013 10:24 pm

TapamN wrote:Also, generating a matrix for each edge you clip seems... uh... extremely inefficient. Normally, you just calculate where the plane intersects the edge (generating a value from 0.0 to 1.0), and then generate vertex data for where the intersection point by linear interpolating the vertex data between the two vertices. No need to mess with matrices.

Thanks again, that motivated me to try out another method for calculation.
I decided to use the sh4's fmac thanks to info provied by this guy http://yam.20to4.net/dreamcast/hints/index.html
It wasnt untill I worked out my function that I noticed you actually do use it in your code

FMAC is a SH4 math function that is not provided by KOS, so I thought I would share my inlined implementation.

Code: Select all

/* SH4 fmac - floating-point multiply/accumulate */
/* Returns a*b+c at the cost of a single floating-point operation */
inline float FMAC( float a, float b, float c )
{
     register float __FR0 __asm__("fr0") = a; 
     register float __FR1 __asm__("fr1") = b; 
     register float __FR2 __asm__("fr2") = c;      
     
     __asm__ __volatile__( 
        "fmac   fr0, fr1, fr2\n"
        : "=f" (__FR0), "=f" (__FR1), "=f" (__FR2)
        : "0" (__FR0), "1" (__FR1), "2" (__FR2)
        );
        
     return __FR2;       
}

For versatility, its not too hard to use that for a multiply/decrement:

Code: Select all

/* SH4 fmac - floating-point multiply/decrement */
/* Returns a*b-c at the cost of a single floating-point operation */
inline float FMDC( float a, float b, float c )
{
     register float __FR0 __asm__("fr0") = a; 
     register float __FR1 __asm__("fr1") = b; 
     register float __FR2 __asm__("fr2") = -c;      
     
     __asm__ __volatile__( 
        "fmac   fr0, fr1, fr2\n"
        : "=f" (__FR0), "=f" (__FR1), "=f" (__FR2)
        : "0" (__FR0), "1" (__FR1), "2" (__FR2)
        );
        
     return __FR2;       
}

Now the code to clip a veretex to another looks like this, including u/v correction, no more wasteful matrix math

Code: Select all

void LineClipFrustum3fvT( vector3f v1, vector3f v2, float v, BYTE c, float *uva, float *uvb )
{  
     float MAG = (v - v1[c])/(v2[c]-v1[c]); /* Magnitude */
     
     /* Use the SH4's FMAC operation to linear interpolate the U/V data */
     uva[0] = FMAC( ((v2[0]-v1[0])*MAG)/v2[0], uvb[0], uva[0] );    
     uva[1] = FMAC( ((v2[1]-v1[1])*MAG)/v2[1], uvb[1], uva[1] );
     
     /* Use the SH4's FMAC operation to linear interpolate the Vertex data */
     v1[0] = FMAC( v2[0]-v1[0], MAG, v1[0] );
     v1[1] = FMAC( v2[1]-v1[1], MAG, v1[1] );
     v1[2] = FMAC( v2[2]-v1[2], MAG, v1[2] );
}

SiZiOUS · Post by **SiZiOUS** » Thu Mar 14, 2013 4:37 am

Woah PH3NOM, your work is just very impressive!

Keep up the great work!

I'll try your 'Kamikaze' demo very soon

TapamN · Post by **TapamN** » Thu Mar 14, 2013 8:19 am

Uh, you don't really need any assembly to use FMAC instructions. You can have GCC automatically generate them for you for normal C math.

I think on older versions (GCC 3.4) all you had to do was specify -ffast-math. On more recent versions (GCC 4.7), you (also?) have to specify -mfused-madd.

Letting GCC use FMAC itself is more efficient than using inline assembly. With your assembly, it will always have to move things in and out of FR0-FR2, while when GCC can make its own FMACs it can use any registers. GCC also knows how long FMAC instructions take, and can reorder things to run faster, but it doesn't know how long asm blocks take, so it can't optimize the program as well.

But that seems like much better clipping implementation overall. If you're using gouraud shading, you also have to calculate lighting like you do with UV and position.

RyoDC · Post by **RyoDC** » Fri Mar 15, 2013 12:13 pm

Reminds me the cost that you pay when switching from managed to unmanaged code and vice-versa.

Neoblast · Post by **Neoblast** » Mon Mar 18, 2013 8:34 pm

It is indeed impressive, I guess your GL lib is being put to good use right now in some games isn't it

Actually I'd like to know the difference of perfmrance you could get with benchmarks on KGL and GL.

I guess your GL is faster, but how much?

Also Shenmue 2 was said to have near 6 million poly count, is that true?

PH3NOM · Post by **PH3NOM** » Fri Apr 05, 2013 8:57 pm

TapamN wrote:Uh, you don't really need any assembly to use FMAC instructions. You can have GCC automatically generate them for you for normal C math.

I think on older versions (GCC 3.4) all you had to do was specify -ffast-math. On more recent versions (GCC 4.7), you (also?) have to specify -mfused-madd.

Letting GCC use FMAC itself is more efficient than using inline assembly. With your assembly, it will always have to move things in and out of FR0-FR2, while when GCC can make its own FMACs it can use any registers. GCC also knows how long FMAC instructions take, and can reorder things to run faster, but it doesn't know how long asm blocks take, so it can't optimize the program as well.

But that seems like much better clipping implementation overall. If you're using gouraud shading, you also have to calculate lighting like you do with UV and position.

Thank you again, TapamN, for your input.
I will test using standard C math and enable the -ffast-math flag in GCC. This is a good lesson for me on how to use the limited amount of registers efficiently.

Still using my inlined FMAC, it was easy to interpolate the vertex color ( or lighting ), the hard part was deciding to use packed 32bit color format for implementation with the PVR. And now it seems there will have to be multiple versions of this function to accommodate all of the possible GL texture/color enabled/disabled configurations.
Also, fixed a bug in the U/V interpolation in the last code I posted.

Code: Select all

#define ALPHA 0xFF000000 /* Color Components using PVR's Pack 32bit int */
#define RED   0x00FF0000
#define GREEN 0x0000FF00
#define BLUE  0x000000FF

inline void LineClipFrustum3fvTC1ui( vector3f v1, vector3f v2,
                                     float v, BYTE c,
                                     vector2f uva, vector2f uvb,
                                     uint32 *col1, uint32 *col2 )
{  
     float MAG = (v - v1[c])/(v2[c]-v1[c]); /* Magnitude */
     
     /* Extract Color Components, Apply Linear Interpolation, then Pack it up */
     BYTE a = SHFMAC( ((*col2 & ALPHA)>>24)-((*col1 & ALPHA)>>24), MAG, (*col1 & ALPHA)>>24 );
     BYTE r = SHFMAC( ((*col2 & RED)>>16)-((*col1 & RED)>>16), MAG, (*col1 & RED)>>16 );
     BYTE g = SHFMAC( ((*col2 & GREEN)>>8)-((*col1 & GREEN)>>8), MAG, (*col1 & GREEN)>>8 );
     BYTE b = SHFMAC( ((*col2 & BLUE)>>0)-((*col1 & BLUE)>>0), MAG, (*col1 & BLUE)>>0 );
     *col1 = ( (a<<24) | (r<<16) | (g<<8) | (b<<0) );
     
     /* Use the SH4's FMAC operation to linear interpolate the U/V data */
     uva[0] = SHFMAC( uvb[0]-uva[0], MAG, uva[0] );
     uva[1] = SHFMAC( uvb[1]-uva[1], MAG, uva[1] );
     
     /* Use the SH4's FMAC operation to linear interpolate the Vertex data */
     v1[0] = SHFMAC( v2[0]-v1[0], MAG, v1[0] );
     v1[1] = SHFMAC( v2[1]-v1[1], MAG, v1[1] );
     v1[2] = SHFMAC( v2[2]-v1[2], MAG, v1[2] );
}

Neoblast wrote:It is indeed impressive, I guess your GL lib is being put to good use right now in some games isn't it

Actually I'd like to know the difference of perfmrance you could get with benchmarks on KGL and GL.

I guess your GL is faster, but how much?

Hard to say until I finish all of the clipping algorithm / implementation
The quadmark example compiled against an earlier build increased from 584k verts/second using KGLX up to .96Mil verts/second using my build of GL
viewtopic.php?f=29&t=102181&start=40#p1034323

OpenGL - New Build in the works

OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

PVR - Clipping Polygons to the View Frustum

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works

Re: OpenGL - New Build in the works