DCEmulation

dreamcast development • homebrew software • hardware hacking • indie games • emulators • and more!
Back to main site
It is currently Thu Jul 02, 2015 1:15 am

All times are UTC - 8 hours [ DST ]




Post new topic Reply to topic  [ 16 posts ] 
Author Message
 Post subject: KGL rendering limits?
PostPosted: Tue Nov 04, 2014 6:54 am 
Offline
DCEmu Junior
DCEmu Junior

Joined: Wed Feb 05, 2014 4:58 am
Posts: 42
I'm working on some 3D models for some experiments with the KGL renderer and i wanted to know the rendering limitations. How many vertexes/triangles can be displayed? I have some models with 1000~2000 tris.


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 04, 2014 9:41 am 
Offline
Insane DCEmu
Insane DCEmu
User avatar

Joined: Sun Apr 20, 2014 7:45 am
Posts: 135
Ph3nom should be able to tell you about the limits, he benchmarked his new KGL implementation quite rigorously, there are benchmarks in the example folder of KOS and you can find threads about it on here.

You can also improve the poly count a lot by calling glEnable/Disable GL_KOS_NEARZ_CLIPPING. You should turn near z clipping off when you can guarantee that your meshes are in front of the camera entirely and turn it off when they intersect with the near z plane or you will get glitches (this is a DC limitation).

_________________
My libgl playground (not for production): https://bitbucket.org/bogglez/libgl15
My lxdream fork (with some small fixes): https://bitbucket.org/bogglez/lxdream


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 04, 2014 12:05 pm 
Offline
DCEmu Junior
DCEmu Junior

Joined: Wed Feb 05, 2014 4:58 am
Posts: 42
Where i can find this new KGL implementation? Is it in the current SVN?


Top
 Profile  
Reply with quote  
PostPosted: Tue Nov 04, 2014 1:09 pm 
Offline
The Crabby Overlord
The Crabby Overlord
User avatar

Joined: Mon May 27, 2002 9:31 am
Posts: 4504
bbmario wrote:
Where i can find this new KGL implementation? Is it in the current SVN?
It is in the libgl git repository, which can easily be fetched along with the rest of the kos-ports libraries.


Top
 Profile  
Reply with quote  
PostPosted: Wed Nov 05, 2014 7:21 am 
Offline
DCEmu Junior
DCEmu Junior

Joined: Wed Feb 05, 2014 4:58 am
Posts: 42
Thanks! 8-)


Top
 Profile  
Reply with quote  
PostPosted: Fri Nov 14, 2014 6:38 am 
Offline
Insane DCEmu
Insane DCEmu

Joined: Sat Sep 22, 2007 7:43 pm
Posts: 111
Location: Braga - Portugal
bogglez wrote:
Ph3nom should be able to tell you about the limits, he benchmarked his new KGL implementation quite rigorously, there are benchmarks in the example folder of KOS and you can find threads about it on here.

You can also improve the poly count a lot by calling glEnable/Disable GL_KOS_NEARZ_CLIPPING. You should turn near z clipping off when you can guarantee that your meshes are in front of the camera entirely and turn it off when they intersect with the near z plane or you will get glitches (this is a DC limitation).


That probably explains the glitches I'm having on real HW.

I will give it a shot.


Top
 Profile  
Reply with quote  
PostPosted: Tue Dec 23, 2014 8:51 pm 
Offline
DC Developer
DC Developer
User avatar

Joined: Fri Jun 18, 2010 7:29 pm
Posts: 483
bbmario - Did you make any progress with your 3D modeling experiments?

KGL performs well in immediate mode, and even better when submitting arrays.

Tonight I made a test rendering Quake 3 BSP's.

Test 1: Immediate Mode. Result: ~23.26 fps @ .76 mil verts/sec.
Spoiler: show
Code:
void Q3BSP_RenderImmediateTextured()
{
    GLuint i, f, faces = Q3BSP_Faces(), index, t;

    Q3_BSP_FACE * face = BSP_FACE;
   
    glEnable( GL_KOS_NEARZ_CLIPPING );
   
    glEnable( GL_TEXTURE_2D );
   
    glBlendFunc( GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA );

    for( t = 0; t < Q3BSP_Textures(); t++ )
    {       
        if( !memcmp("noshader", BSP_TEX[t].name, 8) )
            continue;
   
        if( BSP_TEX[t].flags )
            glEnable( GL_BLEND );

        glBindTexture( GL_TEXTURE_2D, GL_TEX[t].ID );
   
        glBegin( GL_TRIANGLES );
   
        for(i = 0; i < faces; i++)
        {
            if( face->textureID == t && ( face->type==1 || face->type==3 ) )
            {
           for(f = 0; f < face->faceIndices; f++)
                {
          index = face->vertexIndex + BSP_INDEX[face->faceIndex + f];

          glColor1ui( BSP_VERTEX[index].color );

                    glTexCoord2fv( &BSP_VERTEX[index].uv.x  ); 

                    glVertex3fv( &BSP_VERTEX[index].pos.x );
                }   
            }
            ++face;
        }
        glEnd();
   
        glDisable( GL_BLEND );
   
        face = BSP_FACE;
    }
    glDisable( GL_TEXTURE_2D );
   
    glDisable( GL_KOS_NEARZ_CLIPPING );
}

Image

Test 2: Pre-Processed Arrays Mode. Result: ~57.60 fps @ 2.01 mil verts/sec.
Spoiler: show
Code:
void Q3BSP_RenderArraysTextured()
{
    GLuint t;
   
    glEnable( GL_KOS_NEARZ_CLIPPING );
   
    glEnable( GL_TEXTURE_2D);
   
    glBlendFunc( GL_SRC_ALPHA, GL_ONE_MINUS_SRC_ALPHA );
   
    for( t = 0; t < Q3BSP_Textures(); t++ )
    {   
        if( !BSP_TEX_INDEX_COUNT[t] )
            continue;
       
        if( BSP_TEX[t].flags )
            glEnable( GL_BLEND );
       
        glBindTexture( GL_TEXTURE_2D, GL_TEX[t].ID );
       
        glColorPointer( 1, GL_UNSIGNED_INT, sizeof(Q3BSP_SimpleVertex), &BSP_TEX_VERTICES[t][0].color );
       
        glTexCoordPointer( 2, GL_FLOAT, sizeof(Q3BSP_SimpleVertex), &BSP_TEX_VERTICES[t][0].u );
       
        glVertexPointer( 3, GL_FLOAT, sizeof(Q3BSP_SimpleVertex), &BSP_TEX_VERTICES[t][0].x );

        glDrawArrays( GL_TRIANGLES, 0, BSP_TEX_INDEX_COUNT[t] );
       
        glDisable( GL_BLEND );
    }   
   
    glDisable( GL_TEXTURE_2D );
   
    glDisable( GL_KOS_NEARZ_CLIPPING );
}

Image


Top
 Profile  
Reply with quote  
PostPosted: Thu Dec 25, 2014 7:42 pm 
Offline
Insane DCEmu
Insane DCEmu
User avatar

Joined: Sun Apr 20, 2014 7:45 am
Posts: 135
Awesome results! :)
How about testing glDisable( GL_KOS_NEARZ_CLIPPING ) with such a scene for objects in front of the near plane? Big results I bet?

_________________
My libgl playground (not for production): https://bitbucket.org/bogglez/libgl15
My lxdream fork (with some small fixes): https://bitbucket.org/bogglez/lxdream


Top
 Profile  
Reply with quote  
PostPosted: Fri Dec 26, 2014 12:44 pm 
Offline
DC Developer
DC Developer
User avatar

Joined: Fri Jun 18, 2010 7:29 pm
Posts: 483
Thanks man!

The frustum culling is rolled out as a part of the PVS system for now, in this demo that has been disabled for testing raw vertex throughput.

In this demo ( not using Light Maps ), I am rendering ~75 arrays ( one per texture ) that contain anywhere from 3 to ~7000 vertices per array at OpenGL with NEARZ_CLIPPING enabled, GL_LIGHTING is disabled.

But you are right, the vertex throughput would certainly be higher if we glDisable( GL_KOS_NEARZ_CLIPPING ).
For example, a high polygon model, say the player model in a 3rd person game where the player is always at the center of the screen, could hit higher throughput since we could skip clipping the model.


Top
 Profile  
Reply with quote  
PostPosted: Sat Dec 27, 2014 5:19 am 
Offline
Insane DCEmu
Insane DCEmu
User avatar

Joined: Sun Apr 20, 2014 7:45 am
Posts: 135
Do you plan to add those projects to the examples as benchmarks? Would be very useful I bet.
I was also wondering how much of a performance hit clipping is, so vertex throughput with and without clipping could be interesting. I assume it's a big hit on performance though.

_________________
My libgl playground (not for production): https://bitbucket.org/bogglez/libgl15
My lxdream fork (with some small fixes): https://bitbucket.org/bogglez/lxdream


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 20, 2015 2:56 pm 
Offline
DC Developer
DC Developer
User avatar

Joined: Fri Jun 18, 2010 7:29 pm
Posts: 483
So I have made some solid progress on optimizing the clipping algorithm, nearly re-writing the entire thing again :o

I have written an assembly routine that will transform a triangle and check vertices for clipping.
The routine takes an input vertex position array, its stride, a uv coord array, its stride, and a pvr_vertex_t output array as parameters.
The input parameters are built to handle glDrawArrays, where each component is stored in a separate array.
If the vertices are completely out, they wont even be pushed out of the registers.
If the vertices are completely in, perspective division will be applied before writing to output.
Currently, if the vertices cross, the vertices are written without perspective divide, to be handled outside of the assembly routine.

For kicks, the assembly code looks like this:

Spoiler: show
Code:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

.globl __glKosTransformTri2

!r0 = [int] = return clip code for triangle
!r1 = [int] = vert stride
!r2 = [int] = uv stride
!r4 = [arg][float *] = vertex position pointer
!r5 = [arg][pvr_vertex_t *] = output vector
!r6 = [arg][pvr_vertex_t *] = uv pointer
!fr4 = [arg][float] vert stride (in bytes)(passed as float to prevent stack use)
!fr5 = [arg][float] uv stride (in bytes)(passed as float to prevent stack use)
!fv0 = vector 1
!fv4 = vector 2
!fv8 = vector 3
!fr12 = u component
!fr13 = v component
!fr15 = z clip threshold
 
.align 4

__glKosTransformTri2:

    fmov.s fr12, @-r15   ! push fr12 to stack ( callee save )
    ftrc fr4, fpul       ! floatToInt(vert stride)
    fmov @r4+, fr0       ! load vertex x
    sts fpul, r1         ! r1 now holds vert stride
    fmov @r4+, fr1        ! load vertex y
    add #-8, r1          ! adjust vert stride to offset read increment
    fmov @r4, fr2        ! load vertex z
    fldi1 fr3            ! load 1 for w

    ftrv xmtrx, fv0      ! transform first vector
 
    ftrc fr5, fpul       ! floatToInt(uv stride)
    add r1, r4           ! add vertex position stride
    fmov @r4+, fr4       ! load vertex x
    fldi1 fr7            ! load 1 for w
    fmov @r4+, fr5       ! load vertex y
    mov #0, r0           ! set 0 for clip code
    fmov @r4, fr6        ! load vertex z

    ftrv xmtrx, fv4      ! transform second vector

    add r1, r4           ! add vertex position stride
    fmov @r4+, fr8       ! load vertex x
    fldi1 fr11           ! load 1 for w
    fmov @r4+, fr9       ! load vertex y
    sts fpul, r2         ! r2 now holds uv stride
    fmov @r4, fr10       ! load vertex z
    add #-4, r2          ! adjust uv stride to offset read increment

    ftrv xmtrx, fv8      ! transform third vector

    fmov.s fr15, @-r15   ! push fr15 to stack ( callee save )   
    fldi1 fr15           ! load 1 to fr15 for clip threshold
    fneg fr15            ! clip threshold = -1.0f

    fcmp/gt fr15, fr2    ! check 1st vertex for z clipping
    bf .V1IN

    or #1, r0            ! clip code 1st vertex out

.V1IN:

    fcmp/gt fr15, fr6    ! check 2nd vertex for z clipping
    bf .V2IN

    or #2, r0            ! clip code 2nd vertex out

.V2IN:

    fcmp/gt fr15, fr10   ! check 3rd vertex for z clipping
    bf .V3IN

    or #4, r0            ! clip code 3rd vertex out

.V3IN:

    cmp/eq #7, r0        ! clip code (1|2|4)= 7 = all out - dont write output
    bt .RETURN
 
    cmp/eq #0, r0        ! clip code 0 = all in - write with perspective divide
    bf .WRITENODIVIDE    ! otherwise, write output with no perspective divide

!.WRITEWITHDIVIDE:

    fldi1 fr2
    fdiv fr3, fr2        ! perspective divide

    fmov.s fr13, @-r15   ! push fr13 to stack ( callee save )
    add #24, r5          ! move output vertex to v component
    fmov @r6+, fr12      ! load u to fr12
    fmov @r6, fr13       ! load v to fr13
    add r2, r6           ! add uv stride
   
    fmov fr13, @-r5      ! write v
    fmov fr12, @-r5      ! write u
    fmul fr2, fr1        ! 1 / w * y
    fmov fr2, @-r5       ! write z
    fmul fr2, fr0        ! 1 / w * x
    fmov fr1, @-r5       ! write y
    fldi1 fr6            ! load 1 to fr6 for next 1 / w op
    fmov fr0, @-r5       ! write x

    fdiv fr7, fr6        ! perspective divide

    add #52, r5          ! move output vertex to v component
    fmov @r6+, fr12      ! load u to fr12
    fmov @r6, fr13       ! load v to fr13
    add r2, r6           ! add uv stride

    fmov fr13, @-r5      ! write v
    fmul fr6, fr5        ! 1 / w * y
    fmov fr12, @-r5      ! write u
    fmul fr6, fr4        ! 1 / w * x
    fmov fr6, @-r5       ! write z
    fldi1 fr10           ! load 1 to fr10 for next 1 / w op
    fmov fr5, @-r5       ! write y
    fmov fr4, @-r5       ! write x

    fdiv fr11, fr10      ! perspective divide

    add #52, r5          ! move output vertex to v component
    fmov @r6+, fr12      ! load u to fr12
    fmov @r6, fr13       ! load v to fr13

    fmov fr13, @-r5      ! write v
    fmul fr10, fr9       ! 1 / w * y
    fmov fr12, @-r5      ! write u
    fmul fr10, fr8       ! 1 / w * x
    fmov fr10, @-r5      ! write z
    fmov fr9, @-r5       ! write y
    fmov fr8, @-r5       ! write x
   
    bra .RETURN          ! done!

    fmov.s @r15+, fr13   ! delay slot = pop stack back to fr13

.WRITENODIVIDE:

    fmov.s fr13, @-r15   ! push fr13 to stack ( callee save )

    add #28, r5          ! add vertex stride to next w component
   
    fmov fr3, @r5        ! write w component
    add #-4, r5          ! move output vertex to write v component

    fmov @r6+, fr12      ! read next u
    fmov @r6, fr13       ! read next v
    add r2, r6           ! add uv stride
   
    fmov fr13, @-r5      ! write v
    fmov fr12, @-r5      ! write u
    fmov fr2, @-r5       ! write z
    fmov fr1, @-r5       ! write y
    fmov fr0, @-r5       ! write x

    add #56, r5          ! add vertex stride to next w component

    fmov fr7, @r5        ! write w component
    add #-4, r5          ! move output vertex to write v component

    fmov @r6+, fr12      ! read next u
    fmov @r6, fr13       ! read next v
    add r2, r6           ! add uv stride

    fmov fr13, @-r5      ! write v
    fmov fr12, @-r5      ! write u
    fmov fr6, @-r5       ! write z
    fmov fr5, @-r5       ! write y
    fmov fr4, @-r5       ! write x

    add #56, r5          ! add vertex stride to next w component

    fmov fr11, @r5       ! write w component
    add #-4, r5          ! move output vertex to write v component

    fmov @r6+, fr12      ! read next u
    fmov @r6, fr13       ! read next v

    fmov fr13, @-r5      ! write v
    fmov fr12, @-r5      ! write u
    fmov fr10, @-r5      ! write z
    fmov fr9, @-r5       ! write y
    fmov fr8, @-r5       ! write x

    fmov.s @r15+, fr13   ! pop stack back to fr13

.RETURN:

    fmov.s @r15+, fr15   ! pop stack back to fr15
    rts
    fmov.s @r15+, fr12   ! pop stack back to fr12

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


And the code that invokes it is as follows:
Spoiler: show
Code:
typedef struct {
    uint32  flags;              /**< \brief TA command (vertex flags) */
    float   x;                  /**< \brief X coordinate */
    float   y;                  /**< \brief Y coordinate */
    float   z;                  /**< \brief Z coordinate */
    float   u;                  /**< \brief Texture U coordinate */
    float   v;                  /**< \brief Texture V coordinate */
    uint32  argb;               /**< \brief Vertex color */
    float   w;                  /**< \brief Vertex W coordinate (oargb gets ignored, use as w)*/
} pvr_cmd_clip_vertex_t;

#define CLIP_NONE 0
#define CLIP_1ST 1<<0
#define CLIP_2ND 1<<1
#define CLIP_3RD 1<<2
#define CLIP_1ST_2 (1<<0 | 1<<1)
#define CLIP_1ST_AND_LAST (1<<0 | 1<<2)
#define CLIP_LAST_2 (1<<1 | 1<<2)
#define CLIP_ALL (1<<0 | 1<<1 | 1<<2)

static inline void _glKosVertexClipZNear1(pvr_cmd_clip_vertex_t *v1, pvr_cmd_clip_vertex_t *v2) {
    GLfloat MAG = ((-1.0f - v1->z) / (v2->z - v1->z));

    colorui *c1 = (colorui *)&v1->argb;
    colorui *c2 = (colorui *)&v2->argb;

    v1->x += (v2->x - v1->x) * MAG;
    v1->y += (v2->y - v1->y) * MAG;
    v1->w += (v2->w - v1->w) * MAG;
    v1->z = 1.0f / v1->w;
    v1->u += (v2->u - v1->u) * MAG;
    v1->v += (v2->v - v1->v) * MAG;
    c1->a += (c2->a - c1->a) * MAG;
    c1->r += (c2->r - c1->r) * MAG;
    c1->g += (c2->g - c1->g) * MAG;
    c1->b += (c2->b - c1->b) * MAG;

    v1->x *= v1->z;
    v1->y *= v1->z;
}

static inline void _glKosVertexCopyPUV(pvr_cmd_clip_vertex_t * src,
                                       pvr_cmd_clip_vertex_t * dst)
{
    dst->x = src->x;
    dst->y = src->y;
    dst->z = src->z;
    dst->w = src->w;
    dst->u = src->u;
    dst->v = src->v;
}

static inline void _glKosVertexClipPerspectiveDivide(pvr_cmd_clip_vertex_t *dst) {
    dst->z = 1.0f / dst->w;
    dst->x *= dst->z;
    dst->y *= dst->z;
}

static inline GLubyte _glKosClipTriAndTransform(GLfloat *vert_pos, GLint vert_stride,
                                                float *uv_coord, int uv_stride,
                                                pvr_vertex_t *src, pvr_cmd_clip_vertex_t *dst) {
    GLushort clip = 0; /* Clip Code for current Triangle */
    clip = _glKosTransformTri2(vert_pos, (float)vert_stride * 4.0f,
                               dst,
                               uv_coord, (float)uv_stride * 4.0f);
    if(clip == CLIP_ALL)
        return 0;
   
    if(clip == CLIP_NONE)
    {
            dst[0].argb = src[0].argb;
            dst[1].argb = src[1].argb;
            dst[2].argb = src[2].argb;
            dst[0].flags = dst[1].flags = PVR_CMD_VERTEX;
            dst[2].flags = PVR_CMD_VERTEX_EOL;

       return 3;               
    }
   
    switch(clip) { /* Start by examining # of vertices inside clip plane */
        case CLIP_1ST: //0 1 0 2
            _glKosVertexCopyPUV(&dst[2], &dst[3]);
            _glKosVertexCopyPUV(&dst[0], &dst[2]);
 
            dst[0].argb = src[0].argb;
            dst[1].argb = src[1].argb;
            dst[2].argb = src[0].argb;           
            dst[3].argb = src[2].argb;   
           
            _glKosVertexClipZNear1(&dst[0], &dst[1]);
            _glKosVertexClipZNear1(&dst[2], &dst[3]);
            _glKosVertexClipPerspectiveDivide(&dst[1]);
            _glKosVertexClipPerspectiveDivide(&dst[3]);
           
            dst[0].flags = dst[1].flags = dst[2].flags = PVR_CMD_VERTEX;
            dst[3].flags = PVR_CMD_VERTEX_EOL;           
           
            return 4;

        case CLIP_2ND: //1 2 1 0
            _glKosVertexCopyPUV(&dst[0], &dst[3]);
            _glKosVertexCopyPUV(&dst[1], &dst[0]);
            _glKosVertexCopyPUV(&dst[2], &dst[1]);
            _glKosVertexCopyPUV(&dst[0], &dst[2]);
 
            dst[0].argb = src[1].argb;
            dst[1].argb = src[2].argb;
            dst[2].argb = src[1].argb;           
            dst[3].argb = src[0].argb;   
           
            _glKosVertexClipZNear1(&dst[0], &dst[1]);
            _glKosVertexClipZNear1(&dst[2], &dst[3]);
            _glKosVertexClipPerspectiveDivide(&dst[1]);
            _glKosVertexClipPerspectiveDivide(&dst[3]);
           
            dst[0].flags = dst[1].flags = dst[2].flags = PVR_CMD_VERTEX;
            dst[3].flags = PVR_CMD_VERTEX_EOL;           
           
            return 4;

        case CLIP_3RD: //2 0 2 1
            _glKosVertexCopyPUV(&dst[1], &dst[3]);
            _glKosVertexCopyPUV(&dst[0], &dst[1]);
            _glKosVertexCopyPUV(&dst[2], &dst[0]);
 
            dst[0].argb = src[2].argb;
            dst[1].argb = src[0].argb;
            dst[2].argb = src[2].argb;           
            dst[3].argb = src[1].argb;   
           
            _glKosVertexClipZNear1(&dst[0], &dst[1]);
            _glKosVertexClipZNear1(&dst[2], &dst[3]); 
            _glKosVertexClipPerspectiveDivide(&dst[1]);
            _glKosVertexClipPerspectiveDivide(&dst[3]);
           
            dst[0].flags = dst[1].flags = dst[2].flags = PVR_CMD_VERTEX;
            dst[3].flags = PVR_CMD_VERTEX_EOL;           
           
            return 4;
           
        case CLIP_1ST_2:
            _glKosVertexClipZNear1(&dst[0], &dst[2]);
            _glKosVertexClipZNear1(&dst[1], &dst[2]);
            _glKosVertexClipPerspectiveDivide(&dst[2]);
           
            dst[0].argb = src[0].argb;
            dst[1].argb = src[1].argb;
            dst[2].argb = src[2].argb;
            dst[0].flags = dst[1].flags = PVR_CMD_VERTEX;
            dst[2].flags = PVR_CMD_VERTEX_EOL;

       return 3;

        case CLIP_1ST_AND_LAST:
            _glKosVertexClipZNear1(&dst[0], &dst[1]);
            _glKosVertexClipZNear1(&dst[2], &dst[1]);
            _glKosVertexClipPerspectiveDivide(&dst[1]);
           
            dst[0].argb = src[0].argb;
            dst[1].argb = src[1].argb;
            dst[2].argb = src[2].argb;
            dst[0].flags = dst[1].flags = PVR_CMD_VERTEX;
            dst[2].flags = PVR_CMD_VERTEX_EOL;

       return 3;

        case CLIP_LAST_2:
            _glKosVertexClipZNear1(&dst[1], &dst[0]);
            _glKosVertexClipZNear1(&dst[2], &dst[0]);
            _glKosVertexClipPerspectiveDivide(&dst[0]);
           
            dst[0].argb = src[0].argb;
            dst[1].argb = src[1].argb;
            dst[2].argb = src[2].argb;
            dst[0].flags = dst[1].flags = PVR_CMD_VERTEX;
            dst[2].flags = PVR_CMD_VERTEX_EOL;

       return 3;
    }

    return 0;
}


So, the results are good perfect clipping with very fast speed!
This map, using the old clipping code, ran at ~28fps @ 2.14mil verts/sec.
Now, using the new assembly clipping code, this map runs at ~45fps @3.4mil verts/sec
Image


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 21, 2015 7:28 am 
Offline
Insane DCEmu
Insane DCEmu
User avatar

Joined: Sun Apr 20, 2014 7:45 am
Posts: 135
So you went from 36ms frame time to 22ms? That's a ridiculously huge optimization! Where do you think that comes from? Register pressure?
I didn't have much time recently, but I was getting unhappy with the way I submit vertices. I wanted to split up processing of coordinates, UVs, normals and colors, but I fear that would not make good use of the cache.

EDIT: btw I noticed that you didn't implement glMultiDrawArrays yet. May I suggest moving the current implementation (with a for loop) into glMultiDrawArrays instead, and calling glMultiDrawArrays from glDrawArrays? The advantage is that when you draw multiple objects from one vertex buffer, you only need to perform the init tasks once (load the transform matrix etc).
ref. http://programming4.us/multimedia/8302.aspx

_________________
My libgl playground (not for production): https://bitbucket.org/bogglez/libgl15
My lxdream fork (with some small fixes): https://bitbucket.org/bogglez/lxdream


Top
 Profile  
Reply with quote  
PostPosted: Sat Jan 24, 2015 3:29 pm 
Offline
DC Developer
DC Developer
User avatar

Joined: Fri Jun 18, 2010 7:29 pm
Posts: 483
So, I am curious what the actual limit of polys we can hit using kos is, and use that as a benchmark against what we can hit with KGL.

I think I have pretty much found the limit, both using mat_transform_sq, as well as a custom assembly routine I have written.

My assembly routine takes three input array pointers(position, uv, and argb color), the strides for those arrays, a pvr_vertex_t *dest, and the count of triangles to transform (this routine tested is written for transforming triangles, as such, it will write the pvr_vertex flags to the dest for you.)
My code looks like this:
Spoiler: show
Code:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!void _glKosTransformTriSQ(float *vert_pos,
!                          float vert_stride,
!                          float * uv_coord,
!                          float uv_stride,
!                          unsigned int * argb,
!                          float argb_stride,
!                          float count);

.globl __glKosTransformTriSQ

!r0 = [int] = triangle count
!r1 = [int] = vert stride
!r2 = [int] = uv stride
!r3 = [int] = argb and pvr vertex flags
!r4 = [arg][float *] = vertex position pointer
!r5 = [arg][float *] = uv pointer
!r6 = [arg][uint32 *] = color (32bit int argb) pointer
!r7 = [arg][pvr_vertex_t *] = output vector
!r14 = [int] = argb stride

!fr4 = [arg][float] vert stride (in bytes)(passed as float to prevent stack use)
!fr5 = [arg][float] uv stride (in bytes)(passed as float to prevent stack use)
!fr6 = [arg][float] argb stride
!fr7 = [arg][float] count
!fv0 = vector 1
!fv4 = vector 2
!fv8 = vector 3
!fr12 = u component
!fr13 = v component

.align 4

__glKosTransformTriSQ:

    ftrc fr7, fpul       ! floatToInt(triangle count)
    sts fpul, r0         ! r0 now holds triangle count
    fmov.s fr12, @-r15   ! push fr12 to stack ( callee save )
    ftrc fr4, fpul       ! floatToInt(vert stride)
    sts fpul, r1         ! r1 now holds vert stride
    ftrc fr5, fpul       ! floatToInt(uv stride)
    sts fpul, r2         ! r2 now holds uv stride
    add #-8, r1          ! adjust vert stride to offset read increment
    fmov.s fr13, @-r15   ! push fr13 to stack ( callee save )
    add #-4, r2          ! adjust uv stride to offset read increment
    mov.l r14, @-r15     ! push r14 to stack ( callee save )
    ftrc fr6, fpul       ! floatToInt(argb stride)
    sts fpul, r14        ! r14 now holds argb stride
    !mov.l TA_ADDR, r7   ! load PVR_CMD_VERTEX flag to r3

.LOADVERTEX:

    fmov @r4+, fr0       ! load vertex x
    fmov @r4+, fr1       ! load vertex y
    fmov @r4, fr2        ! load vertex z
    fldi1 fr3            ! load 1 for w

    ftrv xmtrx, fv0      ! transform first vector
 
    add r1, r4           ! add vertex position stride
    fmov @r4+, fr4       ! load vertex x
    fldi1 fr7            ! load 1 for w
    fmov @r4+, fr5       ! load vertex y
    fmov @r4, fr6        ! load vertex z

    ftrv xmtrx, fv4      ! transform second vector

    add r1, r4           ! add vertex position stride
    fmov @r4+, fr8       ! load vertex x
    fldi1 fr11           ! load 1 for w
    fmov @r4+, fr9       ! load vertex y
    fmov @r4, fr10       ! load vertex z

    ftrv xmtrx, fv8      ! transform third vector

    add r1, r4           ! add vertex position stride

    fldi1 fr2
    fdiv fr3, fr2        ! perspective divide
   
    mov.l @r6, r3        ! load input argb to r3
    add #28, r7          ! move output vertex to argb component
    fmov @r5+, fr12      ! load u to fr12
    add r14, r6          ! add argb stride
    fmov @r5, fr13       ! load v to fr13
    add r2, r5           ! add uv stride
   
    mov.l r3, @-r7       ! write argb
    fmov fr13, @-r7      ! write v
    mov.l CMD_VERT, r3   ! load PVR_CMD_VERTEX flag to r3
    fmov fr12, @-r7      ! write u
    fmul fr2, fr1        ! 1 / w * y
    fmov fr2, @-r7       ! write z
    fmul fr2, fr0        ! 1 / w * x
    fmov fr1, @-r7       ! write y
    fldi1 fr6            ! load 1 to fr6 for next 1 / w op
    fmov fr0, @-r7       ! write x
    mov.l r3, @-r7       ! write first vertex flag

    pref @r7             ! flush vertex via SQ to PVR

    fdiv fr7, fr6        ! perspective divide

    add #60, r7          ! move output vertex to argb component
    fmov @r5+, fr12      ! load u to fr12
    mov.l @r6, r3        ! load input argb to r3   
    fmov @r5, fr13       ! load v to fr13
    add r2, r5           ! add uv stride

    mov.l r3, @-r7       ! write argb
    fmov fr13, @-r7      ! write v
    mov.l CMD_VERT, r3   ! load PVR_CMD_VERTEX flag to r3
    fmul fr6, fr5        ! 1 / w * y
    fmov fr12, @-r7      ! write u
    fmul fr6, fr4        ! 1 / w * x
    fmov fr6, @-r7       ! write z
    fldi1 fr10           ! load 1 to fr10 for next 1 / w op
    fmov fr5, @-r7       ! write y
    add r14, r6          ! add argb stride   
    fmov fr4, @-r7       ! write x
    mov.l r3, @-r7       ! write second vertex flag

    pref @r7             ! flush vertex via SQ to PVR

    fdiv fr11, fr10      ! perspective divide

    add #60, r7          ! move output vertex to v component
    fmov @r5+, fr12      ! load u to fr12
    mov.l @r6, r3        ! load input argb to r3 
    fmov @r5, fr13       ! load v to fr13
    add r2, r5           ! add uv stride

    mov.l r3, @-r7       ! write argb
    fmov fr13, @-r7      ! write v
    fmul fr10, fr9       ! 1 / w * y
    fmov fr12, @-r7      ! write u
    fmul fr10, fr8       ! 1 / w * x
    fmov fr10, @-r7      ! write z
    mov.l CMD_EOS, r3    ! load PVR_CMD_VERTEX_EOL flag to r3
    fmov fr9, @-r7       ! write y
    add r14, r6          ! add argb stride
    fmov fr8, @-r7       ! write x
    mov.l r3, @-r7       ! write last vertex flag

    pref @r7             ! flush vertex via SQ to PVR

    add #32, r7          ! move forward to next output vertex

    dt r0                ! decrement count, check for loop
    bf .LOADVERTEX       ! more triangles, run next loop
   
    mov.l @r15+, r14     ! pop stack back to r14
    fmov.s @r15+, fr13   ! pop stack back to fr13

    rts
    fmov.s @r15+, fr12   ! delay slot = pop stack back to fr12

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

.align 2
CMD_VERT:
    .long 0xe0000000
CMD_EOS:
    .long 0xf0000000

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

So, for a quick test, I create a vertex buffer in main RAM to serve as the input to the function
Spoiler: show
Code:
    #define triangles 35000
   
    pvr_vertex_t vertex_array[triangles * 3];
    pvr_vertex_t dst[triangles * 3];
   
    float x = 0;
    float y = 0;
    GLfloat size = 3.75;
    int i = 0;
    for(y = 0; y < 480; y += size)
        for(x = 0; x < 640; x += size)
        {
            vertex_array[i].x = x + size/2;
            vertex_array[i].y = y;
            vertex_array[i].z = -3;

            vertex_array[i+1].x = x + size;
            vertex_array[i+1].y = y + size;
            vertex_array[i+1].z = -3;
           
            vertex_array[i+2].x = x - size/2;
            vertex_array[i+2].y = y + size;
            vertex_array[i+2].z = -3;           
           
            vertex_array[i].argb = 0xFFFF0000;
            vertex_array[i+1].argb = 0xFF00FF00;
            vertex_array[i+2].argb = 0xFF0000FF;
           
            vertex_array[i].flags = PVR_CMD_VERTEX;
            vertex_array[i+1].flags = PVR_CMD_VERTEX;
            vertex_array[i+2].flags = PVR_CMD_VERTEX_EOL;
           
            i += 3;
        }


So, this loop creates an array of exactly 65664 vertices, or 21,888 triangles.

Testing transforming vertices in RAM ( to see only CPU time ), I find my code is actually faster:
mat_transform_sq run 100 times hits 1087msec.
my assembly code run 100 times hits 956msec.

Submitting directly to the PVR, we are running at the full 60fps, we can hit an actual 1,313,280 polygons/sec.

Here, my code uses 10msec/frame, where mat_transform_sq uses 11msec/frame.

However, this is a highly optimized example; it is done in a single draw call, and it only submits one pre-compiled pvr_poly_hdr_t.
These triangles are gouraud shaded, but non-textured. I will need to make another test to see how texturing effects these numbers.

Also, I noticed that when I tested random polygons, the PVR began to choke down as triangles overlapped many times.

NullDC suggests 6.44mil verts/sec, but the real number is actually 3,939,840 vertices / sec.
Image

Edit, another test, bypassing KGL, and using the SH4/PVR directly.
The Quake 3 Vertices are converted to the pvr_vertex_t format, with color and flag pre-set.
The vertices are arranged in arrays, with each array containing all of the vertices that share the same texture.

It is very hard to hit that kind of vertex throughput in a realistic scenario:
(map from here: http://lvlworld.com/review/id:1743)
Image

Testing another BSP, it does in fact seem possible to hit that number:
Image


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 27, 2015 5:50 pm 
Offline
DCEmu Junior
DCEmu Junior

Joined: Wed Feb 05, 2014 4:58 am
Posts: 42
I've been learning my way around with "old" GL code, since i learned modern GL first (shaders, VBO's, etc.). But so far, so good! Thanks for asking, PH3NOM! By the way, what do you mean by pre-processed arrays?


Top
 Profile  
Reply with quote  
PostPosted: Tue Jan 27, 2015 7:53 pm 
Offline
DC Developer
DC Developer
User avatar

Joined: Fri Jun 18, 2010 7:29 pm
Posts: 483
Shaders not really possible on DC... I have recently considered some sort of pre-set shader functionality based on a fixed set of shader operations, but a full-blown programmable pipeline is not realistic.
VBO's should be possible to an extent; I have recently added basic VAO functionality (glGenVertexArray, etc.). By the end of this week I should have a new commit ready to update the API implementing VAO's.

By pre-processed I mean that Quake 3 BSP's store their vertices in a different format then the DC's PVR uses, and the BSP vertices are stored in an indexed array that is intended for use with glDrawElements.
The Pre-Process that I refer to means that before making any render call, I convert the Quake3 BSP vertices into the DC's pvr_vertex_t vertex format.
This involves converting the color format from RGBA as used in the Q3 vertices into ARGB for use with pvr_vertex_t vertices.
In the process, I extract the indexed geometry into a linear array that can be rendered with glDrawArrays(...).
The reason for that is that glDrawArrays is faster than glDrawElements on DC due to the fact that the PVR does not directly support indexed geometry, so the geometry must be un-indexed ( in software by the API ) per frame before submission to the PVR on DC.


Top
 Profile  
Reply with quote  
PostPosted: Wed Jan 28, 2015 10:59 am 
Offline
Insane DCEmu
Insane DCEmu
User avatar

Joined: Sun Apr 20, 2014 7:45 am
Posts: 135
PH3NOM wrote:
Shaders not really possible on DC... I have recently considered some sort of pre-set shader functionality based on a fixed set of shader operations, but a full-blown programmable pipeline is not realistic.

That sounds like a fun project, but in practice I don't think it's useful. It's way too limited and guaranteed to be slow.

Quote:
VBO's should be possible to an extent; I have recently added basic VAO functionality (glGenVertexArray, etc.). By the end of this week I should have a new commit ready to update the API implementing VAO's.

I've already implemented that in my libgl as well. The advantages are not as big as on desktop platforms (since the vertex data cannot be put into VRAM due to transformation), but some exist:
VAO (with or without VBO): no need to set up the vertex attributes every time, saves function calls, branches, etc.
VBO: Basically this gives libgl ownership over the vertex memory. This could allow some optimizations like calculating a bounding volume after glBufferData and using that to improve near clipping?

_________________
My libgl playground (not for production): https://bitbucket.org/bogglez/libgl15
My lxdream fork (with some small fixes): https://bitbucket.org/bogglez/lxdream


Top
 Profile  
Reply with quote  
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 16 posts ] 

All times are UTC - 8 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB® Forum Software © phpBB Group