mat_transform / pvr_prim vs mat_transform_sq

Bouz · Post by **Bouz** » Fri Jan 04, 2013 1:22 pm

Hello, sorry to ask the same question every week, I think mat_transform_sq is highly optimized to render strips. Did you only try to render triangles ou quads?

PH3NOM · Post by **PH3NOM** » Sun Jan 06, 2013 9:48 pm

Bouz wrote:Hello, sorry to ask the same question every week, I think mat_transform_sq is highly optimized to render strips. Did you only try to render triangles ou quads?

Well, I submit a quad as a triangle strip consisting of 4 verts in this order ( lower left, upper left, lower right = tri 1. upper right vert completes tri2. )

Am I right in believing that the pvr_vertex_t flags actually tell the PVR what type of primitive is being submitted?
I remember reading some documentation indicating 0xe0000000 tells the PVR that primitive is a triangle strip, that just so happens to be the value of PVR_CMD_VERTEX
I dont know how to actually submit a quad to the PVR instead of triangle strips. Is it even possible?

At any rate, when using mat_transform_sq, I gather the quad in ram before transforming as a triangle strip.

This still produces lower throughput than transforming each vertex with mat_transform_single then submitting the vertex via Direct Render.

To make a solid comparison, and a point of reference as to how much of the API I have completed, I have compiled the kgl quadmark example without modification using my build of OpenGL.
The results are good, although I was hoping for better, but I dont believe things can be done much more efficiently.

Using KGL, the quadmark example reaches 242k polys/sec, or 584.57k verts/sec according to NullDC

Using my build of the OpenGL API with matrix_transform_sq, quadmark reaches 327k polys/sec, or .79mil verts/sec

Using mat_transform_single, quadmark reaches 398k polys/sec, or .96mil verts/sec

Post by **BlueCrab** » Sun Jan 06, 2013 10:23 pm

If you're using a relatively recent version of KOS, you do have the ability to submit quads as well as triangle strips. In the PVR's terminology, quads are simply called sprites (which can be either textured or non-textured).

There aren't any real examples of using sprites in the KOS examples (I should really put one in there at some point), but here's some code from CrabEmu that demonstrates them.

Sprites are somewhat nice to work with since they take up about half the vertex buffer space that submitting a quad as a triangle strip does. I dunno if it will help your framerate any, but it can't hurt to try.

GyroVorbis · Post by **GyroVorbis** » Mon Jan 07, 2013 3:29 pm

As BlueCrab said, I would be highly surprised if your performance wasn't improved by using hardware sprites/quads. Each quad is 64 bytes total, rather than 32 per vertex. Since you are submitting less vertex data overall like this, I would be very shocked if you didn't get another performance boost.

The downside is that a "quad" on the DC has only one color. The color is specified in the header, rather than on a per-vertex basis... So as far as OpenGL is concerned, it does not behave the same way as GL_QUADS.

PH3NOM · Post by **PH3NOM** » Tue Jan 08, 2013 10:09 pm

BlueCrab wrote:If you're using a relatively recent version of KOS, you do have the ability to submit quads as well as triangle strips. In the PVR's terminology, quads are simply called sprites (which can be either textured or non-textured).

There aren't any real examples of using sprites in the KOS examples (I should really put one in there at some point), but here's some code from CrabEmu that demonstrates them.

Sprites are somewhat nice to work with since they take up about half the vertex buffer space that submitting a quad as a triangle strip does. I dunno if it will help your framerate any, but it can't hurt to try.

Hi BlueCrab, thank you for the suggestion, I have taken a quick look at using PVR Hardware Sprites.
First off, it seems I am using a relatively older version of KOS, the one included in Dev Iso R4.
The first snag, this build of KOS does not have a structure pvr_sprite_hdr_t, so I had to define this first:

Code: Select all

#define pvr_sprite_hdr_t pvr_poly_ic_hdr_t

I dont think that should be a problem, right?

Then, to make sure I understand things correctly, I have opted to make a slight tweak and rename the structure pvr_sprite_txr_t

Code: Select all

typedef struct {
	uint32 flags;
	float ax, ay, az;
	float bx, by, bz;
	float cx, cy, cz;
	float dx, dy, dz; /* Last vertices Z-component is ignored, and calculated by PVR */
	uint32 auv;
	uint32 buv;
	uint32 cuv;  	/* Last vertices uv component will be calculated by the PVR */
} pvr_sprite_vertex_t;

Finally, when I compiled a first attempt, I got a run-time error about Sprites having to be textured.
I guess that is a caveat of using an older build of KOS?

GyroVorbis wrote:As BlueCrab said, I would be highly surprised if your performance wasn't improved by using hardware sprites/quads. Each quad is 64 bytes total, rather than 32 per vertex. Since you are submitting less vertex data overall like this, I would be very shocked if you didn't get another performance boost.

The downside is that a "quad" on the DC has only one color. The color is specified in the header, rather than on a per-vertex basis... So as far as OpenGL is concerned, it does not behave the same way as GL_QUADS.

Recently I read something you posted about DMA rendering with the PVR. I had some questions to ask you, but I will get back to that later.

I will start off by mentioning ( for those who dont already know ) that the PVR has 3 different types of parameters that it can process.
1.) Control Parameter - These are instructions like what type of list is being submitted ( opaque, alpha, punch-through, etc )
2.) Global Parameter - This contains information about the primitive type to render ( triangle, sprite, modifier, etc )
3.) Vertex Parameter - This is the vertex data

So, after a quick look into it, let me explain why I actually think this will be slower than the way I do things currently.

For now, I keep one generic pvr_poly_hdr_t in memory that is used for all un-textured polys.
I keep an internal array of pvr_poly_hdr_t's in memory that are used for each individual texture bound to the API.
From my testing, Poly Headers ( PVR Global Parameter ) need to be submitted in a minimalistic fashion due to processing time by the PVR.
So, I only submit Global Parameters when you make a call to glBindTexture.
For example, in this scene I quickly made for testing my OpenGL build, even though 48,000 Vertex Parameters are sent to the PVR each frame, only 2 Global Parameters are sent ( 1 per texture )

In GL, it is possible to render every quad with a different color, as seen in the KOS quadmark example

To render this scene using PVR sprites, every quad need not only its Vertex Parameter, but it will also need its own Global Parameter.
This means every quad sprite will need 96bytes, instead of the 64bytes you mentioned.
But, worse than the increased bandwidth overhead of 32bytes per vertex, the PVR has to process many many Global Parameters, again, from my testing, is slow.

But, I could be wrong. It will be interesting to make some benchmarks in terms of speed.

Regardless of speed, I can tell now that PVR Sprites will not be able to implement GL_QUADS
1.) GL_QUADS support different colors per primitive, PVR Sprites do not ( the scene with Max Payne texture is not possible with sprites )
2.) PVR Sprites UV Coordinates are packed 16bit, this will lose accuracy from GL's floating point standard
3.) The 4th UV and Z value are determined by the hardware. This is probably best for 2-D rendering.

That said, I do think PVR Sprites will be a good way to implement GL_POINTS.
Since all points typically are the same color, we could get away with sending 1 Global Parameter per n number of Vertex Parameters

Oh yeah that leads me to a question I have been wondering. Is it possible to access the PVR's Registers Directly, instead of submitting a Parameter?

Post by **BlueCrab** » Tue Jan 08, 2013 11:57 pm

PH3NOM wrote:Hi BlueCrab, thank you for the suggestion, I have taken a quick look at using PVR Hardware Sprites.
First off, it seems I am using a relatively older version of KOS, the one included in Dev Iso R4.
The first snag, this build of KOS does not have a structure pvr_sprite_hdr_t, so I had to define this first:
Code: Select all
#define pvr_sprite_hdr_t pvr_poly_ic_hdr_t
I dont think that should be a problem, right?

The pvr_poly_ic_hdr_t is not the correct one to use with sprites, technically. It may well work (and it was the first polygon header that I managed to get to work with them and it was documented as the correct thing to use in KOS for a while, until I got the correct setup working with them), but it is not the correct way to do things. You'll want a structure like this instead:

Code: Select all

/** \brief  PVR polygon header specifically for sprites.

    This is the equivalent of a pvr_poly_hdr_t for use when a quad/sprite is to
    be rendered. Note that the color data is here, not in the vertices.

    \headerfile dc/pvr.h
*/
typedef struct {
    uint32  cmd;                /**< \brief TA command */
    uint32  mode1;              /**< \brief Parameter word 1 */
    uint32  mode2;              /**< \brief Parameter word 2 */
    uint32  mode3;              /**< \brief Parameter word 3 */
    uint32  argb;               /**< \brief Sprite face color */
    uint32  oargb;              /**< \brief Sprite offset color */
    uint32  d1;                 /**< \brief Dummy value */
    uint32  d2;                 /**< \brief Dummy value */
} pvr_sprite_hdr_t;

You should be able to find the code in current KOS that is used to calculate one of these relatively easily (look in kernel/arch/dreamcast/hardware/pvr/pvr_prim.c at the function pvr_sprite_compile).

Then, to make sure I understand things correctly, I have opted to make a slight tweak and rename the structure pvr_sprite_txr_t
Code: Select all
typedef struct {
	uint32 flags;
	float ax, ay, az;
	float bx, by, bz;
	float cx, cy, cz;
	float dx, dy, dz; /* Last vertices Z-component is ignored, and calculated by PVR */
	uint32 auv;
	uint32 buv;
	uint32 cuv;  	/* Last vertices uv component will be calculated by the PVR */
} pvr_sprite_vertex_t;
Finally, when I compiled a first attempt, I got a run-time error about Sprites having to be textured.
I guess that is a caveat of using an older build of KOS?

I think this might actually be a caveat of using the pvr_poly_ic_hdr_t with sprites. I know that I could never get untextured sprites working with that header type. As I said before, it is not the right way to do things and causes some interesting problems, even on the hardware and especially on NullDC, IIRC. I remember having some... interesting conversations with the main NullDC developer on the subject of KOS' PVR sprite support back then about how wrong KOS' behavior was.

My response was basically: "Well it works on the hardware, so your code is wrong, not mine."

It was all in good fun, of course.

I highly recommend that you look into updating your toolchain and KOS version however. There are a bunch of improvements to various parts of KOS since the SVN revision used in the DC Dev ISO r4, including some rather large changes to threading and other stuff (plus the sprite fixes for the PVR stuff, oh and recently read-only SD card support with ext2fs -- which I'm actively working on expanding to read-write).

For now, I keep one generic pvr_poly_hdr_t in memory that is used for all un-textured polys.
I keep an internal array of pvr_poly_hdr_t's in memory that are used for each individual texture bound to the API.
From my testing, Poly Headers ( PVR Global Parameter ) need to be submitted in a minimalistic fashion due to processing time by the PVR.
So, I only submit Global Parameters when you make a call to glBindTexture.
For example, in this scene I quickly made for testing my OpenGL build, even though 48,000 Vertex Parameters are sent to the PVR each frame, only 2 Global Parameters are sent ( 1 per texture )

You should be able to share the global parameter with quads just the same, if you're talking simple textured quads. Yes, if you want to use flat colored boxes, you'll have to submit a different one per quad color, but as GyroVorbis (and you) pointed out, there aren't really many good uses for untextured PVR sprites.

Regardless of speed, I can tell now that PVR Sprites will not be able to implement GL_QUADS
1.) GL_QUADS support different colors per primitive, PVR Sprites do not ( the scene with Max Payne texture is not possible with sprites )
2.) PVR Sprites UV Coordinates are packed 16bit, this will lose accuracy from GL's floating point standard
3.) The 4th UV and Z value are determined by the hardware. This is probably best for 2-D rendering.

While you're technically correct on all the points you made there, I don't think that #2 is probably as big of a problem as you might initially think.

Oh yeah that leads me to a question I have been wondering. Is it possible to access the PVR's Registers Directly, instead of submitting a Parameter?

What would you be trying to do by doing so? There's a somewhat limited amount of stuff you can do by accessing the registers directly (that I know of), so I'd imagine the answer to your question would be that you probably can't do anything useful by doing so.

GyroVorbis · Post by **GyroVorbis** » Wed Jan 09, 2013 4:23 pm

Yeah, I can definitely see where the hardware sprites/quads are going to fall short for implementing GL_QUADS...

They can be EXTREMELY useful in certain scenarios though... Imagine rendering a tile-based map in a 2D game. Since they're being rendered from the same sheet, you submit one header followed by one 64-byte pvr_vertex per tile.

Bluecrab, what kind of threading changes have been implemented since DC Dev ISO r4?

Post by **BlueCrab** » Wed Jan 09, 2013 4:35 pm

GyroVorbis wrote:Bluecrab, what kind of threading changes have been implemented since DC Dev ISO r4?

The entirety of the threading system was pretty much overhauled. About the only thing that didn't change was the scheduler itself.

All the synchronization primitives have changed so that they aren't allocated on the heap in general (which allows static initialization of global sync primitives, like you have in POSIX with things like PTHREAD_MUTEX_INITIALIZER). You have a couple of different kinds of mutexes combined into one mutex type (the recursive mutex has been merged in with the normal mutex, and you have an "error checking" mutex variant as well). Mutexes are no longer semaphores under the covers -- they're two distinct and separate types now.

You can now return values from threads, like POSIX specifies. You have real joinable threads now, not just detached ones like you had before (and the default is to create threads as joinable, like POSIX). There was a major bug fixed in the thread reaping code (to clean up dead threads). Basically, the old code didn't work at all and dead threads never really got removed from the system entirely. Also, a major memory leak was fixed related to cleaning up old threads as well when the stdio functions were used.

I don't remember if the version of the threading stuff in the dev iso had this or not, so I'll throw it in here too. You now have thread-local storage provided via a very POSIX-like API as well.

That's all I can think of off the top of my head. Basically, all the big changes were to align KOS' internal threads a lot closer to the POSIX specification (and to fix really bad bugs).

Tvspelsfreak · Post by **Tvspelsfreak** » Wed Jan 09, 2013 5:02 pm

Unless it's been fixed recently, setting the textured sprite color in KOS won't actually have any effect. KOS sets the texture environment for sprites to PVR_TXRENV_REPLACE, which replaces any color with that from the texture. You'll need to use PVR_TXRENV_MODULATE or PVR_TXRENV_MODULATEALPHA instead. The pvr_sprite_cxt_t.txr struct lacks an env setting (as seen in pvr_poly_cxt_t) and there's no logic in pvr_sprite_compile for handling texture environment, so you'll have to modify the header after it's been compiled.

Code: Select all

pvr_sprite_cxt_txr(...);
pvr_sprite_compile(...);
hdr.mode2 |= (PVR_TXRENV_MODULATE << PVR_TA_PM2_TXRENV_SHIFT);
or
hdr.mode2 |= (PVR_TXRENV_MODULATEALPHA << PVR_TA_PM2_TXRENV_SHIFT);

Post by **BlueCrab** » Wed Jan 09, 2013 9:28 pm

Tvspelsfreak wrote:Unless it's been fixed recently, setting the textured sprite color in KOS won't actually have any effect. KOS sets the texture environment for sprites to PVR_TXRENV_REPLACE, which replaces any color with that from the texture. You'll need to use PVR_TXRENV_MODULATE or PVR_TXRENV_MODULATEALPHA instead. The pvr_sprite_cxt_t.txr struct lacks an env setting (as seen in pvr_poly_cxt_t) and there's no logic in pvr_sprite_compile for handling texture environment, so you'll have to modify the header after it's been compiled.
Code: Select all
pvr_sprite_cxt_txr(...);
pvr_sprite_compile(...);
hdr.mode2 |= (PVR_TXRENV_MODULATE << PVR_TA_PM2_TXRENV_SHIFT);
or
hdr.mode2 |= (PVR_TXRENV_MODULATEALPHA << PVR_TA_PM2_TXRENV_SHIFT);

Yeah, it was indeed missing. Thanks for pointing it out. I've fixed that in the git repository just now.

I think that may have been a vestige of when it was using the wrong header for sprites and the color calculation just didn't work as a result (so I just left it out from the struct entirely). Either way, it is fixed now. Thanks for pointing out that it wasn't right. I seem to remember a bunch of things not working right with that header type, but I can't recall exactly if that was one of them...

TapamN · Post by **TapamN** » Thu Jan 10, 2013 10:00 pm

A lot of the speed difference between your code using mat_transform_sq and using inline assembly with direct SQs probably comes from subroutine call overhead. Try having glVertex just set the vertex position and flags, and have glEnd set the end-of-strip and do the m_t_s call, then see how it handles longer strips.

A more efficient way of doing the transformation inline would be like this:

Code: Select all

inline void glVertex3fv( float *v )
{
	register float __x __asm__("fr0") = (v[0]); 
	register float __y __asm__("fr1") = (v[1]); 
	register float __z __asm__("fr2") = (v[2]); 
	register float __w __asm__("fr3") = 1;
	
	__asm__ __volatile__( 
		"ftrv	xmtrx,fv0\n"
		: "=f" (__x), "=f" (__y), "=f" (__z), "=f" (__w)
		: "0" (__x), "1" (__y), "2" (__z) , "3" (__w)
		: );
    float invz = 1/z;
    DCE_DR_Vert->x = __x * invz; 
    DCE_DR_Vert->y = __y * invz;
    DCE_DR_Vert->z = invz;
    
    if(++DR_VERT%DR_PRIM==0)
        DCE_DR_Vert->flags = PVR_CMD_VERTEX_EOL;
    else
        DCE_DR_Vert->flags = PVR_CMD_VERTEX;
        
    __asm__ __volatile__("pref @%0" : : "r" (DCE_DR_Vert));
}

The compiler can optimize this better. I've never seen the compiler change code inside asm blocks (although I have seen it drop asm blocks), but by moving the code out of assembly and into C, the compiler can reorder things and make better optimizations. For example, the compiler should be able to tell that setting the flags can be done while the FPU is working on the division, so it could reorder the code to look like this...

Code: Select all

    float invz = 1/z;
    if(++DR_VERT%DR_PRIM==0)
        DCE_DR_Vert->flags = PVR_CMD_VERTEX_EOL;
    else
        DCE_DR_Vert->flags = PVR_CMD_VERTEX;

    DCE_DR_Vert->x = __x * invz; 
    DCE_DR_Vert->y = __y * invz;
    DCE_DR_Vert->z = invz;

...which is faster since the CPU doesn't idle completely during the slow division operation.

Also, while using quads is good when they are an option (since they require less CPU time to setup and submit), they don't actually seem use less vertex buffer space than a 2 triangle long strip. You can check how much of the vertex buffer is used with pvr_get_stats. IIRC, it's 92 bytes either way (that's with texture, 16-bit UV, and no offset color on both). I don't know if the PVR, when actually rendering, handles them any faster than two triangles.

PH3NOM · Post by **PH3NOM** » Fri Jan 11, 2013 8:50 pm

TapamN thank you for the advise to optimize the routine.

the code you posted will not compile, I had to remove that last asm line

Code: Select all

inline void glVertex3f( float x, float y, float z )
{
	register float __x __asm__("fr0") = (x); 
	register float __y __asm__("fr1") = (y); 
	register float __z __asm__("fr2") = (z); 
        register float __w __asm__("fr3") = 1;

	__asm__ __volatile__( 
		"ftrv	xmtrx,fv0\n" 
		: "=f" (__x), "=f" (__y), "=f" (__z), "=f" (__w)
		: "0" (__x), "1" (__y), "2" (__z), "3" (__w) );

    float invz = 1/z;
    DCE_DR_Vert->x = __x * invz;
    DCE_DR_Vert->y = __y * invz;
    DCE_DR_Vert->z = invz;
   
    if(++DR_VERT%DR_PRIM==0)
        DCE_DR_Vert->flags = PVR_CMD_VERTEX_EOL;
    else
        DCE_DR_Vert->flags = PVR_CMD_VERTEX;
       
    __asm__ __volatile__("pref @%0" : : "r" (DCE_DR_Vert));
}

But, the transformations are not correct

About Quads/Sprites, my guess is that the PVR derives THEN stores the needed vertex data, simply reducing required data transfer.
That would explain why it uses the same amount of vertex buffer space, as you pointed out

PH3NOM · Post by **PH3NOM** » Fri Jan 11, 2013 9:20 pm

TapamN wrote:A lot of the speed difference between your code using mat_transform_sq and using inline assembly with direct SQs probably comes from subroutine call overhead. Try having glVertex just set the vertex position and flags, and have glEnd set the end-of-strip and do the m_t_s call, then see how it handles longer strips.

If I understand your advise here fully, you are saying we could throw the entire vertex array in between glBegin() and glEnd() as a single tristrip.

Problem is, in order to be compatible with OpenGL, say using GL_TRIANGLES, we must treat every 3 vertices received as a single triangle. Or using GL_QUADS every 4 vertices etc.

Consider the KOS quadmark example

Code: Select all

	glBegin(GL_QUADS);
	for (i=0; i<polycnt; i++) {
		x = rand() % 640;
		y = rand() % 480;
		z = rand() % 100 + 1;
		size = rand() % 50 + 1;
		col = (rand () % 255)*0.00391f;
	
		glColor3f(col, col, col);
		glVertex3f(x-size, y-size, z);
		glVertex3f(x+size, y-size, z);
		glVertex3f(x-size, y+size, z);
		glVertex3f(x+size, y+size, z);
	}
	glEnd();

If the End_of_strip was set only on the last vertex by calling glEnd, the result would be a strip that is, simply, wrong.

TapamN · Post by **TapamN** » Fri Jan 11, 2013 11:46 pm

PH3NOM wrote:Broken code

Well, it's not surprising the code I posted didn't work right, since I didn't test it. I also misread the asm you posted, and thought you were using 1/z instead of 1/w for depth. The correct way for the part after the asm should be this:

Code: Select all

    float invw = 1/w;
    DCE_DR_Vert->x = __x * invw;
    DCE_DR_Vert->y = __y * invw;
    DCE_DR_Vert->z = invw;

I'm not sure you would see a speed difference in an emulator with this, since they are more concerned with running the code quickly than with accurate timing. The emulator might guess that every floating point division takes X cycles, but the surrounding code it could make it take longer or slower on real hardware, which the emulator doesn't closely examine in order to run faster.

For example, floating point division can cost half a cycle to 12 cycles depending on what the surrounding code is like. If you need the division result immediately, it could cost 12 cycles, but if there's a lot of stuff you can do waiting for the division, it only costs one of the two execution slots available each cycle. Doing multiple divisions or mixed square roots and division can also cause extra slowdowns an emulator might not replicate.

PH3NOM wrote:OpenGL

Oh, I was thinking about triangle/quad strips, not individual triangles/quads. For those, try having glVertex set the position and flags, with vertex or end of strip as required (i.e. every 3rd vertex of a bunch of triangles would be a strip end vertices, while the rest would be normal vertices), and just have glEnd call mat_transform_sq without messing with the existing flags. That should get correct triangle/quad flags while reducing the number of mat_transform_sq calls.

PH3NOM · Post by **PH3NOM** » Sun Jan 13, 2013 12:19 pm

TapamN wrote:Well, it's not surprising the code I posted didn't work right, since I didn't test it. I also misread the asm you posted, and thought you were using 1/z instead of 1/w for depth. The correct way for the part after the asm should be this:
Code: Select all
    float invw = 1/w;
    DCE_DR_Vert->x = __x * invw;
    DCE_DR_Vert->y = __y * invw;
    DCE_DR_Vert->z = invw;

Yes that is the correct transformation, just had to change first line there to

Code: Select all

float invw = 1/__w;

Will doing the final multiplication / division outside of the registers slow the operation down?
Also, I would like to avoid creating local variables if possible.
Here, I am using register "fr4" set to 1 to apply the 1/__w operation.

Code: Select all

    __asm__ __volatile__( "fdiv fr3, fr4\n" );
    DCE_DR_Vert->z = __w;
    __asm__ __volatile__( "fmul fr3, fr0\n" );
    DCE_DR_Vert->x = __x;
    __asm__ __volatile__( "fmul fr3, fr1\n" );
    DCE_DR_Vert->y = __y;

One of the... uh, interesting things the library does is that it doesn't use SQs or DMA to submit data to the PVR. Since the library writes the intermediate data to a buffer between passes, you would either have to stop and copy the final results to the SQ manually, wasting CPU time, or dump everything to memory and then uses DMA, which wastes bandwidth and still indirectly slows the CPU down. Instead, the library is set up so that the generated vertex data is sent directly from the cached buffer to the PVR. It uses the OCINDEX cache mode of the SH-4 to allocate one half of the cache on top of the TA input, and submits data with cache writeback instructions (OCBWB). It's kind of weird, but it's completely reliable on real hardware (although no emulator is accurate enough to support this).

Looking at kos's cache.s code, I have just now realized what you are doing. Excellent idea.

TapamN · Post by **TapamN** » Sun Jan 13, 2013 2:22 pm

PH3NOM wrote:Will doing the final multiplication / division outside of the registers slow the operation down?
Also, I would like to avoid creating local variables if possible.
Here, I am using register "fr4" set to 1 to apply the 1/__w operation.
Code: Select all
    __asm__ __volatile__( "fdiv fr3, fr4\n" );
    DCE_DR_Vert->z = __w;
    __asm__ __volatile__( "fmul fr3, fr0\n" );
    DCE_DR_Vert->x = __x;
    __asm__ __volatile__( "fmul fr3, fr1\n" );
    DCE_DR_Vert->y = __y;     

The SH-4 does all operations in registers. The compiler will have the multiplication/division/etc done inside registers (because that's the only way the CPU can do them) and then write the results to memory. Trying to stick extra assembly in like that will probably be slower than what GCC can create with optimizations turned on. The only time I can think of when you should use inline assembly is when you want the CPU run instructions that GCC itself isn't capable of, like using the FTRV instruction. GCC can handle simple division and multiplication just fine.

Like I mentioned before, by having the multiplication and division done in C, GCC can figure out where good places to put the FMUL and FDIV instructions are. By sticking all of that in asm, you force GCC to use something slower that it cannot plan around. This is especially important with inline functions, like your glVertex, since it gives the compiler chances to further optimize the code based on where it is called.

PH3NOM · Post by **PH3NOM** » Thu Jan 17, 2013 11:05 pm

TapamN wrote:
PH3NOM wrote:OpenGL
Oh, I was thinking about triangle/quad strips, not individual triangles/quads. For those, try having glVertex set the position and flags, with vertex or end of strip as required (i.e. every 3rd vertex of a bunch of triangles would be a strip end vertices, while the rest would be normal vertices), and just have glEnd call mat_transform_sq without messing with the existing flags. That should get correct triangle/quad flags while reducing the number of mat_transform_sq calls.

To do what you mentioned here, requires the use of a Vertex Buffer of significant size.
I have made an implementation that uses a Vertex Buffer of 3Mb, that is transformed using mat_transform_sq() with few calls.
You are right, we can achive much higher throughput
In this scene, every frame a bunch of triangles are created real time.

GyroVorbis · Post by **GyroVorbis** » Fri Jan 18, 2013 1:32 pm

PH3N0M wrote:To do what you mentioned here, requires the use of a Vertex Buffer of significant size.

I'm confused as to why this is the case.

Don't you only need a vertex buffer as large as your longest strip with the method you're using? glEnd() is calling mat_transform_sq on a per-strip basis then copying them over to the PVR directly (direct rendering), right?

You're seeing these kinds of performance improvements calling mat_transform_sq on every 4 vertices?

PH3NOM · Post by **PH3NOM** » Fri Jan 18, 2013 9:02 pm

GyroVorbis wrote:
PH3N0M wrote:To do what you mentioned here, requires the use of a Vertex Buffer of significant size.
I'm confused as to why this is the case.

Don't you only need a vertex buffer as large as your longest strip with the method you're using? glEnd() is calling mat_transform_sq on a per-strip basis then copying them over to the PVR directly (direct rendering), right?

You're seeing these kinds of performance improvements calling mat_transform_sq on every 4 vertices?

Consider this code I wrote for testing purposes, as seen in the previous screen I posted.

Code: Select all

        GlBegin(GL_TRIANGLES);
   
        DWORD x,y;
        for(x=0; x<640;x+=PRIM_SIZE)
            for(y=0; y<480;y+=PRIM_SIZE)
            { 
                GlColor1ui( 0xFFFF0000 ); /* Red */
                GlVertex3f(x, y+PRIM_SIZE, 1.0f);
            
                GlColor1ui( 0xFF00FF00 ); /* Green */ 
                GlVertex3f(x+(PRIM_SIZE/2.0), y, 1.0f);

                GlColor1ui( 0xFF0000FF ); /* Blue */ 
                GlVertex3f(x+PRIM_SIZE, y+PRIM_SIZE, 1.0f);
            }

        GlEnd();

GlColor1ui is not a part of the official standard; for a speed increase, I wrote this specific to the Dreamcast API.
( GlColor1ui(0xFF0000FF) produces the same results as glColor4f(1.0f,0.0f, 0.0f, 1.0f) without the need to pack 4 floats into an unsigned int )

Anyway, in this code, (640*480)/(5*5), or 12480 triangles are created every frame, that means 36864 vertices are submitted before a call to GlEnd.

Point being, when GlBegin(GL_TRIANGLES) is called, the pipeline has no idea how many vertices will be submitted before a call to GlEnd() is made.

Furthermore, TapamN advised to reduce the number of calls to mat_transform_sq.

In order to accomdate both needs, I have decided to use a Vertex Buffer of 3Mb ( static, size can be changed at compile time; using a dynamic array would be too wastefull )
I have reduced the number of function calls to mat_transform_sq to 1 per call to glBegin/End ( or 1 per 3Mb of vertex buffer ( the code checks for buffer overflow, and will flush the buffer by calling mat_tx_sq, so that the buffer will never be "full" ))

If I dont explain things clearly, perhaps looking at my code might make things more clear

Code: Select all

#define GL_MAX_VERTICES 1024*96 /* 1024*96*32 = 3Mb ( 98304 vertices ) */
static pvr_vertex_t VERTEX_BUFFER[GL_MAX_VERTICES] __attribute__((aligned(64)));
static DWORD        VERTICES = 0;

static uint32 GL_COLOR = 0xFFFFFFFF;
static float  GL_UV[2] = {0.0f,0.0f};
static int    GL_PRIMITIVE = -1;

inline void GlColor1ui( uint32 c )
{
    GL_COLOR = c;
}

inline void GlTexCoord2f( float u, float v )
{
    GL_UV[0] = u;
    GL_UV[1] = v;
}

inline void GlEnd()
{       
    mat_transform_sq( (pvr_vertex_t*)VERTEX_BUFFER, (pvr_vertex_t*)0xe0000000, VERTICES );
    VERTICES = 0;
}

inline void GlVertex3f( float x, float y, float z )
{
    VERTEX_BUFFER[VERTICES].x = x;
    VERTEX_BUFFER[VERTICES].y = y;
    VERTEX_BUFFER[VERTICES].z = z;
    VERTEX_BUFFER[VERTICES].u = GL_UV[0];
    VERTEX_BUFFER[VERTICES].v = GL_UV[1];
    VERTEX_BUFFER[VERTICES].argb = GL_COLOR;
    
    if(++VERTICES==GL_MAX_VERTICES)
    {
        glEnd();
        glBegin(GL_PRIMITIVE);
    }    
}

Any advise on how to make things more efficient will be appreciated!

GyroVorbis · Post by **GyroVorbis** » Tue Jan 22, 2013 11:00 am

Aaaaaaaah, for some reason I thought you were making those triangles out of triangle strips between glBegin() and glEnd() calls... and I guess either way you would need a big ass buffer, because you have no idea how many vertices the user is going to submit before glEnd() anyway...

That's exactly how I would have done it, with a compile-time modifiable size for a static vertex buffer.

edit: Another thing I'm wondering is how you are planning to handle glEnable(GL_BLEND) and glDisable(GL_BLEND) to switch between TR and OP polys.

Since you are using direct rendering, you have to submit all OP polys then all TR polys and you can't mix them. If you want to be able to submit OP or TR polys whenever, aren't you also going to need to use a buffered approach anyway? Then you fill up two buffers in RAM (OP and TR lists) and have KOS DMA them over to the PVR when the scene ends.

At least that's the approach I'm using. I am also writing an API kind of like this. It is not quite as multipurpose as yours, but it is an abstraction away from the PVR that our engine is using. I find this topic very interesting, because these are all things I'm considering/wondering in my own API.

mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq