mat_transform / pvr_prim vs mat_transform_sq

PH3NOM · Post by **PH3NOM** » Tue Jan 22, 2013 10:04 pm

GyroVorbis wrote:Another thing I'm wondering is how you are planning to handle glEnable(GL_BLEND) and glDisable(GL_BLEND) to switch between TR and OP polys.

Since you are using direct rendering, you have to submit all OP polys then all TR polys and you can't mix them. If you want to be able to submit OP or TR polys whenever, aren't you also going to need to use a buffered approach anyway?

Well, good question. As you guessed from my code, things are not suited for mixing the submission of opaque and transparent polys.
I guess that is because I have been testing using code that I wrote for the KGL API that came with KOS that expects the user to submit OP polys first, then flush that list before submitting TR polys.
viewtopic.php?f=29&t=102059#p1032695

GyroVorbis wrote:Then you fill up two buffers in RAM (OP and TR lists) and have KOS DMA them over to the PVR when the scene ends.

At least that's the approach I'm using. I am also writing an API kind of like this. It is not quite as multipurpose as yours, but it is an abstraction away from the PVR that our engine is using. I find this topic very interesting, because these are all things I'm considering/wondering in my own API.

So that brings me to the question I mentioned I wanted to ask you, about a post you made here
http://elysianshadows.com/dev/community ... -dreamcast
Thank you for that information by the way.
My question, is how is that actually using the DMA channels to transfer the vertex data to the PVR?
From what I can see there, every vertex is transferred through the store queues, although I admit I have not looked at any of the pvr_set_vertbuf stuff in kos yet.
Any further info on that appreciated

So, to actually enable a GL API that can handle the submission of mixed lists, I dont think we can get away with using mat_transform_sq.
( But that is not a loss, from my testing it operates at the same throughput of using mat_transform_single anyway )
The way that first comes to mind, will be waiting untill the entire scene has been submitted to the vertex buffer before sending the vertices to the PVR.
The problem with using mat_transform_sq is that if we wait untill the entire scene has been submitted before transforming the vertices, the matrix in the registers will most likely have changed since the vertex was actually submitted to the GL pipeline.
So, we could mat_transform_single every vertex recieved by the GL pipeline into a Vertex Buffer, waiting untill the scene is finished before transferring the Vertex Buffer via DMA.
It would be best to use a double-buffer in this approach, so that one buffer can be transferred to the PVR while the other buffer is being created.

Bouz · Post by **Bouz** » Wed Jan 23, 2013 2:03 pm

Couldn't you send the solid polys to the pvr as they are decared, and buffer only TR and PT lists so that they can be DMAed to the PVR once the full scene has been submitted? Statistically, there are probably much more opque polys than PT and TR?

Post by **BlueCrab** » Wed Jan 23, 2013 8:18 pm

Bouz wrote:Couldn't you send the solid polys to the pvr as they are decared, and buffer only TR and PT lists so that they can be DMAed to the PVR once the full scene has been submitted? Statistically, there are probably much more opque polys than PT and TR?

Actually, the way KOS' PVR api works with DMA, that's not possible at all.

Basically, the normal (store queue based) api collects data for one frame while rendering the frame before. With DMA, you collect data for one frame, submit the frame before that by dma, and render the frame before that (so there are always 3 frames in progress with DMA versus 2 frames with store queues).

While I'm sure you could work around that issue somewhat, it isn't supported natively.

PH3NOM · Post by **PH3NOM** » Thu Jan 24, 2013 8:38 pm

GyroVorbis wrote:edit: Another thing I'm wondering is how you are planning to handle glEnable(GL_BLEND) and glDisable(GL_BLEND) to switch between TR and OP polys.

Since you are using direct rendering, you have to submit all OP polys then all TR polys and you can't mix them. If you want to be able to submit OP or TR polys whenever, aren't you also going to need to use a buffered approach anyway? Then you fill up two buffers in RAM (OP and TR lists) and have KOS DMA them over to the PVR when the scene ends.

So, I have got around to making a first pass at my idea to allow Mixed List Submission.

It is not strictly necessary to use two buffers as you mentioned, using my approach.
I use the same Vertex Buffer I posted before, but now it is possible to mix the submission of Transparent and Opaque polys.

Code: Select all

pvr_vertex_t VERTEX_BUFFER[GL_MAX_VERTICES] __attribute__((aligned(64)));

First, I created a structure to help orgainze things when it comes time to submit to the PVR

Code: Select all

typedef struct
{
    pvr_poly_hdr_t * hdr;
    pvr_vertex_t   * vertex;
    DWORD            vertices;
}RenderNode;

Pretty self-explanatory, just note that the *vertex is pointing to a location in the VERTEX_BUFFER.

For simplicity, I have created a static array of these nodes to keep track of Opaqe and Transparent vertices submitted

Code: Select all

static RenderNode OP_LIST[1024*64];
static RenderNode TR_LIST[1024*64];
static DWORD      OP_SIZE = 0;
static DWORD      TR_SIZE = 0;

Next, I have re-worked the code to keep track of what blending mode is currently enabled

Code: Select all

#define GL_BLEND_OPAQUE PVR_LIST_OP_POLY
#define GL_BLEND_TRANS  PVR_LIST_TR_POLY

static BYTE GL_BLEND_FUNC = GL_BLEND_OPAQUE;

From there, glBegin() and glEnd() will take care of manipulating the RenderNodes.

Now, glVertex3f will transform each vertex directly into the VERTEX_BUFFER

Code: Select all

static float invw;
inline void GlVertex3f( float x, float y, float z )
{
	register float __x __asm__("fr0") = x; 
	register float __y __asm__("fr1") = y; 
	register float __z __asm__("fr2") = z; 
	register float __w __asm__("fr3") = 1;
	
	__asm__ __volatile__( 
		"ftrv	xmtrx,fv0\n" 
		: "=f" (__x), "=f" (__y), "=f" (__z), "=f" (__w)
		: "0" (__x), "1" (__y), "2" (__z), "3" (__w) );  
        
    VERTEX_BUFFER[VERTICES].z = invw = 1/__w; 
    VERTEX_BUFFER[VERTICES].x = __x*invw; 
    VERTEX_BUFFER[VERTICES].y = __y*invw;
    VERTEX_BUFFER[VERTICES].u = GL_UV[0];
    VERTEX_BUFFER[VERTICES].v = GL_UV[1];
    VERTEX_BUFFER[VERTICES].argb = GL_COLOR;
    VERTEX_BUFFER[VERTICES].oargb = GL_COLOR_OFFSET;
    VERTICES++;
    
    (GL_BLEND_FUNC == GL_BLEND_OPAQUE) ?
    OP_LIST[OP_SIZE].vertices++ : TR_LIST[TR_SIZE].vertices++;
}

Finally, when the scene is finished, we can process our RenderNodes

Code: Select all

inline void RenderCallback()
{
    DWORD i;

    pvr_wait_ready();
    pvr_scene_begin();    
    pvr_list_begin(PVR_LIST_OP_POLY);  
    
    for(i=0;i<OP_SIZE;i++)
    {
        sq_cpy((pvr_poly_hdr_t*)0x10000000, (pvr_poly_hdr_t*)OP_LIST[i].hdr,  0x20 );
        dcache_flush_range((pvr_vertex_t*)OP_LIST[i].vertex, 0x20*OP_LIST[i].vertices);
        pvr_dma_load_ta((pvr_vertex_t*)OP_LIST[i].vertex, 0x20*OP_LIST[i].vertices, 1, NULL, NULL);
    }
    OP_SIZE = 0;
    
    pvr_list_finish(); 
    pvr_list_begin(PVR_LIST_TR_POLY); 
    
    for(i=0;i<TR_SIZE;i++)
    {
        sq_cpy((pvr_poly_hdr_t*)0x10000000, (pvr_poly_hdr_t*)TR_LIST[i].hdr,  0x20 );
        dcache_flush_range((pvr_vertex_t*)TR_LIST[i].vertex, 0x20*TR_LIST[i].vertices);
        pvr_dma_load_ta((pvr_vertex_t*)TR_LIST[i].vertex, 0x20*TR_LIST[i].vertices, 1, NULL, NULL);
    }
    TR_SIZE = 0;
    
    pvr_scene_finish();    
}

For now, the Global Parameters are submitted via the store queues, but the Vertex Parameters are sent trough DMA.

But, I still dont understand the KOS vertex DMA pvr_set_vertbuf stuff ( thats why I did my own implementation of vertex dma ), can someone please explain to me how it works?

Post by **BlueCrab** » Fri Jan 25, 2013 10:21 am

PH3NOM wrote:But, I still dont understand the KOS vertex DMA pvr_set_vertbuf stuff ( thats why I did my own implementation of vertex dma ), can someone please explain to me how it works?

As mentioned earlier, the way vertex DMA works internally in KOS, you cannot possibly combine it directly with any other vertex submission methods (be it the direct render stuff or manually copying vertices over with store queues).

Basically, you set a buffer to store all primitives into with the pvr_set_vertbuf function, and you must set one for each list (OP, TR, PT, OP Modifiers, TR Modifiers) that you intend to use in your scene. When you do a pvr_scene_finish, those vertex buffers are flushed out to the TA in the correct order for rendering behind the scenes for you.

Thus, when you're actually submitting vertices and other parameters, you don't actually send them directly to the TA, as you would normally with pvr_prim, the direct render stuff, or by copying them yourself manually via the store queues or whatever. They get buffered in RAM into whatever buffer is needed for the list you specify to the pvr_list_prim function.

GyroVorbis · Post by **GyroVorbis** » Mon Feb 04, 2013 4:20 pm

BlueCrab wrote:
PH3NOM wrote:But, I still dont understand the KOS vertex DMA pvr_set_vertbuf stuff ( thats why I did my own implementation of vertex dma ), can someone please explain to me how it works?
As mentioned earlier, the way vertex DMA works internally in KOS, you cannot possibly combine it directly with any other vertex submission methods (be it the direct render stuff or manually copying vertices over with store queues).

Basically, you set a buffer to store all primitives into with the pvr_set_vertbuf function, and you must set one for each list (OP, TR, PT, OP Modifiers, TR Modifiers) that you intend to use in your scene. When you do a pvr_scene_finish, those vertex buffers are flushed out to the TA in the correct order for rendering behind the scenes for you.

Thus, when you're actually submitting vertices and other parameters, you don't actually send them directly to the TA, as you would normally with pvr_prim, the direct render stuff, or by copying them yourself manually via the store queues or whatever. They get buffered in RAM into whatever buffer is needed for the list you specify to the pvr_list_prim function.

^ This is how we're doing it. Although we are still using the store queues to commit vertices into the intermediate buffers in RAM.

Then when GL_BLEND (or equivalent) is toggled on and off, I'm just swapping which buffer (PT, OP, TR), we're submitting to.

Like Crabby said, when you call pvr_scene_finish(), KOS initializes the buffer transfers via the DMA.

Apologies for taking so long to respond!

PH3NOM · Post by **PH3NOM** » Thu Aug 01, 2013 11:10 pm

Bouz wrote:Hi,
The question is, to summarize: is it longer to compute matrix transforms or to submit vertices to the PVR?
- Solution one is to compute all vertex coordinates using mat_transform (avoiding cache trashing problems), then to submit strips based on the computations using pvr_prim calls (doing access to the RAM in a non sequencial way).
- Solution two is to compute strips and submit them immediately through store queues using mat_transform_sq. It looks more efficient, but requires to compute multiple times the same vertices when strips have vertices in common.

If you have any ideas, feel free to share.

Thanks in advance!!

I hope you are still working on your project!

Tonight, I have made another test due to frame rate issues with high-poly models using my build of OpenGL. This model is being rendered as GL_TRIANGLES because I have not yet had time to convert the TRIANGLE_FAN vertices into something the PVR can handle.

I made another test of using mat_transform_sq compared to using mat_trans_single with Direct Rendering.

Results of rendering this sonic MD2 model, with mat_trasform_sq: 30 fps at ~550k verts/sec

Results of rendering the same model, using mat_trans_single with Direct Render: 60fps at ~1.13mil verts/sec

Although my testing is application specific, I fully believe the results apply to all applications. mat_transform_sq is slower and should not be used.

Ayla · Post by **Ayla** » Wed Sep 10, 2014 1:56 pm

TapamN wrote:One of the... uh, interesting things the library does is that it doesn't use SQs or DMA to submit data to the PVR. Since the library writes the intermediate data to a buffer between passes, you would either have to stop and copy the final results to the SQ manually, wasting CPU time, or dump everything to memory and then uses DMA, which wastes bandwidth and still indirectly slows the CPU down. Instead, the library is set up so that the generated vertex data is sent directly from the cached buffer to the PVR. It uses the OCINDEX cache mode of the SH-4 to allocate one half of the cache on top of the TA input, and submits data with cache writeback instructions (OCBWB). It's kind of weird, but it's completely reliable on real hardware (although no emulator is accurate enough to support this).

Any news about that? With all the work done recently by PH3NOM on the GL lib, I'm rather curious to know if this could speed up things even further.

mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: mat_transform / pvr_prim vs mat_transform_sq

Re: Re : mat_transform / pvr_prim vs mat_transform_sq