mat_transform / pvr_prim vs mat_transform_sq

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
TapamN
DC Developer
DC Developer
Posts: 105
https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 90 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by TapamN »

PH3NOM wrote:TapamN - Thanks for the info, it seems Bouz has the hard part already done for him!

Are the screens you posted running on DC? If so, would you mind uploading the binary to have a look?

Also, what license is that source code released under?
Yes, it's running on real hardware. At the moment, I don't have a clean binary I can upload as a demo at the moment. I'll try to get something ready.

The code that rendered those images was an assembly routine that can do between 4-6 million vertices per second (no clipping). IIRC, it's around 4.2 million if the source data is not in the cache (or is larger than the cache), and 6.4ish million if all of the source data is already in the cache (if you're rendering many copies of the same model, like drawing rings for a Sonic game). It was one of the first SH-4 assembly pieces I wrote. It reads the vertex data (position, normal, and UV), does transform, a dot product for lighting, applies ambient lighting, and writes the results to the SQs and submits it all in one pass. There's possible to speed it up by a couple cycles per vertex, but it should be pretty close to the limit of what you can do when submitting vertices like that.

One of the problems with the SH-4 is the lack of registers, it's hard to do many things at once. There aren't any more floating point registers to really do anything more advanced. The SH-4 has four vector registers. In that routine, they were used like this:

1. Vertex position
2. Vertex normal
3. Light vector
4. Misc (Ambient, UV)

There's nothing left to do something like multiple lights, specular, or fresnel without having to waste a lot of time dumping and reloading registers. And doing vertex skinning in one pass is completely out of the question since you'd have to reload XMTRX multiple times per vertex.

I'm working on a 3D library (which does transformation, lighting, and near clipping) designed to run on multiple passes over the data in cache to be more efficient and flexible.

One of the... uh, interesting things the library does is that it doesn't use SQs or DMA to submit data to the PVR. Since the library writes the intermediate data to a buffer between passes, you would either have to stop and copy the final results to the SQ manually, wasting CPU time, or dump everything to memory and then uses DMA, which wastes bandwidth and still indirectly slows the CPU down. Instead, the library is set up so that the generated vertex data is sent directly from the cached buffer to the PVR. It uses the OCINDEX cache mode of the SH-4 to allocate one half of the cache on top of the TA input, and submits data with cache writeback instructions (OCBWB). It's kind of weird, but it's completely reliable on real hardware (although no emulator is accurate enough to support this).

KOS needs some changes to get OCINDEX support working, but the one-pass render routine doesn't need OCINDEX, so I'll see if I can clean up the one-pass assembly and release it with the demo binary.

What source code's license are you taking about? My rendering library probably use LGPL. The nVidia code I don't really know about, but they probably don't really care what you do with it.

I'm also planning to work on my own "strip" generator at some point as well. I'm going to try to use a format more efficient than strips (but not actually indexed geometry).

The number of vertices per triangle in a strip asymptotically reaches one as the length of the strip increases. But if you have a "2-D strip", you can asymptotically reach half a vertex per triangle, and this is much easier to make cache friendly on the SH-4 since it uses the vertices in a more predictable way.
User avatar
Bouz
DCEmu Junior
DCEmu Junior
Posts: 46
Joined: Mon May 10, 2010 3:42 pm
Location: St. Bauzille de Putois (France)
Has thanked: 0
Been thanked: 0

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by Bouz »

Waho, we don't play the same game, here! I was trying to get the most of the PVR API, and you come with a custom SH4 code that also handles lightning! Good luck with all this, I can't wait to see the results!
Do you already plan to do something with that incredible engine? (A new Sonic 3D version?)
User avatar
PH3NOM
DC Developer
DC Developer
Posts: 576
Joined: Fri Jun 18, 2010 9:29 pm
Has thanked: 0
Been thanked: 5 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by PH3NOM »

TapamN wrote:
PH3NOM wrote:TapamN - Thanks for the info, it seems Bouz has the hard part already done for him!

Are the screens you posted running on DC? If so, would you mind uploading the binary to have a look?

Also, what license is that source code released under?
Yes, it's running on real hardware. At the moment, I don't have a clean binary I can upload as a demo at the moment. I'll try to get something ready.

The code that rendered those images was an assembly routine that can do between 4-6 million vertices per second (no clipping). IIRC, it's around 4.2 million if the source data is not in the cache (or is larger than the cache), and 6.4ish million if all of the source data is already in the cache (if you're rendering many copies of the same model, like drawing rings for a Sonic game). It was one of the first SH-4 assembly pieces I wrote. It reads the vertex data (position, normal, and UV), does transform, a dot product for lighting, applies ambient lighting, and writes the results to the SQs and submits it all in one pass. There's possible to speed it up by a couple cycles per vertex, but it should be pretty close to the limit of what you can do when submitting vertices like that.

One of the problems with the SH-4 is the lack of registers, it's hard to do many things at once. There aren't any more floating point registers to really do anything more advanced. The SH-4 has four vector registers. In that routine, they were used like this:

1. Vertex position
2. Vertex normal
3. Light vector
4. Misc (Ambient, UV)

There's nothing left to do something like multiple lights, specular, or fresnel without having to waste a lot of time dumping and reloading registers. And doing vertex skinning in one pass is completely out of the question since you'd have to reload XMTRX multiple times per vertex.
No sh!t. That sounds amazing, friend. Really looking forward to see this in action!

Just a small thought, each vector register is 4 floats? Lets call it VR[4][4].
Since Vertices and Normals only use 3 floats, that leaves some space that can be utilized.
For example, UV can be divided into Vertex and Normal Registers:

============ X ===== Y ===== Z ===== Tex-U
Vertex Data: VR[0][0] VR[0][1] VR[0][2] VR[0][3]
============ X ===== Y ===== Z ===== Tex-V
Normal Data: VR[1][0] VR[1][1] VR[1][2] VR[1][3]
TapamN
DC Developer
DC Developer
Posts: 105
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 90 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by TapamN »

Bouz: The PVR API should really only be used for managing the hardware (initialization, list management, IRQ handling). It doesn't really give you anything efficient for submitting polygons, other than the "direct render" store queue macros.

I'm working on a 3D platformer, which I am planning to try to turn into a commercial homebrew.

PH3NOM: That's actually how the vertex data is formatted in memory. It's compact and aligned for easy 64-bit loads. But you can't do that when working on things in registers. You need to set the W element of the vector to 1 (for positions) or 0 (for normals) for FTRV to generate correct results. FIPR returns the result in the destination W element, and the source W generally has to be 0 for correct results, so those are taken too.

Anyways, this file has some sample code showing the rendering code (with a example polygon model taken from this site), a precompiled ELF, the nVidia strip generator (precompiled for use with Cygwin in Windows), and Blender script for exporting to the model format used the the demo.

The demo was made by stripping out chunks from what I have for my platformer, so if you wonder why somethings seem over designed, messy, or out of place for this example, that's the reason why. Beside the rendering routine, there's some debugging and profiling stuff in it that might be useful to others.

The assembly uses C preprocessor macros to replace register numbers with text names (e.g. "vert_cnt" instead of "r7"). Getting this to work requires a change to the makefile rules so that the preprocessor runs. The makefile in the example doesn't use the normal KOS makefile definitions, so it should compile as is. But it's probably a good idea to adjust KOS to have the preprocessor run by default. You can do this by editing the file "Makefile.rules" in the KOS root directory. In it, you should find a line like this:

Code: Select all

%.o: %.S
	kos-as $< -o $@
That's the one with the capital S on the first line, not a lowercase. Replace the "kos-as" line with this:

Code: Select all

	kos-cc -c $< -o $@
With this, files with a .S extention get run through the preprocessor. (But not files with .s, those stay the same.) By running the preprocessor on the assembly, you not only get #defines but also #includes, so with the right #ifdefs, you can have constants shared between assembly and C/C++ without having to make duplicates in different files.

The rendering routine has two versions, render_strip_mix and render_strip_array_mix.The first one takes a SQ destination, a vertex source array pointer, the length of the strip, light vector, and ambient. It outputs one strip to the PVR. The second, instead of taking a strip length, takes a pointer to an array of strip lengths and a strip count, and can output multiple strips in one call, making it a bit faster. You have to set up the store queues yourself before you call the functions. (The "mix" part of the names refers to he source vertex format used.) The vertex data should be aligned to a 32-byte boundary for best speed.

The vertex format used by the functions is defined in vertex.h. The output format generated for the PVR is called "type 7" in Maiwe's TA document. That's intensity color with 32-bit texture coords. So the polygon header sent before must specify the intensity color format, and not the packed integer or 4 floats format. Textures are optional. Offset color should be disabled.

The lighting vector W parameter of the functions should always be 0, otherwise the lighting will be wrong. The formula for lighting is: intensity = max(dot(vertex_normal, light_normal), ambient)

The code doesn't do near clipping. KGL, with clipping disabled, still checked what it was sending and would skip polygons that touch the near plane. This code doesn't, and will send everything to the PVR without checking it. You have to check beforehand if the strips intersect the near plane and use something else instead if they do. There will be some rather large graphical corruption otherwise. The demo doesn't bother checking, so if you want to see what it looks like, try it out.

I compiled this with GCC 4.7.0, but it should work with older versions.

The Blender export script isn't a real export menu script. Open it in Blender's text editor then hit the "Run Script" button (or press Alt-P). I wrote it for Blender 2.63, it probably will have problems with other versions.

You'll need a few changes before running it. The path to the stripifier needs to be changed to wherever you put program, and the resulting model filename and path will probably need to be changed. The name of the model to export is also coded in near the end of the script, so you'll have to set that as well. You'll probably want to have the console open (accessable from Help, System Console) to see the output of any errors that might occur.

It seems that, when reading a model to export in Blender, you're supposed to make a temporary copy and use that. The script included bother with a copy and directly accesses the model. One side effect of this is that it can't read a model that is open in edit mode, so watch out for that.

In the demo program, you can use the controller to adjust the model and lighting. Hold X, Y, or A to select an axis to change, and use the analog stick to change it. Holding A results in faster movement. Normally, you adjust the position of the model, but holding L will change the rotation, and holding R will change the light direction. Hold L then press R to print out some timing information over the console. Pressing Start will exit.
Attachments
demo.zip
(2.67 MiB) Downloaded 187 times
User avatar
Bouz
DCEmu Junior
DCEmu Junior
Posts: 46
Joined: Mon May 10, 2010 3:42 pm
Location: St. Bauzille de Putois (France)
Has thanked: 0
Been thanked: 0

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by Bouz »

I have downloaded your demo and will try it ASAP! (Thanks for the pre-compiled elf!)
The assembly option seems very interesting. The problem is that I don't know the SH4 assembly at all. My only strong experience with assembly was for the HP48 (Saturn 4 bit processor, about 2MHz). The SH4 looks very different in its appraoch ;-)
User avatar
PH3NOM
DC Developer
DC Developer
Posts: 576
Joined: Fri Jun 18, 2010 9:29 pm
Has thanked: 0
Been thanked: 5 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by PH3NOM »

Wow, great work man! Very impressive. When I get some time I will take a closer look at the source.
I believe this will be a great lesson to all of us, in SH4 assembly.

And Deadbolt, simply Marvelous! ( One of my favorites! )
Image
User avatar
Bouz
DCEmu Junior
DCEmu Junior
Posts: 46
Joined: Mon May 10, 2010 3:42 pm
Location: St. Bauzille de Putois (France)
Has thanked: 0
Been thanked: 0

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by Bouz »

I still did not look at the code on my Dreamcast :-(
Do you know if the current version of mat_transform_sq (also in SH4 assembly) in KOS could be upgraded by yours to handle lights without too much work? (not by you, of course :wink: )
User avatar
PH3NOM
DC Developer
DC Developer
Posts: 576
Joined: Fri Jun 18, 2010 9:29 pm
Has thanked: 0
Been thanked: 5 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by PH3NOM »

TamapN-
Sorry, this question is off topic, but I figured I would ask while we have your attention :lol:

Do you know how to get the PVR to convert YUV420 texture data into YUV422?
viewtopic.php?f=29&t=101263
TapamN
DC Developer
DC Developer
Posts: 105
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 90 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by TapamN »

I haven't done anything with the YUV converter, so I don't really know. What does the output you currently get look like? The result SWAT gets on that page looks like it's getting close; just the macroblocks are in the wrong order and channels are offset somehow.

The source for nullDC's YUV420-to-YUV422 emulation is here, if you haven't looked at it. Maybe it would help? It shows how the macroblocks are processed. One thing that seems notable to me is that the source YUV420 blocks are stored in the order UVY, instead of YUV like one might first expect. It looks like each 16-by-16 pixel block is made out of 8x8 samples of U (64 bytes), then 8x8 samples of V (64 bytes), then 16x16 samples of Y (256 bytes).
User avatar
PH3NOM
DC Developer
DC Developer
Posts: 576
Joined: Fri Jun 18, 2010 9:29 pm
Has thanked: 0
Been thanked: 5 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by PH3NOM »

TapamN wrote:I haven't done anything with the YUV converter, so I don't really know. What does the output you currently get look like? The result SWAT gets on that page looks like it's getting close; just the macroblocks are in the wrong order and channels are offset somehow.

The source for nullDC's YUV420-to-YUV422 emulation is here, if you haven't looked at it. Maybe it would help? It shows how the macroblocks are processed. One thing that seems notable to me is that the source YUV420 blocks are stored in the order UVY, instead of YUV like one might first expect. It looks like each 16-by-16 pixel block is made out of 8x8 samples of U (64 bytes), then 8x8 samples of V (64 bytes), then 16x16 samples of Y (256 bytes).
Thank you for the response, I will look at the NullDC source as you suggested.

Inspired by this thread to find a faster method than KGL, I have written some code to use the hardware more directly.
Written in C, It uses the sh4 matrix operations and Direct Render API included with KOS.

The vert count per frame is 196608. NullDc suggests ~5mil verts/second. Can someone test on Real Hardware please?

Push Up/Down to move the matrix forward/backward. (each point is 4 verts, or a quad if you prefer )
Press start to exit

Image
Attachments
dc-engine-3d-r000.rar
(299.68 KiB) Downloaded 134 times
TapamN
DC Developer
DC Developer
Posts: 105
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 90 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by TapamN »

Bouz wrote:I still did not look at the code on my Dreamcast :-(
Do you know if the current version of mat_transform_sq (also in SH4 assembly) in KOS could be upgraded by yours to handle lights without too much work? (not by you, of course :wink: )
Sorry I forgot to reply to you. If you're intent on using mat_transform_sq, making efficient use of it would require some planning. Since mat_transform_sq expects the source data to have correct flags, lighting, and UV parameters but untransformed position, it's currently most natural for statically lit polygons and static UVs (so no scrolling textures or environment mapping).

To do more advanced things with m_t_s, you would have to have a temporary command buffer to store the generated the PVR vertex commands. Since this would require you to read from the source vertex data (with position, normal, and UVs), and the command buffer at the same time, there would now be the potential for cache thrashing. The simplest way to do avoid thrashing would be to turn on OCRAM in the KOS init variable and keep the command buffer in the on-chip RAM.

I would also recommend changing m_t_s so that it takes two source pointers, one for the vertex commands, and one for the vertex positions. This way you can avoid unnecessary copying of the vertex position into the command buffer. (So instead doing, for calculating each vertices' position, "read source position, write buffer position, read buffer position, transform, write SQ", it would just be read, "source position, transform, write SQ".) If you're not doing anything special with the UVs, you might also want a version of m_t_s to copy the UVs as well; there's plenty of time to do so without extra overhead during the perspective divide.

PH3NOM, your program crashes. It looks like you're using mat_load to load a matrix from an address that isn't aligned to an 8-byte boundary. The return address for the mat_load call is 0x8c013a8e, which looks like part of your rendering function (I can see FTRV and SQ usage in it).
User avatar
PH3NOM
DC Developer
DC Developer
Posts: 576
Joined: Fri Jun 18, 2010 9:29 pm
Has thanked: 0
Been thanked: 5 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by PH3NOM »

TapamN wrote:PH3NOM, your program crashes. It looks like you're using mat_load to load a matrix from an address that isn't aligned to an 8-byte boundary. The return address for the mat_load call is 0x8c013a8e, which looks like part of your rendering function (I can see FTRV and SQ usage in it).
Yes, thank you for testing and your insight. Strange thing, the matrix in question was in fact aligned to at least 8 bytes

Code: Select all

static matrix4f rm __attribute__((alligned(32))); /* Render Matrix */
I have now established a DC Dev environment where I can finally test on Real Hardware.

I have reworked the code, now pushing over 1million polys/sec while performing other tasks on real hardware ( ie audio decompression and playback ). Still some transformation issues to work out, but overall I believe things are going to be way faster than using KGLX
TapamN
DC Developer
DC Developer
Posts: 105
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 90 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by TapamN »

PH3NOM wrote:

Code: Select all

static matrix4f rm __attribute__((alligned(32))); /* Render Matrix */
Uh, "aligned" is misspelled... You might want to turn on -Wall and -Werror in GCC, which would catch things that (at least it does on GCC 4.7.0).
Ayla
DC Developer
DC Developer
Posts: 142
Joined: Thu Apr 03, 2008 7:01 am
Has thanked: 0
Been thanked: 4 times
Contact:

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by Ayla »

The 6.4 vertices/s number you mentionned corresponds to the multi-pass rendering method, or the one-pass version?

Anyway this is extremely interesting. I'm looking forward to your lib. How hard would it be to add an OpenGL-compatible API on top of it?
User avatar
Bouz
DCEmu Junior
DCEmu Junior
Posts: 46
Joined: Mon May 10, 2010 3:42 pm
Location: St. Bauzille de Putois (France)
Has thanked: 0
Been thanked: 0

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by Bouz »

Ayla wrote:How hard would it be to add an OpenGL-compatible API on top of it?
Still that OpenGL obsession ;-)

I am reading the SH4 specs and it is really interesting! I feel like I felt 20 years ago the first time I learnt assembly!!!
Ayla
DC Developer
DC Developer
Posts: 142
Joined: Thu Apr 03, 2008 7:01 am
Has thanked: 0
Been thanked: 4 times
Contact:

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by Ayla »

Bouz wrote:Still that OpenGL obsession ;-)
It's not that I particularly like OpenGL, but using a standard API makes porting games and apps way easier. The same goes for SDL; while there are many things I don't like in SDL (although SDL2 is looking much better), it's still very precious to have it on DC, knowing the support it has on the homebrew community. And it helps game creators to achieve multi-platform without too much work needed.
User avatar
PH3NOM
DC Developer
DC Developer
Posts: 576
Joined: Fri Jun 18, 2010 9:29 pm
Has thanked: 0
Been thanked: 5 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by PH3NOM »

TapamN wrote:
PH3NOM wrote:

Code: Select all

static matrix4f rm __attribute__((alligned(32))); /* Render Matrix */
Uh, "aligned" is misspelled... You might want to turn on -Wall and -Werror in GCC, which would catch things that (at least it does on GCC 4.7.0).
:) I am glad I posted that, I have been so busy lately, that little syntax error completely passed by me.
I made that matrix to store the screenview matrix multiplied by the projection matrix.
My temporary fix was to skip storing that matrix, instead calculating each frame.
Thank you again, sir. I will also add those flags to GCC, thanks for the tip 8-)
Ayla wrote:
Bouz wrote:Still that OpenGL obsession ;-)
It's not that I particularly like OpenGL, but using a standard API makes porting games and apps way easier. The same goes for SDL; while there are many things I don't like in SDL (although SDL2 is looking much better), it's still very precious to have it on DC, knowing the support it has on the homebrew community. And it helps game creators to achieve multi-platform without too much work needed.
You read my mind. It makes a lot of sense to build a much faster OpenGL API for DC, all things considered. I have done some basic work in this way recently without much difficulty; it should be easy work for TapamN :P
User avatar
Bouz
DCEmu Junior
DCEmu Junior
Posts: 46
Joined: Mon May 10, 2010 3:42 pm
Location: St. Bauzille de Putois (France)
Has thanked: 0
Been thanked: 0

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by Bouz »

I had a look at the demo. Nice to see lighting on the Dreamcast!
I keep reading the SH4 documentation and start understanding the internal cache thing.
By the way, do you have to handle the delay slot manually or are instructions automatically switched by the compiler?
TapamN
DC Developer
DC Developer
Posts: 105
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 90 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by TapamN »

Ayla:

The 6.4 million is with the one-pass version when the model fits entirely into cache (i.e. it's less than 16 KB) and you draw the model multiple times (so it stays in cache between each function call). So it's not really a useful number for much in the real world, as you typically have more than 16 KB worth of models. Although if you want to make your own Chu Chu Rocket clone and need to draw a bunch of mice with identical polygon models, it might be something meaningful. I have a version of the one-pass code that's even faster with small models (I think it could do about a million vertices per second more), but it uses a slightly larger vertex format (40 bytes instead of 32 bytes) and has even worse performance with large models.

The multipass code, when using similar effects, should actually have better better performance than the one-pass code when it comes to larger polygons models. (While I have the parts of the library needed to generate data for the TA tested, I don't have the optimized "submit" parts done yet, so I can't be certain of the final performance.)

It's also worth noting that the PVR can't handle that many polygons. The most I've gotten the PVR to draw, with gouraud shaded, textured polygons, 32-bit UV, and no specular, is around 4.2 million. (They were also all tiny polygons, that didn't overlap many tiles. I was drawing a bunch of tori, with each made out of single triangle strip. A screen from a real game probably won't be able to get that high.) Also, the vertex data the TA generates for that is so large that's there's not much video RAM left for textures. I think it was less than half a megabyte. And that was at 60 FPS; if you wanted to try to increase the number of polygons each frame by reducing the framerate, you'd run out of video RAM almost immediately. You could cut down the size of the vertex data by using 16-bit UVs or untextured polygons, but they aren't always usable options.

The multipass rendering library should be very easy to use for implementing display lists and the glDrawArrays function. It's not designed for glBegin/glEnd style rendering, but it should still work significantly faster than current KGL, even if you have to record the data in a temporary buffer first.

Bouz:

Well, technically, the compiler does handle delay slots for you... ;)

But you do have to handle delay slots yourself in SH series assembly.
User avatar
PH3NOM
DC Developer
DC Developer
Posts: 576
Joined: Fri Jun 18, 2010 9:29 pm
Has thanked: 0
Been thanked: 5 times

Re: mat_transform / pvr_prim vs mat_transform_sq

Post by PH3NOM »

So, to actually focus on the topic of Discussion :-)

To sum things up, mat_transform_sq seems surprisingly slower than mat_transform_single.

I am working on my own build ( from the ground up ) of OpenGL for DC, and I have tested between using mat_transform_single and _sq.

Using mat_transform_sq produces lower vert count according to NullDC, and on real hardware.
Furthermore, I contrived some scenes that caused the audio playback to skip when using mat_transform_sq, while using mat_transform_single did not produce the mentioned skipping ( due to CPU burden ).

In this demo, press up / down to move forward / backward. Press Y / X to enable / disable shading.
dc-engine-3d-OpenGL_Beta01.7z
(2.98 MiB) Downloaded 119 times
For implementing this as an OpenGL compatible glVertex3fv function, I decided to store an entire primitive ( triangle or quad ) in RAM before transforming the primitive with mat_transform_sq

Code: Select all

inline void glVertex3fv( float *v )
{
    dma_vert_arr[dma_verts].x = v[0]; 
    dma_vert_arr[dma_verts].y = v[1];
    dma_vert_arr[dma_verts].z = v[2];
    dma_vert_arr[dma_verts].u = dma_vert.u;
    dma_vert_arr[dma_verts].v = dma_vert.v;
    dma_vert_arr[dma_verts].argb = dma_vert.argb;
        
    if(++dma_verts%DR_PRIM==0)
    { 
        dma_vert_arr[dma_verts-1].flags = PVR_CMD_VERTEX_EOL;
        mat_transform_sq( (pvr_vertex_t*)&dma_vert_arr, (pvr_vertex_t*)0xe0000000, DR_PRIM );
    } 
    else
        dma_vert_arr[dma_verts-1].flags = PVR_CMD_VERTEX;  
}
I contrived a scene composed of 48000 vertices that are accumulated and transformed on the SH4 each frame before rendering with the PVR

Using this scene I contrived, NullDC suggests ~.86mil verts/sec when using glVertex3fv posted above ( not all vertices are drawn on-screen ).
Image

As an alternative, I implemented the same function using mat_transform_single and pvr_dr_commit to transform and send each vertex to the PVR.
To make things slightly more optimized, I pulled all of the inline asm into this function with parameters hard coded.
DCE_DR_Vert is set to the target of the DR sq registers.

Code: Select all

inline void glVertex3fv( float *v )
{
	register float __x __asm__("fr0") = (v[0]); 
	register float __y __asm__("fr1") = (v[1]); 
	register float __z __asm__("fr2") = (v[2]); 
	
	__asm__ __volatile__( 
		"fldi1	fr3\n" 
		"ftrv	xmtrx,fv0\n" 
		"fldi1	fr2\n" 
		"fdiv	fr3,fr2\n" 
		"fmul	fr2,fr0\n"
		"fmul	fr2,fr1\n"
		: "=f" (__x), "=f" (__y), "=f" (__z)
		: "0" (__x), "1" (__y), "2" (__z) 
		: "fr3" );
		
    DCE_DR_Vert->x = __x; 
    DCE_DR_Vert->y = __y;
    DCE_DR_Vert->z = __z;
    
    if(++DR_VERT%DR_PRIM==0)
        DCE_DR_Vert->flags = PVR_CMD_VERTEX_EOL;
    else
        DCE_DR_Vert->flags = PVR_CMD_VERTEX;
        
    __asm__ __volatile__("pref @%0" : : "r" (DCE_DR_Vert));
}
Rendering the exact same scene, NullDC suggests over 1.08mil verts/seccond
Image

You also mentioned pvr_prim. I have also explored its nature.
Calling pvr_prim, I believe, is submitting a Global Parameter to the PVR, and from my testing, this is the slowest parameter to process by the PVR.
Try to arrange your render stage to make as few calls as possible to pvr_prim.
Better yet, dont use pvr_prim at all.
After some checks, it boils down to a function call to sq_cpy to send the command to the PVR's TA. Might as well do that instead

Code: Select all

sq_cpy((void *)PVR_TA_INPUT, (pvr_poly_hdr_t *)hdr_t, 0x20 );
Instead, I have made an implementation using DMA to send the Global Parameter to the PVR's TA

Code: Select all

/* Modified for inclusion into KOS by Dan Potter */
/* Modified for faster(simpler) vertex loading by PH3NOM 2013 */

/* DMA registers */
static vuint32	* const pvrdma = (vuint32 *)0xa05f6800;
static vuint32	* const shdma  = (vuint32 *)0xffa00000;

#define DMAC_SAR2	 0x20/4
#define DMAC_DMATCR2 0x28/4
#define DMAC_CHCR2	 0x2c/4
#define DMAC_DMAOR	 0x40/4

/* PVR Dma registers - Offset by 0xA05F6800 */
#define PVR_STATE	0x00
#define PVR_LEN		0x04/4
#define PVR_DST		0x08/4
#define PVR_LMMODE0	0x84/4
#define PVR_LMMODE1	0x88/4

/* Send to the Tile Accelerator */		
static uint32 ta_dest_addr = (((unsigned long)0x10000000) & 0xFFFFFF) | 0x10000000;
static uint32 val;
#define dest_addr ta_dest_addr

/* Source MUST be 32byte aligned, this routine does not check for you! */	
inline void pvr_dma_transfer_vertex(void * src, uint32 count )
{
	uint32 src_addr = ((uint32)src); 
	src_addr &= 0x0FFFFFE0;
	
	while(pvrdma[PVR_DST] != 0) ; /* Make sure we're not already DMA'ing */

	val = shdma[DMAC_CHCR2];
	if (val & 0x1) /* DE bit set so we must clear it */
		shdma[DMAC_CHCR2] = val | 0x1;
	if (val & 0x2) /* TE bit set so we must clear it */
		shdma[DMAC_CHCR2] = val | 0x2;
	
	shdma[DMAC_SAR2] = src_addr;
	shdma[DMAC_DMATCR2] = count/32;
	shdma[DMAC_CHCR2] = 0x12c1;

	val = shdma[DMAC_DMAOR];

	pvrdma[PVR_LMMODE0] = 1;
	pvrdma[PVR_STATE] = dest_addr;
	pvrdma[PVR_LEN] = count;
	pvrdma[PVR_DST] = 0x1;
}
From that, I use this code to replace pvr_prim

Code: Select all

#define DCE_RenderHdrSubmitDMA()  pvr_dma_transfer_vertex( (pvr_poly_hdr_t *)hdr_t, 0x20 );
But, here is my question: Could it be possible/faster to send Vertex Parameters via DMA instead of the store queues?
From my tests with KOS PVR DMA, I can only get a single primitive to submit per frame, very slow(:-()
Post Reply