Yes, it's running on real hardware. At the moment, I don't have a clean binary I can upload as a demo at the moment. I'll try to get something ready.PH3NOM wrote:TapamN - Thanks for the info, it seems Bouz has the hard part already done for him!
Are the screens you posted running on DC? If so, would you mind uploading the binary to have a look?
Also, what license is that source code released under?
The code that rendered those images was an assembly routine that can do between 4-6 million vertices per second (no clipping). IIRC, it's around 4.2 million if the source data is not in the cache (or is larger than the cache), and 6.4ish million if all of the source data is already in the cache (if you're rendering many copies of the same model, like drawing rings for a Sonic game). It was one of the first SH-4 assembly pieces I wrote. It reads the vertex data (position, normal, and UV), does transform, a dot product for lighting, applies ambient lighting, and writes the results to the SQs and submits it all in one pass. There's possible to speed it up by a couple cycles per vertex, but it should be pretty close to the limit of what you can do when submitting vertices like that.
One of the problems with the SH-4 is the lack of registers, it's hard to do many things at once. There aren't any more floating point registers to really do anything more advanced. The SH-4 has four vector registers. In that routine, they were used like this:
1. Vertex position
2. Vertex normal
3. Light vector
4. Misc (Ambient, UV)
There's nothing left to do something like multiple lights, specular, or fresnel without having to waste a lot of time dumping and reloading registers. And doing vertex skinning in one pass is completely out of the question since you'd have to reload XMTRX multiple times per vertex.
I'm working on a 3D library (which does transformation, lighting, and near clipping) designed to run on multiple passes over the data in cache to be more efficient and flexible.
One of the... uh, interesting things the library does is that it doesn't use SQs or DMA to submit data to the PVR. Since the library writes the intermediate data to a buffer between passes, you would either have to stop and copy the final results to the SQ manually, wasting CPU time, or dump everything to memory and then uses DMA, which wastes bandwidth and still indirectly slows the CPU down. Instead, the library is set up so that the generated vertex data is sent directly from the cached buffer to the PVR. It uses the OCINDEX cache mode of the SH-4 to allocate one half of the cache on top of the TA input, and submits data with cache writeback instructions (OCBWB). It's kind of weird, but it's completely reliable on real hardware (although no emulator is accurate enough to support this).
KOS needs some changes to get OCINDEX support working, but the one-pass render routine doesn't need OCINDEX, so I'll see if I can clean up the one-pass assembly and release it with the demo binary.
What source code's license are you taking about? My rendering library probably use LGPL. The nVidia code I don't really know about, but they probably don't really care what you do with it.
I'm also planning to work on my own "strip" generator at some point as well. I'm going to try to use a format more efficient than strips (but not actually indexed geometry).
The number of vertices per triangle in a strip asymptotically reaches one as the length of the strip increases. But if you have a "2-D strip", you can asymptotically reach half a vertex per triangle, and this is much easier to make cache friendly on the SH-4 since it uses the vertices in a more predictable way.