[KOS] best way to display a fullscreen image

shazz · Post by **shazz** » Sun Mar 14, 2010 12:05 pm

Hello,

Dummy question (I searched the forum but... I did not find a clear answer). What is the best way to display a fullscreen (640x480) image ?
- Should I use a 1024x512 texture ? Or split my image in multiple 2^n textures ?
- Is it better to use one big sprite (pvr_sprite_cxt_txr) or to display by bands to optimize any texture cache ? Or to use a poly (quad, triangles, grid...) ?

any existing code benchmarks ?

thanks !

shazz

Post by **Quzar** » Sun Mar 14, 2010 6:11 pm

I don't know of anyone who has done benchmarks on such a thing. It really depends on what exactly your objective is. For example, if you're taking a single 640x480 image and trying to put it on the screen, it might be impractical or overly elaborate to do what you've said. A simple texture on a plain poly would suffice, or a direct write to the framebuffer. If you're generating the image dynamically it would all depend on the rest of your engine. I've done the video output of an emulator where the fastest way to transfer was store queues straight into the framebuffer. Other times, due to stretching or palette considerations, textures on polys make more sense. Never used pvr sprites, not sure if anyone (except bluecrab) has.

None of what you have said really matters for a single static image. More information on exactly what you're doing would let us have a better idea of what might be fastest.

shazz · Post by **shazz** » Mon Mar 15, 2010 3:06 am

Hello Quzar

So I come from the PS2/PSP dev, and on the PSP for example, people have done benchmarks to define the best way to blit fullscreen pictures (I don't mean expand a 512x512 texture to 640x480, but really blit a 640x480 image). For the moment take as an assumption that this image is static and not generated (I don't write an emu

).

And due to the PSP video chip, it was clear there is multiple ways to do it (blit directly to the framebuffer [very slow], use 2 big textured triangles [slow], use one sprite [slow], use a grid [fast] use vertical sprite bands [very fast] (width = psp texture cache size) + some difference depending of the texture color mode (32b, 16b, palette, compressed...) and the texture format (swizzled or not) and where is the texture [SDRAM or VRAM]. And the differences were extremely huge ! (see : http://tmpstore.free.fr/blog/BenchsSpeed.png, number of fullscreen blits per second)

So we don't care of the PSP here but I'd like to know first how many methods they are (poly, sprite, direct framebuffer,...) and which one is the most efficient.

For the moment I use a 1024x512 texture and a 4 vertices poly but...
By the way it seems that writing a little benchmark can be useful

Thanks in advance ! And the store queues stuff seems interesting

any docs about that ?

Post by **Quzar** » Mon Mar 15, 2010 8:44 am

Well if you look them up here you'll find ample information. The only real documentation I know of is the SH programming manual for the SH7750. Basically they're a feature of the SH4 (the cpu) of the the Dreamcast. They allow for the fastest writing directly from the CPU out to other devices/memory. DMA is still faster for directly transferring blocks of memory from one memory region to another, but if it has to be processed first, SQs are.

I see what you mean now, and no, nobody has every done such benchmarks as far as I know. If you allow for the image to be stored in any way at all (so taking out the factor of performing whatever splitting/rearranging required to get from the png/jpg/whatever to the super-speed screen format) then I think the fastest way would be to split into small quads, the size of the PVR's cache lines and transfer the lot of them via dma then transfer the definition of the screen grid over as well. This kind of thing won't really work with updating images or whatever because if everything isn't timed right you might get some weird grid looks.

I'm not sure how useful such a benchmark would be since it would *only* apply to fixed pre-formatted images. I can't think of any time when you'd need to display a lengthy amount of stills in rapid succession (unless I suppose you'd want to develop a fast, KOS-specific video format ...). Once you try to apply the same methods to an image where for example, you're generating one tile at a time but they're not the same size as an optimal tile, it will become very difficult to determine exactly which method would be the fastest. How exactly did these benchmark results inform coders for the PSP?

I suppose some of the different categories you'd want to test (mixed and matched): memory transfer (standard, SQ, DMA, maybe DMAC), transfer size (I can't think of this off the top of my head, but SQ work off 32byte values), and display device method (direct to framebuffer, standard KOS texture, standard KOS sprite, then various striped variations of the previous two). I'm not sure if there's other things. I'd love to hear from BlueCrab or BlackAura who generally have their thoughts more organized than myself.

shazz · Post by **shazz** » Mon Mar 15, 2010 9:08 am

Quzar wrote: How exactly did these benchmark results inform coders for the PSP?

Oh, that's simple. The difference was so huge that if you were using the bad method it was taking more than one VBL to blit the screen so your application is totally screwed up. So in the pspsdk there is a sample code with the various technics & results (http://svn.ps2dev.org/filedetails.php?r ... lit/blit.c)

And ok, let's talk of dynamic image generation such as an emulator virtual framebuffer, it requires some time (the core emulation) to generate the virtual framebuffer. Then you have to display it. So if only final rendering takes most of the time, you're not gonna have a nice emulation, isn't it ?
But if the target rendering takes only a few cycles you've got a lot to do something else...

Then, you're right, in the case of a dynamic buffer, all methods are not applicable. For example, PSP swizzled textures need some cpu/time to be computed (pixels reorder) so swizzling is only useful for static textures. But the fact to optimize the texture cache is always a good idea, static textures or not.

And to finish... it's nice to have a cool 3D object moving on the screen but that's better with a nice background behind

And ok, a 640x480 texture is a particular example, but if you're designing an interface for example with buttons & stuff... stretching a texture may not be so nice looking.

So to conclude... I'll write some benchs asap

(as soon as I find how to use SQ)

BlackAura · Post by **BlackAura** » Mon Mar 15, 2010 9:47 am

The Dreamcast's video hardware doesn't work like the PSP's. In fact, it doesn't really work like anything else.

Blitting first, using only the 2D hardware.

Your first option is directly writing to the framebuffer. This is incredibly slow. The SH-4 is connected to VRAM through a fairly slow bus, and all access is uncached. Writing a single byte (or 2 or 4) to VRAM requires the SH-4 to fetch an entire cache line (32 bytes) from VRAM, modify it, and write it back immediately. Worst case - it'll be doing that 8 times per cache line, transferring 16 times as much data as is needed.

That's the theory - I don't think anyone benchmarked this to determine if it actually does do a read-modify-write cycle for each write operation, because it's so obviously slow compared to store queues and DMA, neither of which are any more difficult.

Second option is using store queues. With store queues, you can write an entire cache line at once, without needing to read it back first. It will simply overwrite the entire cache line with the contents of the store queue. This is going to be much, much faster, since it's transferring the bare minimum amount of data, and is never reading back from VRAM.

However, you still have to read the entire image from main memory using the SH-4. You can significantly optimize this, using prefetches and bits of inline assembly, so it'll be pretty fast. Likely as fast as the bus can handle it, in fact.

You can avoid all that work by using a third option - DMA transfer directly from main memory to the frame buffer. I doubt that it would actually be any faster than store queues - it's still shoving the same amount of data over the same bus after all - but it would free the SH-4 to do something else while the image is being transferred. It adds the overhead of setting up the DMA transfer in the first place, of course.

Then you get to the 3D hardware.

The obvious approach is to use a single full-screen tristrip (four vertices, two triangles). If you didn't care about pixel-accuracy, you could use a 512x256 or 512x512 texture instead of a 1024x1024 texture. In either case, you could cram the image into the corner of the texture, or stretch it out.

16bpp texture formats are the fastest. 8bpp and 4bpp texture formats can halve the fillrate, depending on which palette format you're using. You can also twiddle the texture (equivalent process to swizzling on the PSP, and an equally stupid name), which helps speed up texture access if you're using bilinear filtering. It shouldn't make any difference if you aren't using filtering, but I don't think anyone's tested that. For all we know, it might make it slower. Twiddling is mandatory for 8bpp and 4bpp texture formats.

You can also submit your display list using store queues, or DMA (write the display list to main memory, then DMA it across in one go). Your display list for this scene is going to be so tiny that it's not going to make a difference - it's dwarfed by the overhead of setting up the frame in the first place.

This approach will get you 60FPS immediately, no matter which combination of texture formats, filtering options, of display list transfer modes you might use.

The KOS PVR driver is basically limited to 60FPS, because it needs to maintain the triple-buffering chain (one frame being displayed, one being rendered, and one being submitted as a display list) and it only swaps the backbuffer during a vertical blanking period. You could probably remove that limit if you wanted to.

The alternatives are:

One 512x512 texture and one 128x512 texture, side-by-side. To save texture memory, you can simply truncate the memory allocations at 512x480 and 128x480, so your textures will use the minimum possibly amount of video memory. Despite drawing four triangles per frame instead of two, it should be pretty much the same speed.

Sprites. I don't think anyone's actually used these, and I don't think anyone knows how they're implemented. Sprites are actually triangles, so you'd need two of them. You could use any combination of texture formats, or arrangements of textures.

Only one other option comes to mind.

The PVR2DC uses tiled rendering. It breaks the scene down into a set of tiles (32x32 pixels), sorts all the geometry that intersects that tile into depth order, eliminates anything that's hidden behind an opaque pixel, and renders the tile into a 32x32 internal framebuffer, before copying it into VRAM.

It's possible that breaking the image up into 32x32 tiles might do something. I don't think it would improve rendering times at all, since it's still rendering the same pixels. It might actually make it slightly slower, since the display list is now much larger, and will take longer to process. More likely, it will make no measurable difference at all.

Post by **BlueCrab** » Mon Mar 15, 2010 1:28 pm

BlackAura wrote:Sprites. I don't think anyone's actually used these, and I don't think anyone knows how they're implemented. Sprites are actually triangles, so you'd need two of them. You could use any combination of texture formats, or arrangements of textures.

PVR sprites are actually normal plain old quads, and can be done either textured or non-textured. KallistiOS has code for handling these, and has for a little while now (since I added it). They basically have the same limitations as using triangle strips, with slightly less vertex data transferred over to the TA. For an example of how to use it, take a look at http://crabemu.svn.sourceforge.net/view ... iew=markup (the font handling code on the Dreamcast for my emulator CrabEmu), specifically the function font_draw_char. That function does a bit more work than it needs to (remaking the pvr_sprite_cxt_t and pvr_sprite_hdr_t every pass), but its relatively self-contained and shows how to make things at least work).

shazz · Post by **shazz** » Thu Mar 18, 2010 3:07 am

Thanks a lot guys !
There is a LOT of interesting things here.
So i na few words I understand that contrary to most of the "modern" consoles, it is common on the DC to use more or less directly (thru memory mapping, DMA or SQ) the framebuffer.

But I may avoid to use only "the 2D hardware", I mean accessing even if a optimized way (SQ, DMAC) the framebuffer. Simply because it would require to do all my rendering in software first and I guess I will to do multiple blitting. But interesting to know for some very specific points.

So using the 3D hardware, it seems the Sprite is a not-that-known land... a pity

On the PSx hardware usually, Sprites can be used in order to bypass the 3D transformations and the primitives are directly sent to the rasterizer (so sprites coordinates are always expressed into screen coordinates and no space coordinates).

And I'll take a look at the TA yes... seems interesting.

I'll write some benchmarks so using all those methods.

By the way, I have tested some stuff and "for fun" I changed vid_clear(0,0,0) to vid_empty() and if the first one looks relatively slow, the second one looks extremelly slow, is it the right way to clear the screen ?
Moreover, I have finished my rendering loop with

Code: Select all

		
// wait for VBL
vid_waitvbl();
// switch double buffer
vid_flip(0);

but there is still a "flashing" effect and some strange behavior.... isn't it the right way to do ?

BlackAura · Post by **BlackAura** » Thu Mar 18, 2010 9:18 am

So i na few words I understand that contrary to most of the "modern" consoles, it is common on the DC to use more or less directly (thru memory mapping, DMA or SQ) the framebuffer.

Not usually. It's generally only used in homebrew, as a quick-and-dirty way to display something on-screen.

So using the 3D hardware, it seems the Sprite is a not-that-known land... a pity On the PSx hardware usually, Sprites can be used in order to bypass the 3D transformations and the primitives are directly sent to the rasterizer (so sprites coordinates are always expressed into screen coordinates and no space coordinates).

The Dreamcast's video hardware lacks any kind of hardware 3D transforms, so they have to be done in software. The software transforms everything into screen space, and sends those coordinates (x, y, and 1/z) to the 3D hardware in a display list.

The KOS PVR API is very low-level. It deals with display lists only, and does not contain a complete 3D pipeline. If you actually want 3D transformations, you have to do them yourself. You can always skip that, and shove the screen coordinates directly into the display list.

By the way, I have tested some stuff and "for fun" I changed vid_clear(0,0,0) to vid_empty() and if the first one looks relatively slow, the second one looks extremelly slow, is it the right way to clear the screen ?

In 3D mode, no. The PVR clears the framebuffer automatically when it renders a frame - it renders each tile to a tiny embedded framebuffer, and then copies it to the framebuffer, overwriting whatever was there before.

In 2D mode... Yes, that's the simplest way to clear the screen. However, while vid_clear only clears the framebuffer (~600KB), vid_empty clears all of VRAM (8MB), so it'll be much slower.

Looking at the implementation of vid_clear, it appears to be using store queues under the hood. That's pretty much as fast as you're likely to get.

Moreover, I have finished my rendering loop with

Is this in 2D?

As far as I remember, the default 2D video setup doesn't set up double buffering. Also, vid_flip(0) will always set to the first framebuffer, rather than cycling through them. To cycle through them, the correct thing would seem to be vid_filp(-1);

If you're doing software rendering, the fastest way is to render the entire thing to main memory. Then, call vid_waitvbl(), and during the vblank period, transfer the framebuffer over. DMA or store queues would both work well. Hardware page flipping is only really useful if you're using the 3D hardware.

If this is in 3D, the KOS PVR driver already handles page flipping for you. When a scene is rendered, it hooks the vblank interrupt. When the vblank interrupt is triggered, it calls vid_flip as appropriate. The call pvr_wait_ready is used instead of vid_waitvbl - it waits until the PVR is ready to accept another display list, which happens immediately after a vblank.

PH3NOM · Post by **PH3NOM** » Sat Jan 05, 2013 4:38 pm

BlackAura wrote:The KOS PVR driver is basically limited to 60FPS, because it needs to maintain the triple-buffering chain (one frame being displayed, one being rendered, and one being submitted as a display list) and it only swaps the backbuffer during a vertical blanking period. You could probably remove that limit if you wanted to.

Hi BlackAura I hope your still around

Sorry to bump an old thread, but I just noticed you mentioned it should be possible to get around the 60fps limitation of the KOS PVR driver.
How would you go about doing that?

Thanks in advance.

[KOS] best way to display a fullscreen image

[KOS] best way to display a fullscreen image

Re: [KOS] best way to display a fullscreen image

Re: [KOS] best way to display a fullscreen image

Re: [KOS] best way to display a fullscreen image

Re: [KOS] best way to display a fullscreen image

Re: [KOS] best way to display a fullscreen image

Re: [KOS] best way to display a fullscreen image

Re: [KOS] best way to display a fullscreen image

Re: [KOS] best way to display a fullscreen image

Re: [KOS] best way to display a fullscreen image