Screen render code speed up

JMD · Post by **JMD** » Sat Dec 11, 2004 5:38 am

Hi,

I'm working on the port of a CPC emu for few week.
I need optimize my code to speed it up.
One of the speed bottleneck is the screen render code.

Here is my function that render a line of the cpc direct on the DC screen (x 2 because the cpc resolution is 640*240) :

Code: Select all


void TraceLigne16B640_direct( int y )
{
    int x, adr = y * 80; 
    
    // render a line direct on DC screen
    
    USHORT  * p = (USHORT *)vram_s ;
    p += (USHORT) (y*2) * (USHORT)640;
    
    for ( x = 0; x < 80; x++, adr++ )
        {
        int adCrtc = TabAdrCrtc[ adr ];
        if ( adCrtc > -1 )
            {
            char * ad = TabPoints[ lastMode ][ MemCPC[ adCrtc + OfsEcr ] ];
            * (p+640) = ( USHORT )TabCoul[ * ad ] ;
            * p++ = ( USHORT )TabCoul[ * ad++ ];
            * (p+640) = ( USHORT )TabCoul[ * ad ];
            * p++ = ( USHORT )TabCoul[ * ad++ ];
            * (p+640) = ( USHORT )TabCoul[ * ad ];
            * p++ = ( USHORT )TabCoul[ * ad++ ];
            * (p+640) = ( USHORT )TabCoul[ * ad ];
            * p++ = ( USHORT )TabCoul[ * ad++ ];
            * (p+640) = ( USHORT )TabCoul[ * ad ];
            * p++ = ( USHORT )TabCoul[ * ad++ ];
            * (p+640) = ( USHORT )TabCoul[ * ad ];
            * p++ = ( USHORT )TabCoul[ * ad++ ];
            * (p+640) = ( USHORT )TabCoul[ * ad ];
            * p++ = ( USHORT )TabCoul[ * ad++ ];
            * (p+640) = ( USHORT )TabCoul[ * ad ];
            * p++ = ( USHORT )TabCoul[ * ad ];
            }
        else
            {
            int Border = TabCoul[ 16 ];
            * p++ = ( USHORT )Border;
            * p++ = ( USHORT )Border;
            * p++ = ( USHORT )Border;
            * p++ = ( USHORT )Border;
            * p++ = ( USHORT )Border;
            * p++ = ( USHORT )Border;
            * p++ = ( USHORT )Border;
            * p++ = ( USHORT )Border;
            * (p+640) = ( USHORT )Border;
            * (p+640) = ( USHORT )Border;
            * (p+640) = ( USHORT )Border;
            * (p+640) = ( USHORT )Border;
            * (p+640) = ( USHORT )Border;
            * (p+640) = ( USHORT )Border;
            * (p+640) = ( USHORT )Border;
            * (p+640) = ( USHORT )Border;
            }
        } 
}

Can you help me speed it up ?

Thanks

NB : I use KOS 1.2

BlackAura · Post by **BlackAura** » Sat Dec 11, 2004 6:27 am

First off, consider using the 3D hardware instead of stretching the image manually. It makes the rendering code a little more complex (a lot more complex, actually), but it would probably be worth it. You'd only need to draw each line once then, because the hardware will scale it up for you.

Basically, you create a texture, and draw your image to that. I'd suggest a 1024x256 texture, which would be able to contain the entire screen easily. Then you just need to draw into that texture instead of the framebuffer (remembering to take the width of the texture into account), using the U and V texture coordinates to draw only the part of the texture containing the screen image.

Second, writing directly to VRAM is really, really slow. It would be much faster using store queues.

To do that, have a look in the KOS source code for the store queue functions (kernel/arch/dreamcast/hardware/sq.c). Specifically, have a look at how to store queue copy function (sq_cpy) works. You can (probably) take that code, modify it to draw the CPC screen instead of just copying data, and use that to output the image to VRAM.

Just in case you need to know how store queues work... Each store queue is a 32-byte long buffer, which can be sent all in one go to external devices, such as VRAM. Because it's sending 32 bytes at once (the minimum amount of data that can be sent over the Dreamcast's data bus), it's much faster than trying to send only two bytes at once. There are two store queues, so you can be writing to one of them while the other one is being written to memory.

The copy function basically reads data from memory, stuffs 32 bits of it into a store queue, and then tells the SH-4 to start writing that store queue to the destination address (the fragment of inline assembly at the bottom of the loop). It then starts filling up the second store queue, while the first one is being written.

Before you do that though, you'll probably need to do some minor reorganizing. First, fudge the main loop around so that it generates 32 bytes of data (16 pixels, 2 columns) at a time. That's exactly the length of one store queue, and it gives you a convenient place to switch between the two queues. Second, you'll need to modify the drawing code so that it generates 32-bits of data at a time (the store queues can only hold 32-bit values). Just combining adjacent pixels using bitshifts and the OR operator should do the trick.

Once you've done that, you should be able to integrate it into the store queue function fairly easily.

JMD · Post by **JMD** » Sun Dec 12, 2004 5:27 am

Thanks BA, the difference is AWESOME

here is my new code :

Code: Select all


void InitPlateforme( void )
{
  .........

    // init KOS
	pvr_init_defaults();
	text_render = pvr_mem_malloc(1024*256*2); 

...........
}


void TraceLigne16B640_TextureSQ( int y )
{
    int x, adr = y * 80; 
    
   void *src; 
    int v;     
    
    
    USHORT  * p = (USHORT *)text_render ;
    p += (USHORT) (y) * (USHORT)1024;
    
	unsigned int *d = (unsigned int *)(void *)
		(0xe0000000 | (((unsigned long)p) & 0x03ffffe0));
	unsigned int *s = src;
    
	/* Set store queue memory area as desired */
	QACR0 = ((((unsigned int)p)>>26)<<2)&0x1c;
	QACR1 = ((((unsigned int)p)>>26)<<2)&0x1c;
       
    for ( x = 0; x < 80; x++, adr++ )
        {
        int adCrtc = TabAdrCrtc[ adr ];
        
        if ( adCrtc > -1 )
            {
            char * ad = TabPoints[ lastMode ][ MemCPC[ adCrtc + OfsEcr ] ];
 
            v =( USHORT )TabCoul[ * ad++ ];
            v|=( USHORT )TabCoul[ * ad++ ]<<16;
            d[0] = v; 
            v =( USHORT )TabCoul[ * ad++ ];
            v|=( USHORT )TabCoul[ * ad++ ]<<16;
            d[1] = v; 
            v =( USHORT )TabCoul[ * ad++ ];
            v|=( USHORT )TabCoul[ * ad++ ]<<16;
            d[2] = v; 
            v =( USHORT )TabCoul[ * ad++ ];
            v|=( USHORT )TabCoul[ * ad++ ]<<16;
            d[3] = v; 
            }
        else
            {
            int Border = TabCoul[ 16 ];
            v = ( USHORT )Border;
            v|= ( USHORT )Border<<16;
            d[0] = v; 
            v = ( USHORT )Border;
            v|= ( USHORT )Border<<16;
            d[1] = v; 
            v = ( USHORT )Border;
            v|= ( USHORT )Border<<16;
            d[2] = v; 
            v = ( USHORT )Border;
            v|= ( USHORT )Border<<16;
            d[3] = v;              
            }
        adr++;   
        x++;
        adCrtc = TabAdrCrtc[ adr ];
            
        if ( adCrtc > -1 )
            {
            char * ad = TabPoints[ lastMode ][ MemCPC[ adCrtc + OfsEcr ] ];
            int v; 
            v =( USHORT )TabCoul[ * ad++ ];
            v|=( USHORT )TabCoul[ * ad++ ]<<16;
            d[4] = v; 
            v =( USHORT )TabCoul[ * ad++ ];
            v|=( USHORT )TabCoul[ * ad++ ]<<16;
            d[5] = v; 
            v =( USHORT )TabCoul[ * ad++ ];
            v|=( USHORT )TabCoul[ * ad++ ]<<16;
            d[6] = v; 
            v =( USHORT )TabCoul[ * ad++ ];
            v|=( USHORT )TabCoul[ * ad++ ]<<16;
            d[7] = v; 
            }
        else
            {
            int Border = TabCoul[ 16 ];
            v = ( USHORT )Border;
            v|= ( USHORT )Border<<16;
            d[4] = v; 
            v = ( USHORT )Border;
            v|= ( USHORT )Border<<16;
            d[5] = v; 
            v = ( USHORT )Border;
            v|= ( USHORT )Border<<16;
            d[6] = v; 
            v = ( USHORT )Border;
            v|= ( USHORT )Border<<16;
            d[7] = v;              
            }
            
         // Start this store queue, and switch to the other one
         asm("pref @%0" : : "r" (d));
         d += 8; 
            
        }  
     
}


void UpdateScreen_Texture( void )
{

    // scene begin
    pvr_wait_ready();
    pvr_scene_begin();
    pvr_list_begin(PVR_LIST_TR_POLY);
 
 
    pvr_poly_cxt_t cxt;
	pvr_poly_hdr_t hdr;
	pvr_vertex_t vert;
	float flt_tmpx,flt_tmpy;
 
    pvr_poly_cxt_txr(&cxt, PVR_LIST_TR_POLY, PVR_TXRFMT_RGB565|PVR_TXRFMT_NONTWIDDLED , 1024, 256,  text_render, PVR_FILTER_BILINEAR); 
 	cxt.gen.culling = PVR_CULLING_NONE;
    pvr_poly_compile(&hdr, &cxt);
	pvr_prim(&hdr, sizeof(hdr));

	   	
	vert.argb = PVR_PACK_COLOR(1.0f, 1.0f, 1.0f, 1.0f);   
	vert.oargb = 0;
	vert.flags = PVR_CMD_VERTEX;
 	
	vert.x = 0;  
	vert.y = 0;
	vert.z = 1;  
	vert.u = 0;
	vert.v = 0;
	pvr_prim(&vert, sizeof(vert));

	vert.x = 1024;  
	vert.y = 0;	
	vert.z = 1; 	
        vert.u = 1;  
	vert.v = 0;
	pvr_prim(&vert, sizeof(vert));
	
	vert.x = 0;  
	vert.y = 480;  
	vert.z = 1;  
	vert.u =0 ;
	vert.v = 1;  
	pvr_prim(&vert, sizeof(vert));

	vert.x = 1024;  
	vert.y = 480;  
	vert.z = 1;  
	vert.u = 1; 
	vert.v = 1;  
	vert.flags = PVR_CMD_VERTEX_EOL;
	pvr_prim(&vert, sizeof(vert));
                            
    // scene end       
    pvr_list_finish();
    pvr_scene_finish(); 
 
}

This code works fine but should not be perfect (not really know bit manipulation tricks).

Thanks again BlackAura, you are a really good teacher

.

Can I use the SQ for sound streaming too ?

BlackAura · Post by **BlackAura** » Sun Dec 12, 2004 9:05 am

This code works fine but should not be perfect (not really know bit manipulation tricks).

No, you got the bit manipulation bit correct, as far as I can see. If you had got it wrong, each pair of pixels would be inverted.

Thanks again BlackAura, you are a really good teacher .

Can I use the SQ for sound streaming too ?

You're welcome...

You can use store queues for sound streaming, but that's probably not a good idea. The bus that connects to SRAM is very, very slow. Using store queues would actually block the entire processor for quite a long time, which is not what you want. Your best bet would be to generate the sound into a buffer in main memory, and then transfer it over using DMA. I don't think KOS 1.2 supports sound DMA though, so you'd have to upgrade everything to KOS 1.3.x if you want to use that.

Oh yeah, one minor thing...

You're using the coordinates of the polygon to display things at the right size. It's probably easier to just use the texture coordinates. You're currently drawing something like this (X, Y)(U, V):

Code: Select all

(0,0)(0,0)
(1024, 0)(1, 0)
(0, 480)(0, 1)
(1024, 480)(1, 1)

Alternatively, you could do this:

Code: Select all

(0, 0)(0, 0)
(640, 0)(0.625, 0)
(0, 480)(0, 0.9375)
(640, 480)(0.625, 0.9375)

That should get the image scaled up properly.

OneThirty8 · Post by **OneThirty8** » Tue Jan 18, 2005 6:30 pm

BlackAura wrote: Alternatively, you could do this:
Code: Select all
(0, 0)(0, 0)
(640, 0)(0.625, 0)
(0, 480)(0, 0.9375)
(640, 480)(0.625, 0.9375)
That should get the image scaled up properly.

Sorry to bump a month-old topic, but I have basically the same question and would like some clarification, and some other questions that are related.
Here's what I'm trying to do. I've been looking at the Dreamcast video driver for libmpeg2 that la cible wrote for lvfdc. It works fine for what he was doing (playing back a 320x240 mpeg-1 or mpeg-2 video by writing directly to the framebuffer), but I'd like to be able display things at any resolution up to whatever size the DC can handle at an acceptable framerate (so, probably just VCD-ish resolutions.) I can't make a texture that's 352x288, so the closest option that would fit an image that size would be 512x512. Here are my questions:

1) First, the easy one - I just want to be sure I understand how the u and v coordinates work. Let's say that I've drawn a frame of video that's 352x288, I've already called pvr_txr_load or whatever, it's all ready to get thrown up on the screen, except the texture is 512x512. I want my video to fill the whole screen, so my x and y coordinates are going to be like this:

Code: Select all

(0,0)
(640,0)
(0,480)
(640,480)

(that's correct so far, right?)
Now, since I don't want the whole 512x512 texture to be drawn, I would use these u and v coordinates(or something close)?

Code: Select all

(Basically, you give it the percentage of the width and height, so it would be 352/512 for the u and 288/512 for the v?)
(0,0)
(0.6875,0)
(0,0.5625)
(0.6875,0.5625)

2) I know that if you call pvr_txr_load_ex, it requires that the width and height of the textures you're loading are both a power of 2. Is the same true for pvr_txr_load, or can I give it 352x2 bytes at a time? I'd do it like so:

Code: Select all

int scanline_offset = 0;
 pvr_txr_load(scanline[0],some_pvr_ptr_t+scanline_offset,  width*2);
scanline_offset+=(512*2);
 pvr_txr_load(scanline[1],some_pvr_ptr_t+scanline_offset,width*2);
scanline_offset+=(512*2);
/*etc, only I'd use a loop to actually do it for real rather than type out every possible number of widths you could have, etc...*/

3) Assuming I can give it one line at a time, would I want to do that, or would I be much better off figuring an efficient way of writing to a 512x512 buffer in main memory and copying it over in one shot?

4) Would it be faster to draw to a buffer in main memory and then copy it to the pvr, or would it take me less time to draw directly to a texture and then just display it?

5) I noticed that one of the texture format flags in pvr.h is PVR_TXRFMT_YUV422. I have two questions about this:
a) Would this allow me to skip the yuv-to-rgb conversion step? (that might save a bit of time)
b) I'm reading that as y=4 bits, u=2 bits, v=2 bits (along the lines of RGB565 which is is 5 bits red, 6 green, 5 blue) Is this correct? That would mean one byte per pixel if I use this texture format?

Thanks in advance for any help!

BlackAura · Post by **BlackAura** » Tue Jan 18, 2005 8:04 pm

1 - Correct
2 - The upload functions assume that you're uploading an actual texture, ready to use, in the same format that you want it to be in VRAM. Good for uploading textures, but not so good for uploading video buffers. You'd be better off uploading them manually, preferably using store queues (see above for how and why).
3 - Keep the buffer whatever size you like in main memory, and then copy it over manually. You can easily rearrange it as you're copying it.
4 - Drawing directly to VRAM is slow - the bus that connects VRAM to the CPU isn't as fast as the bus that connects main RAM to the CPU, and it's not cached, which means that each and every operation requires a read and a write to VRAM. That's very slow.

The ideal way is to process it in a main memory buffer, and then copy it over. In this case, you're going to have to do some colour space conversion, so you may as well do that while copying the texture. So you have a buffer with the appropriate format for decoding in main RAM, and a buffer in the appropriate display format in VRAM.

5 - If your decoder spits out frames in YUV422 format, they yes, you should be able to use those directly. If not, then... Well, video (YUV) image formats are a right pain in the backside, and are really nothing at all like RGB image formats. Converting between them is likely to be difficult, although not impossible, and might be as CPU intensive than converting to RGB (although probably not more).

OneThirty8 · Post by **OneThirty8** » Tue Jan 18, 2005 10:14 pm

Thanks for all your help, BlackAura. It's much appreciated.

I'm not 100% sure that I'm dealing with YUV422, so I'll have to try it and see what I end up with. Aside from that, it looks like I have a good idea of what needs to be done now. Thanks much! After I figure this bit out, I'm on to trying to decode MPEG audio. I don't think the decoding part should give me too much trouble because I'm just going to send all of the audio frames to an MP3 decoder, and I haven't had too much trouble with the sound code in KOS when I've used that before. Hopefully something good will come of this.

JMD · Post by **JMD** » Wed Jan 19, 2005 1:40 am

ho, I never thanks you BlackAura for your last answer.
So Thanks you now .