Streaming Music Playback CPU Usage
-
- DC Developer
- Posts: 120
- https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
- Joined: Sun Oct 04, 2009 11:13 am
- Has thanked: 2 times
- Been thanked: 109 times
Streaming Music Playback CPU Usage
tl;dr Sound decompression in KOS is much more CPU intensive than it needs to be (uses up to 60% CPU usage). I think it can be lowered to less than 15%.,
When I was working on the sound for my version of Gens4All, I looked at the code for KOS's MP3 and Vorbis streamers to see how to push my own samples from the CPU. I was able to figure out that, but the way the decompressor was structured looked like it would have problems doing the decoding. It seemed like the MP3/Vorbis streamer would wait until a large chunk of the AICA buffer had been emptied, then decode a big block of the audio and send it to the AICA. This would cause a large spike in CPU usage whenever it decided to preform the decode. If you wanted a game to run without dropping frames or slowing down while streaming compressed audio, you would have to allocate enough CPU time every frame to absorb the worst case spike. CPU time on frame without spikes would just be wasted.
To check if my understanding of the code was right, I measured how much CPU time was available to the main thread by continuously incrementing a global variable in an idle loop. On vblank, the variable would be reset. By looking at how high the variable reached, it's possible to measure idle time for that frame.
The CPU usage was as bad as I was afraid of. The average CPU usage was 16%, but there were large spikes of CPU usage that reached 60%. If you want to have a Vorbis file decoding in the background during a game, the game has to set aside enough CPU time for the worst case and ends up being limited to using at most 40% CPU usage, otherwise you'll get stuttering. This was measured playing a file stored from romdisk, so it CPU usage would probably be higher reading from a CD with syscall and inter-device communication overhead.
Reducing the sample rate of the audio does reduce the average CPU usage, but not the worst case CPU usage. The spikes are a bit less frequent, but still about as long.
I compared lowmem Tremor, used by KOS, and the stb_vorbis library. Tremor uses fixed point math only, while stb_vorbis uses floating point. When decoding a single test file, already loaded into RAM, entirely in one shot, stb_vorbis was around 10% faster than Tremor. I think this is because the SH4's integer multiply instructions aren't as fast as floating point ops, and using floating point gave the compiler access to more registers to use.
I also checked how well the library minimp3 worked for MP3 decoding. This was around 20% faster than stb_vorbis for one shot decoding. Using MP3s might be a better option than Vorbis because of this (and the MP3 patents have expired), but for now I'm mainly focusing on Vorbis. Any improvements to Vorbis decode made would still apply to MP3 decoding.
I modified some of the KOS streaming code to manually fill up audio to the AICA buffer each frame, and not let it get too empty. This significantly improved the worst case per frame CPU load, from 60% to 25%. The average CPU load was 12%, so the worst spikes are still pretty noticeable, though they don't happen as much. It mostly alternates between a frame with practically no CPU load, and one with a minimum amount of CPU load, with the occasional double load frames. Dropping to 32Khz reduces the worst case spike to 22% with an average load of 10%.
Vorbis (and MP3) files are split up into chunks called frames. To avoid confusion with video frames/periods-between-vblanks, I'm going to refer to call audio frames as "chunks" in this post, while video frames will just be "frames".
I think the no-load frames are ones were there the entire amount the AICA needs is already decoded in a previous frame and buffered, and the high load frames are ones where the CPU needs to decode one or more chunks.
Vorbis decoding has occasional extra large spikes. I'm not sure where the inconsistency in the Vorbis spikes are caused by. From what I've read, a Vorbis stream can have multiple chunk sizes, so maybe sometime more than one chunk is needed, even though the number of samples need is mostly consistent?
MP3s also alternate between frames without decoding and frames with decoding, but they don't have the occasional extra long spikes that Vorbis has. The difference in average CPU usage while streaming to the AICA was smaller than the one shot decode, with an average CPU usage of 11.75%.
The graph is still jagged. Some frames we don't need to decode anything, since we have enough from the last chunk. What matters most performance-wise is what the worst case is like. Instead of having to decode an entire vorbis/MP3 chunk in one video frame, it would be better if we could spread the decode time across frames. This could be gone by putting the decoder in a thread and limiting how much time it can run per frame.
The thread scheduler in KOS doesn't seem suited for this. There's doesn't seem to be a way to configure how much time a thread gets when it gets scheduled, and there's no way to synchronize scheduling with vblank.
It's hacky, but it looks like it might be possible to override KOS's scheduling and achieve this without modifying the kernel itself. The decode thread would be marked a low priority, so that it won't normally interfere with the main thread scheduling. A vblank handler would force schedule the thread and set the thread reschedule timer to however long the thread should run for that frame.
This would only work for one thread at a time. If two threads tried this, one thread would override the scheduling of the other. Ideally, some more robust way of doing this would be worked into KOS, but this might work for now.
If/when KOS scheduling improves, I think instead of specifying how much time a thread gets per frame, it should be specified per second. This way, 60hz NTSC, 50hz PAL, and custom VGA resolutions (like 640x400x70hz) will allocate the correct amount of time without extra handling.
Another issue with speeding up decoding as the lack of sound DMA. How much time does manual pushing samples to the AICA take? Disabling writing the samples to sound RAM reduced average CPU load of the decompression by about 10%. But, obviously, you can't hear it then. Replacing CPU sample pushing with a DMA driven one would get very close to that speed up.
I first tried using blocking sound DMA as a first step to getting background DMA working. The system would completely freeze every time I tried to start the DMA with blocking. I guess I'll have to look into what happening with G2 DMA. I have a fairly long (and heated) thread saved from the dcdev Yahoo Group about G2 DMA, and M. R. Brown was very critical of how G2 DMA was handled in KOS at the time, and posted some documentation of how he thought G2 DMA worked.
The "mode" parameter passed to g2_dma_transfer by spu_dma_transfer is 6 in the current version of KOS. There's a comment saying that 6 might work better, even though that's what it is? Sounds like it was originally something else, and going through older revisions, it was originally 5. It looks like KOS's DMA originated from an example by bITmASTER, where the value was also 5. Using mode 5, the DMA started working again. Is something wrong with my code or Dreamcast, or has no one used KOS's sound DMA in 15+ years?
I looked a MAME's source code to see how it handled this register. If bit 1 is set (i.e. "(mode & 0x2)") then ctrl2 won't start the DMA, and instead certain interrupts will start DMA instead? I guess the idea is that the ARM could trigger and IRQ to trigger a DMA, but a single DMA from the ARM doesn't seem enough to be useful. It doesn't really make much sense to me, but using mode 5 got sound DMA working again
Another thing is the way KOS's G2 DMA code configures the SH4's DMA controller. There shouldn't be any need to, as the Dreamcast uses the "On-Demand Data Transfer" (DDT) feature of the SH4 DMA controller, which allows an external device to set DMAC registers, for all DMA except one type of PVR DMA, which uses a special mode that does require the CPU setting up channel 2. I think the reason SH4 DMA channels 1 and 3 are the only ones that work are because they are unused, and setting the other channels can interfere with running DMA. No emulator I looked at read the SH4's DMA controller when preforming G2 DMA. I tried removing the part of g2_dma_transfer that modifies the SH4 DMA registers, and AICA DMA still works fine without it.
One thing on Yahoo Groups that everyone agreed on was that writing to G2 while there was DMA going on was a bad idea, and that DMA should be stopped before the CPU touches G2.
The first way I thought of handing this was to stop the DMA by clearing the start bit, then setting it again to resume from where it left off. But that ends up restarting the DMA from the beginning. You also can't get an idea of how much DMA remains by looking at the source, destination, and count registers set by g2_dma_transfer. They don't change as the DMA progresses. I looked at the register M. R. Brown called a status register. KOS calls it "u1".
Reading the status register before DMA returned 0x20. Starting DMA, then reading it immediately returned a 0. Starting DMA, preforming a small delay, then reading the register would return either a 0x20 or a 0, depending on how long the delay was. It seemed like a inverse copy of the DMA completion result you could get from ctrl2.
I looked into how different emulators handled the register. The MAME source names it "SB_ADSUSP", but it doesn't seem to be used anywhere. lxdream names the register "G2DMA0STOP". Well, that sounds like it might be the right register to pause DMA.
I modified my Vorbis sample pusher to write 0xffffffff (one of these bits have to do something, right?) to G2DMA0STOP immediately after starting DMA. There was complete silence, so it did seem to stop the DMA. After uncommenting the pause request and rerunning the program, there was still no DMA. It seems KOS's G2 initialization wasn't enough to undo the pause command, and it stayed stopped across different runs. Writing 0 to G2DMA0STOP reenabled DMA.
When reading G2DMA0STOP immediately after writing all ones to it, I would get 0x17. If I delayed the read a bit, I would get 0x37. My guess is (G2DMA0STOP & 0x20) means the DMA channel is idle, while (ctrl2 & 0x1) means the DMA hasn't fully completed. Some of the 0x17 bits probably mean more than just "the channel is paused", but I don't know what. They always seem to be set or cleared together.
Setting only bit 0 in G2DMA0STOP seems to be enough to pause DMA. I haven't figured out what side effects bits 1 and 2 have when set.
But I have seen both a 0 in ctrl2 and G2DMA0STOP&0x20, which would imply that the channel is not idle but complete?
The G2 read/write functions KOS have don't seem to cause corruption, but when there's a DMA, they waiting until DMA completes before operating. This explains the temporary lockup that happened when I tried a DMA with size 0. The DMA took a long time to complete, so I think the AICA channel read waited for a while for it to finish.
After a bit of extra testing, it seems setting bit 0 is enough for bit 4 and then 5 to get set. I'm not sure what bits 1 and 2 are for.
My final guess is that the correct way to read/write to G2 during G2 DMA is this:
1. Write 0xffffffff (or 0x1) to G2DMA0STOP.
2. Wait for (G2DMA0STOP&0x20) to be set
3. Wait for FIFO to clear
4. Preform read/write
5. Write 0 to G2DMA0STOP
Maybe we don't need both steps 2 and 3, but I guess it doesn't hurt to be safe.
I'm working on creating tests for this to verify this. It starts a long DMA to one part of sound RAM, and while it's going on, it will write values to another part of RAM with the SH4. When DMA completes, it checks that all values written by DMA and CPU have been written correctly. It will also test reads during DMA. I'll test a couple different methods for pausing the DMA, to see what works best.
I've also noticed KOS and SDL check different bits when seeing if the AICA FIFO is clear. Any ideas which s right? Just check all of them just to be sure?
From what testing I've done, at first it appears mid-DMA writes didn't corrupt or interrupt the current DMA, but the CPU write was dropped and sound RAM was unchanged. But after cold rebooting the system, the writes started working? I did still get some odd corruption afterwards, but I think it was because I left the ARM running while I overwrote sound RAM with test data.
When KOS initializes the AICA, it leaves the ARM running, but has it execute an endless loop (since stopping the ARM apparently prevents CD audio playback). My DMA tests would break the loop and cause the ARM to execute random code, overwriting some of the tests. At first, I just stopped the ARM, but then I changed my test to left the ARM running, leaving the idle loop KOS sets up alone, in case having the ARM stopped/running affects anything.
After doing that, all corruption stopped. ALL of it. I can't get any corruption even if I try to deliberately cause errors by ignoring the FIFO registers and just spamming the AICA with writes from the SH4 while DMA is occurring. This is pretty irritating. I want errors I can compare the error-free code against!
In the Yahoo Group thread, someone mentions that different hardware revisions might be affected differently by this. I know that different revisions of the Genesis have bits were there are slight differences, but Sega had redesign parts of the Genesis to reduce chip count and and save costs. The Dreamcast already had G2 combined into Holly at launch, and Sega would still need compatibility with older versions of the hardware, so I don't really see any benefits to Sega for redesigning and improving G2 afterwards. I think it's unlikely, although possible, that different revisions behave differently. But if anyone has older hardware versions, like the original Japanese systems, and can run homebrew, I'd appreciate if you could help run the G2 DMA tester, whenever it's finished. The manufacture date stored in flash on my main console is March 3, 2000 (revision VA1).
I haven't tried seeing how reads behave yet.
The G2 DMA code in KOS mentioned G2 status registers. I didn't look too much at them, since they weren't covered much in emulators or the Yahoo Groups thread, but I went back and did a quick check. The length in the status registers does countdown as DMA progresses. But the length status register doesn't stay at 0 when the DMA completes, and instead resets to its starting value.
It might be possible to "pause" the DMA by reading the in-progress counter, stopping the DMA completely, then starting after setting the length to the read value, but there's the possibility of a race conditions depending on when the counter is read. It could probably be done by doing multiple reads to work around it, but using the pause register seems safer and simpler.
I haven't looked at how the status registers' source and destination values behave, but they probably update the same way as the length register.
Also, wow, bandwidth to sound RAM is really low. About 5 MB a second? I know a lot of sound RAM's bandwidth is reserved for the playback channels, DSP, and ARM, but that's still impressively low.
Final major issue with speeding up decoding: Accessing the discwork drive in KOS uses PIO (CPU reads) instead of DMA. My CPU usage tests were decoding a file already loaded entirely in RAM (except for the current KOS playback library, which played from romdisk), and DMA-less disc access would increase CPU load.
By modifying KOS to use DMA for CD file system access, it should be possible to eliminate the CPU usage for this. But KOS doesn't really provide a way to prefetch CD data, so the decoder might still block occasionally. If the decoder blocks, it would need more CPU time to make up for the time lost. One way around this would be to have a separate decode and I/O threads. The I/O thread would prefetch the compressed audio data into a buffer which the decoder would consume from.
I had hoped converting KOS to use DMA would just be a matter of changing the PIO read call to a DMA read call, but while typing this it occurred to me that DMA alignment would probably be a problem. The DMA probably requires 32-byte alignment for the destination, and would only work in 32-byte blocks. KOS's existing file system I/O doesn't require any alignment or round sizes. Just converting all disc access to direct DMA probably wouldn't work as easily as I hoped.
Has anyone actually tried using CD DMA in KOS? Does it work? I would guess that there's some syscall that needs to be run when a disc drive DMA completion IRQ happens, but I don't see something like that in KOS. Are the syscalls able to still able to work correctly anyways? Marcus's site has very little infomation on disc drive syscalls. I have some disc drive syscall stuff saved from Yahoo Groups, but I didn't see anything about that. It's possible that the "main loop" syscall (or the check command syscall) could poll the DMA register for completion, but then why have the IRQ?
I personally would prefer having an asynchronous I/O system, rather than having another thread. You'd just have to ask "Move bytes from here to here, and let me poll or set up a callback to find out when it's done." Maybe have a priority system, so a game can load level data (normal priority) while streaming music (high priority). A lot of complexities that a desktop or server OS would have to deal with (like synchronization between reads and writes) wouldn't matter on the Dreamcast, which is pretty much read only.
When I was working on the sound for my version of Gens4All, I looked at the code for KOS's MP3 and Vorbis streamers to see how to push my own samples from the CPU. I was able to figure out that, but the way the decompressor was structured looked like it would have problems doing the decoding. It seemed like the MP3/Vorbis streamer would wait until a large chunk of the AICA buffer had been emptied, then decode a big block of the audio and send it to the AICA. This would cause a large spike in CPU usage whenever it decided to preform the decode. If you wanted a game to run without dropping frames or slowing down while streaming compressed audio, you would have to allocate enough CPU time every frame to absorb the worst case spike. CPU time on frame without spikes would just be wasted.
To check if my understanding of the code was right, I measured how much CPU time was available to the main thread by continuously incrementing a global variable in an idle loop. On vblank, the variable would be reset. By looking at how high the variable reached, it's possible to measure idle time for that frame.
The CPU usage was as bad as I was afraid of. The average CPU usage was 16%, but there were large spikes of CPU usage that reached 60%. If you want to have a Vorbis file decoding in the background during a game, the game has to set aside enough CPU time for the worst case and ends up being limited to using at most 40% CPU usage, otherwise you'll get stuttering. This was measured playing a file stored from romdisk, so it CPU usage would probably be higher reading from a CD with syscall and inter-device communication overhead.
Reducing the sample rate of the audio does reduce the average CPU usage, but not the worst case CPU usage. The spikes are a bit less frequent, but still about as long.
I compared lowmem Tremor, used by KOS, and the stb_vorbis library. Tremor uses fixed point math only, while stb_vorbis uses floating point. When decoding a single test file, already loaded into RAM, entirely in one shot, stb_vorbis was around 10% faster than Tremor. I think this is because the SH4's integer multiply instructions aren't as fast as floating point ops, and using floating point gave the compiler access to more registers to use.
I also checked how well the library minimp3 worked for MP3 decoding. This was around 20% faster than stb_vorbis for one shot decoding. Using MP3s might be a better option than Vorbis because of this (and the MP3 patents have expired), but for now I'm mainly focusing on Vorbis. Any improvements to Vorbis decode made would still apply to MP3 decoding.
I modified some of the KOS streaming code to manually fill up audio to the AICA buffer each frame, and not let it get too empty. This significantly improved the worst case per frame CPU load, from 60% to 25%. The average CPU load was 12%, so the worst spikes are still pretty noticeable, though they don't happen as much. It mostly alternates between a frame with practically no CPU load, and one with a minimum amount of CPU load, with the occasional double load frames. Dropping to 32Khz reduces the worst case spike to 22% with an average load of 10%.
Vorbis (and MP3) files are split up into chunks called frames. To avoid confusion with video frames/periods-between-vblanks, I'm going to refer to call audio frames as "chunks" in this post, while video frames will just be "frames".
I think the no-load frames are ones were there the entire amount the AICA needs is already decoded in a previous frame and buffered, and the high load frames are ones where the CPU needs to decode one or more chunks.
Vorbis decoding has occasional extra large spikes. I'm not sure where the inconsistency in the Vorbis spikes are caused by. From what I've read, a Vorbis stream can have multiple chunk sizes, so maybe sometime more than one chunk is needed, even though the number of samples need is mostly consistent?
MP3s also alternate between frames without decoding and frames with decoding, but they don't have the occasional extra long spikes that Vorbis has. The difference in average CPU usage while streaming to the AICA was smaller than the one shot decode, with an average CPU usage of 11.75%.
The graph is still jagged. Some frames we don't need to decode anything, since we have enough from the last chunk. What matters most performance-wise is what the worst case is like. Instead of having to decode an entire vorbis/MP3 chunk in one video frame, it would be better if we could spread the decode time across frames. This could be gone by putting the decoder in a thread and limiting how much time it can run per frame.
The thread scheduler in KOS doesn't seem suited for this. There's doesn't seem to be a way to configure how much time a thread gets when it gets scheduled, and there's no way to synchronize scheduling with vblank.
It's hacky, but it looks like it might be possible to override KOS's scheduling and achieve this without modifying the kernel itself. The decode thread would be marked a low priority, so that it won't normally interfere with the main thread scheduling. A vblank handler would force schedule the thread and set the thread reschedule timer to however long the thread should run for that frame.
This would only work for one thread at a time. If two threads tried this, one thread would override the scheduling of the other. Ideally, some more robust way of doing this would be worked into KOS, but this might work for now.
If/when KOS scheduling improves, I think instead of specifying how much time a thread gets per frame, it should be specified per second. This way, 60hz NTSC, 50hz PAL, and custom VGA resolutions (like 640x400x70hz) will allocate the correct amount of time without extra handling.
Another issue with speeding up decoding as the lack of sound DMA. How much time does manual pushing samples to the AICA take? Disabling writing the samples to sound RAM reduced average CPU load of the decompression by about 10%. But, obviously, you can't hear it then. Replacing CPU sample pushing with a DMA driven one would get very close to that speed up.
I first tried using blocking sound DMA as a first step to getting background DMA working. The system would completely freeze every time I tried to start the DMA with blocking. I guess I'll have to look into what happening with G2 DMA. I have a fairly long (and heated) thread saved from the dcdev Yahoo Group about G2 DMA, and M. R. Brown was very critical of how G2 DMA was handled in KOS at the time, and posted some documentation of how he thought G2 DMA worked.
The "mode" parameter passed to g2_dma_transfer by spu_dma_transfer is 6 in the current version of KOS. There's a comment saying that 6 might work better, even though that's what it is? Sounds like it was originally something else, and going through older revisions, it was originally 5. It looks like KOS's DMA originated from an example by bITmASTER, where the value was also 5. Using mode 5, the DMA started working again. Is something wrong with my code or Dreamcast, or has no one used KOS's sound DMA in 15+ years?
I looked a MAME's source code to see how it handled this register. If bit 1 is set (i.e. "(mode & 0x2)") then ctrl2 won't start the DMA, and instead certain interrupts will start DMA instead? I guess the idea is that the ARM could trigger and IRQ to trigger a DMA, but a single DMA from the ARM doesn't seem enough to be useful. It doesn't really make much sense to me, but using mode 5 got sound DMA working again
Another thing is the way KOS's G2 DMA code configures the SH4's DMA controller. There shouldn't be any need to, as the Dreamcast uses the "On-Demand Data Transfer" (DDT) feature of the SH4 DMA controller, which allows an external device to set DMAC registers, for all DMA except one type of PVR DMA, which uses a special mode that does require the CPU setting up channel 2. I think the reason SH4 DMA channels 1 and 3 are the only ones that work are because they are unused, and setting the other channels can interfere with running DMA. No emulator I looked at read the SH4's DMA controller when preforming G2 DMA. I tried removing the part of g2_dma_transfer that modifies the SH4 DMA registers, and AICA DMA still works fine without it.
One thing on Yahoo Groups that everyone agreed on was that writing to G2 while there was DMA going on was a bad idea, and that DMA should be stopped before the CPU touches G2.
The first way I thought of handing this was to stop the DMA by clearing the start bit, then setting it again to resume from where it left off. But that ends up restarting the DMA from the beginning. You also can't get an idea of how much DMA remains by looking at the source, destination, and count registers set by g2_dma_transfer. They don't change as the DMA progresses. I looked at the register M. R. Brown called a status register. KOS calls it "u1".
Reading the status register before DMA returned 0x20. Starting DMA, then reading it immediately returned a 0. Starting DMA, preforming a small delay, then reading the register would return either a 0x20 or a 0, depending on how long the delay was. It seemed like a inverse copy of the DMA completion result you could get from ctrl2.
I looked into how different emulators handled the register. The MAME source names it "SB_ADSUSP", but it doesn't seem to be used anywhere. lxdream names the register "G2DMA0STOP". Well, that sounds like it might be the right register to pause DMA.
I modified my Vorbis sample pusher to write 0xffffffff (one of these bits have to do something, right?) to G2DMA0STOP immediately after starting DMA. There was complete silence, so it did seem to stop the DMA. After uncommenting the pause request and rerunning the program, there was still no DMA. It seems KOS's G2 initialization wasn't enough to undo the pause command, and it stayed stopped across different runs. Writing 0 to G2DMA0STOP reenabled DMA.
When reading G2DMA0STOP immediately after writing all ones to it, I would get 0x17. If I delayed the read a bit, I would get 0x37. My guess is (G2DMA0STOP & 0x20) means the DMA channel is idle, while (ctrl2 & 0x1) means the DMA hasn't fully completed. Some of the 0x17 bits probably mean more than just "the channel is paused", but I don't know what. They always seem to be set or cleared together.
Setting only bit 0 in G2DMA0STOP seems to be enough to pause DMA. I haven't figured out what side effects bits 1 and 2 have when set.
But I have seen both a 0 in ctrl2 and G2DMA0STOP&0x20, which would imply that the channel is not idle but complete?
The G2 read/write functions KOS have don't seem to cause corruption, but when there's a DMA, they waiting until DMA completes before operating. This explains the temporary lockup that happened when I tried a DMA with size 0. The DMA took a long time to complete, so I think the AICA channel read waited for a while for it to finish.
After a bit of extra testing, it seems setting bit 0 is enough for bit 4 and then 5 to get set. I'm not sure what bits 1 and 2 are for.
My final guess is that the correct way to read/write to G2 during G2 DMA is this:
1. Write 0xffffffff (or 0x1) to G2DMA0STOP.
2. Wait for (G2DMA0STOP&0x20) to be set
3. Wait for FIFO to clear
4. Preform read/write
5. Write 0 to G2DMA0STOP
Maybe we don't need both steps 2 and 3, but I guess it doesn't hurt to be safe.
I'm working on creating tests for this to verify this. It starts a long DMA to one part of sound RAM, and while it's going on, it will write values to another part of RAM with the SH4. When DMA completes, it checks that all values written by DMA and CPU have been written correctly. It will also test reads during DMA. I'll test a couple different methods for pausing the DMA, to see what works best.
I've also noticed KOS and SDL check different bits when seeing if the AICA FIFO is clear. Any ideas which s right? Just check all of them just to be sure?
From what testing I've done, at first it appears mid-DMA writes didn't corrupt or interrupt the current DMA, but the CPU write was dropped and sound RAM was unchanged. But after cold rebooting the system, the writes started working? I did still get some odd corruption afterwards, but I think it was because I left the ARM running while I overwrote sound RAM with test data.
When KOS initializes the AICA, it leaves the ARM running, but has it execute an endless loop (since stopping the ARM apparently prevents CD audio playback). My DMA tests would break the loop and cause the ARM to execute random code, overwriting some of the tests. At first, I just stopped the ARM, but then I changed my test to left the ARM running, leaving the idle loop KOS sets up alone, in case having the ARM stopped/running affects anything.
After doing that, all corruption stopped. ALL of it. I can't get any corruption even if I try to deliberately cause errors by ignoring the FIFO registers and just spamming the AICA with writes from the SH4 while DMA is occurring. This is pretty irritating. I want errors I can compare the error-free code against!
In the Yahoo Group thread, someone mentions that different hardware revisions might be affected differently by this. I know that different revisions of the Genesis have bits were there are slight differences, but Sega had redesign parts of the Genesis to reduce chip count and and save costs. The Dreamcast already had G2 combined into Holly at launch, and Sega would still need compatibility with older versions of the hardware, so I don't really see any benefits to Sega for redesigning and improving G2 afterwards. I think it's unlikely, although possible, that different revisions behave differently. But if anyone has older hardware versions, like the original Japanese systems, and can run homebrew, I'd appreciate if you could help run the G2 DMA tester, whenever it's finished. The manufacture date stored in flash on my main console is March 3, 2000 (revision VA1).
I haven't tried seeing how reads behave yet.
The G2 DMA code in KOS mentioned G2 status registers. I didn't look too much at them, since they weren't covered much in emulators or the Yahoo Groups thread, but I went back and did a quick check. The length in the status registers does countdown as DMA progresses. But the length status register doesn't stay at 0 when the DMA completes, and instead resets to its starting value.
It might be possible to "pause" the DMA by reading the in-progress counter, stopping the DMA completely, then starting after setting the length to the read value, but there's the possibility of a race conditions depending on when the counter is read. It could probably be done by doing multiple reads to work around it, but using the pause register seems safer and simpler.
I haven't looked at how the status registers' source and destination values behave, but they probably update the same way as the length register.
Also, wow, bandwidth to sound RAM is really low. About 5 MB a second? I know a lot of sound RAM's bandwidth is reserved for the playback channels, DSP, and ARM, but that's still impressively low.
Final major issue with speeding up decoding: Accessing the discwork drive in KOS uses PIO (CPU reads) instead of DMA. My CPU usage tests were decoding a file already loaded entirely in RAM (except for the current KOS playback library, which played from romdisk), and DMA-less disc access would increase CPU load.
By modifying KOS to use DMA for CD file system access, it should be possible to eliminate the CPU usage for this. But KOS doesn't really provide a way to prefetch CD data, so the decoder might still block occasionally. If the decoder blocks, it would need more CPU time to make up for the time lost. One way around this would be to have a separate decode and I/O threads. The I/O thread would prefetch the compressed audio data into a buffer which the decoder would consume from.
I had hoped converting KOS to use DMA would just be a matter of changing the PIO read call to a DMA read call, but while typing this it occurred to me that DMA alignment would probably be a problem. The DMA probably requires 32-byte alignment for the destination, and would only work in 32-byte blocks. KOS's existing file system I/O doesn't require any alignment or round sizes. Just converting all disc access to direct DMA probably wouldn't work as easily as I hoped.
Has anyone actually tried using CD DMA in KOS? Does it work? I would guess that there's some syscall that needs to be run when a disc drive DMA completion IRQ happens, but I don't see something like that in KOS. Are the syscalls able to still able to work correctly anyways? Marcus's site has very little infomation on disc drive syscalls. I have some disc drive syscall stuff saved from Yahoo Groups, but I didn't see anything about that. It's possible that the "main loop" syscall (or the check command syscall) could poll the DMA register for completion, but then why have the IRQ?
I personally would prefer having an asynchronous I/O system, rather than having another thread. You'd just have to ask "Move bytes from here to here, and let me poll or set up a callback to find out when it's done." Maybe have a priority system, so a game can load level data (normal priority) while streaming music (high priority). A lot of complexities that a desktop or server OS would have to deal with (like synchronization between reads and writes) wouldn't matter on the Dreamcast, which is pretty much read only.
- These users thanked the author TapamN for the post (total 4):
- |darc| • T_chan • Ian Robinson • GyroVorbis
- Ian Robinson
- DC Developer
- Posts: 126
- Joined: Mon Mar 11, 2019 7:12 am
- Has thanked: 225 times
- Been thanked: 45 times
Re: Streaming Music Playback CPU Usage
I have known how bad the ogg decoder is for a long time DMA read works in kos below code is the mod for my dreambor project thank you to megavolt.
I use SPU DMA in KOS in number of old projects in kos1.3 so you right prolly nothing in 15 years..
I used kazade's dcprof and found out about how bad ogg streaming is and mp3 both are not suited to using in a 3D game.. CDDA is the best for people with homebrew right now as it uses zero cpu time and you can work around loading level blocks when loading to stop and resume. You do have 250ms access time as well when it starts to read.
Josh posted homebrew adx example and api that uses about 4% only in the forum i have compiled and tested it with dcprof. There is opensrc encoder decoder now
https://github.com/Isaac-Lozano/radx/releases
https://github.com/ianmicheal/Dreamcast ... w-opensrc-
Sound ram is 40x slower then system ram
Adx streaming uses 2 to 4% or at 44khz mono
I use SPU DMA in KOS in number of old projects in kos1.3 so you right prolly nothing in 15 years..
I used kazade's dcprof and found out about how bad ogg streaming is and mp3 both are not suited to using in a 3D game.. CDDA is the best for people with homebrew right now as it uses zero cpu time and you can work around loading level blocks when loading to stop and resume. You do have 250ms access time as well when it starts to read.
Josh posted homebrew adx example and api that uses about 4% only in the forum i have compiled and tested it with dcprof. There is opensrc encoder decoder now
https://github.com/Isaac-Lozano/radx/releases
Code: Select all
README
LibADX (c)2012 Josh PH3NOM Pearson
decoder algorithm based on adx2wav (c) BERO 2001
LibADX is a library for decoding ADX audio files using the
Kallisti:OS development environment, intended for use only
on the Sega Dreamcast game console.
LibADX features full implementation of the ADX looping function.
The functions available include play, pause, stop, restart.
This library is completely free to modify and or redistribute.
LibADX is a static library, build the library with the makefile:
LibADX/LibADX/Makefile
Then build the example player:
LibADX/Makefile
The example player uses a hard-coded file name, so make sure
to include a "sample.adx" on the root of the /cd/, or modify
the source libADXplay.c to load a different file.
Low cpu time use and you can use this encoder
https://github.com/Isaac-Lozano/radx/releases
compiled just make add an adx file called sample.adx and make a selfboot cdi using
the included compiled elf or compiled your own
Ian micheal
Sound ram is 40x slower then system ram
Adx streaming uses 2 to 4% or at 44khz mono
- Attachments
-
- Kos[gdrom][dmaread].rar
- (3.69 KiB) Downloaded 143 times
- Ian Robinson
- DC Developer
- Posts: 126
- Joined: Mon Mar 11, 2019 7:12 am
- Has thanked: 225 times
- Been thanked: 45 times
Re: Streaming Music Playback CPU Usage
Do you have spu dma working on kos2.0 i cant get it to work and it's grey out Maybe a kos patch I have it working like you said 15 years ago but what happen? Only just noticed it's missing from new version of kos..
-
- DC Developer
- Posts: 120
- Joined: Sun Oct 04, 2009 11:13 am
- Has thanked: 2 times
- Been thanked: 109 times
Re: Streaming Music Playback CPU Usage
I get an internal server error when I try to embed the code in this post. I'll put it it on Pastebin. This is ripped out of my DMA testing code; I haven't tested the code to see if it works by itself, but I think everything important is in there to get DMA running.
I'm can't really list everything I've tried so far at the moment, but the main issue is I've had trouble getting reliable errors. I was hoping using DMA incorrectly would just result in either having errors on every access, or a percentage of accesses would result in errors. But it looks like whether or not you get errors at all is randomly determined on boot up. I booted the system, had reliable mid-DMA errors once, rebooted the system, and could no longer get any errors when using the same code.
I've been trying to modify an emulator to log what commercial games do, so I don't have to do this experimentation, but I've have problems getting the emus I've tried to compile and run.
Trivia I've discovered: Audio DMA bandwidth drops by around a third when the AICA is playing 16-bit PCM samples on all channels.
Admin edit, code _should_ work now... -|darc|
I'm can't really list everything I've tried so far at the moment, but the main issue is I've had trouble getting reliable errors. I was hoping using DMA incorrectly would just result in either having errors on every access, or a percentage of accesses would result in errors. But it looks like whether or not you get errors at all is randomly determined on boot up. I booted the system, had reliable mid-DMA errors once, rebooted the system, and could no longer get any errors when using the same code.
I've been trying to modify an emulator to log what commercial games do, so I don't have to do this experimentation, but I've have problems getting the emus I've tried to compile and run.
Trivia I've discovered: Audio DMA bandwidth drops by around a third when the AICA is playing 16-bit PCM samples on all channels.
Admin edit, code _should_ work now... -|darc|
Code: Select all
/* the address of the sound ram from the SH4 side */
#define G2_ARAM 0xa0800000
#define G2_DMA_TO_DEVICE 0
#define G2_DMA_FROM_DEVICE 1
typedef struct {
uint32 ext_addr; /* External address (SPU-RAM or parallel port) */
uint32 sh4_addr; /* SH-4 Address */
uint32 size; /* Size in bytes; all addresses and sizes must be 32-byte aligned */
uint32 dir; /* 0: cpu->ext; 1: ext->cpu */
uint32 mode; /* 5 for SPU transfer */
uint32 ctrl1; /* b0 */
uint32 ctrl2; /* b0 */
uint32 stop; /* ?? */
} g2_dma_ctrl_t;
typedef struct {
uint32 ext_addr;
uint32 sh4_addr;
uint32 size;
uint32 status;
} g2_dma_stat_t;
typedef struct {
g2_dma_ctrl_t dma[4];
uint32 u1[4]; /* ?? */
uint32 wait_state;
uint32 u2[10]; /* ?? */
uint32 magic;
g2_dma_stat_t dma_stat[4];
} g2_dma_reg_t;
#define G2_REGS (*(volatile g2_dma_reg_t *)0xa05f7800)
#define G2_FIFO (*(volatile int*)0xa05f688c)
static inline void G2PauseDMA(int channel) {
G2_REGS.dma[channel].stop = -1;
}
static inline void G2ResumeDMA(int channel) {
G2_REGS.dma[channel].stop = 0;
}
static inline int G2DMAInProgress(int channel) {
return G2_REGS.dma[channel].ctrl2 & 1;
}
static semaphore_t dma_done[4];
static int dma_blocking[4];
static g2_dma_callback_t dma_callback[4];
static ptr_t dma_cbdata[4];
static void G2DMAIRQHandler(uint32 code) {
int chn = code - ASIC_EVT_G2_DMA0;
if(chn < 0 || chn > 3) {
//panic("wrong channel received in g2_dma_irq");
//return;
//printf("wrong channel received in g2_dma_irq");
chn = 1;
}
/* VP : changed the order of things so that we can chain dma calls */
// Signal the calling thread to continue, if any.
if(dma_blocking[chn]) {
sem_signal(&dma_done[chn]);
thd_schedule(1, 0);
dma_blocking[chn] = 0;
}
// Call the callback, if any.
if(dma_callback[chn]) {
dma_callback[chn](dma_cbdata[chn]);
}
}
int G2DMATransfer(void *from, void * dest, uint32 length, int block,
g2_dma_callback_t callback, ptr_t cbdata, uint32 dir, uint32 g2chn) {
if(g2chn > 3) {
errno = EINVAL;
return -1;
}
/* Check alignments */
if(((uint32)from) & 31) {
dbglog(DBG_ERROR, "g2_dma: unaligned source DMA %p\n", from);
errno = EFAULT;
return -1;
}
if(((uint32)dest) & 31) {
dbglog(DBG_ERROR, "g2_dma: unaligned dest DMA %p\n", dest);
errno = EFAULT;
return -1;
}
if(((uint32)length) & 31) {
dbglog(DBG_ERROR, "g2_dma: unaligned length DMA %p\n", dest);
errno = EFAULT;
return -1;
}
dma_blocking[g2chn] = block;
dma_callback[g2chn] = callback;
dma_cbdata[g2chn] = cbdata;
/* Start the DMA transfer */
G2_REGS.dma[g2chn].ctrl1 = 0;
G2_REGS.dma[g2chn].ctrl2 = 0;
G2_REGS.dma[g2chn].ext_addr = ((uint32)dest) & 0x1fffffe0;
G2_REGS.dma[g2chn].sh4_addr = ((uint32)from) & 0x1fffffe0;
G2_REGS.dma[g2chn].size = (length & ~31) | 0x80000000;
G2_REGS.dma[g2chn].dir = dir;
G2_REGS.dma[g2chn].mode = 5;
G2_REGS.dma[g2chn].ctrl1 = 1;
G2_REGS.dma[g2chn].ctrl2 = 1;
/* Wait for us to be signaled */
if(block)
sem_wait(&dma_done[g2chn]);
return 0;
}
//Poll for DMA completion with G2DMAInProgress(0)
void StartBackgroundAudioDMA(void *src, uint32 dst, int size) {
G2DMATransfer(src, (void*)G2_ARAM + dst, size, 0,0,0, G2_DMA_TO_DEVICE, 0);
}
- Ian Robinson
- DC Developer
- Posts: 126
- Joined: Mon Mar 11, 2019 7:12 am
- Has thanked: 225 times
- Been thanked: 45 times
Re: Streaming Music Playback CPU Usage
[10:44 AM] MetalliC: so, 22.5M/2 / 2 = 5.6M of 16bit RAM accesses per second for ARM, but it is old stupid ARM7DI which always does 32bit mem reads, so one more /2 = ~2.8M
So, worse case scenario is that the ARM7 only gets 1 out of 16 cycles (in case the AICA ties up the bus for other things), so it functions at 2.8mhz, which is awfully close to my estimate. Maybe it gets two or three or more out of sixteen, increasing the available time in 2.8mhz increments.
SPU DMA was nulled out seems it's working with value 5 like you said.. this thing is super slow
So, worse case scenario is that the ARM7 only gets 1 out of 16 cycles (in case the AICA ties up the bus for other things), so it functions at 2.8mhz, which is awfully close to my estimate. Maybe it gets two or three or more out of sixteen, increasing the available time in 2.8mhz increments.
SPU DMA was nulled out seems it's working with value 5 like you said.. this thing is super slow
- Ian Robinson
- DC Developer
- Posts: 126
- Joined: Mon Mar 11, 2019 7:12 am
- Has thanked: 225 times
- Been thanked: 45 times
Re: Streaming Music Playback CPU Usage
HI, TapamN A lot of work has been done to reconfigure the KOS scheduler, and other things. I would love to see this benchmark done again ?. Do you have the benchmark src we can run to check, or can you try the latest PR's and check?
-
- DC Developer
- Posts: 120
- Joined: Sun Oct 04, 2009 11:13 am
- Has thanked: 2 times
- Been thanked: 109 times
Re: Streaming Music Playback CPU Usage
What changes were made? The two most important changes would be fixing the streaming system to preform per-frame small requests, instead of large once-every-couple-of-frames requests, and a way for the scheduler to limit how much a thread can run on a single frame. Working AICA DMA would also be nice.Ian Robinson wrote: ↑Mon Apr 22, 2024 10:58 pmHI, TapamN A lot of work has been done to reconfigure the KOS scheduler, and other things. I would love to see this benchmark done again ?. Do you have the benchmark src we can run to check, or can you try the latest PR's and check?
- These users thanked the author TapamN for the post:
- Ian Robinson
- SWAT
- Insane DCEmu
- Posts: 193
- Joined: Sat Jan 31, 2004 2:34 pm
- Location: Russia/Novosibirsk
- Has thanked: 1 time
- Been thanked: 1 time
- Contact:
Re: Streaming Music Playback CPU Usage
Should be fixed most issues in by my PRs that already merged to KOS master in 2023-2024:
https://github.com/KallistiOS/KallistiOS/pull/762
...other
https://github.com/KallistiOS/KallistiOS/pull/762
...other
- These users thanked the author SWAT for the post:
- Ian Robinson