Race condition in KOS timer_??_gettime*

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
Post Reply
TapamN
DC Developer
DC Developer
Posts: 104
https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 88 times

Race condition in KOS timer_??_gettime*

Post by TapamN »

(I had a lot of trouble getting this post to work. I would get server errors when I tried to preview the post if I had too many/too large {code} sections. I ended up attaching the code.)

I've been using timer_us_gettime64 for profiling for years, and I've noticed the occasional hiccup with results from it. This post also applies to timer_ms_gettime and timer_ms_gettime64.

I had a bar graph showing how long different processes were taking, and for a single frame, the bar graph would sometimes have some insane value and go off the charts. At first, I thought it was a problem on my end, but eventually I noticed that something was supposedly taking multiple frames, but I was still having a new frame rendered every vblank, which made me think it there was a problem with the time KOS was returning.

It looks like the problem comes from a race condition between reading the seconds value KOS keeps track of, and reading the SH4 TMU value for more precision. For example, in timer_us_gettime64...

scnt = timer_ms_counter;
cnt = timer_count(TMU2);

...the TMU can underflow and trigger and IRQ between when timer_ms_counter is read and when timer_count(TMU2) is read. The events end up like this:
  1. timer_us_gettime64 is called
  2. Read 5 from timer_ms_counter
  3. Read value from timer_count(TMU2)
  4. timer_us_gettime64 returns 5.5 seconds since boot
  5. Almost half a second time passes and timer_us_gettime64 is called again
  6. Read 5 from timer_ms_counter
  7. Timer underflows, IRQ handler increments timer_ms_counter to 6 and TMU resets
  8. Read value from timer_count(TMU2)
  9. timer_us_gettime64 returns 5.0 seconds since boot (Should be 5.999 or 6.0)
  10. The first time is subtracted unsigned from the second time (5.0 - 5.5), resulting in 2**64-0.5 seconds having elapsed between calls to gettime
In the attachment, my fix works by disabling IRQs when reading the timer so that the IRQ can't increment timer_ms_counter behind the function's back. The IRQ handler was modified to clear the TMU's underflow bit (it just left it set before). When reading the timer, with IRQs disabled, the underflow bit is checked after reading the TMU. If it's set, it increments the return value by one second and rereads the TMU. When IRQs are re-enabled, the IRQ handler is responsible for incrementing timer_ms_counter.

I made a basic test to check for unusual timer behavior. It runs until a key is pressed on a keyboard. It rapidly polls the timer, and when there is a large difference between two calls to gettime it prints the difference. Using the original KOS timer, I would get about three timer errors detected each minute. With my fixes, I haven't found any errors after 10 minutes.

On a semi-related note, why is the seconds timer called a millisecond timer in the code? timer_ms_counter doesn't count milliseconds, it apparently actually counts seconds.
Attachments
timertest.7z
(3.79 KiB) Downloaded 85 times
These users thanked the author TapamN for the post (total 2):
Ian RobinsonGyroVorbis
User avatar
BlueCrab
The Crabby Overlord
The Crabby Overlord
Posts: 5652
Joined: Mon May 27, 2002 11:31 am
Location: Sailing the Skies of Arcadia
Has thanked: 9 times
Been thanked: 69 times
Contact:

Re: Race condition in KOS timer_??_gettime*

Post by BlueCrab »

I'll try to take a look at this more closely this weekend sometime. Thanks for the report of the issue.

As for why it is called "timer_ms_counter", I'd assume that is because it is an internal counter used for the "timer_ms_*" functions, but being that I didn't write that code I can't necessarily say exactly what the original author was thinking when they wrote the code. Considering that it is a static variable and all, it's not like most people are likely to ever see the code and question it too much. ;-)
User avatar
Ian Robinson
DC Developer
DC Developer
Posts: 114
Joined: Mon Mar 11, 2019 7:12 am
Has thanked: 206 times
Been thanked: 41 times

Re: Race condition in KOS timer_??_gettime*

Post by Ian Robinson »

Any more on this BlueCrab ?
These users thanked the author Ian Robinson for the post:
GyroVorbis
User avatar
Quzar
Dream Coder
Dream Coder
Posts: 7497
Joined: Wed Jul 31, 2002 12:14 am
Location: Miami, FL
Has thanked: 4 times
Been thanked: 9 times
Contact:

Re: Race condition in KOS timer_??_gettime*

Post by Quzar »

I agree with BlueCrab's guess that the naming of timer_ms_counter is just due to it being a counter for the timer_ms_* functions. Something like 'timer_ms_secs' might have made sense too, but I think the core notion was to separate it from the other functions in timer.c.

Your changes make sense when compared against the SH manual's information on how the timers work. It's surprising that the built-in was never clearing/checking overflow. My one question about them is why do 'used=timer_count(TMU2)' in the overflow conditional branch? Was this just a redundancy or is it expected that checking the TCR for overflow would change the value? If it's a matter of timing, why not check overflow first then always set used directly after?

Hopefully it's just a matter of ordering, but I'm concerned I missed some fundamental aspect of how this works. I saw the same was done in timer_us_gettime64, so it didn't seem like just an oversight.

Thanks!
void timer_ms_gettime(uint32 *secs, uint32 *msecs) {
    uint32 used;
    
    int irq_status = irq_disable();

    /* Seconds part comes from ms_counter */
    if(secs)
        *secs = timer_ms_counter;

    /* Milliseconds, we check how much of the timer has elapsed */
    if(msecs) {
        assert(timer_ms_countdown > 0);
        /* Overflow is only notable if we have seconds we can overflow into
	   Avoid read of TCR if secs is null */
	if (secs && TIMER16(tcrs[TMU2]) & 0x100) 
                *secs += 1;        
        used = timer_count(TMU2);
        *msecs = (timer_ms_countdown - used) * 1000 / timer_ms_countdown;
    }
    irq_restore(irq_status);
}
"When you post fewer lines of text than your signature, consider not posting at all." - A Wise Man
TapamN
DC Developer
DC Developer
Posts: 104
Joined: Sun Oct 04, 2009 11:13 am
Has thanked: 2 times
Been thanked: 88 times

Re: Race condition in KOS timer_??_gettime*

Post by TapamN »

Quzar wrote: Tue Dec 14, 2021 5:04 amMy one question about them is why do 'used=timer_count(TMU2)' in the overflow conditional branch? Was this just a redundancy or is it expected that checking the TCR for overflow would change the value? If it's a matter of timing, why not check overflow first then always set used directly after?
I think I read it twice to avoid a race condition between reading the counter and checking TCR, with an overflow happening in-between? Your suggestion to read the counter once after checking for overflow seems good, I can't see any problems with it. I guess I just wasn't thinking hard enough when I tried to fix it!
These users thanked the author TapamN for the post:
Quzar
User avatar
GyroVorbis
Elysian Shadows Developer
Elysian Shadows Developer
Posts: 1873
Joined: Mon Mar 22, 2004 4:55 pm
Location: #%^&*!!!11one Super Sonic
Has thanked: 79 times
Been thanked: 61 times
Contact:

Re: Race condition in KOS timer_??_gettime*

Post by GyroVorbis »

JESUS CHRIST. Did this fix not make it into KOS!? Been doing some timer work lately, and I don't actually see that we're even checking for overflow or are disabling interrupts here.
These users thanked the author GyroVorbis for the post:
Ian Robinson
User avatar
Ian Robinson
DC Developer
DC Developer
Posts: 114
Joined: Mon Mar 11, 2019 7:12 am
Has thanked: 206 times
Been thanked: 41 times

Re: Race condition in KOS timer_??_gettime*

Post by Ian Robinson »

I tested it with and with out the fix and it does solve the problem old has errors like TapamN said every three timer errors detected each minute.. And zero with out even after 30 mins.
These users thanked the author Ian Robinson for the post (total 2):
|darc|GyroVorbis
User avatar
GyroVorbis
Elysian Shadows Developer
Elysian Shadows Developer
Posts: 1873
Joined: Mon Mar 22, 2004 4:55 pm
Location: #%^&*!!!11one Super Sonic
Has thanked: 79 times
Been thanked: 61 times
Contact:

Re: Race condition in KOS timer_??_gettime*

Post by GyroVorbis »

Fantastic! I have this sitting in a branch locally right now. Only reason it's not in a PR yet is because I'm trying to also increase the resolution of the timers and get rid of the slow divisions in the interrupts while I'm already down there.
These users thanked the author GyroVorbis for the post:
Ian Robinson
|darc|
DCEmu Webmaster
DCEmu Webmaster
Posts: 16373
Joined: Wed Mar 14, 2001 6:00 pm
Location: New Orleans, LA
Has thanked: 103 times
Been thanked: 90 times
Contact:

Re: Race condition in KOS timer_??_gettime*

Post by |darc| »

When working on my benchmarks a couple months ago, I noticed every once in a while I'd get erratic jumps in the performance readings. I set the benchmark to discard readings that were out of expected range, which got the job done, but today I applied the solutions shared by TapamN and Quzar and it fixes the problem altogether.

I put it together in a pull request and the fix been merged into KOS now. Thanks TapamN and Quzar.
These users thanked the author |darc| for the post (total 2):
Ian RobinsonGyroVorbis
It's thinking...
User avatar
GyroVorbis
Elysian Shadows Developer
Elysian Shadows Developer
Posts: 1873
Joined: Mon Mar 22, 2004 4:55 pm
Location: #%^&*!!!11one Super Sonic
Has thanked: 79 times
Been thanked: 61 times
Contact:

Re: Race condition in KOS timer_??_gettime*

Post by GyroVorbis »

I think it got checked in with the extra TMU2 counter read, but it's fine, I'll clean it up in the follow-up PR which increases precision and (hopefully) removes the division.
These users thanked the author GyroVorbis for the post:
Ian Robinson
User avatar
Ian Robinson
DC Developer
DC Developer
Posts: 114
Joined: Mon Mar 11, 2019 7:12 am
Has thanked: 206 times
Been thanked: 41 times

Re: Race condition in KOS timer_??_gettime*

Post by Ian Robinson »

SO this is now in main kos and is it like TapamN's version as i have tested that and works flawless even on crusty kos from 2005 :)
These users thanked the author Ian Robinson for the post:
GyroVorbis
Post Reply