C/C++ Language-Level thread_local Storage Support

If you have any questions on programming, this is the place to ask them, whether you're a newbie or an experienced programmer. Discussion on programming in general is also welcome. We will help you with programming homework, but we will not do your work for you! Any porting requests must be made in Developmental Ideas.
Post Reply
User avatar
GyroVorbis
Elysian Shadows Developer
Elysian Shadows Developer
Posts: 1874
https://www.artistsworkshop.eu/meble-kuchenne-na-wymiar-warszawa-gdzie-zamowic/
Joined: Mon Mar 22, 2004 4:55 pm
Location: #%^&*!!!11one Super Sonic
Has thanked: 80 times
Been thanked: 62 times
Contact:

C/C++ Language-Level thread_local Storage Support

Post by GyroVorbis »

Hey guys, I just finished up something I have been working on with Colton for a very, very long time, that I have always wanted to see working on Dreamcast with KOS. I just opened the PR, and it's still in review, but I wanted to let you guys know about it, because depending on your threading needs and concurrency model, it can either make your life a lot easier or your performance a lot better.

https://github.com/KallistiOS/KallistiOS/pull/111

This is for when you want to give each thread its own individual copy of some sort of data. I see it used frequently for things like giving each thread its own unique error code or execution context, giving them each their own local version of some buffer, giving them each their own copy of some sort of custom allocator, or even giving them each a copy of a Lua thread or some other scripting state. This is for both C or C++.

Using it is also extremely simple and convenient. You just use the thread_local specifier in C++11 and beyond, _Thread_local specifier in C99 and beyond, or __thread if you're using a dinosaur revision of either language that belongs in a museum:

Code: Select all

static thread_local uint32_t per_thread_counter = 0;
static thread_local std::string per_thread_error  = "success";
static thread_local uint8_t per_thread_buffer[256] = { 0 };
Previously you were forced to handle this kind of stuff by either statically allocating multiple copies of something and indexing it somehow based on the current thread's ID, OR you had to use the OS-level pthread-style TLS API. Both of these are going to be much much slower for reads/writes. (OS style is still very useful and nice for when TLS needs cannot be known at compile-time and need to be fully dynamic).

The way compiler-level TLS is handled is extremely fast, especially if everything is built statically like we do so all sizes can be calculated up front and stored within the ELF file.

What winds up happening is just that to access each thread's copy of the variable, the compiler emits reads/writes to just compile-time offsets from the GBR register for each variable's address. GBR is our "thread pointer" which will now point to the current thread's statically allocated TLS data, which is all allocated up front just once as a single block whenever a thread gets created.

I'm not sure how much concurrency everyone is doing, but hopefully if you're doing some advanced things or find yourself using any sort of data pattern where a structure or variable is duplicated so that each thread gets its own copy, you will be able to leverage this for much better performance and ease-of-use.
These users thanked the author GyroVorbis for the post:
|darc|
Post Reply