I did the same for tinycc here https://github.com/mingodad/tinycc and used Netbeans IDE that has great refactoring options for C/C++/Java.
Benchmarking the reentrant result showed it to be around 5% slower.
Now I'm trying to redo it again but this time scripting the refactoring using sparse https://github.com/lucvoo/sparse to parse and using it's error messages with with line/column to guide the refactoring, I already got an initial script that performs some initial transformations and is repeatable, but more work need to be done, mainly enhance/extend the info that sparse provide while parsing the code.
thread_local is usually considered the hack to make unthreaded code littered with static variables useable from multiple thread contexts. It has overhead and reduces the compiler’s ability to optimize the code as compared to when parameters are used.
Also, until very recently, a lot compilers/platforms were unable to handle thread_local variables larger than a pointer size making it difficult to retrofit a lot of old code.
It's worth noting that `thread_local` does reduce register pressure. Unfortunately, almost no languages actually natively support the scoping that sane use of this requires.
I'm not aware of one, but it probably exists in some obscure language. I can describe how it should work though:
* ideally, thread-local variables should not be directly exposed at all
* variables intended to be dynamically scoped should be explicitly declared using the `dynamic` keyword at top level, using a thread-local variable (possibly of a different type) under the hood (contrary to languages that use dynamic scoping for all variables, this is opt-in, explicit, and more efficient. Note also that many languages do not have real thread-locals, they just pass the thread ID to a map, thus have atrocious performance.)
* reading a dynamically-scoped variable just accesses the thread-local, after unwrapping the container if needed. If the top level declaration didn't have an initializer, maybe this should throw an error?
* mutations to the dynamically-scoped variable require an explicit keyword, and involve saving the old value as if in a block-scoped local variable, and restoring it on unwind. If there's more than one mutation in a single block this can be optimized out.
* note that "assign a value wholesale" is only the most basic mutation; sometimes a stack is wanted (in which case the saved local value is just the depth) or a set (in which case the saved local value is the key to remove; this needs to be conditional)
* there should be escape hatches to manually do a save, restore, or write-without-saving. These should be used only to implement nontrivial mutations in the library or to port code using unsafe thread-locals.
It's possible to write a reasonable library implementation of this using macros in C++, GNU C, or maybe standard C23.
(an alternate approach would be to write code as if it were passing arguments to functions, but internally change them to writes to thread-local variables (only if not passed from a parameter of the same name) purely for register-pressure-optimization reasons)
Certain linker operations can be multi-threaded (not sure if this is specifically true for LLD). Particularly LTO in the GNU toolchain, but also there's been a lot of effort recently to make linking faster by actually having it use all available cores.
> Thread-local is way too magical for me. I wouldn't want to debug a system that made use of it.
There's a perfectly cromulent register just begging to be used; the circuitry has already been paid for, generating heat whether you like it or not, what magic are you afraid of here?
> Also, if you pass a param, then it can be shared.
Maybe, but if you design for sharing you'll never use your program might be bigger and slower as a result. Sometimes that matters.
> There's a perfectly cromulent register just begging to be used; [...] what magic are you afraid of here?
Most of the magic is not when using the thread-local variable, but when allocating it. When you declare a "static __thread char *p", how do you know that for instance this is located at the 123th word of the per-thread area? What if that declaration is on a dynamic library, which was loaded late (dlopen) into the process? What about threads which were started before that dynamic library was loaded, and therefore did not have enough space in their per-thread area for that thread-local variable, when they call into code which references it? What happens if the thread-local variable has an initializer?
The documentation at https://gcc.gnu.org/onlinedocs/gcc/Thread-Local.html links to a 81-page document describing four TLS access models, and that's just for Unix-style ELF; Windows platforms have their own complexities (which IIRC includes a per-process maximum of 64 or 1088 TLS slots, with slots above the first 64 being handled in a slightly different way).
When you declare a `static char *p;', how do you even know in which address of memory it is going to end up ?? How do you know what will happen if another compilation unit declares another variable of the same name? Another static library? Another dynamic library? What about initialization, what about other constructors that may read memory before main() runs? What about injected threads that are started before that? Madness, I tell you, absolute and utter madness.
> When you declare a "static __thread char p", how do you know that for instance this is located at the 123th word of the per-thread area?
Believe it or not, it's exactly the same way it knows that "static char p" is located at the 123rd word of the data section: The linker does it!
> What if that declaration is on a dynamic library, which was loaded late (dlopen) into the process?
Same thing, except the linker is dynamic.
> What about threads which were started before that dynamic library was loaded, and therefore did not have enough space in their per-thread area for that thread-local variable, when they call into code which references it? What happens if the thread-local variable has an initializer?
Seriously? I mean, you probably do have enough space because address space is huge, but dlopen should move TLS.
> links to a 81-page document describing four TLS access models, and that's just for Unix-style ELF
Just because someone can write 81-pages about something doesn't mean that it takes 81-pages to understand something.
Different people have different appetites for magic (and different definitions of what magic is).
For my first magic trick, I'd like to make something not equal to itself:
foo("bar") == foo("bar")
> false // Magic!
This is easy enough. You can make foo(..) do different things by giving it an implicit dependency on the state of the world. I'll notate implicit dependencies with [].
If you like magic, there's at least two fixes in Java land. 1: You can use some kind of mocking-framework magic to override what your actual clock does (in a way which is not visible in the non-test source code). 2: You can inject a fake clock service into your bean-wiring-framework so that you can control the clock during testing. Others seem to like these two, but they make me barf.
The way I fix it is to just make the implicit explicit instead, by turning them into parameters.
> Different people have different appetites for magic (and different definitions of what magic is).
Magic just means "I don't understand this"
And whilst I don't think it's that complicated, I also don't think you really need to understand how TLS works to use it any more than you need to understand auto mechanics to drive a car. It's faster than walking.
> I feel like thread-local takes this magic to the next level:
- TLS is indexed off of a base address, just like data.
- The thread "o" is going to be the same in each expression, so if thread[o] can live in a special-purpose register like %fs then it won't take a general-purpose register. If data can live in a special-purpose register...
Perhaps with a better picture you too will consider TLS is basically the same as non-TLS, except it's sometimes faster because, well, special circuitry.
If anything accessing via "data" is the weird one because even though it looks like you could just set data=0, you'd find almost everyone does it indirectly via %rip, and a variable "foo" is going to be %rip(-5000) in one place and %rip(-4000) in another, but would always be %fs:-26(0) if it were TLS.
I use thread_local a lot, but until recently, on Windows a delay-loaded dll with thread_local would've not worked, and the fix that is in place today is costly, okay that may not be the typical case, but it shows that support for such feature can create a lot of cost elsewhere.
Another pitfall with these is with thread-stealing concurrent schedulers - e.g. your worker thread now waits on something, and the scheduler decides to reuse the current thread for another worker - what is the meaning of thread_local there?
Another one would be coroutines (though haven't used them a lot in C/C++).
Separate compilation. Of course, if your compiler is fast enough to rebuild the whole universe in 6 seconds and then rest on the seventh — an approach Wirth advocated in one of his papers about an implementation of Pascal system — you won't need a linker. But most compilers are not that fast.
Besides, there is more than one programming language, so that's something we have to deal with somehow.
And to be fair, merging modules in the compiler, as you go by, while not that difficult, is just annoying. If you link them properly together, into big amalgamated text/rodata/data sections, then you need to apply relocations (and have them in the first place). If you just place them next to each other, then you have to organize the inter-module calls via some moral equivalent of GOT/PLT. In any case, all this logic really doesn't have much to do with code generation proper, it's administrativia — and logic for dealing with has already been written for you and packed in the so called "link editor".
All but the most trivial programs require a linker to resolve references among object files. And while "int main() {}" might seem like a trivial C program, it's not (by that definition, at least).
Your favorite toolchain will often include archives and objects that you might take for granted like crt0.o, init.o, fini.o, libgcc/clang_rt.builtins and more.
The compiler's design is simplified by not having to resolve references among symbols. The assembler can do this for references within a section and linkers can do it among sections. Linkers might have to add trampolines/thunks for relocations that span a distance longer than the opcode could reach.
Loaders do this symbol resolution and relocation at runtime.
Note that the nontrivial C program I showed above was not "hello world". Emitting output is yet more complicated still.
> you can easily generate e.g web assembly from your own lang input without linker
IIRC some things that generate wasm generate object files (e.g. C translation units). Those wasm objects require a linker to resolve symbol references to create the executable.
> it will run fine
...on a wasm virtual machine. IIRC Java doesn't have anything like a traditional linker either.
> Aint the same viable for x86?
I mean, you can do anything you want inventing a new compiler that does much more than traditional compilers. But consider perhaps that there's a lot of value in the status quo that you might appreciate more if you were more familiar with the design of a toolchain.
It's not that linkers are an absolute drop-dead necessity, it's that having one allows you to have distinct assembler and compiler functions that are much simpler to design by focusing on one kind of transformation.
When I first came to C++ from Rust I was surprised by the regularity of linker errors. Rust must be compiled with a linker as well but I don't think I've ever seen a linker error, except when doing exotic things far outside of my typical day-to-day.
I guess rustc detects the situations in which the linker would throw an error and then throws its own error preemptively. It leads to a much better user experience than the C++ one, since the error messages produced by the linker are always unnecessarily terrible
> I guess rustc detects the situations in which the linker would throw an error and then throws its own error preemptively.
Pretty much. The crucial difference between C and Rust which enables Rust to do this sort of detection is that in Rust, the extern things are anchored in modules (crates? whatever), and so when you import things, you have to say where you are importing them from.
It even enables one to actually type-check against the imported crate during the compilation (IIRC if Cargo can't locate the imported crate to look into it, it will refuse to build the project), as opposed to hoping that the headers you've included are correctly describing the object file you'll be linking against.
I did the same for tinycc here https://github.com/mingodad/tinycc and used Netbeans IDE that has great refactoring options for C/C++/Java.
Benchmarking the reentrant result showed it to be around 5% slower.
Now I'm trying to redo it again but this time scripting the refactoring using sparse https://github.com/lucvoo/sparse to parse and using it's error messages with with line/column to guide the refactoring, I already got an initial script that performs some initial transformations and is repeatable, but more work need to be done, mainly enhance/extend the info that sparse provide while parsing the code.
Also for C/C++ binaries with debug info gdb is one of the ingredients used to show where and how much globals exists:
gdb -batch -ex "info variables" -ex quit --args binary-to-examine
Why not use thread_local instead of passing a param everywhere? What's the drawback there?
thread_local is usually considered the hack to make unthreaded code littered with static variables useable from multiple thread contexts. It has overhead and reduces the compiler’s ability to optimize the code as compared to when parameters are used.
Also, until very recently, a lot compilers/platforms were unable to handle thread_local variables larger than a pointer size making it difficult to retrofit a lot of old code.
It's worth noting that `thread_local` does reduce register pressure. Unfortunately, almost no languages actually natively support the scoping that sane use of this requires.
I’m curious, what’s a language/compiler that does?
I'm not aware of one, but it probably exists in some obscure language. I can describe how it should work though:
It's possible to write a reasonable library implementation of this using macros in C++, GNU C, or maybe standard C23.(an alternate approach would be to write code as if it were passing arguments to functions, but internally change them to writes to thread-local variables (only if not passed from a parameter of the same name) purely for register-pressure-optimization reasons)
Certain linker operations can be multi-threaded (not sure if this is specifically true for LLD). Particularly LTO in the GNU toolchain, but also there's been a lot of effort recently to make linking faster by actually having it use all available cores.
Thread-local is way too magical for me. I wouldn't want to debug a system that made use of it.
Also, if you pass a param, then it can be shared.
> Thread-local is way too magical for me. I wouldn't want to debug a system that made use of it.
There's a perfectly cromulent register just begging to be used; the circuitry has already been paid for, generating heat whether you like it or not, what magic are you afraid of here?
> Also, if you pass a param, then it can be shared.
Maybe, but if you design for sharing you'll never use your program might be bigger and slower as a result. Sometimes that matters.
> > Thread-local is way too magical for me.
> There's a perfectly cromulent register just begging to be used; [...] what magic are you afraid of here?
Most of the magic is not when using the thread-local variable, but when allocating it. When you declare a "static __thread char *p", how do you know that for instance this is located at the 123th word of the per-thread area? What if that declaration is on a dynamic library, which was loaded late (dlopen) into the process? What about threads which were started before that dynamic library was loaded, and therefore did not have enough space in their per-thread area for that thread-local variable, when they call into code which references it? What happens if the thread-local variable has an initializer?
The documentation at https://gcc.gnu.org/onlinedocs/gcc/Thread-Local.html links to a 81-page document describing four TLS access models, and that's just for Unix-style ELF; Windows platforms have their own complexities (which IIRC includes a per-process maximum of 64 or 1088 TLS slots, with slots above the first 64 being handled in a slightly different way).
When you declare a `static char *p;', how do you even know in which address of memory it is going to end up ?? How do you know what will happen if another compilation unit declares another variable of the same name? Another static library? Another dynamic library? What about initialization, what about other constructors that may read memory before main() runs? What about injected threads that are started before that? Madness, I tell you, absolute and utter madness.
The initialisation model in c++ is totally and utterly broken and indecipherable. That doesn’t stop me from doing vector<int> foo = {1,2, 3};
> When you declare a "static __thread char p", how do you know that for instance this is located at the 123th word of the per-thread area?
Believe it or not, it's exactly the same way it knows that "static char p" is located at the 123rd word of the data section: The linker does it!
> What if that declaration is on a dynamic library, which was loaded late (dlopen) into the process?
Same thing, except the linker is dynamic.
> What about threads which were started before that dynamic library was loaded, and therefore did not have enough space in their per-thread area for that thread-local variable, when they call into code which references it? What happens if the thread-local variable has an initializer?
Seriously? I mean, you probably do have enough space because address space is huge, but dlopen should move TLS.
> links to a 81-page document describing four TLS access models, and that's just for Unix-style ELF
Just because someone can write 81-pages about something doesn't mean that it takes 81-pages to understand something.
Avoiding thread locals due to dynamic libraries being bad is justified but still doesn't feel like the right tradeoff.
Different people have different appetites for magic (and different definitions of what magic is).
For my first magic trick, I'd like to make something not equal to itself:
This is easy enough. You can make foo(..) do different things by giving it an implicit dependency on the state of the world. I'll notate implicit dependencies with []. Where does this pop up "in the real world"? Devs always hit the "what time is it" problem [1] If you like magic, there's at least two fixes in Java land. 1: You can use some kind of mocking-framework magic to override what your actual clock does (in a way which is not visible in the non-test source code). 2: You can inject a fake clock service into your bean-wiring-framework so that you can control the clock during testing. Others seem to like these two, but they make me barf.The way I fix it is to just make the implicit explicit instead, by turning them into parameters.
Not only is it the simplest (subjectively), the technique works in any language.I feel like thread-local takes this magic to the next level:
is really: [1] https://stackoverflow.com/questions/2425721/unit-testing-dat...> Magic just means "I don't understand this"
Does this mean I don't understand that objects can have private mutable variables, or that I don't understand random() or getCurrentTime() ?
> The thread "o" is going to be the same in each expression
I don't accept the premise that I'll always be on the same thread, and if I did, I wouldn't need thread-local in the first place.
> Different people have different appetites for magic (and different definitions of what magic is).
Magic just means "I don't understand this"
And whilst I don't think it's that complicated, I also don't think you really need to understand how TLS works to use it any more than you need to understand auto mechanics to drive a car. It's faster than walking.
> I feel like thread-local takes this magic to the next level:
> assertEquals ( three(), one() + two() )
> is really:
> assertEquals ( three[thread m](), one[thread n]() + two[thread o]() )
So I think it's more like:
versus: That is to say:- TLS is indexed off of a base address, just like data.
- The thread "o" is going to be the same in each expression, so if thread[o] can live in a special-purpose register like %fs then it won't take a general-purpose register. If data can live in a special-purpose register...
Perhaps with a better picture you too will consider TLS is basically the same as non-TLS, except it's sometimes faster because, well, special circuitry.
If anything accessing via "data" is the weird one because even though it looks like you could just set data=0, you'd find almost everyone does it indirectly via %rip, and a variable "foo" is going to be %rip(-5000) in one place and %rip(-4000) in another, but would always be %fs:-26(0) if it were TLS.
I use thread_local a lot, but until recently, on Windows a delay-loaded dll with thread_local would've not worked, and the fix that is in place today is costly, okay that may not be the typical case, but it shows that support for such feature can create a lot of cost elsewhere.
Another pitfall with these is with thread-stealing concurrent schedulers - e.g. your worker thread now waits on something, and the scheduler decides to reuse the current thread for another worker - what is the meaning of thread_local there?
Another one would be coroutines (though haven't used them a lot in C/C++).
they could adopt mold as a linker
Ive always struggled to understand the need to have linker
Like, you could easily write your compiler to do not have to rely on such machinery
Meanwhile they add complexity and decrease quality of error messages (in cpp)
Separate compilation. Of course, if your compiler is fast enough to rebuild the whole universe in 6 seconds and then rest on the seventh — an approach Wirth advocated in one of his papers about an implementation of Pascal system — you won't need a linker. But most compilers are not that fast.
Besides, there is more than one programming language, so that's something we have to deal with somehow.
And to be fair, merging modules in the compiler, as you go by, while not that difficult, is just annoying. If you link them properly together, into big amalgamated text/rodata/data sections, then you need to apply relocations (and have them in the first place). If you just place them next to each other, then you have to organize the inter-module calls via some moral equivalent of GOT/PLT. In any case, all this logic really doesn't have much to do with code generation proper, it's administrativia — and logic for dealing with has already been written for you and packed in the so called "link editor".
What are the bottlenecks that make this so slow
All but the most trivial programs require a linker to resolve references among object files. And while "int main() {}" might seem like a trivial C program, it's not (by that definition, at least).
Your favorite toolchain will often include archives and objects that you might take for granted like crt0.o, init.o, fini.o, libgcc/clang_rt.builtins and more.
The compiler's design is simplified by not having to resolve references among symbols. The assembler can do this for references within a section and linkers can do it among sections. Linkers might have to add trampolines/thunks for relocations that span a distance longer than the opcode could reach. Loaders do this symbol resolution and relocation at runtime.
So, if hello world is not trivial
Then you can easily generate e.g web assembly from your own lang input without linker and it will run fine
Aint the same viable for x86?
> So, if hello world is not trivial
Note that the nontrivial C program I showed above was not "hello world". Emitting output is yet more complicated still.
> you can easily generate e.g web assembly from your own lang input without linker
IIRC some things that generate wasm generate object files (e.g. C translation units). Those wasm objects require a linker to resolve symbol references to create the executable.
> it will run fine
...on a wasm virtual machine. IIRC Java doesn't have anything like a traditional linker either.
> Aint the same viable for x86?
I mean, you can do anything you want inventing a new compiler that does much more than traditional compilers. But consider perhaps that there's a lot of value in the status quo that you might appreciate more if you were more familiar with the design of a toolchain.
It's not that linkers are an absolute drop-dead necessity, it's that having one allows you to have distinct assembler and compiler functions that are much simpler to design by focusing on one kind of transformation.
> Like, you could easily write your compiler to do not have to rely on such machinery
You need a linker as soon as you are dealing with either multiple languages in one project (say, C++ and ASM) or if you include other libraries.
When I first came to C++ from Rust I was surprised by the regularity of linker errors. Rust must be compiled with a linker as well but I don't think I've ever seen a linker error, except when doing exotic things far outside of my typical day-to-day.
I guess rustc detects the situations in which the linker would throw an error and then throws its own error preemptively. It leads to a much better user experience than the C++ one, since the error messages produced by the linker are always unnecessarily terrible
> I guess rustc detects the situations in which the linker would throw an error and then throws its own error preemptively.
Pretty much. The crucial difference between C and Rust which enables Rust to do this sort of detection is that in Rust, the extern things are anchored in modules (crates? whatever), and so when you import things, you have to say where you are importing them from.
versus It even enables one to actually type-check against the imported crate during the compilation (IIRC if Cargo can't locate the imported crate to look into it, it will refuse to build the project), as opposed to hoping that the headers you've included are correctly describing the object file you'll be linking against.Only time I get linker errors in rust is when it's linking some dynamic library written in C.