20 or more years ago there was a peephole optimizer that looked at generated Turbo Pascal (and other not so advanced (for their time) compilers) and rearranged instructions, removed holes, replaced instructions with better set (turns out some of the "macro"-instructions were not that fast).
Nowadays calling a .so/.dylib/.dll function, or accessing a thread-local variable also generates lots of cruft code that could possibly get optimized once the data is loaded. It won't always work (shared libraries can't be unloaded, but with enough hints, or assumptions that this would never be done, one can gain some benefits). On the negative this may reduce the code-sharing across processes.
Nowadays calling a .so/.dylib/.dll function, or accessing a thread-local variable also generates lots of cruft code that could possibly get optimized once the data is loaded. It won't always work (shared libraries can't be unloaded, but with enough hints, or assumptions that this would never be done, one can gain some benefits). On the negative this may reduce the code-sharing across processes.