Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It remains unclear what behaviour compilers currently provide (or should provide) for this.

It might be nice if future surveys explicitly asked a followup question "Regardless of the standard or behavior of existing compilers, is there one of these answers that is the 'obviously correct' manner in which compilers should behave? Which one?"

If practically all users believe that the same answer is 'obviously correct', compiler writes might want to take this into account when deciding which behavior to implement.

  For MSVC, one respondent said:

  "I am aware of a significant divergence between the LLVM 
  community and MSVC here; in general LLVM uses "undefined 
  behaviour" to mean "we can miscompile the program and get 
  better benchmarks", whereas MSVC regards "undefined 
  behaviour" as "we might have a security vulnerability so 
  this is a compile error / build break". First, there is 
  reading an uninitialized variable (i.e. something which 
  does not necessarily have a memory location); that should 
  always be a compile error. Period. Second, there is reading 
  a partially initialised struct (i.e. reading some memory 
  whose contents are only partly defined). That should give a 
  compile error/warning or static analysis warning if 
  detectable. If not detectable it should give the actual 
  contents of the memory (be stable). I am strongly with the 
  MSVC folks on this one - if the compiler can tell at 
  compile time that anything is undefined then it should 
  error out. Security problems are a real problem for the 
  whole industry and should not be included deliberately by 
  compilers."
I'm much less familiar with MSVC than the alternatives, but this is a refreshing approach. Yes, give me a mode that refuses to silently rewrite undefined behavior. Is MSVC possibly able to take this approach because it isn't trying to be compliant to modern C standards? Does it actually reduce the ability to apply useful optimizations? Or just a difference in philosophy?


While I am quite sympathetic to the "break the build on undefined behavior" position, I have a nit to pick in the quoted passage.

Initialization of a variable has no relation to whether it has a memory location. You can legitimately take the address of an uninitialized variable (that is often one step in initializing them, such as when you pass the address to memset), and even an initialized variable may not have a memory location (if it lives only in register, or is compiled away entirely).


To pick your nit, note that they specifically said reading an uninitialized or partially initialized variable, not reading the address of the same.


I didn't mean to imply (and don't think I did...) that reading an uninitialized variable is not a problem, just that the problem is entirely unrelated to whether it has a memory location.


Hate to ask the stupid question, but I've been wondering and it seems to be along the same lines... Why can't this be valid?

int f; f = 2;

To me this says, there is an int pointer f. Let f point to 2. Is this not possible only because 2 does not occupy memory? I don't see why this couldn't be valid.


Assuming you're asking about

    int *f; *f = 2;

The statement,

    *f = 2;
says "dereference f to get a location; set the contents of that location to 2". "Set f to the location of 2" would be spelled

    f = &2;
This is partly a matter of the difference between assignment and definition(/equality).

The first of these is invalid C (undefined behavior) because it uses a value (the contents of f) before initialization.

The second is invalid C (compile error) because 2 is not an l-value (that is, it does not have a location).

Edited to add: I will note that I don't think there is any reason

    *f = 2;
couldn't mean to give f the addess of 2, as by analogy to structural pattern matching, but that the syntax is already taken for something else.


And you might want to access a memory location legitimately that was not written by this program, for instance a chunk of dual-port memory, a shared memory block, a piece of memory mapped hardware and so on.

In those cases some compiler override could provide the solution, while still allowing the compiler to flag all the other cases as errors.


Your examples are not expected to work unless you specifically ask the compiler for it.

If you use the volatile key word, then you are guaranteed that all reads and stores will actually happen and not be optimized out. But volatile still allows non-atomicity and reordering so you probably want something which prohibits that too.


You mean volatile memory?


None of those are going to be autos though, so the behavior isn't undefined.


> If practically all users believe that the same answer is 'obviously correct', compiler writes might want to take this into account

The problem with this democratic approach is that most users are not qualified such that their opinion is particularly valuable.

The small minority who doesn't find it "obviously correct" may actually in some cases be the minority with a clue.

It could work if only those users are given a vote who pass a language lawyer exam.


> The problem with this democratic approach is that most users are not qualified such that their opinion is particularly valuable.

They aren't advocating a democratic approach, they are saying that the general expectations of users is valuable information.

You don't rely exclusively on what the users want, but users expectations for how compilers behave is very valuable behavior, whether it means finding out how to inform users of the real behavior or changing the behavior.


"First, there is reading an uninitialized variable (i.e. something which does not necessarily have a memory location); that should always be a compile error. Period."

You cannot implement that efficiently without rejecting valid programs. Consider code like this:

  int x;
  if( f()) x = 1;
  if( g()) h(x);
For sufficiently complex functions f and g, there's no reasonable way to decide at compile time whether x will be set whenever g() returns true. For example, f() might always return false because Fermat's last theorem is true.

And that 'reasonable' likely isn't a necessary part of that statement.


Yes, this is exactly the reason for undefined behavior. It is often overlooked in the "well, the compiler should just throw an error instead of invoking undefined behavior" debate: the compiler doesn't just "choose to invoke" undefined behavior: it relies on it to do program analysis.

For example, your example could legally be rewritten to:

    int x;
    f();
    x = 1;
    if (g()) { h(x); }
The only difference is if f() were false, and someone would then access x and see 1; but that's undefined behavior, so you can ignore it. In fact, assuming x is not accessible outside this block:

    f();
    if (g()) { h(1); }
These optimizations happen all over the place, it's not the compiler invoking or causing undefined behavior, but assuming that it won't ever happen.

EDIT: Note the further optimization that looms: if f() can be proven pure (no side effects), then it can be removed. This makes little sense for a function with no arguments (in which case it would just be a constant). If, however, f(y, ...) is some expensive but pure function, it can just be removed completely.


"Note the further optimization that looms: if f() can be proven pure (no side effects), then it can be removed. This makes little sense for a function with no arguments (in which case it would just be a constant)."

It could apply if f computes a value based on global state but doesn't change anything, which is slightly weaker than "pure".


Why does this have to be a valid program? There is obviously a chance that x is undefined at line 3 so why not allow the compiler to throw an error?


Not necessarily. Just because the compiler can't prove that g only returns true iff f also returned true doesn't mean the programmer doesn't know it to actually be true.

So changing that behavior means that the compiler now rejects 40 years worth of correctly working legacy code (and some buggy code, as well). Newer languages (e.g. Java, C#) that don't have to support existing code can afford to do what you want and reject programs where a simple heuristic isn't enough to tell whether a variable is initialized or not.


You can already break a lot of programs with -Wall -Werror (I generally compile my code with both, but turn them off for libraries). Just hide the amazing optimisation behind an -foptimise-undefined-behaviour which you can only turn on if you've specified at least a significant portion of -Wall.


So don't raise an error, instead publish a warning.


I think the idea is to give an error when there is absolutely no way that variable could have been initialised when it's read, i.e. cases like this:

    int x;
    h(x);
The point is not to solve the Halting Problem, but to catch code that is so obviously wrong that no programmer would deliberately write it (unless they were testing the compiler's reaction.)


If so, "that should always be a compile error. Period.” is IMO a poor way to express that idea.

On top of that, the C compilers I know more or less have that, as they give warnings for basically cases where Java (with its stricter rules) would refuse to compile equivalent code, and have a flag that turns warnings into errors.


The Java compiler WILL reject this program, because x is not initialised on all code-paths. This is not a valid program.


That should be rejected as an invalid program, you'd fix it by initializing x to a sane default.


Or by putting the second if inside the body of the first.

Often there is no sane default value.


Putting the second if inside the body of the first is not necessarily equivalent.

g() might have side-effects, and those side-effects have to happen regardless of whether or not f() returned true.


Then the correct code was something like:

    if (f()) {
        int x = 1;

        if (g()) {
            h(x);
        }

    } else {
        g();
    }
If g() can be called before f() then it could also be written as:

    int g_flag = g();

    if (f()) {
        int x = 1;

        if (g_flag) {
            h(x);
        }
    |
If you really only had one assignment to x, and both f() and g() have side effects that must be run in-order, this can be more clearly written as:

    int f_flag = f(); // force f() to run for ${reasons}
    int g_flag = g(); // force g() to run for ${reasons}

    if (f_flag && g_flag) {
        h(1);
    }
This is a lot clearer about what the code is doing, and the compiler is probably going to optimize away the two int flag variables anyway.

You should never rely on uinitialized values not just because of the problems relating to undefined behavior, but also because you're adding in assumptions about the runtime state. The code depended on the return value of f() but did not fully express that dependency in the code.


It's true that if the only change you make is wrapping braces from "x = 1;" through the end of the quoted code, you would need to be sure that g does not have desired side effects. Otherwise, you could lift both the f() and g() calls above the branching (which still leaves the second if inside the body of the first, as I described).


Valid as well.


My guess would be that LLVM is not so much motivated by better benchmarks. You could always make it a parameter. The bigger issue usually is: If we mark something as an error, which is actually allowed by the C standard, then lots big old libraries will not compile anymore and nobody is willing to fix all the old code.


If we miscompile (or outright delete) something which is marked "undefined behaviour" by the C standard, then lots of old libraries will build without errors, but fail in various fun ways during use, including having severe security issues. I think failing the build is better than giving me a library with deleted NULL checks and whatnot.


> then lots big old libraries will not compile anymore and nobody is willing to fix all the old code.

This is begging for one question in my opinion. Should we keep using old libraries that nobody is maintaining anymore? Isn't that a big security issue?


Can it be put behind a flag? If some library fails to compile, toss the flag into the toolchain invocation and try again?


MSVC will very confidently warn on cases like this by default:

    int foo() {
        int bar;
        return bar + 5; /* C4700: local variable 'bar' used without having been initialized */
    }
https://msdn.microsoft.com/en-us/library/axhfhh6x.aspx

In the same situation, GCC says "may be used uninitialized in this function" if you enable the warning (-Wmaybe-uninitialized), despite this being a trivial and certain case.


gcc 4.9.2 gives: warning: 'bar' is used uninitialized in this function

Maybe you are using a old version.

If you enable -Wall which you always should then this flag is automatically enabled.

https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#inde...


-Wuninitialized has been broken in GCC for over a decade: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=18501


As if we needed more proof that GCC is a fucking joke...


Hey Linus, how are you doing!


While I like MSVC team approach to this I don't think it's necessarily worse to just warn as long as that warning consistently work. I mean, just add some warning options to your compiling script and you are good to go. I am getting this error from GCC very consistently and as good as compilation error in practice.


First, there is reading an uninitialized variable (i.e. something which does not necessarily have a memory location); that should always be a compile error. Period.

You're out of luck. You have to solve the halting problem to statically analyze whether or not a variable will be used when it's undefined. The reason why this is not solved well is because it's impossible to solve perfectly! Java made another approach that I hate: For example if you have

    Object a;
    for(int i = 0; i < 1;++i) a = new Object();
    a.toString();
then Java will give you

    error: variable a might not have been initialized  
even though it's plain and obvious that a is initialized. Things like that make me mad, it's a non-solution.


I'm not asking for the compiler to consistently identify undefined behavior, only to have a mode where it refuses to silently make 'optimizations' when it does identify UB.

The parallel to your example would be if the for loop was "for(int i = 0; i < j;++i)". If the compiler was able to determine that there is a code path whereby j might be undefined, should it be allowed to remove the body of the loop, even in those cases where the programmer knows by other means that "j >= 1"?

My request is that it either keep the loop body, or complain about the undefined behavior, but not silently make 'optimizations' based on the fact that it has identified the potential for undefined behavior to occur.

Note that I'm just using an 'uninitialized variable' as a hypothetical example. Given a chance, I always compile with -Wall -Wextra, and in practice, GCC, CLang, and ICC (the compilers I use) do a good job of issuing warnings for the use of uninitialized variables. I like this current behavior, but would prefer a philosophical approach that makes warnings like this more rather than less common.


I agree in cases where the compiler knows that undefined behavior is taking place. A lot of the silent optimizations LLVM and GCC make are in cases where the compiler isn't really sure it has identified undefined behavior, though.

To put it in classical logic terminology, one case is modus ponens reasoning. Undefined behavior implies the compiler can do whatever it wants. The compiler finds undefined behavior. Therefore it does whatever it wants. This is the case where it'd be better for the compiler to error out than do something nutty.

But many of the optimizations are doing modus tollens reasoning. If X were true, then the program would perform undefined behavior. Conforming programs do not perform undefined behavior. Therefore NOT-X must hold in conforming programs, and this fact can be used in optimizations.


> If the compiler was able to determine that there is a code path whereby j might be undefined, should it be allowed to remove the body of the loop, even in those cases where the programmer knows by other means that "j >= 1"?

No; rather, the correct logic is that compiler must preserve the body of the loop if there exists the possibility that it can be reached by a valid code path without any undefined behavior (j is defined, and so forth). Only if the compiler can prove that no well-defined execution path can reach the body can it remove it.

(A bad idea to do without any warning, though. If undefined behavior is confirmed, it should be diagnosed.)


> only to have a mode where it refuses to silently make 'optimizations' when it does identify UB.

That's not always possible. It's not that it makes the optimization when it identifies UB. It's that it makes an optimization that is valid to make if UB doesn't occur, but if UB were to occur then that optimization could cause all kinds of unexpected problems. But the compiler can't necessarily identify those cases.

Please read the "what every C programmer should know about undefined behavior" series of articles from LLVM; they describe the reason why they can't, in general, provide warnings or errors for these cases in which optimizations rely on lack of undefined behavior:

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

http://blog.llvm.org/2011/05/what-every-c-programmer-should-...

The third article describes why the compiler can't, in general, warn about those cases in which it's relying on lack of UB, but you should read the first two as well.

Note that for some of those cases, clang and GCC have recently added undefined behavior sanitizers, invoked via "-fsanitize=undefined", which can help even more than the warnings they can add. However, what they do is add extra instrumentation to the executable, and then either log a warning or crash when you hit undefined behavior. The runtime aspect helps avoid the "getting this right would involve solving the halting problem" aspect of why they can't, in general, provide appropriate warnings, but it does mean that this is generally only appropriate in test builds, and that you will only find the undefined behavior that you can trigger during test, while there may be more hiding that only show up in obscure circumstances.

If you really don't want undefined behavior, it's best to use a language, like Rust, which does not have any undefined behavior (outside of "unsafe" blocks). The problem with any kind of warnings that are tacked on after the design of the language is that you are either going to get lots of false positives, lots of false negatives, or both. With a language that is designed not to allow undefined behavior, you know that if the code compiles, it doesn't invoke UB.


Fyi Rust has solved this problem; the compiler forbids the use of undefined variables. Compile error example:

    let x;
    println!("x:{}", x);


Not a rust programmer so forgive me if this is a dumb question.

Sometimes in C one might initialize a variable by passing a pointer to it to an init function:

  void f( void )
  {
    int i;
    bool success;

    success = init( &i );

    if ( success )
      do_stuff( i );
  }
That "init" function might be located in a separate .c file, so there's no way for the compiler to know whether or not the memory whose address is passed to init is initialized or not. So how can Rust "solve" the problem? Does Rust simply not allow taking addresses of variables? Or does it not use .o files, compile all codefiles at once and actually analyze globally for uninitialized variables?


The `init` function you use would not be valid. You would instead write something like this:

    fn f() {
        if let Some(i) = init() {
            do_stuff(i);
        }
    }
In this case, the `init` function would return an `Option<i32>`. In a failure state, this would return `None`, and the pattern match would fail. In a success state, this would return `Some(i)`, where i corresponds to the variable you describe.

The Rust pattern is not only safer, but briefer than yours. It describes the code flow such that you can't remove or repeat a part and end up with inadvertently broken code, and it's memory safe. There is no way for `init` to blow up the stack (whereas in your example, a malicious or buggy init can use the address of i to smash the stack.)


Rust doesn't allow this direct use-case. You can have conditionally-successful initialization by returning an `Option` or `Result`, however.


This should be an error and you should be required to write

    int i = 0; // or some other default value


And dozens of other languages.


I love Java's behaviour. It gives a few false positives as you've noticed, but you can fix that just by doing

    Object a = null;
and go on your way. You do run the risk of a null pointer exception if you don't assign it a valid object reference, though.


Main problem with breaking on undefined behavior is detecting it in the first place. It's not a choice not to warn/break it's an implementation challenge to find those.


I think one could reliably break (more safely) on undefined behavior by inserting defensive run-time checks. Whether that's worth the performance hit is liable to be domain specific.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: