Archive for March, 2021

How “undefined” is undefined behavior?

07/03/2021 Leave a comment

I was browsing StackOverflow yesterday, and encountered an interesting question.
Unfortunately, that question was closed almost immediately, with comments stating: “this is undefined behavior” as if there was no explanation to what was happening.

But here is the twist: the author of the question knew already it was “undefined behavior”, but he was getting consistent results, and wanted to know why.

We all know computers are not random. In fact, getting actual random numbers is so hard for a computer, some people result to using lava lamps to get them.
So what is this “undefined behavior”?
Why are novice programmers, especially in C and C++ are always thought that if you do something that leads to undefined behavior, your computer may explode?

Well, I guess because that’s the easier explanation?
Programming languages that are free for anyone to implement, such as C and C++, have very long, very complicated standard documents describing them.
But programmers using these languages seldom read the standard.
Compiler makers do.

So what does “undefined behavior” really mean?
It means the people who wrote the standard don’t care about a particular edge case, or consider it too complicated to handle in a standard way.
So they say to the compiler writers: “programmers shouldn’t do this, but if they do, it is your problem!”.

This means that for every so called “undefined behavior” a specific code will be produced by the compiler, that will do the same specific thing every time the program is ran.
And that means, that as long as you don’t recompile your source with a different compiler, you will be getting consistent results.
The only caveat being that many undefined behavior cases involve code that handles uninitialized or improperly initialized memory, and so if memory values change between runs of the program, results may also change.

The SO question

In the SO question this code was given:

#include <iostream>

using namespace std;

int main() {
   char s1[] = {'a', 'b', 'c'};
   char s2[] = "abc";
   cout << s1 << endl;
   cout << s2 << endl;

According to the person who posted it, when using some online compiler this code consistently printed the first string with an extra character at the end, and the second string correctly.

However, if the line return 0 was added to the code, it would consistently print both strings correctly.

The undefined behavior happens in line 8 where the code tries to print s1 which is not null terminated.

When I ran this code on my desktop machine with g++ 5.5.0 (yes, I am on an old distro), I got a consistent result that looked like this:


Adding and removing return 0 did not change the result.

So it seems that the SO crowd were right: this is some random, unreproducible behavior.
Or is it?

The investigation

I decided I wanted to know what the code was really doing.
Will the result change if I reboot the computer?
Will it change if I ran the binary on another computer?

It is tempting to say “yes” because I am sure that like me, at least some of you imagine that without the \0 character in the s1 array, cout will continue trying to print some random memory that may contain anything.

But, that is not at all what happened here!
In fact, I was surprised to learn that my binary will consistently and reliably print the output shown above, and even changing compiler optimization flags didn’t do anything to alter that.

To find out what the code was really doing, I asked the compiler to output assembly instead of machine code.

This reveled some very interesting things:

First, return 0 doesn’t matter!
Turns out, regardless of me adding the line, gcc would output the code needed to cleanly exit main with return value 0 every time:

	movl	$0, %eax
	movq	-8(%rbp), %rdx
	xorq	%fs:40, %rdx
	je	.L3
	call	__stack_chk_fail

Notice the first line: it is Linux x86-64 ABI convention to return the result in EAX register.
The rest is just cleanup code that is also automatically generated.
Interestingly enough, turning on optimization will replace this line with xorl %eax, %eax wich is considered faster on x86 architecture.

So, now we know that at least on my version of gcc the return statement makes no changes to the code, but what about the strings?

I expected the string literal "abc" to be stored somewhere in the data segment of the produced assembly file, but I could not find it.
Instead, I found this code for local variable initialization that is pushing values to the stack:

	movb	$97, -15(%rbp)
	movb	$98, -14(%rbp)
	movb	$99, -13(%rbp)
	movl	$6710628, -12(%rbp)

The first 3 lines are pretty obvious: these are ASCII codes for ‘a’, ‘b’ and ‘c’, so this is our s1 array.
But what is this 6710628 number?
Is it a pointer? Can’t be! It’s impossible to hardcode pointers on modern systems, and what would it point to, anyway?
The string “abc” is nowhere to be found in the compiler output… Or is it?

If we look at this value in hex, it all becomes clear:

                       +----+----+----+----+   +---+---+---+----+
6710628 = 0x00666564 = | 64 | 65 | 66 | 00 | = | a | b | c | \0 |
                       +----+----+----+----+   +---+---+---+----+

Yep, that’s our string!
Looks like despite optimizations being turned off, gcc decided to take this 4 byte long string and turn it in to an immediate 32 bit value (keep in mind x86 is little-endian).

So there you have it: the code the compiler generates pushes the byte values of both arrays on to the stack in sequence.
So no matter what, the resulting binary will find the null character at the end of the second string (s2) and will never try to print random values from memory it is not suppose to access.

Of course, it is possible that a different compiler, a different version of gcc, or just compiling for a different architecture will compile this code in a different way, that would result in a different behavior.
There is no guarantee these two local variables will wind up on the stack neatly in a row.

But for now, we seem to have a well defined behavior, dough it is not defined by the C++ standard, but by the generated code.

I didn’t write this post to encourage anyone to use code like this anywhere or to claim that we should ignore “undefined behavior”.
I simply want to point out that sometimes it can be educational to look at what your high level code is really doing under the hood, and that with computers, there is often order where people think there is chaos, even if this order isn’t immediately apparent.

I think this question would have made for an interesting and educational discussion for SO, and perhaps next time one comes up, someone will give a more detailed answer before it is closed.

I certainly learned something interesting today.
Hope someone reading this will too!


Categories: Code, Random and fun Tags: , , , ,
%d bloggers like this: