Code | Lev's developer blog

How “undefined” is undefined behavior?

07/03/2021 levdev Leave a comment

I was browsing StackOverflow yesterday, and encountered an interesting question.
Unfortunately, that question was closed almost immediately, with comments stating: “this is undefined behavior” as if there was no explanation to what was happening.

But here is the twist: the author of the question knew already it was “undefined behavior”, but he was getting consistent results, and wanted to know why.

We all know computers are not random. In fact, getting actual random numbers is so hard for a computer, some people result to using lava lamps to get them.
So what is this “undefined behavior”?
Why are novice programmers, especially in C and C++ are always thought that if you do something that leads to undefined behavior, your computer may explode?

Well, I guess because that’s the easier explanation?
Programming languages that are free for anyone to implement, such as C and C++, have very long, very complicated standard documents describing them.
But programmers using these languages seldom read the standard.
Compiler makers do.

So what does “undefined behavior” really mean?
It means the people who wrote the standard don’t care about a particular edge case, or consider it too complicated to handle in a standard way.
So they say to the compiler writers: “programmers shouldn’t do this, but if they do, it is your problem!”.

This means that for every so called “undefined behavior” a specific code will be produced by the compiler, that will do the same specific thing every time the program is ran.
And that means, that as long as you don’t recompile your source with a different compiler, you will be getting consistent results.
The only caveat being that many undefined behavior cases involve code that handles uninitialized or improperly initialized memory, and so if memory values change between runs of the program, results may also change.

The SO question

In the SO question this code was given:

#include <iostream>

using namespace std;

int main() {
   char s1[] = {'a', 'b', 'c'};
   char s2[] = "abc";
   cout << s1 << endl;
   cout << s2 << endl;
 }

According to the person who posted it, when using some online compiler this code consistently printed the first string with an extra character at the end, and the second string correctly.

However, if the line return 0 was added to the code, it would consistently print both strings correctly.

The undefined behavior happens in line 8 where the code tries to print s1 which is not null terminated.

When I ran this code on my desktop machine with g++ 5.5.0 (yes, I am on an old distro), I got a consistent result that looked like this:

abcabc
abc

Adding and removing return 0 did not change the result.

So it seems that the SO crowd were right: this is some random, unreproducible behavior.
Or is it?

The investigation

I decided I wanted to know what the code was really doing.
Will the result change if I reboot the computer?
Will it change if I ran the binary on another computer?

It is tempting to say “yes” because I am sure that like me, at least some of you imagine that without the \0 character in the s1 array, cout will continue trying to print some random memory that may contain anything.

But, that is not at all what happened here!
In fact, I was surprised to learn that my binary will consistently and reliably print the output shown above, and even changing compiler optimization flags didn’t do anything to alter that.

To find out what the code was really doing, I asked the compiler to output assembly instead of machine code.

This reveled some very interesting things:

First, return 0 doesn’t matter!
Turns out, regardless of me adding the line, gcc would output the code needed to cleanly exit main with return value 0 every time:

	movl	$0, %eax
	movq	-8(%rbp), %rdx
	xorq	%fs:40, %rdx
	je	.L3
	call	__stack_chk_fail
.L3:
	leave

Notice the first line: it is Linux x86-64 ABI convention to return the result in EAX register.
The rest is just cleanup code that is also automatically generated.
Interestingly enough, turning on optimization will replace this line with xorl %eax, %eax wich is considered faster on x86 architecture.

So, now we know that at least on my version of gcc the return statement makes no changes to the code, but what about the strings?

I expected the string literal "abc" to be stored somewhere in the data segment of the produced assembly file, but I could not find it.
Instead, I found this code for local variable initialization that is pushing values to the stack:

	movb	$97, -15(%rbp)
	movb	$98, -14(%rbp)
	movb	$99, -13(%rbp)
	movl	$6710628, -12(%rbp)

The first 3 lines are pretty obvious: these are ASCII codes for ‘a’, ‘b’ and ‘c’, so this is our s1 array.
But what is this 6710628 number?
Is it a pointer? Can’t be! It’s impossible to hardcode pointers on modern systems, and what would it point to, anyway?
The string “abc” is nowhere to be found in the compiler output… Or is it?

If we look at this value in hex, it all becomes clear:

                       +----+----+----+----+   +---+---+---+----+
6710628 = 0x00666564 = | 64 | 65 | 66 | 00 | = | a | b | c | \0 |
                       +----+----+----+----+   +---+---+---+----+

Yep, that’s our string!
Looks like despite optimizations being turned off, gcc decided to take this 4 byte long string and turn it in to an immediate 32 bit value (keep in mind x86 is little-endian).

So there you have it: the code the compiler generates pushes the byte values of both arrays on to the stack in sequence.
So no matter what, the resulting binary will find the null character at the end of the second string (s2) and will never try to print random values from memory it is not suppose to access.

Of course, it is possible that a different compiler, a different version of gcc, or just compiling for a different architecture will compile this code in a different way, that would result in a different behavior.
There is no guarantee these two local variables will wind up on the stack neatly in a row.

But for now, we seem to have a well defined behavior, dough it is not defined by the C++ standard, but by the generated code.

I didn’t write this post to encourage anyone to use code like this anywhere or to claim that we should ignore “undefined behavior”.
I simply want to point out that sometimes it can be educational to look at what your high level code is really doing under the hood, and that with computers, there is often order where people think there is chaos, even if this order isn’t immediately apparent.

I think this question would have made for an interesting and educational discussion for SO, and perhaps next time one comes up, someone will give a more detailed answer before it is closed.

I certainly learned something interesting today.
Hope someone reading this will too!

Cheers.

Categories: Code, Random and fun Tags: Assembly, C++, Code, gcc, Linux

Say hello to the new math!

27/04/2018 levdev Leave a comment

Yesterday at my company, some of my department stopped work for a day to participate in an “enrichment” course.

We are a large company that can afford such things, and it can be interesting and useful.

We even have the facilities for lectures right at the office (usually used for instructing new hires) so it is very convenient to invite an expert lecturer from time to time.

This time the lecture was about Angular.

As I am not a web developer, I was not invited, but the lecture room is close to my cubicle so during the launch break I got talking to some of the devs that participated. And of course we got into an argument about programming languages, particularly JavaScript.

I wrote before on this blog that I hate JS with a vengeance. I think it deserves to be called the worst programming language ever developed.

Here are my three main peeves with it:

The == operator. When you read a programming book for beginners that teaches a specific language and it tells you to avoid using a basic operator because the logic behind it is so convoluted even experienced programmers will have hard time predicting the result of a comparison, it kills the language for me.I know JS isn’t the only language to use == and ===, but that does not make it any less awful!
It has the eval built in function. I already dedicated a post to my thoughts on this.
The language is now in its 6th major iteration, and everyone uses it as OO, but it still lacks proper syntax support for classes and other OO features.
This makes any serious piece of modern code written in it exceedingly messy and ugly.

But if those 3 items weren’t enough to prefer Brainfuck over JS, yesterday I got a brand new reason: it implements new math!

Turns out, if you divide by 0 in JS YOU GET INFINITY!!!

Yes, I am shouting. I have a huge problem with this. Because it breaks math.

Computers rely on math to work. Computer programs can not break it. This is beyond ridiculous!

I know JS was originally designed to allow people who were not programmers (like graphics designers) to build websites with cool looking features like animated menus. So the entire language was built upon “on error resume next“ paradigm.

And this was ok for its original purpose, but not when the language has grown to be used for building word processors, spreadsheet editors, and a whole ecosystem of complex applications billions of people use every day.

One of the first things every programmer is taught is to be ready to catch and properly handle errors in their code.

I am not a mathematician. In fact, math is part of the reason I never finished my computer science degree. But even I know you can not divide by zero.

Not because your calculator will show an error message. But because it brakes the basic rules of arithmetic.

Think of it this way:
if 5 ÷ 1 = 5 == 1 × 5 = 5 then 5 ÷ 0 = ∞ == 0 × ∞ = 5 Oops, you just broke devision, multiplication and addition.

But maybe I am not explaining this right? Maybe I am missing something?
Try this TED-ed video instead, it does a much better job:

If JS wanted to avoid exception and keep the math, they could have put a NaN there.

But they didn’t. And they broke the rules of the universe.

So JS sucks, and since it is so prevalent and keeps growing, we are all doomed.

Something to think about… 😛

Categories: Code, Rants Tags: Bad, Code, JavaScript, Languages, Math, Web

Beware Java’s half baked generics

13/10/2016 levdev Leave a comment

Usually I don’t badmouth Java. I think its a very good programming language.

In fact, I tend to defend it in arguments on various forums.

Sure, it lacks features compared to some other languages, but then again throwing everything including a kitchen sink in to a language is not necessarily a good idea. Just look at how easy it is to get a horrible mess of code in C++ with single operator doing different things depending on context. Is &some_var trying to get address of a variable or a reference? And what does &&some_var do? It has nothing to do with the boolean AND operator!

So here we have a simple language friendly to new developers, which is good because there are lots of those using it on the popular Android platform.

Unfortunately, even the best languages have some implementation detail that will make you want to lynch their creators or just reap out your hair, depending on whether you externalize your violent tendencies or not.

Here is a short code example that demonstrates a bug that for about 5 minutes made me think I was high on something:

HashMap<Integer, String> map = new HashMap<>();

byte a = 42;
int b = a;

map.put(b, "The answer!");

if (map.containsKey(a))
	System.out.println("The answer is: " + map.get(a));
else
	System.out.println("What was the question?");

What do you expect this code to print?

Will it even compile?

Apparently it will, but the result will surprise anyone who is not well familiar with Java’s generic types.

Yes folks – the key will not be found and the message What was the question? will be printed.

Here is why:

The generic types in Java are not fully parameterized. Unlike a proper C++ template, some methods of generic containers take parameters of type Object, instead of the type the container instantiation was defined with.

For HashMap, even though it’s add is properly parameterized and will raise a compiler error if the wrong type key is used, the get and containsKey methods take a parameter of type Object and will not even throw a runtime exception if the wrong type is provided. They will simply return null or false respectively as if the key was simply not there.

The other part of the problem is that primitive types such as byte and int are second class citizens in Java. They are not objects like everything else and can not be used to parameterize generics.

They do have object equivalents named Byte and Integer but those don’t have proper operator overloading so are not convenient for all use cases.

Thus in the code sample above the variable a gets autoboxed to Byte, which as far as Java is concerned a completely different type that has nothing to do with Integer and therefore there is no way to search for Byte keys in Integer map.

A language that implements proper generics would have parameterized these methods so either a compilation error occurred or an implicit cast was made.

In Java, it is up to you as a programmer to keep you key type straight even between seemingly compatible types like various size integers.

In my case I was working with a binary protocol received from external device and the function filling up the map was not the same one reading from it, so it was not straight forward to align types everywhere. But in the end I did it and learned my lesson.

Maybe this long rant will help you too. At least until a version of Java gets this part right…

Categories: Code, Tips and tricks Tags: Code, Development, Generics, Java, Type safety

Are Google coders bored?

30/08/2013 levdev Leave a comment

I was browsing Android source code to try and understand some things about ActionBar layout, when I ran in to another little pearl showcasing Android programmers sense of humor, or is it level of boredom?

You decide…

Looking at an older version of ActionBarView.java, I found a member variable called mUpGoerFive (look at line 104 in the link provided).

It held a ViewGroup, so it was important for the display part, but the name did not make sense at first.

Until I remembered this little beauty: http://xkcd.com/1133/

Whats even more funny, while I was looking for a way to link to the proper version of the source file (this variable is removed in the latest version), I ran in to the following commit message:
“Invasion of the monkeys”

I know, these are not the first easter eggs of this kind found in code released by Google, and maybe I am not the first to find them (if you seen this elsewhere, please leave a comment), but they did provide some entertainment during an otherwise tedious task, so I figured I mention them.

Categories: Random and fun Tags: ActionBar, Android, Code, Easter egg, Google, Humor, Java, Up goer five, xkcd

Lev's developer blog

Archive

How “undefined” is undefined behavior?

The SO question

The investigation

Say hello to the new math!

Beware Java’s half baked generics

Are Google coders bored?

Top Posts

Tag Cloud

Categories

Archives

Lev's developer blog

Archive

How “undefined” is undefined behavior?

The SO question

The investigation

Share this:

Say hello to the new math!

Share this:

Beware Java’s half baked generics

Share this:

Are Google coders bored?

Share this:

Top Posts

Tag Cloud

Categories

Archives