Code | Lev's developer blog

How “undefined” is undefined behavior?

07/03/2021 levdev 1 comment

I was browsing StackOverflow yesterday, and encountered an interesting question.
Unfortunately, that question was closed almost immediately, with comments stating: “this is undefined behavior” as if there was no explanation to what was happening.

But here is the twist: the author of the question knew already it was “undefined behavior”, but he was getting consistent results, and wanted to know why.

We all know computers are not random. In fact, getting actual random numbers is so hard for a computer, some people result to using lava lamps to get them.
So what is this “undefined behavior”?
Why are novice programmers, especially in C and C++ are always thought that if you do something that leads to undefined behavior, your computer may explode?

Well, I guess because that’s the easier explanation?
Programming languages that are free for anyone to implement, such as C and C++, have very long, very complicated standard documents describing them.
But programmers using these languages seldom read the standard.
Compiler makers do.

So what does “undefined behavior” really mean?
It means the people who wrote the standard don’t care about a particular edge case, or consider it too complicated to handle in a standard way.
So they say to the compiler writers: “programmers shouldn’t do this, but if they do, it is your problem!”.

This means that for every so called “undefined behavior” a specific code will be produced by the compiler, that will do the same specific thing every time the program is ran.
And that means, that as long as you don’t recompile your source with a different compiler, you will be getting consistent results.
The only caveat being that many undefined behavior cases involve code that handles uninitialized or improperly initialized memory, and so if memory values change between runs of the program, results may also change.

The SO question

In the SO question this code was given:

#include <iostream>

using namespace std;

int main() {
   char s1[] = {'a', 'b', 'c'};
   char s2[] = "abc";
   cout << s1 << endl;
   cout << s2 << endl;
 }

According to the person who posted it, when using some online compiler this code consistently printed the first string with an extra character at the end, and the second string correctly.

However, if the line return 0 was added to the code, it would consistently print both strings correctly.

The undefined behavior happens in line 8 where the code tries to print s1 which is not null terminated.

When I ran this code on my desktop machine with g++ 5.5.0 (yes, I am on an old distro), I got a consistent result that looked like this:

abcabc
abc

Adding and removing return 0 did not change the result.

So it seems that the SO crowd were right: this is some random, unreproducible behavior.
Or is it?

The investigation

I decided I wanted to know what the code was really doing.
Will the result change if I reboot the computer?
Will it change if I ran the binary on another computer?

It is tempting to say “yes” because I am sure that like me, at least some of you imagine that without the \0 character in the s1 array, cout will continue trying to print some random memory that may contain anything.

But, that is not at all what happened here!
In fact, I was surprised to learn that my binary will consistently and reliably print the output shown above, and even changing compiler optimization flags didn’t do anything to alter that.

To find out what the code was really doing, I asked the compiler to output assembly instead of machine code.

This reveled some very interesting things:

First, return 0 doesn’t matter!
Turns out, regardless of me adding the line, gcc would output the code needed to cleanly exit main with return value 0 every time:

	movl	$0, %eax
	movq	-8(%rbp), %rdx
	xorq	%fs:40, %rdx
	je	.L3
	call	__stack_chk_fail
.L3:
	leave

Notice the first line: it is Linux x86-64 ABI convention to return the result in EAX register.
The rest is just cleanup code that is also automatically generated.
Interestingly enough, turning on optimization will replace this line with xorl %eax, %eax wich is considered faster on x86 architecture.

So, now we know that at least on my version of gcc the return statement makes no changes to the code, but what about the strings?

I expected the string literal "abc" to be stored somewhere in the data segment of the produced assembly file, but I could not find it.
Instead, I found this code for local variable initialization that is pushing values to the stack:

	movb	$97, -15(%rbp)
	movb	$98, -14(%rbp)
	movb	$99, -13(%rbp)
	movl	$6710628, -12(%rbp)

The first 3 lines are pretty obvious: these are ASCII codes for ‘a’, ‘b’ and ‘c’, so this is our s1 array.
But what is this 6710628 number?
Is it a pointer? Can’t be! It’s impossible to hardcode pointers on modern systems, and what would it point to, anyway?
The string “abc” is nowhere to be found in the compiler output… Or is it?

If we look at this value in hex, it all becomes clear:

                       +----+----+----+----+   +---+---+---+----+
6710628 = 0x00666564 = | 64 | 65 | 66 | 00 | = | a | b | c | \0 |
                       +----+----+----+----+   +---+---+---+----+

Yep, that’s our string!
Looks like despite optimizations being turned off, gcc decided to take this 4 byte long string and turn it in to an immediate 32 bit value (keep in mind x86 is little-endian).

So there you have it: the code the compiler generates pushes the byte values of both arrays on to the stack in sequence.
So no matter what, the resulting binary will find the null character at the end of the second string (s2) and will never try to print random values from memory it is not suppose to access.

Of course, it is possible that a different compiler, a different version of gcc, or just compiling for a different architecture will compile this code in a different way, that would result in a different behavior.
There is no guarantee these two local variables will wind up on the stack neatly in a row.

But for now, we seem to have a well defined behavior, dough it is not defined by the C++ standard, but by the generated code.

I didn’t write this post to encourage anyone to use code like this anywhere or to claim that we should ignore “undefined behavior”.
I simply want to point out that sometimes it can be educational to look at what your high level code is really doing under the hood, and that with computers, there is often order where people think there is chaos, even if this order isn’t immediately apparent.

I think this question would have made for an interesting and educational discussion for SO, and perhaps next time one comes up, someone will give a more detailed answer before it is closed.

I certainly learned something interesting today.
Hope someone reading this will too!

Cheers.

Categories: Code, Random and fun Tags: Assembly, C++, Code, gcc, Linux

Say hello to the new math!

27/04/2018 levdev 1 comment

Yesterday at my company, some of my department stopped work for a day to participate in an “enrichment” course.

We are a large company that can afford such things, and it can be interesting and useful.

We even have the facilities for lectures right at the office (usually used for instructing new hires) so it is very convenient to invite an expert lecturer from time to time.

This time the lecture was about Angular.

As I am not a web developer, I was not invited, but the lecture room is close to my cubicle so during the launch break I got talking to some of the devs that participated. And of course we got into an argument about programming languages, particularly JavaScript.

I wrote before on this blog that I hate JS with a vengeance. I think it deserves to be called the worst programming language ever developed.

Here are my three main peeves with it:

The == operator. When you read a programming book for beginners that teaches a specific language and it tells you to avoid using a basic operator because the logic behind it is so convoluted even experienced programmers will have hard time predicting the result of a comparison, it kills the language for me.I know JS isn’t the only language to use == and ===, but that does not make it any less awful!
It has the eval built in function. I already dedicated a post to my thoughts on this.
The language is now in its 6th major iteration, and everyone uses it as OO, but it still lacks proper syntax support for classes and other OO features.
This makes any serious piece of modern code written in it exceedingly messy and ugly.

But if those 3 items weren’t enough to prefer Brainfuck over JS, yesterday I got a brand new reason: it implements new math!

Turns out, if you divide by 0 in JS YOU GET INFINITY!!!

Yes, I am shouting. I have a huge problem with this. Because it breaks math.

Computers rely on math to work. Computer programs can not break it. This is beyond ridiculous!

I know JS was originally designed to allow people who were not programmers (like graphics designers) to build websites with cool looking features like animated menus. So the entire language was built upon “on error resume next“ paradigm.

And this was ok for its original purpose, but not when the language has grown to be used for building word processors, spreadsheet editors, and a whole ecosystem of complex applications billions of people use every day.

One of the first things every programmer is taught is to be ready to catch and properly handle errors in their code.

I am not a mathematician. In fact, math is part of the reason I never finished my computer science degree. But even I know you can not divide by zero.

Not because your calculator will show an error message. But because it brakes the basic rules of arithmetic.

Think of it this way:
if 5 ÷ 1 = 5 == 1 × 5 = 5 then 5 ÷ 0 = ∞ == 0 × ∞ = 5 Oops, you just broke devision, multiplication and addition.

But maybe I am not explaining this right? Maybe I am missing something?
Try this TED-ed video instead, it does a much better job:

If JS wanted to avoid exception and keep the math, they could have put a NaN there.

But they didn’t. And they broke the rules of the universe.

So JS sucks, and since it is so prevalent and keeps growing, we are all doomed.

Something to think about… 😛

Categories: Code, Rants Tags: Bad, Code, JavaScript, Languages, Math, Web

Sneaking features through the back door

15/04/2017 levdev Leave a comment

Sometimes programming language developers decide that certain practices are bad, so bad that they try to prevent their use through the language they develop.

For example: In both Java and C# multiple inheritance is not allowed. The language standard prohibits it, so trying to specify more than one base class will result in compiler error.

Another blocking “feature” these languages share, is a syntax preventing creation of derivative classes all together.

For Java, it is declaring a class to be final which might be a bit confusing for new users, since the same keyword is used to declare constants.

As an example, this will not compile:

public final class YouCanNotExtendMe {
    ...
}

public class TryingAnyway extends YouCanNotExtendMe {
    ...
}

For C# just replace final with sealed.

This can also be applied to specific methods instead of the entire class, to prevent overriding, in both languages.

While application developers may not find many uses for this feature, it shows up even in the Java standard library. Just try extending the built-in String class.

But, language features are tools in a tool box.

Each one can be both useful and misused or abused. It depends solely on the developer using the tools.

And that is why as languages evolve over the years, on some occasions their developers give up fighting the users and add some things they left out at the beginning.

Usually in a sneaky, roundabout way, to avoid admitting they were wrong or that they succumbed to peer pressure.

In this post, I will show two examples of such features, one from Java, and one from C#.

C# Extension methods

In version 3.0 of C# a new feature was added to the language: “extension methods”.

Just as their name suggests, they can be used to extend any class, including a sealed class. And you do not need access to the class implementation to use them. Just write your own class with a static method (or as many methods as you want), that has the first parameter denoted by the keyword this and of the type you want to extend.

Microsoft’s own guide gives an example of adding a method to the built in sealed String type.

Those who know and use C# will probably argue that there are two key differences between extension methods and derived classes:

Extension methods do not create a new type.
Personally, I think that will only effect compile time checks, which can be replaced with run time checks if not all instances of the ‘base’ class can be extended.
Also, a creative workaround may be possible with attributes.
Existing methods can not be overridden by extension methods.
This is a major drawback, and I can not think of a workaround for it.
But, you can still overload methods. And who knows what will be added in the future…

So it may not be complete, but a way to break class seals was added to the language after only two major iterations.

Multiple inheritance in Java through interfaces

Java has two separate mechanisms to support polymorphism: inheritance and interfaces.

A Java class can have only one base class it inherits from, but can implement many interfaces, and so can be referenced through these interface types.

public interface IfaceA {
    void methodA();
}

public interface IfaceB {
    void methodB();
}

public class Example implements IfaceA, IfaceB {
    @override
    public void methodA() {
        ...
    }

    @override
    public void methodB() {
        ...
    }
}

Example var0 = new Example();
IfaceA var1 = var0;
IfaceB var2 = var0;

But, before Java 8, interfaces could not contain any code, only constants and method declarations, so classes could not inherit functionality from them, as they could by extending a base class.

Thus while interfaces provided the polymorphic part of multiple inheritance, they lacked the functionality reuse part.

In Java 8 all that changed with addition of default and static methods to interfaces.

Now, an interface could contain code, and any class implementing it would inherit this functionality.

It appears that Java 9 is about to take this one step further: it will add private methods to interfaces!

Before this, everything in an interface had to be public.

This essentially erases any differences between interfaces and abstract classes, and allows multiple inheritance. But, being a back door feature, it still has some limitations compared to true multiple inheritance that is available in languages like Python and C++:

You can not inherit any collection of classes together. Class writer must allow joined inheritance by implementing the class as interface.
Unlike regular base classes, interfaces can not be instantiated on their own, even if all the methods of an interface have default implementations.
This can be easily worked around by creating a dummy class without any code that implements the interface.
There are no protected methods.
Maybe Java 10 will add them…

But basically, after 8 major iterations of the language, you can finally have full blown multiple inheritance in Java.

Conclusion

These features have their official excuses:
Extension methods are supposed to be “syntactic sugar” for “helper” and utility classes.
Default method implementation is suppose to allow extending interfaces without breaking legacy code.

But whatever the original intentions and reasoning were, the fact remains: you can have C# code that calls instance methods on objects that are not part of the original object, and you can now have Java classes that inherit “is a” type and working code from multiple sources.

And I don’t think this is a bad thing.
As long as programmers use these tools correctly, it will make code better.
Fighting your users is always a bad idea, more so if your users are developers themselves.

Do you know of any other features like this that showed up in other languages?
Let me know in the comments or by email!

Categories: Code, Rants Tags: C, Development, Features, Java, Languages, OOP, Standards

Putting it out there…

03/04/2017 levdev Leave a comment

When I first discovered Linux and the world of Free Software, I was already programming in the Microsoft ecosystem for several years, both as a hobby, and for a living.

I thought switching to writing Linux programs was just a matter of learning a new API.

I was wrong! Very wrong!

I couldn’t find my way through the myriad of tools, build systems, IDEs and editors, frameworks and libraries.

I needed to change my thinking, and get things in order. Luckily, I found an excellent book that helped me do just that: “Beginning Linux Programming” from Wrox.

Chapter by chapter this book guided me from using the terminal, all the way up to building desktop GUI programs like I was used to on Windows.

But you can’t learn programming just by reading a book. You have to use what you learn immediately and build something, or you will not retain any knowledge!

And that is exactly what I did: as soon as I got far enough to begin compiling C programs, I started building “Terminal Bomber”.

In hindsight, the name sounds like a malware. But it is not. It’s just a game. It is a clone of “Mine sweeper” for the terminal, using ncurses library.

I got as far as making it playable, adding high scores and help screens. I even managed to use Autotools for the build system.

And then I lost interest and let it rot. With the TODO file still full of items.

I never even got to packaging it for distribution, even though I planned from the start to release it under GPL v3.

But now, in preparation for a series of posts about quickly and easily setting up HTTP server for IPC, I finally opened a Github account.

And to fill it with something, I decided to put up a repository for this forgotten project.

So here we go: https://github.com/lvmtime/term_bomber

Terminal Bomber finally sees the light of day!

“Release early release often” is a known FOSS mantra. This release is kind of late for me personally, but early in the apps life cycle.

Will anything become of this code? I don’t know. It would be great if someone at least runs it.

Sadly, this is currently the only project I published that is not built for a dead OS.

Just putting it out there… 😛

Categories: Code, Projects Tags: Desktop, FOSS, git, github, mine sweeper, source code, terminal

Beware Java’s half baked generics

13/10/2016 levdev Leave a comment

Usually I don’t badmouth Java. I think its a very good programming language.

In fact, I tend to defend it in arguments on various forums.

Sure, it lacks features compared to some other languages, but then again throwing everything including a kitchen sink in to a language is not necessarily a good idea. Just look at how easy it is to get a horrible mess of code in C++ with single operator doing different things depending on context. Is &some_var trying to get address of a variable or a reference? And what does &&some_var do? It has nothing to do with the boolean AND operator!

So here we have a simple language friendly to new developers, which is good because there are lots of those using it on the popular Android platform.

Unfortunately, even the best languages have some implementation detail that will make you want to lynch their creators or just reap out your hair, depending on whether you externalize your violent tendencies or not.

Here is a short code example that demonstrates a bug that for about 5 minutes made me think I was high on something:

HashMap<Integer, String> map = new HashMap<>();

byte a = 42;
int b = a;

map.put(b, "The answer!");

if (map.containsKey(a))
	System.out.println("The answer is: " + map.get(a));
else
	System.out.println("What was the question?");

What do you expect this code to print?

Will it even compile?

Apparently it will, but the result will surprise anyone who is not well familiar with Java’s generic types.

Yes folks – the key will not be found and the message What was the question? will be printed.

Here is why:

The generic types in Java are not fully parameterized. Unlike a proper C++ template, some methods of generic containers take parameters of type Object, instead of the type the container instantiation was defined with.

For HashMap, even though it’s add is properly parameterized and will raise a compiler error if the wrong type key is used, the get and containsKey methods take a parameter of type Object and will not even throw a runtime exception if the wrong type is provided. They will simply return null or false respectively as if the key was simply not there.

The other part of the problem is that primitive types such as byte and int are second class citizens in Java. They are not objects like everything else and can not be used to parameterize generics.

They do have object equivalents named Byte and Integer but those don’t have proper operator overloading so are not convenient for all use cases.

Thus in the code sample above the variable a gets autoboxed to Byte, which as far as Java is concerned a completely different type that has nothing to do with Integer and therefore there is no way to search for Byte keys in Integer map.

A language that implements proper generics would have parameterized these methods so either a compilation error occurred or an implicit cast was made.

In Java, it is up to you as a programmer to keep you key type straight even between seemingly compatible types like various size integers.

In my case I was working with a binary protocol received from external device and the function filling up the map was not the same one reading from it, so it was not straight forward to align types everywhere. But in the end I did it and learned my lesson.

Maybe this long rant will help you too. At least until a version of Java gets this part right…

Categories: Code, Tips and tricks Tags: Code, Development, Generics, Java, Type safety

Get XML element value in Python using minidom

29/07/2011 levdev Leave a comment

Finally, a “development” post for my “developer” blog.

Recently, I’ve been working on some XML processing programs in Python.

The minidom module is great if you want your XML in a tree, and want tag names and attributes easily accessible, but, what happens if you want the text content inside a tag?

DOM, does not have a “tag value” concept. Instead, every bit of text in the XML, including the indentation is a “text node”, which is parsed as a separate tree element.

That means, that if you have something like this:


<name>John Smith</name>

You will get a tree with two levels: top level for “name” element, for which nodeValue will be None. This element will have a child node (second level of the tree) which will be of type TEXT_NODE an it’s values will be the text “John Smith”.

So far, so good, but, what if the value we want has some XML markup of its own?


<text>This text has <b>bold</b> and <i>italic</i> words.</text>

Now we have a complex tree on our hands with 3 levels and multiple branches.

It will look something like this:

<text>
   |______
          |-"This text has
          |-<b>
          |  |_________
          |            -"bold"
          |-"and"
          |-<i>
          |  |_________
          |            -"italic"
          --"words."

As you can see, this is a big mess, with the text split in to multiple parts on two separate tree levels.

There is no facility in minidom, to get the value of our <text> tag directly.

There is however, a way around it, that is simple but not obvious: you need to “flatten” the desired tag in to an XML string, then strip the tag it self from the string and you will have a clean value.

Here is the code:

def get_tag_value(node):
    """retrieves value of given XML node
    parameter:
    node - node object containing the tag element produced by minidom

    return:
    content of the tag element as string
    """

    xml_str = node.toxml() # flattens the element to string

    # cut off the base tag to get clean content:
    start = xml_str.find('>')
    if start == -1:
        return ''
    end = xml_str.rfind('<')
    if end < start:
        return ''

    return xml_str[start + 1:end]

Just pass the node you want the value of to the function and it will give you back the value as a string, including any internal markup.

I place this code in the public domain, which means you can use it anywhere any way you want with no strings attached.

Categories: Code, Tips and tricks Tags: Development, minidom, Python, XML

Lev's developer blog

Archive

How “undefined” is undefined behavior?

The SO question

The investigation

Say hello to the new math!

Sneaking features through the back door

C# Extension methods

Multiple inheritance in Java through interfaces

Conclusion

Putting it out there…

Beware Java’s half baked generics

Get XML element value in Python using minidom

Top Posts

Tag Cloud

Categories

Archives

Lev's developer blog

Archive

How “undefined” is undefined behavior?

The SO question

The investigation

Share this:

Say hello to the new math!

Share this:

Sneaking features through the back door

C# Extension methods

Multiple inheritance in Java through interfaces

Conclusion

Share this:

Putting it out there…

Share this:

Beware Java’s half baked generics

Share this:

Get XML element value in Python using minidom

Share this:

Top Posts

Tag Cloud

Categories

Archives