Saturday, March 23, 2013

Building a git network visualization tool


I always liked the Git network viewer (i.e. see this example). In the "early git days", I found it very useful to have a visible return of the git operations I was performing.
As today I found it more clear than the regular vertical log viewer.
Unfortunately it seems that only github provides a similar functionality, and of course it works only on remote repositories.

So, I decided to write a similar tool by myself :) I picked up Qt 5.0 and libgit2 and built something from scratch, in a few hours.
I thought it was a relatively simple operation, but in the end I spent a good amount of time fighting with QML performance. At the moment I just reached a decent level of performance (at least for a first version), and after I complete the GUI with a basilar set of options, I will publish it as OSS.

After the first version, I plan to add navigation functionalities the github viewer doesn't have and also a "graphs" page. This kind of visualization it's not really nice in repositories where many branch are involved, but I think I can find a way to display nicely these too...

So, stay tuned :)



Monday, February 4, 2013

Fun with composition and interfaces

As one of most important programming concepts is code reuse, let's see some nice ways to use C++ templates and make code-reuse simple and efficient. Here it follows a bunch of techniques that have proven useful in my real life programming.

Suppose you have a set of unrelated classes, and you want to add a common behavior to each of these.

class A : public A_base
{

};

class B : public B_base
{

}
...
 
you want to add a set of common data and functions to each of these class. These can be extension functions (like loggers,benchmarking classes, etc...), or algorithm traits (see below).

class printer
{
public:
    void print ()
    {
      std::cout << "Print " << endl;
    }
};
The simplest way to accomplish this is using multiple inheritance:
 
class A : public A_base , public printer
 {

 }

That's easy , but the printer class won't be able to access any of the A' members. In this way it is possible to compose only classes which are independent from each other.

Composition by hierarchy insertion

Suppose we are willing to access a class member to perform an additional operation. A possible way is to insert the extension class in the hierarchy.
 template <class T>
 class printer : public T
 {
  public:
    void print ()
    {
         std::cout << "Class name " << this->name() << endl;
    }
 }

The class printer now relies on the "std::string name()" function being available on it's base class. This kind of requirement is quite common on template classes, and until we get concepts we must pay attention that the required methods exists on the extending classes.
BTW, type traits could eventually be used in place of direct function calls.
The class can be inserted in a class hierarchy and the derived classes can access the additional functionality.
 
class A_base
{
public:
    std::string name () { return "A_base class";}
}  

class A : public printer<A_base>
{

} ;

int main ()
{
    A a;
    a.print();
    return 0;
}

this technique can be useful to compose algorithms that have different traits, and share code between similar classes that don't have a common base.
This last examples is not a really good one for multiple reasons:
  • In case of complex A_base constructor, it's arguments should be forwarded by the printer class. Using C++11 delegating constructors will make things easier, but in C++98 you'll have to manually forward each A_base constructor , making the class less generic.
  • inserting a class in a hierarchy can be unwanted.
  • If you want to access A (not A_base) members from the extension class, you need to add another derived class, deriving from printer<A>
Anyway, this specific kind of pattern can still be useful to reuse virtual function implementations:
class my_interface
{
public:
   virtual void functionA () = 0;
   virtual void functionB () = 0;
};

template <class Base>
class some_impl : public Base
{
     void functionA () override;
};

class my_deriv1 : public some_impl<my_interface>
{
   void functionB() override;
};

In particular, if my_base is an interface, some_impl can be used to reuse a piece of the implementation.

Using the Curiously recurring template pattern.

Now it comes the nice part: to overcome the limitations of the previous samples,a nice pattern can be used: the Curiously recurring template pattern.

template <class T>
class printer
{
public:
   void print ()
   {
       std::cout << (static_cast<T*>(this))->name() << std::endl;
   }
};

class A : public A_base, public printer<A>
{
public:
   std::string name ()
   {
       return "A class";
   }
};

int main ()
{
     A a;
     a.print ();
     return 0;
}

Let's analyze this a bit: the printer class is still a template, but it doesn't derive from T anymore.
Instead, the printer implementation assumes that the printer class will be used in a context where it is convertible to T*, and will have access to all T members.
This is the reason of the static cast to (T*) in the code.

If you look at the code of the printer class alone, this question arises immediately:
how comes that a class unrelated to T can be statically casted to T* ?

The answer is that you don’t have look at the printer class “alone” : template classes and functions are instantiated on the first usage.
When the print() function is called, the template is instantiated. At that point the compiler already knows that A is derived from printer<A> so the static cast can be performed like any down cast.

As you can see, with this idiom you can extend any class and even access it's members from the extending functions.
You may have noticed that the extension class can only access public members of the extending class. To avoid this , the extension class must be made friend:
template <class T>

class printer
{
 T* thisClass() { return (static_cast<T*>(this)) };

public:

   void print ()
   {
       std::cout << thisClass()->name() << std::endl;
   }
};

class A : public A_base, public printer<A>
{
  friend class printer<A>;

private:

   std::string name ()
   {
       return "A class";
   }
};

I've also added a thisClass() utility function to simplify the code and place the cast in one place (const version left to the reader).

Algorithm composition

This specific kind of pattern can be used to create what I call “algorithm traits”, and it’s one of the ways I use it in real code.

Suppose you have a generic algorithm which is composed by two or more parts. Suppose also that the data is shared and is eventually stored in another class (as usual endless combinations are possible). Here I’ll make a very simple example, but I’ve used it to successfully compose complex CAD algorithms:
template <class T>
class base1
{
protected:
    std::vector Data();
    void fillData ();
};

template <class T>
class phase1_A
{
protected:
    void phase1();
};

template <class T>
class phase1_B
{
protected:
    void phase1();
};

template <class T>
class phase2_A()
{
protected:
   void phase2();
};

template <class T>
class phase2_B()
{
protected:
   void phase2();
};

template <class T>
class algorithm
{
public:
    void run ()
    {
          fillData();
          phase1();
          phase2();
    }
};

// this would be the version using the “derived class” technique
//public class UserA : public Algorithm<phase2_a<phase1_a<base1>>>;

class comb1 : public algorithm<comb1>,phase1_A<comb1>, phase2_A<comb1>,base1<comb1> {};
class comb2 : public algorithm<comb2>,phase1_B<comb2>, phase2_A<comb2>,base1<comb2> {};
class comb3 : public algorithm<comb3>,phase1_A<comb3>, phase2_B<comb3>,base1<comb3> {};
class comb4 : public algorithm<comb4>,phase1_B<comb4>, phase2_B<comb4>,base1<comb4> {};
...
This technique is useful when the algorithms heavily manipulate member data, and functional style programming could not be efficient (input->copy->output). Anyway.... it's just another way to combine things.
A small note on performance: the static_casts in the algorithm pieces will usually require a displacement operation on this (i.e a subtraction), while using a hierarchy usually results in a NOP.
This technique can also be mixed with virtual functions and eventually the algorithm implemented in a base class while the function overrides composed in the way I just exposed.

Interfaces

As we saw, extension methods allow to reuse specific code in unrelated extending classes. In high level C++, the same thing is often accomplished with interfaces:
class IPrinter
{
public:
  virtual void print () = 0;
};

class A : public A_base , public IPrinter
{
public:
 void print () override { std::cout << "A class" << std::endl; }
};

in this case each class that implements an interface requires to re-implement the code in a specific way. The (obvious) advantage of using interfaces is that instances can be used in an unique way, independently from the implementing class.
void use_interface(IPrinter * i)
{
  i->print();
}

A a;
B b; // unrelated to a
use_interface(&a);
use_interface(&b);

this it is not possible with the techniques of the previous sections, since even if the template class is the same, the instantiated classes are completely unrelated.
Of course one could make use_interface a template function too. Surely it can be a way to go, specially if you are writing code heavily based on templates. In this case I would like to find an high-level way, and reduce template usage (consequently code bloat and compilation times too).
class IPrinter
{
public:
  virtual void print () = 0;
};

template <class T>
class Implementor : public IPrinter
{
    T* thisClass() { return (static_cast<T*>(this)) };

public:
  void print () override
  {
      return thisClass()->name ();
  }
};

the Implementor class implements the IPrinter interface using the composition technique explained before and expects the name() function to be present in the user class.
It can be used in this simple way:
class A : public A_Base, public Implementor
{
    std::string name () {return "A";}
}

int main ()
{
    A a;
    IPrinter * intf = &a;
    intf->print();
    use_interface(intf);
    return 0;
}
Some notes apply:
  • This kind of pattern is useful when you have a common implementation of an interface, that depends only in part on the combined class ( A in this case )
  • Since A derives from Implementor, it's also true that A impelemnts IInterface; up-casting from A to IInterface is allowed.
  • Even if Implementor doesn't have data members, A size is increased due to presence of the IPrinter VTable.
  • Using interfaces allows to reduce code bloat in the class consumers, since every Implementor derived class can be used as (IPrinter *)
    There's a small performance penalty though, caused by virtual function calls and increased class size.
  • The benefit is that virtual function calls will be used only when using the print function thought an IPrinter pointer. If called directly, static binding will be used instead. This can be true even for references, if the C++11 final modified is added to the Implementor definition.
 void print () final override;
A a;
A & ref = a;
IPrinter * intf = &a;
a.print ();        // static binding
ref.print();       // dynamic binding (static if declared with final)
intf->print (); // dynamic binding;

This kind of composition doesn't have limits on the number of interfaces.
class mega_composited : public Base , public Implementor<mega_composited>, public OtherImplementor<mega_composited>.....
{

};


Adapters

These implementors can be seen as a sort of adapters between your class and the implemented interfaces. This means that the adaptor can also be an external class. In this case you will need to pass the pointer to the original class in the constructor.
template <class T>
class Implementor : public IPrinter
{
    T* thisClass;

public:
  Implementor (T * original): thisClass(cls)
  {
  }

  void print () override
  {
      return thisClass->name ();
  }
};
note that thisClass is now a member and is initialized in the constructor.
 ...
 A a;
 Implementor impl(&a);
 use_interface(&impl);
as you see , the implementor is used as an adapter between A and IPrinter. In this way class A won't contain the additional member functions.
Note: memory management has been totally omitted from this article. Be careful with these pointers in real code!
Eventually one can make the object convertible to the interface but keeping it as object member (sort of COM aggregation).
class A
{
public:

   Implementor<A> printer_impl;
   /*explicit */ operator IPrinter * { return &printer_impl; }

   A () : printer_impl(this) {};
};
or even lazier...
class A
{
public:
     unique_ptr<Implementor<A>> printer_impl;

     /* explicit */ operator IPrinter * () {
       if (printer_impl.get() != nullptr)
          printer_impl.reset(new Implementor(this));
       return &printer_impl;
     }    
};


I will stop here for now. C++ let programmers compose things in many interesting ways, and obtain an high degree of code reuse without loosing performance.
This kind of composition is near the original intent of templates, i.e. a generic piece of code that can be reused without having to use copy-and-paste! Nice :)

Friday, February 1, 2013

Write a C++ blog they said...

... it will be fun they said! :)
Indeed I already wrote two more articles, but it's the editing part that is time consuming:
  • Read the article over and over and make sure the English is good enough.
  • Do proper source code formatting
  • Check that the code actually compiles and works
  • Make sure that the whole article makes sense, and so the smaller parts.
all of this can double the time required to write the original article text.
On the meantime I'll try to do smaller updates on smaller topics

So,next time we'll have some fun with interfaces and class composition! Stay tuned!

Sunday, January 6, 2013

Unicode and your application (5 of 5)


Other parts: Part 1 , Part 2 , Part 3 , Part 4

Here it comes the last installment on this series: a quick discussion on output files, then a resume on the possible choices using the C++ language.

Output files

What we actually saw for input files applies as well to output files.
Whenever your application needs to save a text output file for the user, the rules of thumb are the following:

  • Allow the user to select an output encoding the files he's going to save.
  • Consider a default encoding in the program's configuration options, and allow the user to change it using the output phase.
  • If your program loads and saves the same text file, save the original encoding and propose it as default.
  • Warn the user if the conversion is lossy(e.g. when converting from UTF8 to ANSI)

Of course this doesn't apply for files stored in the internal application format. In this case it's up to you to choose the preferred encoding.

The steps of text-file saving are the mirrored steps of text-file loading:
  1. Handle the user interaction and encoding selection (with the appropriate warnings)
  2. Convert the buffer from your internal encoding to the output encoding
  3. Send/save the byte buffer to the output device
I won't bother with the pseudo-code to do these specific steps, instead I have updated the unicode samples on github with "uniconv" , an utility that is an evolution of the unicat one: a tool that let's you convert a text file to a different encoding.
  uniconv input_file output_file [--in_enc=xxxx] --out_enc=xxxx [--detect-bom] [--no-write-bom]
it basically let's you choose a different encoding both for input and output files.

To BOM or not to BOM?


When saving an UTF text file, it's really appropriate to save the BOM in the first bytes of the text file. This will allow other programs to automatically detect the encoding of the file (see previous part).
The unicode standard discourages the usage of the BOM in UTF8 text files. Anyway, at least in Windows, using an UTF8 BOM is the only way to automatically detect an UTF8 text file from an "ANSI" one.
So, the choice it's up to you, depending on how and where the generated files will be used.
Personally, I tend to prefer the presence of BOM, to leave all the ambiguities behind.
Edit: since the statements above seem to be a personal opinion, and I don't have enough arguments neither pro or against storing an UTF8 BOM, I let the reader find a good answer on his own! I promise I'll come back to this topic.

You have to make a choice

As we have seen in the previous article, there are a lot of differences between systems, encodings and compilers. Anyway, you still need to handle strings and manipulate them. So, what's the most appropriate choice for character and string types?
Posing such a question in a forum or StackOverflow, could easily generate a flame war :) This is one of the decisions that depends on a wide quantity of factors and there's not a definitive choice valid for everyone.

Choosing a string and encoding type doesn't mean you have to use an UNICODE encoding at all, also it doesn't mean that you have to use a single type of encoding for all the platforms you are porting your application in.

The important thing is that this single choice is propagated coherently across your program.

Basically you have to decide about three fundamental aspects:
  1. Whether using standard C++ types and functions or an existing framework
  2. The character and string type you will be using
  3. The internal encoding for your strings
Choosing an existing framework will often force choices 2) and 3).
For instance, by choosing Qt you will automatically forced to use QChar and UTF-16.
Choices 2) and 3) are strictly related, and can be inverted. Indeed, choosing a data-type will force the encoding you use, depending on the word size. In alternative one can choose a specific encoding and the data-type will be choosed by consequence.

Depending on how much portable your software needs to be, you can choose between:
  • a narrow character type (char) and the relative std::string type
  • a wide character type (wchar) and the relative std::wstring type
  • a new C++11 character type
  • a character type depending on the compilation system
Here it follows a quick comparison between these various choices.

Choosing a narrow character type

This means choosing the char/std::string pair.
Pros:
  • it's the most supported at library level, and widely used in existing programs and libraries.
  • the data type can be adapted to a variety of encodings, both with fixed and variable length (eg Latin1 or UTF8).
Cons:
  • In Windows you can only use fixed length encodings , UTF8 is not supported unless you do explicit conversions.
  • In linux the internal encoding can vary between systems, and UTF8 is just one of the choices. Indeed you can have a fixed and variable encoding depending on the locale.

Let me mark once again that choosing char in a Windows platform is a BAD choice, since your program will not support unicode at all.

Choosing a wide character type

This means using wchar_t and std::wstring.

Pros:
  • wide character types are actually more unicode-friendly and less ambiguous than the char data type.
  • Windows works better( is built on!) with wide-character strings.
Cons:
  • the wchar_t type has different sizes between systems and compilers , and the encoding will change accordingly.
  • Library support is more limited , some functions are missing between standard library implementations.
  • Wide characters are not really widespread, and existing libraries that choose the "char" data type can require character type conversions and cause troubles.
  • Unixes "work better" with the "char" data type.

Choosing a character type depending on the compilation system

This means that the character type is #defined at compilation type and it varies between system.
Pros:
  • the character type will adapt to the "preferred" one of the system. For instance, it can be wchar_t on windows and char on unixes
  • You are sure that the character type is well supported by the library functions too
Cons:
  • you have to think in a generic way and make sure that all the functions are available for both data types
  • Not many library support a generic data type, and the usage is not widespread in unixes, more in windows
  • You will have to map with a define all the functions you are using.
Have you ever met the infamous tchar type in Windows? it is defined as
#ifndef UNICODE
  #define TCHAR char
#else
  #define TCHAR wchar_t
#end
the visual C++ library also defines a set of generic "C" library functions, that map to the narrow or wide version.
This technique can be used between different systems too , and indeed works good. The bad thing is that you will have to mark all your literals with a macro that also maps to char or wchar version.
#if !defined(WIN32) || !defined(UNICODE)
  #define tchar char
  #define tstring std::string
#else
  #define tchar wchar_t
  #define tstring std::wstring
#end

tstring t = _T("Hello world");

I never have seen using this approach outside Windows, but I have used it sometimes and it's a viable choice, specially if you don't do too much string manipulations.

Choosing the new C++11 types

Hei, nice idea. This means using char16_t and char32_t (and the relative u16string and u32string).
Pros:
  • You will have data-types that have fixed sizes between systems and compilers.
  • You only write and test your code once.
  • Library support will likely improve in this direction.
Cons:
  • As today, library support is lacking
  • Operating systems don't support these data types, so you will need to do conversions anyway to make functions calls.
  • The UTF8 strings still use the char data type and std::string, increasing ambiguity (see previous chapters).


Making a choice,the opposite way

As we have seen in the part above, making the choice about the data type will lead you to different encodings in different runtimes and systems.
The opposite way to take a decision is choosing an encoding and then infer the data-type from it. For instance , one can choose UTF8 and select the data type consequently.
This kind of choice is "against" the common C/C++ design and usage,because as we have seen, encodings and data type tend to be varying between platforms.
Still,I can say that is probably a really good choice, and the choice that many "existing frameworks" took.
Pros:
  • You write and test code only once (!)
Cons:
  • Forget about the C/C++ libraries unless you are willing to do many conversions depending on the system (and loosing all the advantages).
  • This kind of approach will likely require to use custom data types
  • Libraries written using standard C++ will require many conversions between string types.
In this regard,let's consider three possible choices:
UTF8:
  • You use the char data type and be compatible. I suggest not doing so to avoid ambiguity.
  • In windows it is not supported, so you will require to convert to UTF16/wchar_t inside your API calls
  • In unixes that support UTF8 locales, it works out-of-the-box.
UTF16:
  • If you have a C++11 compiler you can use a predefined datatype, else you will have invent one of your own that is 16bits on all systems
  • In windows , all the APIs work out of the BOX, in linux you will have to convert to the current locale (hopefully UTF8).
UTF32:
  • Again, if you have C++11 you can use a predefined datatype, else you will have invent one of your own that is 32bits on all systems
  • On any system you will have to do conversions to make system calls.
This approach, taken by frameworks such as Qt, .NET , etc... requires a good amount of code to be written. Not only they choose an encoding for the strings, but also these contain a good number of supporting functions, streams, conversions,etc...

Finally choosing

All of this seems like a cul de sac :)
To resume, one has either to choose a C++ way of doing things that is very variable between systems, or a "fixed encoding" way forced to use existing frameworks.

Fortunately the C++ genericity can abstract the differences, and hopefully standard libraries will improve unicode support out of the box. I don't expect too much change from existing OS APIs though.
Still, I'm not able to give you a rule and the "correct" solution, simply because it doesn't exists.
Anyway, I hope to have pointed out the possible choices and scenarios that a programmer can face during the development of an application.

Now that you have read all of this, I hope that you are asking yourself these questions:
  • Do my programs correctly handle compiler differences about encodings and function behavoir?
  • Do my programs correctly handle variable length encodings?
  • Do my programs do the all the required conversions when passing strings around?
  • Do your programs correctly handle input and output encodings for text files?

Personal choices

Personally, if the application I'm writing does any text manipulation, I prefer to use an existing framework,such as Qt, and forget about the problem
I really prefer the maintainability and the reduced test-burden of this approach.
Anyway,if the strings are just copied and passed around, I usually stick with the standard C/C++ strings, like std::string or "tstring" when necessary.In this way I keep the program portable with a reduced set of dependencies.
Finally, when I write Windows-only programs I choose std::wstring, but then I use APIs to do everything.

C++ standard library status : again

As we have previously seen, standard C++ classes are not really unicode friendly, and the implementation level varies very much (too much!)between systems. Anyway if you are using a decent C++11 compiler you will found some utility classes that let you do many of these operations by only using standard classes.
I will write an addendum soon on this blog, and I will update the two unicode examples using C++11 instead of Qt, trying writing them in the most portable way possible.

Conclusions

I hope to get some feedback about all of this: I'm still learning it, and I think that shared knowledge always leads to better decisions. So, feel free to write me down an email or a comment below!

Monday, December 10, 2012

Unicode and your application : part 4 example

I just wrote a simple example with Qt on how to read (and print-out) a text file. As I suggested in part 4, the program let the user choose the default encoding and eventually auto-detect it.
In this github repository I added the unicat application, which is a console application that can print a single text file.
You can download the repository as ZIP file here: https://github.com/QbProg/unicode-samples/archive/master.zip

Usage

unicat [--enc=<encoding_name>] [--detect-bom] <filename>

  • If the application is launched without a filename, it prints a list of supported encodings.
  • If --enc is not passed , it uses the system default encoding
  • if --enc is passed , it uses the selected encoding.
  • If --detect-bom is passed, the program tryes to detect the encoding from the BOM.If a BOM is not found, it uses the passed (or default) encoding.
The file is then printed to stdout.

Note: on Windows, the console must have an Unicode font set, like Lucida Console, or you won't see anything displayed. Also , even with a console font , you won't see all the characters.

The repository contains some text files encoded with various encodings. Feel free to play with these files and with the program options.

Building

To build the unicat sample, point open the .pro file with QtCreator or run qmake from the sources dir. To build it use make or nmake, or your preferred build command.

The code

The code is pretty simple. It first uses qt functions to parse the program options.
It then reads the text file: Qt uses QTextCoded to abstract the coding and decoding of a binary stream to a specific string encoding.

The function QTextCoded::availableCodecs() is used to enumerate all the possible encodings supported in the system.
Q_FOREACH(const QByteArray & B , QTextCodec::availableCodecs())
   {
       QString name(B);
       std::cout << name.toStdString() << std::endl;
   }

If the user passes the encoding, it tryes to load a specific QTextCoded from it, else it uses the default QTextCoded, which usually is set to "System"
if (encoding.isEmpty())
    userCodec = QTextCodec::codecForLocale();
else
    userCodec = QTextCodec::codecForName(encoding.toAscii());

After a file is open, the program uses a QTextStream to read it:
QTextStream stream(&file);
stream.setAutoDetectUnicode(detectBOM);
stream.setCodec(userCodec);
this is the initialization part, which specifies if a BOM should be auto-detected and the default encoding to be used.
Reading the text and displaying it is just a matter of readline and wcout.

"Why are you using wcout?"

If you mind the previous parts, we have that the default std::string encoding is UTF8 on Linux, but ANSI (i.e. the active 8bit code page) on Windows. You can't do anything about this, since the CRT works that way.
In addition, std::string filenames won't work on Windows for the same reason. So,in this example, the most portable way to print unicode strings is by using wcout + QString::toStdWString().

There are also other problems with the Windows console: by default Visual C++ streams and windows console don't support UTF16 (or UTF8) output. To make it work you have to use this hack:

 _setmode(_fileno(stdout), _O_U16TEXT);  

this allows to print unicode strings to console. Keep in mind the considerations done in the previous chapter, since not all fonts will display correctly all the characters.
BTW, I didn't find any problem in Linux.

Other frameworks

It would be nice to get an equivalent of this program (portable in the same way) using other frameworks or using only standard C++, to compare the complexity of each approach. Feedback is appreciated :)

Sunday, December 9, 2012

Unicode and your application (4 of n) : input files

Other parts: Part 1 , Part 2 , Part 3 , Part 5

I was already writing part 4 about choosing a good internal string encoding. Then I realized that a piece was missing: as a reader, I would like to know about all of the issues before making such a choice.
So here it comes the most interesting part of all these disquisitions: input and output.

This part is mostly generic, you won't find too much code inside.
This is because the implementation varies between libraries and frameworks. Like in the previous parts, standard C++ is not of much help here.

Reading a text file

Let's suppose for a moment that you did choose an internal encoding for your strings and you are willing to read a user text file and display it.

This is the example taken from cplusplus.com:

// reading a text file
#include <iostream>
#include <fstream>
#include <string>
using namespace std;

int main () {
  string line;
  ifstream myfile ("example.txt");
  if (myfile.is_open())
  {
    while ( myfile.good() )
    {
      getline (myfile,line);
      cout << line << endl;
    }
    myfile.close();
  }

  else cout << "Unable to open file"; 

  return 0;
}

Do you know what?

This is wrong. Really wrong!

Really, it is wrong:
  • The program ignores completely the text encoding of example.txt.
  • The program assumed that example.txt file is in the same encoding as your C locale. Eventually this may be the same as your system default encoding.
  • As we saw in previous parts, this program behaves in a completely different way between systems (Windows, modern linuxes, older linuxes).
  • Even if the text file encoding matches the system and C locale one, you are implicitly forcing the user to use that specific encoding.
Given that one knows all the implications,this piece of code may be good to read internal configuration files.
But absolutely not to read user text files.
With user file I mean every file that is not under direct control of the application,for instance any file opened via command line or common dialogs.

A matter of encodings

This leads to two different and fundamental questions:
  • What is the encoding of the user file?
  • How do I read a text file once I know it's encoding?

If you take a look inside the wikipedia page of “character encodings”, you see that the lists is very long.
Why should your program support all these encodings?
Could we force the user to use a specific encoding for the files passed to our application?

No! This is wrong (again).

Please don't do it: you have to let the user choose it's preferred encoding. .
Don't make assumptions and don't force it to use your preferred encoding, just because you are not able to read the others! The user must have the freedom of choosing an encoding, mainly for two reasons:
  • The user has the right to choose it's preferred encoding. It may choose an encoding that looks wrong to you, but remember: you are writing a tool,the user is using it! :)
  • Most users don't even know that encodings are, and they are not required to. I see that many programmers still don't know what encodings are, so why should end-users?

rant-mode on
I see that this specific rule is often ignored in linux systems. At a certain point they decided to change the locale to UTF8, and by a strange transitive property, all the text files handed by command line utilities are required to be UTF8. Unfortunately, this is true even for commonly used tools, like diff (and consequently git). Bad, very bad. rant-mode off

Again: please let you users to choose their preferred encoding whenever they need to pass a textual input to your program.

When dealing with textual input files, consider these aspects:
  • Most text files in Windows systems are encoded with a fixed 8-bit encoding. So, these files look good only in systems with the same code-page.
  • Most text files in Linux are encoded in UTF8 nowadays, but older systems used different encodings.
  • Text files are likely to come from different systems, via Internet, so assumptions good 10-15 years ago, are not valid anymore. This may sound ridicoluous to say, but it's not so obvious for many. XML and HTML files already allow to specify any encoding, so do emails and many internet-related formats.

Of course, if you need to load/store internal data files, you can choose the encoding you like, probably one that matches your internal string encoding. At least, in this case, document it somewhere in the program documentation.

The input file encoding

Once you get convinced that the input encoding choice it's a user matter, we can go on with the first question: how to know the input file encoding?
(if you are still not convinced , please go back to the previous section and convince yourself! Or just try to diff an UTF16 file using git and come back here.)

There are two ways to obtain the input file encoding:
  • You let the user specify it
  • Your program tries to auto detect it
  • You assume that all the input files have a predefined-fixed encoding

To let the user specify it, you have to give it a chance to pass the encoding in some way; it can be a command line switch, an option in the common dialog, or a predefined encoding that can be changed later.
Keep in mind that each text file can have a different encoding, so having a program default is good but you should allow changing it for the specific file.

Let's see some examples:
  • LibreOffice let choose the encoding when opening the text file
  • Qt Creator let you choose a default encoding in the program options, but let's you eventually change it later if it is not compatible with the contents of the file or if you see strange characters.
  • GCC lets you choose the encoding of source files with a command line switch (-finput-charset=charset )
The number of character encodings is really high, and any of these programs uses a specific library or framework to support all of these. For instance, GCC uses libiconv, so the list of supported charset is defined here.

Auto-detecting the encoding

There are some ways to auto-detect the encoding of a text file: all of these are empirical, though they provide different levels of uncertainty.

Byte order mask

There is a convention for UNICODE text files that allows to store the encoding used directly inside the text file: a predefined byte sequence is saved at the top of the text file.
This predefined byte sequence is called Byte Order Mask, abbreviated BOM.
Different byte sequences specify different encodings, following these rules:
  • Byte sequence :
    0xEF 0xBB 0xBF
    the file is encoded with UTF8
  • Byte sequence :
    0xFE 0xFF
    the file is encoded with UTF16 , big endian
  • Byte sequence :
    0xFF 0xFE
    the file is encoded with UTF32 , big endian
  • Byte sequence :
    0x00 0x00 0xFE 0xFF
    the file is encoded with UTF32 , big endian
  • Byte sequence :
    0xFF 0xFE 0x00 0x00
    the file is encoded with UTF32 , little endian
(there are other sequences to defined the remaining UNICODE encodings here ).

When you open a text file, you can try to detect one of these sequences on the top of it and read the rest of the file accordingly.
If you don't find a BOM, just start again and fallback to a default encoding.

If you look at the byte sequences, you see that the used values can be valid values of other encodings.
For instance, the UTF8 BOM can be a good character sequence using CP1252 (þÿ);what if the user text file contained exactly these characters at the top?
You are unlucky, and you will interpret the text file in a wrong way. This is unlikely, at least not much probable , but it is not impossible.

This is why I referred as these method as uncertain: one can't be 100% sure of the result,so even if you choose this auto-detection method it is wise to let the user override the detected encoding.

BOM Fallback

If you don't find a BOM inside the text file, you can eventually fallback to another detection method or use a default encoding. Keep in mind that the BOM is optional even for UTF16 and UTF32, and also is even discouraged for UTF8.

  • On windows system you can fallback to the system default code page, and read the file as ANSI.
  • On linux systems, you can fallback to system “C” locale.
  • You can use other autodetection methods.


Other autodetection methods

There are some other encoding detection methods.
Some encodings have an unused range of values, so one could exclude these specific encodings if an invalid value is found.
Unicode files can contain code-point sequences. Detecting an invalid code-point sequence, can let you exclude that specific encoding in the detection algorithm.
Windows API has an infamous IsTextUnicode function, that uses some heuristics to detect the possible file encoding.
Other frameworks may have different detection functions.

I advise against using these functions, since you are adding uncertainty to your input handling process. Nonetheless, the auto-detected encoding can be used as a proposed value in the user interface.

Since none of these method is 100% safe (even the BOM one), always give more priority to the user choice.

An interesting link in this regard is http://blogs.msdn.com/b/oldnewthing/archive/2007/04/17/2158334.aspx : how notepad tries to detect a text file encoding.

Conversion

So you got a possible encoding for your file. You have two ways to load it and convert it to your internal encoding.
  • A buffer based approach : you load the entire file as binary data, and perform a read.
  • A stream based approach.

Here I propose a pseudo-code pieces that illustrates the steps to to handle textual input in your program, using a buffer based approach.
string_type read_text_file (std::string & filename)
{
 std::vector buffer = read_file_as_binary(filename);
 encoding_t  default_encoding = get_locale_encoding();
 encoding_t detected_encoding = try_detect_encoding(buffer);
 encoding_t user_encoding = get_user_choice ( /* det_encoding */ );
 encoding_t internal_encoding = /* get_locale_encoding () or UTF8 or ASCII */
 encoding_t input_encoding;
 
 if (user_encoding != UNK_ENCODING)
 {
  if (det_encoding == UNK_ENCODING)
   input_encoding = default_encoding;
  else 
   input_encoding = detected_encoding;

  if (is_unicode(detected_encoding) && has_bom(buffer))
   strip_bom(buffer);
 }
 else 
  input_encoding = user_encoding;
 
 string_type str = convert_encoding( input_encoding, internal_encoding,buffer);
 return ret;
}

string_type convert_encoding ( encoding_t input_encoding , encoding_t internal_encoding, std::vector buffer)
{
 adjust_endianness(input_encoding, buffer);
 if (input_encoding != internal_encoding)
  convert_cvt(input_encoding,internal_encoding,buffer);

 return string_type(reinterpret_cast< internal_char_t *>(buffer), buffer.size());
}


This code perform these operations:
  • Load the text file as untyped byte buffer.
  • Try to detect the encoding, and eventually let the user choose it.
  • If the buffer has a BOM (and the encoding is an UNICODE one), strip the BOM from the buffer.
  • Call the convert function to obtain an internally-encoded string:
  • If the endianness of the input file doesn't match the endianness of your system (when using a 16 or 32 bit encoding), adjust it and swap the bytes of the input buffer accordingly.
  • If the input encoding doesn't match your internal encoding, convert it. This operation is complex to implement, and usually done by existing libraries.
  • Finally, interpret the input buffer as your preferred character type, since it is converted to your internal encoding. You can construct a string from the buffer data.

Some notes:
  • In this case the default encoding is taken from the system. The program may have a configuration option to specify the default encoding.
  • The detection routine may be simpler, and detect if the text is UNICODE or NOT. In that case you may treat the text as fixed 8-bit code page or UTF8.
  • If you are using C++ data types and strings, keep in mind that the encoding used (internal encoding) can vary between system. As we saw in the previous part, in Visual C++ std::string is ANSI, in Linux is UTF8 nowadays. The same applies to wchar_t.
  • If you are using an 8bit data type which is UTF8, the text file can contain non-representable character, which your function will discard.

Some frameworks will do all of this for you, by letting you choose a text file and encoding, and resulting in a compatible string.

Detecting the BOM is easy.

As stated before, I won't propose an implementation for this pseudo code. Anyway, just to illustrate one of these function, here it comes a possible BOM detection routine, still buffer-based:
/* Tryes to detect the BOM */
encoding_t try_to_detect_encoding ( const std::vector & buffer )
{
 /* tryes to detect the BOM */
 if (buffer.size() > 3 && buffer[0] == 0xEF && buffer[1] == 0xBB && buffer[2] == 0xBF) 
  return UTF8;
 else if (buffer.size() > 2 && buffer[0] == 0xFE && buffer[1] == 0xFF ) 
  return UTF16_BE;
 else if (buffer.size() > 2 && buffer[0] == 0xFF && buffer[1] == 0xFE ) 
  return UTF16_LE;
 ….. 
 else 
  return UNK_ENCODING
};
About the other functions, there are not many standard C++ utilities, at least not enough to handle all the required cases. In Windows, the convert functions may be implemented with MultiByteToWideChar , in Qt the QTextCodec class provides most of these. There are many toolkits, frameworks, and libraries to do this specific thing, so you just have to pickup your preferred one.

Using streams

Loading the entire file into a byte buffer can be unwanted and impossible in some cases, and you may want to use streams instead.
C++ streams don't support unicode well. Some library implementation are better than others, but you'll end up with non-portable code.
Also, remember that execution character set may be different between compilers, so streams potentially behave different with different data types. For instance, to write UTF8 data you still need to use wchar_t in Windows, because char streams are ANSI anyway (even if your internal encoding is forced to UTF8).
Since I don't usually use standard C++ streams to do textual input, I spent some time looking for some examples.I mostly found mostly platform specific code, workarounds, and custom implementations. I won't propose any of these in this post.

Using existing frameworks

By contrast, take a look how it is easy to read a text file using Qt.
int main()
{
QFile file("in.txt");
if (!file.open(QIODevice::ReadOnly | QIODevice::Text))
 return 1;

 QTextStream in(&file);
  in.setCodec(input_encoding);  // "UTF-8" or any other...
     // in.setAutoDetectUnicode (true);
 while (!in.atEnd()) {
  QString line = in.readLine();
  process_line(line);
    }
}
Qt also let's you do the auto-detection, with the setAutoDetectUnicode function.

After many tests, implementation and experimentation , I've reached a conclusion: using existing frameworks to do textual input (and eventually output) is a really good choice.
This can be really time saving, as you eventually call one function and get the result, or a specific stream that already works good with any text encoding.
Usually I just need to set or detect the encoding and let the library do all the hard work.
Why bother with something unportable and complex instead, like C++ iostreams ?

Conclusions

One may think that text files are the easier way to handle data, but it is clear that this is not the case anymore. Even reading a text file can be difficult and error-prone.
Your users will thank you if you let them to make a choice (explicit or implicit).
Existing frameworks will help you in this, Standard C++ support is still way behind. So what are you waiting for? Go on and replace your textual file input routines with better ones :)

Next time: textual output.

Sunday, November 18, 2012

Unicode and your application (3 of n)


Other parts: Part 1 , Part 2 , Part 4 , Part 5

Let's go into the actual Unicode support among different compilers and systems. In these examples I mainly mention Visual C++ and GCC, just because I have them by hand.
I'm interested in doing a Clang+MAC comparison too,but I don't have such a system avaiable. Comments are welcome in this regard :)

Character types

C++ defines two character types: char and wchar_t.

char is 1 byte as defined by the standard. wchar_t size is implementation dependent.
These two different types allow to make a distinction between different types of strings (and eventually encodings).

The first important thing to remember : wchar_t can have different size in different compilers or architectures.

For instance, in Visual C++ it is defined as a 16-bit type, in GCC is defined as a 32-bit type. So, in Visual C++ wchar_t can be used to store UTF16 text, in GCC can be used to store UTF32 text.

C++11 defines two new character type: char16_t , char32_t , which have fixed sizes on all compilers. But at the moment there is no standard library function support for them. You can use these two types to write more portable string and character functions.

String literals

C++98 compilers support narrow character literals and wide strings.
const char * narrowStr = "Hello world!";
const wchar_t * wideStr = L"Hello world!";
  • Question 1 : what is the encoding of the text inside string literals?
  • Question 2 : what is the encoding of the text inside wide string literals?
Answer for 1 and 2 : it's compiler specific.

The encoding may be an extended-ASCII one with a specific code-page, it can be UTF16 , it can be anything else.

In Visual C++ (from MSDN) : For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII. In reality , it seems that the resulting string encoding is based on the compiling system code page, even if the source file is saved in UTF8. So, it's absolutely not UTF8.

In GCC, on modern systems, the default encoding for char string literals is UTF8. This can be changed with the compilation option -fexec-charset.

 std::string narrow = "This is a string.";
 std::string narrowWithUnicode = "This is a string with an unicode character : Ц";

The first string compiles fine in both compilers, the second one is not representable in Visual C++, unless your system uses the Windows:Cyrillic code page.

Let's consider wide strings.
 std::wstring wide = L"This is a wide string"; 
In Visual C++ wide strings are encoded using UTF16, while in GCC the default encoding is UTF32. Other compilers may even define wide strings as 8bit type.

Along with new data types C++11 introduces new unicode strings. This is the way to go if you have to represent unicode strings and characters in a portable way.
 const char * utf8literal = u8"This is an unicode UTF8 string! 剝Ц";
 const char16_t * utf16literal = u"This is an unicode UTF16 string! 剝Ц";
 const char32_t * utf32literal = U"This is an unicode UTF32 string! 剝Ц";

In this way you are sure to get the desired encoding in a portable way. But, as we are going to see, you will lack portable functions that handle these strings

You may have noticed that there's is no specific type for utf8 string literals. char is already good for that as it's fixed size. char16_t and char32_t are the relative portable fixed-size types.
Note: as today Visual C++ doesn't support UTF string literals.

As you can see, using char and wchar_t can cause portability problems. Just compiling for two different compilers can make your string encoding different, leading to different requirements when it comes to the conversions mentioned in the previous part.

Character literals

Like strings, we have different behavior between compilers.
Since UTF8 (and also UTF16) is a variable length encoding, is it not possible to represent characters that require more than one byte with a single character item.
  char a = 'a'; // ASCII ,ok
  char b = '€'; // nope, in UTF8 the character is 3 bytes

Keep in mind that you can't store unicode character literals that are defined with a length more than "1" byte/word. The compiler will complain, and depending on what you are using, the char literal will be probably widened from char to int.

  std::cout << sizeof('€') << std::endl;  // 4 in GCC
  std::cout << sizeof("€") << std::endl;  // 3 in GCC 

the "4" value comes out because character constant is widened to an int (with a warning), while the value 3 is the effective size required by UTF8 to represent that "€" character.

Note that in Visual C++ all of this is not even possible, since the literal string encoding is the "ANSI" one. You'll have to use wide literals to obtain similar results.
Note also that C++11 didn't define a new specific UTF8 character type, probably because a single byte wouldn't be sufficient in most cases anyway.

The safest and portable way to store a generic unicode code-point in C++ is by using uint32_t (or char32_t) and use it's hex representation (or \uxxxx escapes)

Wide literals have this kind of issue, but with a minor probability,because BMP characters require a single 16-bit word. If you have to store a non BMP character, you will require a surrogate pair (two 16-bit words).

Variable length encoding and substrings

All of this leads to an interesting rule about unicode strings: You can't do substring and subscript handling like you did before with fixed-length encodings.
   std::string S = "Hello world";
   char first = S[0];
   std::cout << "The first character is : " << first << std::endl;

This code works correctly only with a fixed-length encoding.
If your UTF8 character requires more than 1 byte to be represented, this code does not work as expected anymore.
   std::string S = u8"€";
   char first = S[0];
   std::cout << "The first character is : " << first << std::endl;
you won't see anything useful here. You are just displaying the first of the 3 byte sequence required to represent the character.
The same applies if you are doing string validation and manipulation. Consider this example:
std::string cppstr = "1000€";
int len = cppstr.length();

in GCC len is 7 (because the string encoding is UTF8), in MSVC is 5 because it's Latin1.
This specific aspect can cause much trouble if you don't explicitly take care of it.

bool validate ()
{
std::string pass=getpass();
if (pass.length() < 5)
   {
      std::cerr << "invalid password length") << std::endl;
      return false;
   }
return true;
}

This function will fail with an UTF8 string, because it will check the "character length" with an inappropriate function.
std::length (or strlen) is not UTF8 aware.We need to use an UTF8 aware function as replacement.

So, how do we implement the pass.length replacement? Sorry to say this, but the standard library doesn't help here. We will need to get an additional library to do that (like boost or UTFCpp).

Another problematic issue arise when doing character insertion or replacement.
std::string charToRemove("€"), the_string;
std::string::size_type pos;

if ((pos=the_string.find(charToRemove)) != the_string.npos)
  the_string.remove(pos,1);

now , with variable length characters, you'll have to do:
  
if ((pos=the_string.find(charToRemove)) != the_string.npos)
     the_string.remove(pos,strlen(charToRemove));
because charToRemove is not long 1 anymore.
Note three things:
  • This is true even for UTF16 strings, because non BMP characters can take up to 4 bytes.
  • In UTF32 , you won't have surrogate pairs, but still code-points sequences.
  • You should not use sizeof('€'),because the size of characters with length more than 1 word is compiler defined. In GCC sizeof('€') is 4, while strlen("€") is 3.

Generally, when dealing with unicode you need to think about substrings, instead of single characters.

I will not discuss the correct UNICODE string handling and manipulation techniques, both because the topic is huge and because I'm not really an expert in this field.
The important thing is knowing that these issues exists and cannot be ignored.

More generally, it is much better to get a complete unicode string handle library and use it to handle all the problematic cases.
Really.


Wide strings

UTF16 is often seen and used as a fixed-length Unicode encoding. This is not true. UTF16 strings can contain surrogate pairs. Both in UTF16 and UTF32 you can find code-point sequences.

As today the probability of finding these two situations is not very high, unless you are writing a word-processor, or an application that deals with text.
It's up to you to define a compromise between effort and unicode compatibility, but before converting your app to wide chars, at least consider to use specific unicode library functions(and eventually stay UTF8).

Conversions

When using unicode strings you will to perform four types of possible encoding conversions:
  • From single-byte code-paged encodings to another single-byte code-paged encoding
  • From single-byte code-paged encodings to UTF8/16/32.
  • From UTF8/16/32 to single-byte code-paged encodings (lossy)
  • From UTF8/16/32 to UTF8/16/32

Be careful with the third kind of conversions, as it is a lossy conversion. In that case you are converting from an UNICODE encoding (with a million possible characters) to an encoding that only supports 255 characters.
This can end up corrupting your string,as the non-representable characters will be encoded with an "unknown" character (i.e "?"), usually chosen from one of the allowed characters.
While knowing how to perform these conversions can surely be interesting, you are probably not willing to implement them :)

Unfortunately, once again, the C++ standard libraries don't help you at all.
You can use platform specific functions (such as MultiByteToWideString), but the better option is to find a portable library that does it for you.

Even after C++11, the standard library seriously lacks unicode support. If you go that way, you'll find yourself with portability problems and missing functions.
For instance, being a good choice or not, if you end up using wide strings you'll find incoherent support to the wide versions of the C functions, missing functions, incomplete or buggy stream support.

Finally, if you choose to use the new char16_t or char32_t types , support is still nonexistent today.


Windows specific issues

As we have seen, Visual C++ defines wchar_t as 16 bit, and wide strings encoding as UTF16. This is mostly because the Windows API is based and works with UTF16 strings.
For compatibility reasons,there are two different versions for each API function, one that accepts wide strings, one that accepts narrow strings.

Here comes the ambiguity again: what is the encoding required by the windows API for it's char strings?

From MSDN : All ANSI versions of API functions use the currently active code page.

The program active code page inherits from the system default code page, which can be changed by the user. The active code page can be changed by the program at runtime.
So, even if you internally enforce the usage of UTF8 you will need to convert your UTF8 strings to an "ANSI" encoding, and probably loose your unicode data.

Visual C++ narrow strings are ANSI strings. This is probably because they wanted to keep compatibility with the ANSI version of the API.

Note: I don't really see this as argument of complain. If you look back, this a more compatible choice than converting the whole thing to UTF8, which is in fact a completely different representation.

This is a big portability issue, and probably the most ignored one.

In Windows, using wchar_t could be a good choice for enabling correct Unicode support. The consecutive choice would be using wchar_t strings, but keep in mind that other operating systems and compilers may have more limited widechar support.

In Windows , if you want to correctly support a minimum set of unicode , I'd suggest to:
  • Compile your program with UNICODE defined. This doesn't mean that you are forced to use wide strings, but that you are using the UNICODE version of the APIs
  • Use wide API functions, even if you are internally using narrow strings.
  • Keep in mind that narrow strings are ANSI ones, not UTF8 ones. This is reflected in all the standard library functions available in Visual C++.
  • If you need portability don't use wchar_t as type string, or be prepared to switch at compilation time. (Ironically enough , the infamous TCHARs could ease code portability in this case).
  • Be careful with UTF8 encoded strings, because the standard library functions may not support these.
As today, if you keep using ANSI functions and strings, you will limit the possibilities for your users.
int main (int argc, const char [] * argv)
{
    FILE * f = fopen(argv[1],"r");
}
if your user passes an unicode filename (i.e. outside the current code page), your program simply won't work, because Windows it will corrupt the unicode data by converting the argument to an 8bit "ANSI" string.

Note: in Windows it is not possible to force the ANSI API version to use UTF8.

A quick resume

I hope to have made it clear that there are many perils and portability issues when dealing with strings.
  • Take a look at your compiler settings and understand what is the exact encoding used.
  • If you write portable code, be careful at the different compiler and system behavior.
  • Using UTF8 narrow strings can cause issues with existing non-UTF8 string handling routines and code.
  • Choose a specific unicode string handling and conversion library if you are going to do any kind of text manipulation.
  • In a theoretical way, each string instance and argument passing could require a conversion.The number and type of conversions can change at compile time depending on the compiler, and at runtime depending on the user system settings.

What unicode support library are you using in your projects?