Sunday, November 18, 2012

Unicode and your application (3 of n)


Other parts: Part 1 , Part 2 , Part 4 , Part 5

Let's go into the actual Unicode support among different compilers and systems. In these examples I mainly mention Visual C++ and GCC, just because I have them by hand.
I'm interested in doing a Clang+MAC comparison too,but I don't have such a system avaiable. Comments are welcome in this regard :)

Character types

C++ defines two character types: char and wchar_t.

char is 1 byte as defined by the standard. wchar_t size is implementation dependent.
These two different types allow to make a distinction between different types of strings (and eventually encodings).

The first important thing to remember : wchar_t can have different size in different compilers or architectures.

For instance, in Visual C++ it is defined as a 16-bit type, in GCC is defined as a 32-bit type. So, in Visual C++ wchar_t can be used to store UTF16 text, in GCC can be used to store UTF32 text.

C++11 defines two new character type: char16_t , char32_t , which have fixed sizes on all compilers. But at the moment there is no standard library function support for them. You can use these two types to write more portable string and character functions.

String literals

C++98 compilers support narrow character literals and wide strings.
const char * narrowStr = "Hello world!";
const wchar_t * wideStr = L"Hello world!";
  • Question 1 : what is the encoding of the text inside string literals?
  • Question 2 : what is the encoding of the text inside wide string literals?
Answer for 1 and 2 : it's compiler specific.

The encoding may be an extended-ASCII one with a specific code-page, it can be UTF16 , it can be anything else.

In Visual C++ (from MSDN) : For the Microsoft C/C++ compiler, the source and execution character sets are both ASCII. In reality , it seems that the resulting string encoding is based on the compiling system code page, even if the source file is saved in UTF8. So, it's absolutely not UTF8.

In GCC, on modern systems, the default encoding for char string literals is UTF8. This can be changed with the compilation option -fexec-charset.

 std::string narrow = "This is a string.";
 std::string narrowWithUnicode = "This is a string with an unicode character : Ц";

The first string compiles fine in both compilers, the second one is not representable in Visual C++, unless your system uses the Windows:Cyrillic code page.

Let's consider wide strings.
 std::wstring wide = L"This is a wide string"; 
In Visual C++ wide strings are encoded using UTF16, while in GCC the default encoding is UTF32. Other compilers may even define wide strings as 8bit type.

Along with new data types C++11 introduces new unicode strings. This is the way to go if you have to represent unicode strings and characters in a portable way.
 const char * utf8literal = u8"This is an unicode UTF8 string! 剝Ц";
 const char16_t * utf16literal = u"This is an unicode UTF16 string! 剝Ц";
 const char32_t * utf32literal = U"This is an unicode UTF32 string! 剝Ц";

In this way you are sure to get the desired encoding in a portable way. But, as we are going to see, you will lack portable functions that handle these strings

You may have noticed that there's is no specific type for utf8 string literals. char is already good for that as it's fixed size. char16_t and char32_t are the relative portable fixed-size types.
Note: as today Visual C++ doesn't support UTF string literals.

As you can see, using char and wchar_t can cause portability problems. Just compiling for two different compilers can make your string encoding different, leading to different requirements when it comes to the conversions mentioned in the previous part.

Character literals

Like strings, we have different behavior between compilers.
Since UTF8 (and also UTF16) is a variable length encoding, is it not possible to represent characters that require more than one byte with a single character item.
  char a = 'a'; // ASCII ,ok
  char b = '€'; // nope, in UTF8 the character is 3 bytes

Keep in mind that you can't store unicode character literals that are defined with a length more than "1" byte/word. The compiler will complain, and depending on what you are using, the char literal will be probably widened from char to int.

  std::cout << sizeof('€') << std::endl;  // 4 in GCC
  std::cout << sizeof("€") << std::endl;  // 3 in GCC 

the "4" value comes out because character constant is widened to an int (with a warning), while the value 3 is the effective size required by UTF8 to represent that "€" character.

Note that in Visual C++ all of this is not even possible, since the literal string encoding is the "ANSI" one. You'll have to use wide literals to obtain similar results.
Note also that C++11 didn't define a new specific UTF8 character type, probably because a single byte wouldn't be sufficient in most cases anyway.

The safest and portable way to store a generic unicode code-point in C++ is by using uint32_t (or char32_t) and use it's hex representation (or \uxxxx escapes)

Wide literals have this kind of issue, but with a minor probability,because BMP characters require a single 16-bit word. If you have to store a non BMP character, you will require a surrogate pair (two 16-bit words).

Variable length encoding and substrings

All of this leads to an interesting rule about unicode strings: You can't do substring and subscript handling like you did before with fixed-length encodings.
   std::string S = "Hello world";
   char first = S[0];
   std::cout << "The first character is : " << first << std::endl;

This code works correctly only with a fixed-length encoding.
If your UTF8 character requires more than 1 byte to be represented, this code does not work as expected anymore.
   std::string S = u8"€";
   char first = S[0];
   std::cout << "The first character is : " << first << std::endl;
you won't see anything useful here. You are just displaying the first of the 3 byte sequence required to represent the character.
The same applies if you are doing string validation and manipulation. Consider this example:
std::string cppstr = "1000€";
int len = cppstr.length();

in GCC len is 7 (because the string encoding is UTF8), in MSVC is 5 because it's Latin1.
This specific aspect can cause much trouble if you don't explicitly take care of it.

bool validate ()
{
std::string pass=getpass();
if (pass.length() < 5)
   {
      std::cerr << "invalid password length") << std::endl;
      return false;
   }
return true;
}

This function will fail with an UTF8 string, because it will check the "character length" with an inappropriate function.
std::length (or strlen) is not UTF8 aware.We need to use an UTF8 aware function as replacement.

So, how do we implement the pass.length replacement? Sorry to say this, but the standard library doesn't help here. We will need to get an additional library to do that (like boost or UTFCpp).

Another problematic issue arise when doing character insertion or replacement.
std::string charToRemove("€"), the_string;
std::string::size_type pos;

if ((pos=the_string.find(charToRemove)) != the_string.npos)
  the_string.remove(pos,1);

now , with variable length characters, you'll have to do:
  
if ((pos=the_string.find(charToRemove)) != the_string.npos)
     the_string.remove(pos,strlen(charToRemove));
because charToRemove is not long 1 anymore.
Note three things:
  • This is true even for UTF16 strings, because non BMP characters can take up to 4 bytes.
  • In UTF32 , you won't have surrogate pairs, but still code-points sequences.
  • You should not use sizeof('€'),because the size of characters with length more than 1 word is compiler defined. In GCC sizeof('€') is 4, while strlen("€") is 3.

Generally, when dealing with unicode you need to think about substrings, instead of single characters.

I will not discuss the correct UNICODE string handling and manipulation techniques, both because the topic is huge and because I'm not really an expert in this field.
The important thing is knowing that these issues exists and cannot be ignored.

More generally, it is much better to get a complete unicode string handle library and use it to handle all the problematic cases.
Really.


Wide strings

UTF16 is often seen and used as a fixed-length Unicode encoding. This is not true. UTF16 strings can contain surrogate pairs. Both in UTF16 and UTF32 you can find code-point sequences.

As today the probability of finding these two situations is not very high, unless you are writing a word-processor, or an application that deals with text.
It's up to you to define a compromise between effort and unicode compatibility, but before converting your app to wide chars, at least consider to use specific unicode library functions(and eventually stay UTF8).

Conversions

When using unicode strings you will to perform four types of possible encoding conversions:
  • From single-byte code-paged encodings to another single-byte code-paged encoding
  • From single-byte code-paged encodings to UTF8/16/32.
  • From UTF8/16/32 to single-byte code-paged encodings (lossy)
  • From UTF8/16/32 to UTF8/16/32

Be careful with the third kind of conversions, as it is a lossy conversion. In that case you are converting from an UNICODE encoding (with a million possible characters) to an encoding that only supports 255 characters.
This can end up corrupting your string,as the non-representable characters will be encoded with an "unknown" character (i.e "?"), usually chosen from one of the allowed characters.
While knowing how to perform these conversions can surely be interesting, you are probably not willing to implement them :)

Unfortunately, once again, the C++ standard libraries don't help you at all.
You can use platform specific functions (such as MultiByteToWideString), but the better option is to find a portable library that does it for you.

Even after C++11, the standard library seriously lacks unicode support. If you go that way, you'll find yourself with portability problems and missing functions.
For instance, being a good choice or not, if you end up using wide strings you'll find incoherent support to the wide versions of the C functions, missing functions, incomplete or buggy stream support.

Finally, if you choose to use the new char16_t or char32_t types , support is still nonexistent today.


Windows specific issues

As we have seen, Visual C++ defines wchar_t as 16 bit, and wide strings encoding as UTF16. This is mostly because the Windows API is based and works with UTF16 strings.
For compatibility reasons,there are two different versions for each API function, one that accepts wide strings, one that accepts narrow strings.

Here comes the ambiguity again: what is the encoding required by the windows API for it's char strings?

From MSDN : All ANSI versions of API functions use the currently active code page.

The program active code page inherits from the system default code page, which can be changed by the user. The active code page can be changed by the program at runtime.
So, even if you internally enforce the usage of UTF8 you will need to convert your UTF8 strings to an "ANSI" encoding, and probably loose your unicode data.

Visual C++ narrow strings are ANSI strings. This is probably because they wanted to keep compatibility with the ANSI version of the API.

Note: I don't really see this as argument of complain. If you look back, this a more compatible choice than converting the whole thing to UTF8, which is in fact a completely different representation.

This is a big portability issue, and probably the most ignored one.

In Windows, using wchar_t could be a good choice for enabling correct Unicode support. The consecutive choice would be using wchar_t strings, but keep in mind that other operating systems and compilers may have more limited widechar support.

In Windows , if you want to correctly support a minimum set of unicode , I'd suggest to:
  • Compile your program with UNICODE defined. This doesn't mean that you are forced to use wide strings, but that you are using the UNICODE version of the APIs
  • Use wide API functions, even if you are internally using narrow strings.
  • Keep in mind that narrow strings are ANSI ones, not UTF8 ones. This is reflected in all the standard library functions available in Visual C++.
  • If you need portability don't use wchar_t as type string, or be prepared to switch at compilation time. (Ironically enough , the infamous TCHARs could ease code portability in this case).
  • Be careful with UTF8 encoded strings, because the standard library functions may not support these.
As today, if you keep using ANSI functions and strings, you will limit the possibilities for your users.
int main (int argc, const char [] * argv)
{
    FILE * f = fopen(argv[1],"r");
}
if your user passes an unicode filename (i.e. outside the current code page), your program simply won't work, because Windows it will corrupt the unicode data by converting the argument to an 8bit "ANSI" string.

Note: in Windows it is not possible to force the ANSI API version to use UTF8.

A quick resume

I hope to have made it clear that there are many perils and portability issues when dealing with strings.
  • Take a look at your compiler settings and understand what is the exact encoding used.
  • If you write portable code, be careful at the different compiler and system behavior.
  • Using UTF8 narrow strings can cause issues with existing non-UTF8 string handling routines and code.
  • Choose a specific unicode string handling and conversion library if you are going to do any kind of text manipulation.
  • In a theoretical way, each string instance and argument passing could require a conversion.The number and type of conversions can change at compile time depending on the compiler, and at runtime depending on the user system settings.

What unicode support library are you using in your projects?

Wednesday, November 14, 2012

Unicode and your application (2 of n)


Other parts: Part 1 , Part 3 , Part 4 , Part 5

In this part I will write some speculations about the C++ string encodings and the possibly required conversions.

Post-editing note : after re-reading this many times, I see that it probably ended way more theoretical and demagogic than practical; instead of throwing it away, I'll publish it anyway since it poses some real issues.
Part 3, that I'm editing right now, is much more practical :)


Some clarifications

After writing the first article of this series, I got some feedback on reddit.
I'd like to clarify some points and make a quick resume:
  • ASCII is the original commonly used 7-bit encoding with 128 characters.
  • Extended-ASCII encodings define a maximum of 256 characters. ISO 8859-1 is an extended ASCII encoding, also called Latin1. The common Windows-1252 encoding defines 32 unused values in Latin1, but it's almost Latin1.
  • Specially on Windows, the ANSI term is often referred as 8 bit single-byte encodings,even if this is not totally correct.
  • A Code-page is used to specify the set of available characters and a way to encode them using a binary representation. "Character encoding" is a synonym for code-page. Each encoding or code-page has an identifier: UTF8, Windows-1252, OEM , ASCII are all encoding names.
  • A system default code page is used to interpret 8-bit character data. Each system had a default code-page which is used to interpret and encode text files, user textual input, and any other form of text data. Applications can eventually change the default encoding used. C locales have a similar setting and behavior.
  • UTF8 is backwards compatible with 7-bit ASCII data. It is not backwards compatible with any other 8 bit encoding, even the Latin1 one.
  • UTF16 is a variable length encoding. It requires a single 16-bit word for characters is the BMP (Basic Multilingual Plane), but it requires two 16-bit words for characters in other Unicode planes.
  • Characters and code-points are the same thing in Unicode. Multiple code-points can be combined to form a "glyph", i.e. a "character displayed to the user", also called "grapheme cluster".
    For instance, a letter code-point followed by an accent code-point , will result in an accented letter glyph.
  • I didn't mention normalization forms at all. That's another important concept, worth a read.
Thanks to everyone who commented.

The compatibility problem

In C/C++ the historical data-type for string characters is char
The char data type is defined by the standard as a 1-byte type. It may be signed or unsigned depending on the compilation settings.
As we have seen in the previous post, a 8-bit data type is used by various character encodings:
  • UTF8
  • ASCII and extended-ASCII encodings
  • (less common) multi-byte non-unicode
Note that the "extended-ASCII" item in reality groups a lot of 8-bit encodings.

In a C/C++ software, if you see code a piece of code like this:
const char * mybuf = getstring ();
how do you tell the effective encoding of the string?

You can't, unless the getstring () function has an explicit documentation, somewhere else.

In fact a char * or std::string string can hold any text in ANY 8-bit encoding you want. And there are plenty of 8-bit encodings. Moreover,you can have a std::string with an UTF8 text, while another with an windows-1252 text, and another with a 7 bit ASCII encoding.

Let's consider two simple implementations of the getstring function

const char * getstring () 
{
   return "English!";
}
const char * getstring()
{
   return "Français!"
}

What is the encoding used in the returned strings?

The English version string is composed by ASCII characters, so it's representation would be the same in any ASCII compatible encoding. The french version could result in an UTF8 encoded string or in a extended-ASCII encoded string, maybe with the latin1 encoding.

At least, in this case, you can take a look at the functions, and deduce the argument type (eventually looking in the compiler documentation).

Another example:

const char * getstring ()
{
    return external_lib_function();
}

In this case, you have to read the documentation and hope that it contains the encoding used by the function.
One can deduce that is uses the standard encoding used by the C/C++ runtime.

In C++11 you can find
const char * getstring ()
{
    return u8"捯橎 びゅぎゃうク";
}

in this case , it is sure that the returned string is encoded with UTF8.

But, what if the external_lib_function would have returned a string like that?

The point is that char strings can represent a wide variety of encodings. For each instance of std::string in your program, you need to know exactly what encoding is using that exact instance.

Note: this holds true even for wide chars, even if it's mitigated by the relatively small number of 16-bit and 32-bit encodings.

Do these issues arise in reality?

At this point you may ask why bother with these issues at all. While the effective encoding may be clear to identify in a relatively simple software, when you have a good mix of libraries and components, things can get more subtle.
In addition, utf8 literals will effectively make this worse. What one of your libraries starts using UTF8 literals, and the rest of your system does not? You could end having UTF8 std::strings and 8-bit std::strings in the same software. (note : this heavily depends on compiler and compilation settings)

Just to name a few cases:
  • Your function (or compiler) generates UTF8 literals, while your locale is set to use a different encoding.
  • Your function (or compiler) generates UTF8 literals and are using these as ANSI (say, calling an ANSI windows API).
  • A library you use doesn't correctly handle ANSI file input, and returns them in ANSI format while your locale uses UTF8(and vice-versa)
  • You write functions to handle UTF8 strings, but the strings are in a different encoding because of the system settings.

Theoretical string handling and conversions

Theoretically, in function calls, each input string argument would require a conversion to the required input encoding, and each output string argument would require a conversion to the required output encoding.
To keep things congruent, you will need to keep in mind :
  • The string encoding used by your compiler.
  • The string encoding used by the libraries and external functions you are using.
  • The string encoding used by the system APIs you are compiling for.
  • The string encoding used by your implementation.

The conceptually correct way to move strings around would be using something like this:

typedef std::pair<encoding_t,std::string> enc_string_t;
i.e. bring the string encoding along with the string. The conversions would be done accordingly using a (conceptual) conversion like this:
enc_string_t convert (const enc_string_t & , encoding_t dest_encoding);
Fortunately you don't have to do all of these conversions every time, nor to bring along the string encoding. Anyway,tracking this aspect when writing code is useful to avoid errors.
const char * provider_returning_windows1252();
void consumer(const char * string_utf8);
Proper string handling would require a conversion (pseudo code):
const char * ansiStr = provider_returning_windows1252();
consumer(convert(ansiStr,UTF8));
of course, if you are sure that both the provider and the consumer use the same encoding, the conversion becomes an identity and could be removed.
const char * ansiStr = provider_returning_utf8();
consumer(ansiStr);
unfortunately, many times the conversion is omitted even if it is not an identity.
const char * ansiStr = provider_returning_windows1252();
consumer(ansiStr);
in this case, the consumer is called with a different type of string and will probably lead to corruption, unless the ansiStr contains only characters in the range 0-127.
So, how can we avoid this kind of problems?
Unfortunately the language doesn't help you.
std::string utf8 = convert_to_utf8(std::string & ascii);
std::string ascii = convert_from_utf8(std::string & utf8);
as we have seen the data-type is shared.Maybe this would be better to understand:
std::basic_string<utf8char_t> utf8 = convert_to_utf8(std::string & ascii);
std::string ascii = convert_from_utf8(std::basic_string<utf8char_t> & ascii);
for the same reason, people tend to ignore the UTF8->ASCII conversion, which in the worst case is being LOSSY.
I would really have liked to see an utf8_t type in the new C++, at least to give programmers a choice to explicitly mark utf8 strings. Anyway, nothing prevents you to define and use such a type.
Despite language support, to detect the conversions need one has to look and edit the code
  • Review all the library usage and calls, and make sure of the string type and encoding it should use. This include external libraries and operating system APIs.
  • Place appropriate conversions where needed or missing.
  • Enforce a single encoding in your application source code.
this latter aspect is important to remove the ambiguities at least inside the controlled part of the source code and make it more readable.

Mitigation factors

All of these potential conventions may appear a bit scaring and heavy to handle, specially for existing code. If you look in open source software (or even closed one), you will hardly find any explicit kind of handling.
This is because there exists some mitigation factors that in fact remove the need of the conversions.
  • If all the libraries use the same encoding and rules there's no need for conversions at all. This is very common, for instance, in C/C programs. Modern linux systems use UTF8 as default locale encoding, so if you accept this, you just to make sure that your input/output is UTF8 too, and that the libraries you use follow the same conventions.
    Unfortunately this heavily depends on the compiler and eventually on the system settings.
  • If you use an application framework, which already choose a specific encoding and does everything for you, the need of conversions is limited.
  • If you don't explicitly generate or manipulate the strings of your program, these aspects could be ignored.
    For instance,
    int main (int argc, const char [] * argv)
    {
        FILE * f = fopen(argv[1],"r");
    ...
    
    in this case, you are just passing the input argv[1] string to the fopen function. So, knowing that both the arguments use the "C" locale setting, these already match and no conversion is needed.

In the end you'll probably simplify the 90% of the possibly required conversions.

Conclusions

In writing this article , to illustrate the points, I took the "always convert" approach, excluding later the identity conversions.
When coding, you usually take the opposite approach, because most of the conversions are not needed in practice. Anyway, considering and thinking about them is always important, specially when doing inter-library or API calls.
In part 3 I will examine some of the peculiar aspects about unicode string handling in C/C++ , and some big differences across linux and windows compilers. You'll see that there are many cases where the encoding is not obvious, and eventually not coherent between compilers and systems.

Sunday, November 11, 2012

Unicode and your application (1 of n)

If you look around in the open source ecosystem, and also in the non-opensource one, you will find a good amount of applications that don't support unicode correctly.


I also find that many programmers don't know exactly what "supporting unicode" means, so they end up not supporting it at all.

Handling unicode in a correct way doesn't mean knowing all of the standard and/or transformation rules, but instead means making some choices and following them congruently across the application.

Ignoring it will cause frustration for your users, specially when you are dealing with textual input and output.

Some of the fundamental points of unicode support include:

  • Having a basic knowledge of what unicode means and of the commonly used encodings
  • Choosing a proper string representation to use inside your application.
  • Correctly handling textual input.
  • Correctly handling textual output.

I won't write about unicode advanced text handling and transformations. At the end of this series you won't to be able to write an unicode text editor; instead your application will stop corrupting the user text files, at least :)

Many concept here apply generically to any programming language, anyway I'll focus a bit more on C++ and eventually on the new C++11 unicode features.
Also, I'll threat the thing from an high level point of view, so I won't discuss about the specific encoding representations, etc...

Understanding UNICODE


This is a really huge and confusing topic. But it's also a fundamental requisite for every programmer,nowadays.

There is a lot of material on the Internet, both generic and focused to a specific programming language.
Wikipedia gives a good quick start.
Oh, of course the unicode consortium site is the most reliable source of information about this topic.

Also, this article by "Joel on Software" gives a good introduction on unicode and code points.
I suggest reading it, as it is a bit redundant (and more complete) in respect of what I'm writing in this first part.

A quick overview of encoding and formats

This would be really long to explain exactly and in a complete way. I'll write a quick resume of the concepts, focusing on some facts that people usually tend to ignore.

From wikipedia : Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world's writing systems. Unicode is a standard .
The standard defines a (big) set of characters that can be represented by a computer system. The number of characters is, as today, 10FFFF16


A code-point is any acceptable value in the unicode character space. I.e. is any value that is defined by the unicode standard. That is, the range of integers from 0 to 10FFFF16

Unicode code-points and characters are theoretical, defined by a set of tables inside an international standard. 

Not every unicode character is composed by a single code-point: the standard defines code-point sequences that can result in a single character. For example, a code-point followed by an accent code-point will eventually result in an accented character.

Anyway, you will find that in real-word usage most of code-points represent a single unicode character. But, please, don't take this as an assumption, or your program will fail the first time it encounters a code-point sequence.



An encoding is defined as the way you choose to represent your set of characters inside computer memory.
An encoding defines how you physically represent the theoretical code-point.

Given that you have 10FFFF16  code points to represent, you need to choose a way to store them using your preferred data type.  For instance, you may choose to use an unsigned 32 integer. In this case you will need just one word per character. Instead, by choosing a 16 data type you will theoretically need two words per character, using an 8 byte type, you'll need up to 4 bytes per character.

Encoding can have a fixed-length per code-point or variable length per code point.

Fact #1 : It doesn't matter what encoding you are choosing, the important thing to remember is that you can have more than one byte/word per unicode code-point.

To resume:
- A character is fully determined by one or more code-points (theoretical level)
- A code-point can be composed of multiple bytes or words, depending on the encoding (physical level)


This has big implications, specially for C/C++ users, because special care is needed when doing string manipulation.



Let's take a quick overview of the commonly used unicode and non-unicode encodings.

Code-page 

The code page it's a number that defines how a specific text is encoded.
You can see the code-page as a number that identifies the encoding.
Code pages usually indicate single-bytes character sets, but also can indicate multi-byte character sets and UNICODE ones.

A code-page can identify a single-byte encoding (ANSI/ASCII) , a non-unicode multi-byte encoding or an unicode encoding.

Single-byte non-unicode encodings (ASCII,ANSI) 

In the early days computer used to represent strings as 7/8bit characters. That meant that you had only 255 possible different characters. Usually the first 128 (7 bits) where defined as a common set of characters (common English alphabetical and numeric characters).
The higher 128 characters are actually different between languages. For instance french computer systems have a different 128-255 character set from Italian ones, and from English ones.

There's a lot confusion of terms about ASCII,ANSI, and code pages. From a practical standpoint, we can say that:

Fact #2 : As today, the complete set of 255 characters that are allowed by the ASCII/ANSI encoding is defined by the code page. A single-byte code page binds each of the 0-255 values to a specific "theoretical" character.
Each system has one default code page.In windows you can see your system default code page in the international options of the language. 


Using an ASCII encoding will limit your software to use at most 255 different characters inside a string.This also means that you will not be able to represent all the unicode characters.

As stated above, code pages can even indicate a multi-byte or unicode encoding: I will indicate single-byte encodings as ASCII code-paged encodings.

Fact #3 : single-byte Code-paged encodings are often referred as ANSI encoding or ASCII encodings. While these terms are not fully correct, can be usually can be interpreted as "8 bit single byte encodings". Each character is fully represented as 8 bit value, i.e. you have 1-byte 1- character encoding.

This kind of encoding is commonly used as it perfectly fits the C "char" data type and the stdlib functions.

Multi byte non-unicode encodings 

for completeness, some code pages point to multi-byte encodings not defined by the unicode standard. These were used for oriental languages , that required more than 255 symbols.

Unicode encodings 


Unicode encodings allow to represent each unicode code-point with one or more words of defined size.

UTF-8 encoding :


The encoding uses an 8bit word, and uses a variable length scheme to represent a single code-point. Each code-point can use up to 4 bytes, but common ANSI characters usually require 1 byte.

UTF16 encoding:

it is still a variable-length encoding. The first 65535 characters (Basic Multilingual Plane) are represented exactly with a single 16 bit value. Code-points with value greater than 0xFFFF use a multi-word surrogate sequence.
Windows uses UTF16 as it's internal representation

UTF32 encoding:

Uses 32bit word (4 bytes) to exactly represent each unicode character. While it can be used as internal presentation, it's not commonly used in text storage.

Unicode Endianness  

In non-8 bit encodings, you can store each word as BIG-ENDIAN or LITTLE-ENDIAN. So, you have UTF16-BE and UTF16-LE . These define the same encoding with different storage rules in the computer memory.The same applies for UTF32 and other non-8bit word encodings.
 --
As you can see, there are many ways to encode unicode and non-unicode text. Each encoding has it's specific rules. This leads to a simple, yet fundamental fact:

Fact #4: each transition from an encoding to another requires a conversion.

I.e. you cannot interpret an ASCII text as UTF8 without doing a conversion. This also holds true for text encoded using two different code-pages, and also for endianness. A conversion is required to go from UTF8 to UTF16 (you cannot just stick an additional 0 in front of each value).

So, even if you have an 8-bit ASCII-codepaged text, you cannot use it as UTF8.

Let me repeat that: UTF8 is different from ANSI/ASCII text, and unless you are in the lucky case of using only the first 128 values of the character set, you need a conversion for this case too.

All of this may appear banal, but ignoring Fact #4 is the common source of many kind of encoding problems (and also the motivation behind this article).

Unicode support in different operating systems


Windows support two different set of APIs : ASCII and UNICODE. Internally, the storage of choice is UTF-16, and ASCII strings get converted on-the-fly to UTF-16 ones.
In reality , the APIs support three sets of APIs:  ASCII code paged, multi byte (by defining special multi-byte code pages) and unicode UTF-16.
Also, Microsoft compiler defined wchar_t as 16 bit type, so application using that type to store unicode strings (std::wstring for instance), will gain the benefit of direct API usage.

Many Linux and FOSS software, being strongly based on C and stdlib, usually support UTF8.
This can be done, while maintaining most source-compatibility , by enabling the correct locale in the C stdlib.

Unfortunately the level of support in applications is limited to it's internal representation. Input and output issues are usually ignored.

The C confusion problem

In C we have a char type , that is commonly used as string character type. The standard says that is must be of size = 1 byte. The identity 1 byte = 1 char was true until the advent of unicode, and it's still used in many places.C/C++ programmers still think as 1-byte = 1-char , and eventually think that this is true even for wide-chars.

I strongly suggest to use BYTE instead of char when handling UTF8 unicode strings. So it will be clear that 1 byte doesn't necessarily represent a character/code-point.
edit: as this sentence is not clear (thanks zokier), I will go in depth about this in the next articles.


If you want to support unicode, the C char type doesn't represent a single unicode character. "char" is in fact a misleading name nowadays.

Wide char 
C/C++ supports the wchar_t type. It is defined as a wide character type. Unfortunately it's size is compiler dependent.

For example,you have
MSVC = 2 byte
GCC = 4 byte

this will eventually cause many troubles when choosing an internal encoding for your program.
Many "C" stdlib string functions are now available in the relative wide format, but this is not true for no-so-older compilers.

In C++11 ,there exists better suitable character types (char16_t) , but you will have troubles in finding string handling functions anyway.
We'll discuss these points in the next articles, but on the meantime remember that :

Fact 5: using wchar_t in your program doesn't make it correctly support unicode.


--

In the next article, I will discuss how to choose an internal string representation and how to correctly handle it in C++.

I promise to include some source code too! 

Other parts: Part 2 , Part 3 , Part 4 , Part 5

Saturday, November 3, 2012

rvalue and universal references

Introductory note: this is the first article in a series about some aspects I don't like about C++.  Please take this as the start of a discussion , and not as sterile complains.

While writing down some considerations about the great Scott Meyers article Universal References in C++11 , it came in my mind why I initially found the name "rvalue reference" at least a bit inappropriate.

One of the first counter-intuitive things I found when introduced to this kind of references was the fact that named rvalue references can be used as lvalues. At the same time , unnamed rvalue references can only  be used as rvalues.

A quick resume here:

int && a = 3;
a = 4;

is valid code. Unexpected , at least for a C++98 user, but valid :)

The temporary object "3" is bound to the rvalue refence and it's lifetime extended. Now, according to the rule above , that temporary object can be used as it was an lvalue.
In fact it can also be bound to an lvalue reference!

int & b = a;

instead, if you use an unnamed rvalue reference(say, returned from a function), you can only use it as rvalue.
int && ret () { return 3; }
ret () = 4;  // ERROR

it doesn't compile because the rvalue reference is unnamed. It could be used as function argument,instead.
All of this is pretty logical and useful, in this way you can reuse and extend the lifetime of temporary objects, without making a copy.
--

Returning to the point: the part "rvalue" in the "rvalue references" name is not related to "rvalues" in any special way. At least no more than how it's related to lvalues.

"Temporary object reference",as name,can be an alternative to better explain the concept and not mix the names (but surely I won't discuss the standard committee difficult naming choices).
This is how I usually think rvalue references.

Now, the article above gave me another surprise about this C++11 topic : universal references.

(I admittely don't do advanced template metaprogramming, so I still didn't had to use auto and decltype with rvalue references; I didn't know these rules before)

To quickly resume, using the && syntax with a deduced type doesn't necessarily generate an rvalue reference.

int value = 10; 
auto && a =  value;
int && b = value;

a is actually an lvalue reference, while b is actually an rvalue reference.
This is because a has a deduced type, while b has an explicit type.


In this case kind of usage of these "universal references" has nothing to do both with other "rvalues" and "rvalue references" usage at all.

--
So in the end we have that
- deduced types that appear as && , will become lvalue references or rvalue references depending on the case.
- named rvalue references (&&) will used as lvalues
- unnamed rvalue references will be used as rvalues when all of these have the same apperance.

Scott Meyers , about the "universal references", in it's blog article, writes the following:
"I really think that the idea of universal references helps make C++11's new "&&" syntax a lot easier to understand.If you agree, please use this term and encourage others to do it, too. "

Sorry, I don't agree here.
Don't you see that you have two really different things, with different behavoir and rules, that appear exactly the same in the code?

This seems terrific to me.At least very counter-intuitive.

Why these concepts got mixed in that way?
What about maintability and readability?
And, given all of these explanations , why the name "rvalue references"?


In the snipped above, you can't figure out the type of a unless you exactly know the language rules(or you have intellisense).
While ignorance is not an excuse , this will make programmer's life difficut, specially for those who don't regularly do advanced things with templates and metaprogramming, or follow the language evolution.
It will end up that either the universal references will be wrongly exchanged and used as rvalue references (eventually leading to bugs), or not used at all.

Friday, November 2, 2012

functional and warnings

I like and make heavy use of the "new" functional programming facilities of C++. If you still don't know what function, bind and lambas do, you should really go into it.

Recently I came into an issue when using this kind of paradigm.
Take this simplified code: it contains a stupid, yet subtle bug.

#include <functional>

std::function<bool()> Temp ()
{
     return [] { return true;};
}

int main(int argc, char* argv[])
{
 Temp ();
 return 0;
}

even worse:

int main(int argc, char* argv[])
{
      if (Temp ())
          return 1;
      else
          return 0;
}

if you compile this code with GCC or MSVC , with all the warnings enabled (-Wall , /Wall) , you won't get any warning.

The problem is that the Temp () function call returns a function object, which is not used at all. It's missing an extra () to do the actual function object function call.

int main(int argc, char* argv[])
{
      if (Temp ()())
          return 1;
      else
          return 0;
}

even worst, the function<> class is convertible to bool: it returns true if the function object is valid.
So if you're using it inside a conditional statement you won't get any error too.
The compiler correctly doesn't generate a warning because the statement is a Temp function call; it is just the return value that is unused.

You probably don't want a compiler that complains for every unused function value.
Still, spotting this specific problem can be time-consuming, because the Temp () seems a regular function call and can unnoticed even in a code review.

Probably, a static analizer could catch this kind of errors.
Else, using a more verbose code style can avoid this error.
int main(int argc, char* argv[])
{
        std::function<bool()> RetFn = Temp ();
        assert(RetFn);
        if (RetFn())
            return 1;
        else
            return 0;
}
This is just a simple example, very straight to read and fix. But you'd better not forget a () when doing functional programming or you'll get a problematic code piece that can go unnoticed.
I could not find any idiom to detect this at compile time. Any idea? :)

Thursday, November 1, 2012

About technical videos

One of the trends of the web 2.0 and post-youtube-boom era is surely the transition from text based tutorials to webcasts.
As a perfect example for this ,you can surf into that new MSDN Channel9 site: it contains a good section of C++ videos, talking about various and interesting topics.

The first video you get inside the C++ channel is this one:
 
"Stephan T. Lavavej, aka STL, is back on C9! This time, STL will take us on a journey of discovery within the exciting world of Core C++ (standard C++, the core language)."

Wow, this is something interesting. Part 1 whas about C++ name lookup, something which is always a source of errors and portability issues.

I just opened the video and read:  44 minutes, 48 seconds









Do you really think I have all that time to dedicate on this topic? Should I really stay 45 minutes in front of my screen watching you talking about C++ name lookup?

Mmm, maybe there's a transcript somewhere? Nope.


This results in a simple thing: I'm not watching the video at all!

Note, I just picked this particular video as an example, but the pattern is repeated over and over, from the "How Do I videos" on MSDN, to the "lectures" and overviews of the C++ committee folks.

So, everytime you're going to post something technical in video-only format, I'm gonna hate you.
Instead I'll just google the topic about and find a written explanation of it. Usually I'll be able to understand the same thing in a fraction of the time, and if I don't understand something, I wont't need to stop-rewind the video over and over.




All of this is a pity, because the covered topics can be very interesting.

Written things are easier to understand and follow. One can easily find the exact information it needs while filtering all the useless things.
Also, listening a non-native language is much harder than reading it.

I find watching technical videos very boring and frustrating. I prefer to spend my limited learning time more productive ways.

Wow, a new C++ blog!

Hello Guys!

since usually writing an introduction for a blog always means writing the only post in years to come, I'll keep this short.
Just four points:

- I'll talk about C++ and programming topics and issues. C++ is my primary programming language and I'm using it from various years now, both at work and at home.

- English, instead, is not my primary language, so please be tolerant if you find grammatical horrors here and there :) Just drop me a note in case.
- As you can imagine by reading the title , I love the language but I amost hate it's syntax, specially after C++11.
- The first posts will surely be complains against the language :) Anyway I'll try to bring reasonable arguments for my rants.

Let's go !