Charset vs Character Encoding

Charset and Character Encoding has always tripped me up in C#, Python, and some other languages. I really didn't have a firm grasp of what someone meant when they said charset or encoding, and worse interchangeably. I decided, it was time for me to wrap my head around the two and really understand them. I discovered most programmers are in the same boat as me.

Character Set (charset)

When you would write programs in the "Good old days" there was no such thing as Character Encoding for strings. A string was literally an array of Bytes which mapped directly to the Characters (alphabet if you will) in the selected charset. For instance in one charset the code 241 could be the ñ and in another charset it could be a $. Memory did not change at all, the only thing that changed was what alphabet character was mapped to the number, and all encodings only contained 255 characters.

Now this quickly becomes a problem when you want to support more than one language. The old way of handling this was to pass along the Map (charset) to read the bytes and display the correct "alphabet" to the user. But this comes with lots of problems, and limits what characters you can have in each alphabet. Consider Chinese or Japanese which can have many more than 255 characters in their alphabet. Then we even have the situation where a single document needs to have multiple character sets.

Encoding & Unicode

In all honesty, this one is still not concrete for me but here is what I know. Unicode is actually a Character Set which is a 32 bit mapping of numbers to characters (in Unicode, known as Glyphs). This raises a new issue, we have quadrupled the size of our text by accounting for all characters of all known languages. With today's computing landscape, this is not entirely a problem, however it is still wasteful. When Unicode was created, this would have horribly impacted the performance and memory footprint of applications. To deal with this several Encodings became commonplace. Windows has fully latched onto UTF-16 while most Unix and Linux (POSIX) systems embrace the UTF-8 Encoding.

The Encoding is really just a way to represent the 32 bit charset without using an array of 32 bit elements. The values are packed into one or more bytes instead. I didn't really understand how this worked until I have worked with UTF-8 Encoding in C. Actually having to work with the individual bytes and bits really helped my understanding. See the image below of the memory usage of UTF32 (no encoding) vs UTF8 (multi-byte encoding).

UTF-32 Encoding vs UTF-8 Encoding of Unicode


The simplest way I have thought of Charset vs Encoding is: Characterset is the Alphabet you are going to use, and Encoding is how you want to store the alphabet into memory or disk.Soon I will create an article about how UTF-8 is actually stored in memory, and why I like using it. Look for that link here sometime in the nearish future.