Encoding¶
We all know that the computer cannot store letters or numbers so we need some kind of mapping from these letters/numbers to bit strings (the only thing a computer can understand). One such known mapping is ASCII
. If you’re not sure about ASCII
and Unicode
, read ahead otherwise you can skip to Encoding in Python-2.
ASCII¶
There are 95 human readable characters specified in the ASCII table, including the letters A through Z both in upper and lower case, the numbers 0 through 9, a handful of punctuation marks and characters like the dollar symbol, the ampersand and a few others.
Bits | Character |
---|---|
01000001 |
A |
01000010 |
B |
01000101 |
E |
01000110 |
F |
Shown above is just a small part of the 128 characters that ASCII has.
éh?¶
How do you write the é
in ASCII? You simply cannot because there is no encoding from some é
to a bit string in ASCII. Thus the world needed more encoding schemes.
Suppose someone wanted to write a document in Klingon, found that there was no encoding scheme, so just went ahead and invented one. This is how we’ve ended up with a boat-load of schemes.
Unicode to the rescue¶
Someone finally had enough and decided to combine all of these encoding schemes into one, which they called Unicode. Unicode contains a repertoire of 137,439 characters covering 146 modern and historic scripts, as well as multiple symbol sets and emoji. It even has a section for Klingon :smile:.
If you want to have a look at just how big Unicode actually is, try googling Ian Albert’s Unicode chart. This guy printed the entire unicode chart and ended up with 6 feet X 12 feet of chart.
If it has so many characters then you must be wondering how many bytes does it actually use? None because it is not an encoding scheme
Confused? Well you are supposed to be. Unicode is just a table which contains character points. By this I mean a character point just says 65 for A, 66 for B..
. It was left to us how we wanted to denote this 65
into computer bits. And so we did.
These many character points need at least three bytes but since three is awkward, we actually use four bytes. Now if you have just an English file (a file with only English characters) that is encoded using the above unicode scheme, we will actually use four times the space the file is. That is not so nice. To optimize this, there are several ways to encode Unicode code points into bits like UTF-8, UTF-16, UTF-32.I have discussed UTF-8 below.
UTF-8¶
UTF-8 is a variable length encoding. If a character can be represented using a single byte, then it will be encoded using a single byte only. It has elaborate ways to use the highest bits in a byte to signal how many bytes a character consists of.
character | bits |
---|---|
A | 01000001 |
あ | 11100011 10000001 10000010 |
Strings and Encoding in Python-2¶
We have two types of strings in python type 'str'
and type 'unicode'
. Python has default encoding set to ASCII
(check it using sys.getdefaultencoding()
). So if we try to write a unicode string to a file, we get an error:
f = open("test.txt",'w')
s = u'testu0107'
f.write(s)