We all know that the computer cannot store letters or numbers so we need some kind of mapping from these letters/numbers to bit strings (the only thing a computer can understand). One such known mapping is
ASCII. If you're not sure about
Unicode, read ahead otherwise you can skip to Encoding in Python-2.
There are 95 human readable characters specified in the ASCII table, including the letters A through Z both in upper and lower case, the numbers 0 through 9, a handful of punctuation marks and characters like the dollar symbol, the ampersand and a few others.
Shown above is just a small part of the 128 characters that ASCII has.
How do you write the
é in ASCII? You simply cannot because there is no encoding from some
é to a bit string in ASCII. Thus the world needed more encoding schemes.
Suppose someone wanted to write a document in Klingon, found that there was no encoding scheme, so just went ahead and invented one. This is how we've ended up with a boat-load of schemes.
Unicode to the rescue¶
Someone finally had enough and decided to combine all of these encoding schemes into one, which they called Unicode. Unicode contains a repertoire of 137,439 characters covering 146 modern and historic scripts, as well as multiple symbol sets and emoji. It even has a section for Klingon :smile:.
If you want to have a look at just how big Unicode actually is, try googling Ian Albert's Unicode chart. This guy printed the entire unicode chart and ended up with 6 feet X 12 feet of chart.
If it has so many characters then you must be wondering how many bytes does it actually use? None because it is not an encoding scheme
Confused? Well you are supposed to be. Unicode is just a table which contains character points. By this I mean a character point just says
65 for A, 66 for B... It was left to us how we wanted to denote this
65 into computer bits. And so we did.
These many character points need at least three bytes but since three is awkward, we actually use four bytes. Now if you have just an English file (a file with only English characters) that is encoded using the above unicode scheme, we will actually use four times the space the file is. That is not so nice. To optimize this, there are several ways to encode Unicode code points into bits like UTF-8, UTF-16, UTF-32.I have discussed UTF-8 below.
UTF-8 is a variable length encoding. If a character can be represented using a single byte, then it will be encoded using a single byte only. It has elaborate ways to use the highest bits in a byte to signal how many bytes a character consists of.
Strings and Encoding in Python-2¶
We have two types of strings in python
type 'str' and
type 'unicode'. Python has default encoding set to
ASCII (check it using
sys.getdefaultencoding()). So if we try to write a unicode string to a file, we get an error:
f = open("test.txt",'w') s = u'test\u0107' f.write(s)
--------------------------------------------------------------------------- UnicodeEncodeError Traceback (most recent call last) <ipython-input-6-6c2a70c7683b> in <module>() 1 f = open("test.txt",'w') 2 s = u'test\u0107' ----> 3 f.write(s) UnicodeEncodeError: 'ascii' codec can't encode character u'\u0107' in position 4: ordinal not in range(128)
This is because we can't actually store the unicode character. So how do we solve this? We follow the strategy:
Decode early, Use Unicode Everywhere, Encode late.
Note 1: From the codes module, we have an open function where we can specify the required encoding.
Note 2: The CSV module doesn't support unicode, so we have to encode, do our stuff then decode to unicode again.
Summary of problems in Python-2:¶
- default Python 2 encoding is 'ascii'
- not all Python 2 internals support Unicode
- You can't reliably guess an encoding
Unicode in Python-3¶
They fixed it!
type 'str'is now a
type 'unicode'object. Every string is now unicode so no more
- We have new string types
type 'bytes. So encode to bytes and decode to unicode.
- All built-in modules now support unicode. Now no more encoding and decoding in between of code.
- open() in python 3 works like codecs.open(). Ah! yes Default Encoding is now UTF-8.
- For computers, text is always some bits which need to be translated into human-readable form.
- These bits could represent different things according to different encoding schemes, guess (guess should be the last resort) it intelligently.
- Roughly speaking, the set of all those encoding schemes is called Unicode. But it is not an encoding scheme in itself.
- We have different ways of going from Unicode to bites, the important one is UTF-8.
- Python supports Unicode 💖
- The default encoding in Python-2 is ASCII, don't change it just to run your code.
- For python portability, every string in Python(3+) is now Unicode. And we have a new type:
See an error in this post? Please submit a pull request on Github.