An old joke asks "What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American."
Now that I've successfully enraged all of my American readers, I can get to the point, which is that because so many computer technologies were developed in English-speaking countries—and particularly in the United States—the needs of other languages often were left out of early computer technologies. The standard established in the 1960s for translating numbers into characters (and back), known as ASCII (the American Standard Code for Information Interchange), took into account all of the letters, numbers and symbols needed to work with English. And that's all that it could handle, given that it was a seven-byte (that is, 128-character) encoding.
If you're willing to ignore accented letters, ASCII can sort of, kind of, work with other languages, as well—but the moment you want to work with another character set, such as Chinese or Hebrew, you're out of luck. Variations on ASCII, such as ISO-8859-x (with a number of values for "x"), solved the problem to a limited degree, but there were numerous issues with that system.
Unicode gives each character, in every language around the globe, a unique number. This allows you to represent (just about) every character in every language. The problem is how you can represent those numbers using bytes. After all, at the end of the day, bytes are still how data is stored to and read from filesystems, how data is represented in memory and how data is transmitted over a network. In many languages and operating systems, the encoding used is UTF-8. This ingenious system uses different numbers of bytes for different characters. Characters that appear in ASCII continue to use a single byte. Some other character sets (for example, Arabic, Greek, Hebrew and Russian) use two bytes per character. And yet others (such as Chinese and emojis) use three bytes per character.
In a modern programming language, you shouldn't have to worry about this stuff too much. If you get input from the filesystem, the user or the network, it should just come as characters. How many bytes each character needs is an implementation detail that you can (or should be able to) ignore.
Why do I mention this? Because a growing number of my clients have begun to upgrade from Python 2 to Python 3. Yes, Python 3 has been around for a decade already, but a combination of some massive improvements in the most recent versions and the realization that only 18 months remain before Python 2 is deprecated is leading many companies to realize, "Gee, maybe we finally should upgrade."
The major sticking point for many of them? The bytes vs. characters issue.
So in this artice, I start looking into what this means and how to deal with it, beginning with an examination of bytes vs. characters in Python 2. In my next article, I plan to look at Python 3 and how the upgrade can be tricky even when you know exactly when you want bytes and when you want characters.
Traditionally, Python strings are built out of bytes—that is, you can think of a Python string as a sequence of bytes. If those bytes happen to align with characters, as in ASCII, you're in great shape. But if those bytes are from another character set, you need to rethink things a bit. For example:
>>> s = 'hello'
>>> len(s)
5
>>> s = 'שלום' # Hebrew
>>> len(s)
8
>>> s = '你好' # Chinese
>>> len(s)
6
What's going on here? Python 2 allows you to enter whatever characters you want, but it doesn't see the input as characters. Rather, it sees them only as bytes. It's almost as if you were to go to a mechanic and say, "There's a problem with my car", and your mechanic said, "I don't see a car. I see four doors, a windshield, a gas tank, an engine, four wheels and tires", and so forth. Python is paying attention to the individual parts, but not to the character built out of those parts.
Checking the length of a string is one place where you see this issue. Another is when you want to print just part of a string. For example, what's the first character in the Chinese string? It should be the character 你, meaning "you":
>>> print(s[0])
�
Yuck! That was spectacularly unsuccessful and probably quite useless to any users.
If you want to write a function that reliably prints the first character (not byte) of a Python 2 string, you could keep track of what language you're using and then look at the appropriate number of bytes. But that's bound to have lots of problems and bugs, and it'll be horribly complex as well.
The appropriate solution is to use Python 2's "Unicode strings". A Unicode string is just like a regular Python string, except it uses characters, rather than bytes. Indeed, Python 2's Unicode strings are just like Python 3's default strings. With Unicode strings, you can do the following:
>>> s = u'hello'
>>> len(s)
5
>>> s = u'שלום' # Hebrew
>>> len(s)
4
>>> s = u'你好' # Chinese
>>> len(s)
2
>>> print(s[0])
你
Terrific! Those are the desired results. You even can make this the default behavior in your Python 2 programs by using one of my favorite modules, __future__
. The __future__
module exists so that you can take advantage of features planned for inclusion in later versions, even if you're using an existing version. It allows Python to roll out new functionality slowly and for you to use it whenever you're ready.
One such __future__
feature is unicode_literal
. This changes the default type of string in Python to, well, Unicode strings, thus removing the need for a leading "u". For example:
>>> from __future__ import unicode_literals
>>> s = 'hello'
>>> len(s)
5
>>> s = 'שלום' # Hebrew
>>> len(s)
4
>>> s = '你好' # Chinese
>>> len(s)
2
>>> print(s[0])
你
Now, this doesn't mean that all of you problems are solved. First of all, this from
statement means that your strings aren't actually strings any more, but rather objects of type unicode
:
>>> type(s)
<type 'unicode'>
If you have code—and you shouldn't!—that checks to see if s is a string by explicitly checking the type, that code will break following the use of unicode_literals
. But, other things will break as well.
For example, let's assume I want to read a binary file—such as a PDF document or a JPEG—into Python. Normally, in Python 2, I can do so using strings, because a string can contain any bytes, in any order. However, Unicode is quite strict about which bytes represent characters, in no small part because the bytes whose eighth (highest) bit is active are part of a larger character and cannot stand on their own.
Here's a short program that I wrote to read and print such a file:
>>> filename = 'Files/unicode.txt'
>>> from __future__ import unicode_literals
>>> for one_line in open(filename):
... for index, one_char in one_line:
... print("{0}: {1}".format(index, one_char))
When I run it, this program crashes:
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd7
↪in position 0: ordinal not in range(128)
What's the problem? Well, it's still reading the file using bytes, rather than characters. After reading the current line from the filesystem, Python tries to create a string. But, it can't resolve the conflict between the bytes it received and the Unicode it must create as a string.
In other words, while it has managed to make Python's strings Unicode-compliant, there are a bunch of things in the general Python environment that aren't Unicode-aware or friendly.
You can solve this problem by using the codecs
module, and the open
method it provides, telling it which encoding you want to use when reading from the file:
>>> import codecs
>>> for one_line in codecs.open(filename, encoding='utf-8'):
... for index, one_char in enumerate(one_line):
... print("{0}: {1}".format(index, one_char))
To summarize, you can make all of Python's strings Unicode-compliant if you use unicode_literals
. But the moment you do that, you run into the potential problem of getting data in bytes from the user, network or filesystem, and having an error. Although this seems like a really tempting way to deal with the whole Unicode issue, I suggest that you go the unicode_literals
route only if you have a really good test suite, and if you're sure that all the libraries you use know what to do when you change strings in this way. You will quite possibly be surprised to find that although many things work fine, many others don't.
bytes
TypeWhen talking about strings and Unicode, there's another type that should be mentioned as well: the "bytestring", aka the bytes
type. In Python 2, bytes
is just an alias for str
, as you can see here in this Python shell that has not imported unicode_literals
:
>>> s = 'abcd'
>>> type(s) == bytes
True
>>> str == bytes
True
>>> bytes(1234)
'1234'
>>> type(bytes(1234))
<type 'str'>
>>>
In other words, although Python strings generally are thought of as having type str
, they equally can be seen as having type bytes
. Why would you care? Because it allows you to separate strings that use bytes from strings that use Unicode already in Python 2, and to continue that explicit difference when you get to Python 3 as well.
I should add that a very large number of developers I've met (and taught) who use Python 2 are unaware that byte strings even exist. It's much more common to talk about them in Python 3, where they serve as a counterpart to the Unicode-aware strings.
Just as Unicode strings have a "u" prefix, bytestrings have a "b" prefix. For example:
>>> b'abcd'
'abcd'
>>> type(b'abcd')
<type 'str'>
In Python 2, you don't need to talk about byte strings explicitly, but by using them, you can make it very clear as to whether you're using bytes or characters.
This raises the question of how you can move from one world to the other. Let's say, for example, you have the Unicode string for "Hello" in Chinese, aka 你好. How can you get a bytestring containing the (six) bytes? You can use the str.encode
method, which returns a bytestring (aka a Python 2 string), containing six bytes:
s.encode('utf-8')
Somewhat confusingly (to me, at least), you "encode" from Unicode to bytes, and you indicate the encoding in which the string is storing things. Regardless, you then get:
>>> s.encode('utf-8')
'\xe4\xbd\xa0\xe5\xa5\xbd'
Now, why would you want to turn a Unicode string into bytes? One reason is that you want to write the bytes to a file, and you don't want to use codecs.open
in order to do so. (Don't try to write Unicode strings to a file opened in the usual way, with "open").
What if you want to do the opposite, namely take a bunch of bytes and turn them into Unicode? As you can probably guess, the opposite is performed via the str.decode
method:
>>> b.decode('utf-8')
u'\u4f60\u597d'
Once again, you indicate the encoding that should be used, and the result is a Unicode string, which you see here represented with the special \u
syntax in Python. This syntax allows you to specify any Unicode character by its unique "code point". If you print it out, you can see how it looks:
>>> print(b.decode('utf-8'))
你好
Python 2 is going to be deprecated in 2020, and many companies are starting to look into how to upgrade. A major issue for them will be the strings in their programs. This article looks at strings, Unicode strings and bytestrings in Python 2, paving the way to cover these same issues in Python 3, and how to handle upgrades, in my next article.