3.4.2. Strings and bytes¶
We have already been introduced to strings as a basic data type. Now we take a look them again from a different point of view. Strings are containers. This means you can look at their insides and do things like check whether the first character is capitalized and whether the third character is “e”.
3.4.2.1. Indexing strings, string slices¶
To get at the inner components of strings Python uses the same syntax and operators as lists. The Pythonic conception is that both lists and strings belong to a ‘super’ data type, sequences. Sequence types are containers that contain elements in a particular order, so indexing by number makes sense for all sequences:
>>> X = 'dogs'
>>> X[0]
'd'
>>> X[1]
'o'
>>> X[-1]
's'
The following raises an IndexError, as it would with a 4-element list:
>>> X[4]
...
IndexError: string index out of range
Strings can also be one element long:
>>> Y = 'd'
Note
Unlike C, there is no special type for characters in Python. Characters are just one-element strings.
And they can be empty and still be strings, just as lists can:
>>> Z = ''
As with lists, you can check the contents of strings. So:
>>> 'd' in X
True
>>> 'do' in X
True
>>> 'dg' in X
False
So not just any character (like ‘d’) but any substring (like
‘do’) of a string
is regarded as in the string.
However, such a substring must contain all the characters
starting at one index up to an including the character
at some higher index, without skipping any. So
‘dg’ is not in X
. Such continuous
substrings can always be rtrieved by slicing.
Python implements slicing by index for strings
just as it does for lists.
The following examples illustrate some strin slicing:
>>> X[0:2] # string of 1st and 2nd characters
'do'
>>> X[:-1] # string excluding last character
'dog'
>>> X[1:] # string excluding first character
'ogs'
>>> X[1:3] # string 2nd and 3rd characters
'og'
Keep in mind the following rule when picking slices of a Pythonic
sequence X
. The slice X[i:j]
will start at X[i]
and it will have length j-i
. Thus, it will not include
X[j]
.
Guido van Rossum says: “The best way to remember how slices work is to think of the indices as pointing between characters, with the left edge of the first character numbered 0. Then the right edge of the last character of a string of n characters has index n”:
+---+---+---+---+---+
| H | e | l | p | A |
+---+---+---+---+---+
0 1 2 3 4 5
-5 -4 -3 -2 -1
The first row of numbers gives the position of the indices 0…5 in the string; the second row gives the corresponding negative indices. The slice from i to j consists of all characters between the edges labeled i and j, respectively.
For nonnegative indices, the length of a slice is the difference of the indices, if both are within bounds, e.g., the length of word[1:3] is 2.
Finally, slices may have a third component called the step size. A slice [start:stop:step] means return components step size apart, starting at index start up to but not including index stop. For example:
>>> 'banana'[1:6:2]
'aaa'
The built-in function len() returns the length of a string:
>>> s = 'supercalifragilisticexpialidocious'
>>> len(s)
34
Strings can also be concatenated into longer sequences, just as lists can:
>>> X + Y
'dogsd'
Using the name of the type as a function gives us a way of making strings, just as it did with lists:
>>> One = str(1)
One is no longer an int
!
>>> One
'1'
Python reminds us of this
when printing it by including string quotes. We can turn the string
back into an integer just using the official name of the
integer type int
. Note the string quotes disappear:
>>> I = int(str(1))
>>> I
1
And as with lists, calling the type with no arguments produces the empty string:
>>> Empty = str()
>>> Empty
''
There is one thing that can be done with lists that canNOT be done with strings. Assignment of values:
>>> 'spin'[2]= 'a'
...
TypeError: object does not support item assignment
This can be fixed, by avoiding the assignment or making the string into a mutable sequence, such as a list, which contains the relevant information.
See also
Section Mutability (advanced).
3.4.2.2. Splitting and Joining¶
One way to turn a string into a list
is call the split
method, which returns the list
gotten by splitting the string up at given separator characters.
The default separator character is a space. Thus:
>>> 'cats are fun'.split()
['cats', 'are', 'fun']
This operation – splitting a string into a list of strings at some given split points is of such importance that it has been implemented as a string method, which allows any sequence of characters to be used as a split point:
>>> 'abracadabra'.split('bra')
['a', 'cada', '']
The inverse operation, concatenating a list of strings into a single string on given join string, is also a method, called as a method of the join string:
To undo the word split above we do:
>>> ' '.join(['cats', 'are', 'fun'])
'cats are fun'
To undo the ‘abracadabra’ split, we do:
>>> 'bra'.join(['a', 'cada', ''])
'abracadabra'
The most common use for join
is to take a list
of strings and produce a single string with line breaks:
>>> print ('\n'.join(['Roses are red.', 'Violets are blue.', 'Sugar is sweet', 'but not so pooh.']))
Roses are red.
Violets are blue.
Sugar is sweet,
But no so pooh.
3.4.2.3. Unicode¶
Python 2.X maintained distinct unicode and string types. That distinction has been abolished in Python 3.X. Unicode-bearing strings are written the same way strings are written. Everything that has been said about strings is also true of strings containing unicode. Strings containing unicode character are still immutable sequences of characters. Indexing and splicing works the same. As we will see below, unicode strings have the same methods as ordinary strings.
What is the point, then? The point is that unicode can represent characters (and writing systems) that strings can’t. There are 128 official ASCII characters, extended to 256 in various semi-official standards. There are 1, 114, 112 possible characters in modern unicode (17 times the original unicode setup of 65,536 characters ), and about 10% of this space has been assigned to characters in various writing systems, international symbols, and emoji.
Here’s one way to define string that includes Cyrillic characters. Each unicode character is associated with a number called its code point; we simply type the code point numbers into the string, preceding each of them with “u” followed by 4 digits (these are hexadecimal – base 16 – digits, which is why they include characters like e and f):
>>> russia = '\u0420\u043e\u0441\u0441\u0438\u044f'
>>> russia
'Россия'
>>> russia[0]
'Р'
>>> len(russia)
6
Despite how much typing it took enter, russia is 6 characters long, because that’s how many Unicode characters it contains (specifically, 6 characters from the Cyrillic blocks in Unicode).
3.4.2.4. Bytes and byte arrays¶
Missing from the inventory of basic sequence types so far is a sequence type for representing binary data, say, what you get when you read in a compiled, executable C-file as binary data. Such sequences belong to a special type called bytes (yes, the type name is plural).
A more accessible example of where you might encounter a bytes object is what you get by encoding a string in bytes. Of course, what you get will depend on which way of encoding characters you choose. So given a particular (unicode) string like the one above, we can for example construct one bytes object representing its utf-8 encoding and another representing a utf-16 encoding:
>>> r1 = russia.encode('utf-8')
>>> r1
b'\xd0\xa0\xd0\xbe\xd1\x81\xd1\x81\xd0\xb8\xd1\x8f'
>>> r2 = russia.encode('utf-16le')
>>> r2
b'\xff\xfe \x04>\x04A\x04A\x048\x04O\x04'
>>> r1 == r2
False
>>> len(r1) == len(r2)
True
The results here are often called bytes strings. Of course since these bytes strings started out as sequences of characters, they are worse than useless if you don’t know what code was used to construct them:
>>> r1.decode('utf-8')
'Россия'
>>> r1.decode('utf-16le')
'ꃐ뻐臑臑룐近'
A badly decoded bytes string is worse than useless because you don’t always get an error; sometimes you get a message in a bottle from the uni(code)verse. This is what we got from the second decoding of byte r1 above: random gibberish, with no error message to tell us so.
The key point is that a bytes representation is a sequence of bytes (8-bit sequences), not a sequence of characters; or perhaps better in our context: a bytes representation is a sequence of character codes. You should always think of bytes as number sequences that encode something. Indeed when you index them at individual positions, you do get numbers:
>>> r1[0] # a number between 0 and 255
208
>>> len(r1)
12
The length of r1 as a byte sequence is 12 because each of the 6 Cyrillic characters we started with took up two bytes. Encountered out of context, those bytes might represent information having nothing to do with strings. For example, the first byte r1[0] might represent a setting for 8 binary switches, and the first two positions, 16 switches:
>>> print(f'{r1[0]:b}') #print this as a binary number
11010000
>>> print(f'{r1[0]:b} {r1[1]:b}')
11010000 10100000
Akin to the immutable bytes object in many ways (they support almost all the same methods) is the mutable bytearray object:
>>> r1_ba = bytearray(russia,'utf8')
>>> r1_ba
bytearray(b'\xd0\xa0\xd0\xbe\xd1\x81\xd1\x81\xd0\xb8\xd1\x8f')
>>> r1_ba.decode('utf8')
'Россия'
>>> r1_ba[2:4]=[32,32]
>>> r1_ba
bytearray(b'\xd0\xa0 \xd1\x81\xd1\x81\xd0\xb8\xd1\x8f')
The second and third bytes have been replaced with two spaces (we replaced one 2-byte unicode character encoding with two 1-byte character encodings, leaving the bytes string the same length). When we decode now:
>>> russia_dec =r1_ba.decode('utf8')
>>> russia_dec
'Р ссия'
>>> len(russia_dec)
7
we get a new character sequence with two spaces in it (32 is the ASCII code for teh space character) replacing the Cyrillic o in Россия. So the length of the bytes sequence sequence hasn’t changed but the length of the character sequence it represents has.
Meanwhile the original bytes instance is immutable:
>>> r1[2] = 32
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'bytes' object does not support item assignment
3.4.2.5. String/Unicode methods¶
We present a sampling of the most important string methods (omitting among others the dunder methods used for retrieval by index or operators like in and +). All of these represent sensible and important operations on strings: title-casing, upper-casing, capitalization, counting occurrences of a given element, finding the index of (the first occurrence) of a given element, converting the string into a list by splitting it on a particular “separator” element (such as space) or the inverse operation, using the string as the separator to glue a list of strings into a single string. Some of the methods we have already seen as methods on lists. Lists and strings share methods such .index(…) and .count(…) because they both have the property of being sequences. Strings lack certain updating methods that lists have because they are immutable (.append(…), .extend(…), .remove(…)).
In all of the following examples, S
is a string. This is just a
sample. See the official Python docs for the complete list of string
methods. Or just type help(str) at the Python prompt!
S.capitalize()
Return a string just like S, except that it is capitalized. If S is already capitalized, the result is identical to S.
S.count(x)
Return the number of times x appears in the string
S
.
S.index(x)
Return the index in
L
of the first substring whose identicql tox
. It is an error if there is no such item.
S.find(t)
Return index of first instance of t in S, or -1 if not found.
S.rfind(t)
Return index of last instance of t in S, or -1 if not found.
S.join(Seq)
Combine the strings of
Seq
into single string usingS
as the glue. ‘ ‘.join([“See”,”John”,”run”]) produces:"See John run"
S.replace(x,y)
Return a string in which every instance of the substring x in L is replaced with y:
>>> X = 'abracadabra' >>> X.replace('dab','bad') 'abracabadra' >>> X.replace('a','b') 'bbrbcbdbbrb'
S.split(t)
Split
S
into a list wherever at
is found. Ift
is not supplied, split wherever a space is found.
S.join(L)
S.splitlines()
Split
S
into a list of strings, one per line.
S.strip()
Return a string just like S with initial and trailing white space removed. White space includes line breaks. One reason this is useful is because reading a file line by line often produces strings with trailing newline characters.
S.title()
Return a string just like
S
in which all words are capitalized:>>> 'los anGeles'.title() 'Los Angeles'
S.istitle()
Return
True
is every word inS
is capitalized. Otherwise, returnFalse
:>>> 'los anGeles'.istitle() False >>> 'Los AnGeles'.istitle() False >>> 'Los Angeles'.istitle() TrueThis is one of a whole sequence of Boolean test methods on strings, such as isupper(…) and .isascii(…) which test computationally important properties of strings.