7. Classification of Text

In this chapter, we give a brief introduction to text handling tools available in Python.

The problem with text analysis is that there are too many words. Bill Paisley’s description of the problem is now nearly a half century old, but holds up pretty well:

Thanks to paper and ink, words are a durable human artifact ... words form the running records of civilization and also the episodic record of individual experience. Words are rich data for all social research, from psychiatry to cultural anthropology.

Unfortunately, there are always too many words. Words produced in minutes may justify hours of analysis. A set of psychiatric interviews, or editorials in Pravda, or a collection of folktales can occupy (and have occupied) researchers for months.


Thus, what text analysts faced with large quantities of text have historically sought to do is to filter it, running it through a strainer that captures the major strands of content, or to crop it, mining it for specific bits of content. Paisley’s complaint comes at the beginning of a review of the Harvard General Inquirer system, a prime example of the filtering approach, which seeks to boil a text down to core set of high content or high significance words. But computational tools have come a long way since Paisley wrote these words, both in sophistication and in raw computational power: Machine learning now offers a third alternative, representing text as points in a high-dimensional space in which we can draw class boundaries that take into account the contributions of all words.

We start with a brief introduction to some basic ideas in machine learning, introducing the Python library sklearn, then move on to a discussion of some Python tools for text classification. To narrow the field, we focus on aspects of text processing that are best thought of as subfields of machine learning.

We also introduce some text handling facilities provided by NLTK, a toolkit specifically designed for natural language processing.

Finally, we introduce an indispensable low-level language-processing module that may end up beung your most useful day-to-day tool: Python’s implementation of regular expressions (in the re package). Regular expressions provide a way of extracting selected bits of content by defining the text pattern common to all members of the content set.