9. Visualization

In this section we discuss a set of Python tools for visualization.

Python provides access to a sophisticated set of plotting tools for standard mathematical plots, and we will take a brief look at the most complete, matplotlib and its extension, pylab. We’ll also look at bokeh, a tool that focused on creating interactive plots, especially plots that can be displayed in a browser. As anyone who’s done much plotting knows, plotting data is a complex task involving numerous decisions related to what the high-level goals of the visualization are. We look at some examples of things that can be easily done with matplotlib, boxplots, violin plots, and some uses of color to grapple with higher dimensionality. However, we will not much more than scratch the surface of the subject here.

The tools-oriented part of this chapter might be thought of as visualization for exploration. We have a dataset and we’re interested in gaining a deeper understanding of its properties. Various standard plot types and plotting tools can be of help. But another equally important visualization task arises when we have asked good questions about the data and the answers tell us something important about how the data works. We then need to use visualization for story-telling. Good visual story-telling often involves combining information from different sources or enriching the information from one source with statistical analysis. We have to present an explanatory pattern in the data in a clear and compelling way. We’ll take a look at the process of moving from raw data to story telling with a very simple example involving automobile mileage. We’ll also look at the enrichment that’s possible when geographic, which instantly ad a whole new story telling dimension to the data.

It is also important to remember that good visualization involves much more than just simple plotting; good visualization often involves significant computing to cast the data into a form suitable for graphical representation and exploration, and in its capacity as a complete programming language, Python can help out there. We will look at two example of this sort, social networks and dimensionality reduction. For visualization of social networks, Python offers, through the networkx package, a useful interface. For larger networks, other tools may be better. For dimensionality reduction, some of the standard tools in the numpy linear algebra package are of use.

More than just pretty pictures

  1. Good visualizations provide genuine insights into data, even promotes the discovery of lawlike relations. All data graphs are visualizations; some are just more revealing than others. But the field of illuminating visualizations is not limited to graphs. For a small sample of the kind of imaginative spaces some visualizations explore, see the prefuse gallery.

  2. But in social science, a lot of our data is qualitative. We need to make use of categories like ethnicity, political persuasion, social ties, and layered concepts of social identity. It’s hard to draw graphs for these kinds of data, because they don’t easily map onto numbers. This is where the discipline of data visualization becomes interesting, because there are helpful techniques for visualizing all kinds of non-numerical data.

Data visualization is an interdisciplinary field that requires computational, mathematical, and social science knowledge, and we’re not going to try cover even a significant fraction of the subject matter here. Instead we’ll try to give a couple of helpful examples that require two key concepts, multidimensional scaling, and social networks, and we’ll point to some accessible Python tools to help with both. The first concept is much simpler than it sounds: Sometimes complex data points involving many independent variables (many dimensions) can be thought of having a basic quantifiable similarity relation (some points are more alike than others); multidimensional scaling seeks to construct a visualization that respects this similarity relation. The second idea is the idea of a social network. Many social processes including the behavior of social groups in crisis, the diffusion of innovations, and the spread of disease and radicalization can be understood by understanding the network of social ties of various kinds that bind groups together, often into a number of loosely connected networks. In this case a fundamental part of what needs to be visualized isn’t numerical: It’s a network of agent-to-agent ties. In the abstract, a network is what computer scientists call a graph. Visualizing graphs is a special kind of visualization problem of its own. But appropriate visualizations of social groups and an understanding of their properties can bring important insights into our studies of groups.

Finally, the awful truth is that Python is probably not the best language for visualization. For example, the consensus seems to be that the basic plotting tools in R are richer and better suited to data analysis needs than those in Python, or even those in Python with pandas. In addition, some R packages provide their own plotting addons to suit the needs of their particular data. Thus, visualization is a domain in which it is very natural to explore Python’s ability to interface with other programs.

We will look at an example of this by providing a Python script for constructing word clouds. A word cloud is graphical representation of the vocabulary in a document or set of documents, which tries to represent the topical importance of words with font size. R provides a very good, very easy to use word cloud package. To build word clouds in Python, we will write a Python script that calls the R word cloud package, using the subprocess module.