7. Data and Data Frames

In this chapter we introduce Pandas.

Pandas is Python’s most popular toolset for manipulating data in tabular form (Excel sheets, data tables). This module has two main goals. The first is to introduce the two main pandas data types, DataFrame and Series.

A DataFrame is a table of data. Datasets at all levels of analysis of analysis can be represented as DataFrames.

You can think of a DataFrame as being organized in rows and column, like a numpy 2D array, but differing from it in two important ways:

  1. A DataFrame uses keyword indexing instead of positional indexing. Columns and rows have names, so this makes them a little more like spread sheet tables.

  2. The data types of the columns typically differ from one another (Pandas columns may contain strings, number types, and date time types, among others).

Despite these changes in how indexing and typing works, all the principles that apply to computing with numpy arrays will carry over with minor modifications to computing with Pandas DataFrames. This is especially true of Boolean indexing, which will be your fundamental tool for selecting and reshaping data in pandas. Where a DataFrame is like a 2D array, a Series is like a 1D array; both the rows and the columns of Pandas DataFrames are Series objects.

The second goal of the chaoter is to introduce you to some of pandas analytical tools, especially cross tabulation, and grouping. We will briefly examine a few basic uses of pivot tables.