This section contains an introduction to the course, discussing goals and how to set up python.

2. Introduction

Increasingly, social scientists find themselves facing exponentially larger data sets available on the internet and elsewhere without suitable tools to deal with them. Many social scientists end up using spreadsheet programs for their data-processing tasks and spend hours clicking around or copying and pasting, and then repeating the process for other data files. Not only is this a waste of time, but it often leaves you in a situation where it is hard to reproduce the steps that got you a particular result, making your work essentially useless, because it can’t be reproduced, which means it can’t be either validated or improved.


This problem has a name. It is called big data. The emergence of big data has profound consequences for government, social policy, and business, but what we will focus on here is how big data affects research in social science, and here the perspective changes. For social scientists, big data is less of a problem than it is an unprecedented opportunity. Social science has already been and will continue to be revolutionized by big data.

The unprecedented wealth of social data available in the new information age is due in part to social changes due to new media, technical advances in data acquisition, and new hardware. Yet an equally important factor in the growing importance of social data is the development of something called data science. Data science is an interdisciplinary mix of statistics, computer science, and machine learning; its most promising feature is that the data management, data classification, clustering, and visualization tools that data science offers build in a strong component of data exploration and knowledge discovery. Thus, data science promises to provide ways of coping with the overwhelming complexity of big data. This is not a textbook in data science or even data exploration, but it is motivated by the need to provide some entry level skills that will open the door to using data science tools. 1

This course will show you how to use your data more powerfully and effectively via the scripting language Python The course touches on many topics of theoretical interest in data science (such as social networks and data visualization), but the focus is on manipulating data so that you can tailor it to the needs of your particular project. The course targets social science students and will assume no prior programming knowledge. Although many of the techniques are relevant to linguistics, economics, and geography, the course focuses on techniques that are applicable to a wide range of data sources, including images, social network data, web pages, and blogs.


It may be helpful to look at a short history of data science.