8.6. Dimensionality reduction¶
Most social science data is high-dimensional. That is, we are looking at many variables at once. Visualization of high-dimensional datasets can be hard. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent high-dimensional plots can be much harder to understand. As we will show with House votes example below, reducing the dimensionality of the data can be very helpful in finding its structure.
The simplest way to accomplish this dimensionality reduction is by taking a random projection of the data. Though this sometimes allows some degree of understanding of the data structure, the randomness of the choice may well lose the more interesting information within the data. In the figure below, due to Eamonn Keogh and Jessica Lin, a set of data points in 3 dimensions is projected into 2 dimensions along 3 different axes, yielding three very different shapes. Let’s say the triangle captures the structure you are interested in the data. But each projection yields a very different picture.
How will you know which projection to use? It may not be practical to do all possible projections when there are thousands of variables; nor is it necessarily the case the that you will know which picture is the right one just by looking at it. Finally, projecting is just selecting a few variables and ignoring all the others. There are techniques of dimensionality reduction that don’t work that way. Instead we try to find a single new dimension that captures as much information as possible from multiple dimensions in the original data set.
Many of the most important dimensionality reduction techniques are linear. That is, they involve linear equations that boil M numbers down to N numbers (N < M). These include Principal Component Analysis (PCA), Independent Component Analysis, Linear Discriminant Analysis, and others. These algorithms define specific ways to choose an “interesting” linear projection of the data. They can and do miss important non-linear structure in the data, but they can be very useful. We will look at one example of such a technique, Latent Semantic Analysis (LSA, which is very closely related to PCA) 1 .
8.6.1. An example¶
Below we see a simple two dimensional visualization of House of Representatives voting patterns, achieved by LSA. Essentially what happens is that we reduce the voting pattern from 16 dimensions to 2, and in doing so, we effectively cluster representatioves together in a way that seems to capture party affiliations (represented by color).
Data Set Characteristics: Multivariate Number of Instances: 435 Area: Social Attribute Characteristics: Categorical Number of Attributes: 16 Date Donated: 1987-04-27 Associated Tasks: Classification Missing Values? Yes Origin: Congressional Quarterly Almanac, 98th Congress, 2nd session 1984, Volume XL: Congressional Quarterly Inc. Washington, D.C., 1985. Donor: Jeff Schlimmer (Jeffrey.Schlimmer@a.gp.cs.cmu.edu) Data Set Information: This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition). Attribute Information: 1. Class Name: 2 (democrat, republican) 2. handicapped-infants: 2 (y,n) 3. water-project-cost-sharing: 2 (y,n) 4. adoption-of-the-budget-resolution: 2 (y,n) 5. physician-fee-freeze: 2 (y,n) 6. el-salvador-aid: 2 (y,n) 7. religious-groups-in-schools: 2 (y,n) 8. anti-satellite-test-ban: 2 (y,n) 9. aid-to-nicaraguan-contras: 2 (y,n) 10. mx-missile: 2 (y,n) 11. immigration: 2 (y,n) 12. synfuels-corporation-cutback: 2 (y,n) 13. education-spending: 2 (y,n) 14. superfund-right-to-sue: 2 (y,n) 15. crime: 2 (y,n) 16. duty-free-exports: 2 (y,n) 17. export-administration-act-south-africa: 2 (y,n) In R. > install.packages("mlbench") > data(HouseVotes84,package="mlbench") > HouseVotes84 Class V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 1 republican n y n y y y n n n y <NA> y y y n y 2 republican n y n y y y n n n n n y y y n <NA> 3 democrat <NA> y y <NA> y y n n n n y n y y n n 4 democrat n y y n <NA> y n n n n y n y n n y 5 democrat y y y n y y n n n n y <NA> y y y y 6 democrat n y y n y y n n n n n n y y y y 7 democrat n y n y y y n n n n n n <NA> y y y 8 republican n y n y y y n n n n n n y y <NA> y 9 republican n y n y y y n n n n n y y y n y 10 democrat y y y n n n y y y n n n n n <NA> <NA> 11 republican n y n y y n n n n n <NA> <NA> y y n n 12 republican n y n y y y n n n n y <NA> y y <NA> <NA> 13 democrat n y y n n n y y y n n n y n <NA> <NA> 14 democrat y y y n n y y y <NA> y y <NA> n n y <NA> 15 republican n y n y y y n n n n n y <NA> <NA> n <NA> 16 republican n y n y y y n n n y n y y <NA> n <NA> . . . Source: http://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records Note: http://archive.ics.uci.edu/ml/datasets.html is a wonderful dataset repository.
The same techniques can be applied to bills in an effort to find bill clusters, or natural classes of bills that characterize voting blocks. Looking at this visualization sheds some light on what such pictures show.
Party breakdown (yes votes):
Bill Republican Democrat ================================================ handicapped-infants 31 156 water-project-cost-sharing 75 120 budget-resolution 22 231 physician-fee-freeze 163 14 el-salvador-aid 157 55 religious-groups-in-schools 149 123 anti-satellite-test-ban 39 200 aid-to-contras 24 218 mx-missile 19 188 immigration 92 124 synfuels-corporation-cutback 21 129 education-spending 135 36 superfund-right-to-sue 136 73 crime 158 90 duty-free-exports 14 160 export-act-south-africa 96 173
8.6.2. Basic ideas¶
We will begin by trying to motivate some of the basic ideas from an MDS perspective, because MDS emphasizes preserving similarity relations and understanding how LSI preserves similarity relations is key to understanding its utility in other applications.
The structure is as follows. First we pose the problem LSI solves as an MDS problem of a specific kind, and then discuss how LSI is an optimal solution. This in turn gives a good first approximation of the conditions under which LSI works best.
An important class of techniques for doing non-linear dimensionality reduction is called manifold learning. Manifold learning tries to generalize linear frameworks like PCA to be sensitive to non-linear structure in data. Usually manifold learning teachniques learn the high-dimensional structure of the data from the data itself, without the use of predetermined classifications of examples. Such learning algorithms are called unsupervised. Manifold learning techniques include isomap, locally linear emmbedding, spectral embedding, and multideminsional scaling.
This little exposition of LSI contains nothing that is not already said in Deerwester ([DEERWESTER1990]), Bartell et al. (1992), and Manning et al. (2003). The main purpose of this document is to bring the insights of all three together in one place. For an excellent discussion of the relation of LSI to factoring, see Tiago Ramalho’s blog