12.12. Final projects

Following are some suggestions for class projects. You can also talk to me and suggest a project of your own. Do not do a project of your own without prior approval from me, and do not do a network project on a graph of your own without getting prior approval.

  1. Part One: Go to the Social Security Administration US births website and select the births table there and copy it to your clipboard. Use the pandas read_clipboard function to read the table into a pandas DataFrame, and use matplotlib to plot male and female births for the years covered in the data. Put this in an ipython notebook file. In the same notebook, use Python to get a list of male and female names from these files This data is broken down by year of birth.

    Part Two: Aggregate the data for all years (see the examples in the Pandas notebooks). Find the set of all male names and the set of all female names. Use Python Counters to get separate letter frequencies for the male and female name sets. Use matplotlib to draw a plot that for each letter (x-axis) shows the frequency of that letter (y-axis) as the last letter for both for male and female names.

    Part Three: Now do just female names, but aggregate your data in decades (10 year) increments. Produce a plot that contains the 1880s line, the 1940s line, and the 1990s line, as well as the female line for all years aggregated together from Part Two.

    Part Four: Evaluate how stable this statistic is as follows. First, compute the variance for a random sample of 100 names over one of the three decades. For example, for each name x of your 100 names compute the variance of the name counts for x over the 1940s decade. Sort the names from highest variance names to lowest variance names and plot the variances using matplotblib. Do the same for an adjacent decade (1930s or 1950s). When you plot the 1930s name variances, use the name order you decided on for the 1940s.

    Speculate on why name popularities are stable, if they are, or on what demographic facts might explain any changes in name popularities, if there are any. Turn in your ipython notebook file, showing the code you used to complete parts One, Two, an Three.

  2. Using regular expressions, extract all the proper names from the Project Gutenberg version of Pride and Prejudice or of Anna Karenina or any novel you have online access to. For better proper name extraction, distinguish capitalized first names from capitalized first words of sentences by using the regular expression for sentence endings provided in regular expressions notebook For still better proper name extraction (and a better grade), think about how to extract names like “Fred Flinstone”, which have spaces in them. For Anna Karenina, a special case of this is names that use Russian patronymics, like Alexey Alexandrovitch . Build a frequency table using a Python Counter and display the 50 most frequent names and their frequency. Put this in as an ipython notebook file. You may have to filter some non names from the list by hand. Do so, then answer the following questions:

    1. Are all the top names names of people?
    2. Spoiler alert: Darcy is the man Elizabeth is going to marry at the end. What is his rank in the names ranking? What does this tell us about the kind of world Austen is writing about?

    Next construct a character plot by dividing the book up into 3000 word chunks and calculating how often each character occurs in each chunk. Plot a line for each of the top 20 characters showing how often each character’s name appears in each chunk. Choose the 4 most interesting characters and put them in one plot. Place the code for creating the “4 most interesting character lines” plot in your Python notebook and submit it.

  3. Download the year 2000 PUMS data for some interesting state or states, for example, Alabama.

    Do one of the potential projects proposed in the census_data example notebook. Turn this in as an ipython notebook file.

  4. Participate (belatedly) in the 2012 Kaggle insult detection competition, which involves automatically detecting insults in an Internet corpus.

    All the data is in CSV files. You can satisfactorily complete the assignment by building two text classifiers along the lines of those demonstrated in the text classification notebook or the introduction to SVM notebooks. At least one of your classifiers should be a non-SVM; see the Scikit learn (sklearn) docs for many other examples. Your classifiers should be trained on the Kaggle train files and tested on the Kaggle test files. You should also check your score on the Impermium verification set. On the kaggle website, the problem is defined so that you can produce confidence labels between 0 and 1 indicating how confident your system is that the data item is an insult (where 1 is 100% confident); an SVM approach would actually allow you to do this, but implementing that would take a fair amount of application and mathematical knowledge; so you need only produce 0’s and 1’s. This strategy would get you a terrible competition score, but it will teach you a lot about how hard this problem is. Evaluate yourself by looking at accuracy, precision, and recall. Pay particular attention to the difference between your training and test scores. To turn in: the code you used to build your classifier and a brief description of what you did. The description should give your best scores on training and test, as well as on the Impermium verification set. Note that you should run on the verification set only once, just before you turn in your best system. Your description should also indicate what kind of learners you used, describing their difference, and state which learner you believe is better suited for this task. You should also describe the system parameters that gave you your best performance, if you deviated from the defaults.

  5. Do a community discovery study of some network. Best of all would be some network you construct yourself. For example, you can construct your real world friends graph (not your Facebook friends graph), a graph of the characters whose names co-occur in the same 1000 word chunks of some novel, or any of the following:

    1. Dolphin community. This is a graph of a dolphin community which is about to split up as a result of the loss of a keystone member (node label 'SN100'). The data is presented and described in [LUSSEAU2004] The graph nodes have a 'label' attribute which allows you to match up the graph you read in with Lusseau and Newman’s discussion. To use this graph for a final project, you should download and read the paper The dolphin graph.
    2. Goodreau’s Faux Mesa High School data. This graph is “a simulation of an in-school friendship network”. The network is named faux.mesa.high because the school community on which it is based is in the rural western United States, with a student body that is largely Hispanic and Native American. The data set is based upon a model to fit data from one school community from the National Longitudinal Study on Adolescent Health [RESNICK1997]. This network was originally distributed as part of the Statnet package. The nodes contain race data in the attribute "Race" and gender data in the attribute "Sex", and Grade data (7th, 8th, 9th, etc.) in the "Grade" attribute. The graph.
    3. Political books. This graph was constructed by Valdis Krebs. The page discussing the data has lots of interesting book-buying facts, such as the fact that Rules for Radicals is being bought by the right. The graph Krebs built was downloaded from Mark Newman’s network data page. The nodes in this graph are recently published political books; a link between books indicates that a significant number of buyers bought the same book. The graph also contains data about the political orientation of each book (in the value attribute). Worthy of note: Krebs has classified some of his books Neutral(!). The graph.
    4. A WordNet graph from some subpart of WordNet. For this, use the WordNet Networkx notebook which supplies some code for building interesting subgraphs. To evaluate your communities, you should use average pairwise similarity. For computing the semantic similarity of two concepts in the graph, see the NLTK Wordnet interface docs.

    Run Mark Newman’s community discovery algorithm [NEWMAN2006] on the graph you choose for this project. An implementation was supplied with the using_networkx notebook, and there are some demonstrations there of how to discover and draw communities. Whichever graph you choose, the project is to attempt some form of analysis to evaluate the communities you get. In the case of the dolphins community, you will be evaluating how well the communities the algorithm discovers predict the split that happens, as described in the Lusseau/Newman paper, especially the figures. In the case of Faux Mesa High School and political book data you will be looking at what’s called Assortative mixing. High school: Do the communities discovered by the algorithm have a high degree of homogeneity of race, grade, and gender? Compute the percentages of each race and each gender and each grade in each discovered community, compare to overall averages. Political books: do the communities have a high degree homogeneity in political orientation? Hint: Don’t prejudge this one. In the case of the WordNet graph, you’ll choose a word neighborhood based on some fairly general word. Let’s use dog as an example. Your community discovery algorithm will find word communities within the dog neighborhood (see the WordNet image on my web page, where colors represent discovered communities in the dog neighborhood). You will be looking at the degree of pairwise similarity within communities. Compare it to the average degree of pairwise similarity for dog-neighborhood words in general.

To turn in: The code you used to read in the graph, and the code you used to do your community evaluation, and any images you found helpful. You can turn in an Ipython notebook if you like, but you don’t need to.

Here is some technical help:

  1. Read graphs in using the readwrite_gml module supplied with your using_networkx notebook. Do not use the nx.read_gml function. So for example:

    import readwrite_gml
    pol_book_graph = readwrite_gml.read_gml('polbooks.gml')
  2. As demonstrated in the using_networkx notebook, community detection involves importing a module named community_jolleycraig, which is supplied in the notebook zipfile. That module has two community detection functions detect_communities and split_communities. Some projects will use detect_communities and some will use split_communities. The difference is that the former keeps splitting communities until modularity can’t be increased anymore. The latter only does one split, splitting the input graph into two large communities. Use your judgment.

  3. For help on computing average pairwise similarities for a group, see the Facebook assignments solution notebook.

  4. If you are using a graph other than the dolphin graph, the Faux Mesa high school graph, the polbooks graph, or a Wordnet graph, you will have to come up with your own way of evaluating the communities. Be aware that this is actually the heart of the assignment, so deciding to do something lame on this part is not good strategy. Talk to me if you are using any graph but the graphs suggested here. Talk to me if in doubt about how to evaluate your communities. I may try to talk you in to using some other data.

  5. After laying out the graph as demonstrated in the using_networkx notebook, the easiest way to draw a graph colored according to some attribute is to do this:

                   cls_attr='value', prog='fdp', pos=pos,
                   val_seq=['l','c','n'], show = True, title = 'Pol Book Political Orientations')

    This saves a colored version of the graph to the file "pol_book_orientation_graph.png". The cls_attr argument specifies what attribute determines the color chosen for a node. In this case, it is the value attribute of a node. There are three one letter values for this attribute: "l", "n", and "c", but they don’t have to be supplied unless you want to assign specific colors to them, in which case you also supply a color sequence in the "color_seq" argument. So the above command assigns red to liberals, blue to conservatives, and white to neutrals. Adding a title argument will add a text title to the graph. Running the community discovery algorithm adds an attribute named "Community" to each node of a graph, so the communities discovered can be colored in a very similar way. Note that the attribute "Community" won’t exist until after you run the algorithm.

  1. For doing statistics relating node attributes to a node’s discovered community, you need to be able to get the community assigned to a node n in graph G:


    The same is true for any other attribute like "Sex" or "Race".

  2. The Faux Mesa High School graph is not connected and you will need to do all your work on its largest component, as follows:

    import community_jolleycraig
    import readwrite_gml
    from networkx_patch import get_biggest_component
    fmhs_nx_graph = readwrite_gml.read_gml('FauxMesaHigh.gml')
    fmhs_nx_graph_UG = fmhs_nx_graph.to_undirected()
    fmhs_big_component = get_biggest_component(fmhs_nx_graph_UG)