Word 2 Vec assignment 

    Note: This assignment is optional. I will be happy to comment on your efforts, or answer any questions.

    Prep

    Follow the easy install directions for Radim Rehurek's gensim module, which implements word2vec, the Deep Learning inspried word embedding component. Those directions are here. The discussion here is based on Rehurek's Word2vec tutorial

    Run the following code (which will take some time):

    from gensim.models import Word2Vec
    from nltk.corpus import brown, movie_reviews, treebank
    b = Word2Vec(brown.sents())
    mr = Word2Vec(movie_reviews.sents())
    t = Word2Vec(treebank.sents())
    

    You built 3 word2vec models on very small amounts of data, one on Brown, one on Treebank data, and one on movie reviews. They won't be very good, but they still have useful information.

    Doing things

    Let's find the 5 nearest neighbors of the word man, based on the data in the Movie Review corpus:

    >>> mr_man_nns = mr.most_similar('man', topn=5)
    [(u'woman', 0.8220334053039551), (u'girl', 0.6817629933357239), (u'boy', 0.6774479150772095), (u'doctor', 0.6448361873626709), (u'calculating', 0.605032205581665)]
    >>> mr.similarity('man','woman')
    0.8220334001723657
    

    What you get is mr_man_nns is a list of five things, each is a nearest neighbor together with its similarity to man. These are the five words most similar to man according to the model. To find the similarity score of any two words, you use similarity as shown.

    To complete analogies like woman:king::man::??, you do the following:

    >>> mr.most_similar(positive=['woman', 'king'], negative=['man'])
    [(u'ark', 0.7656992077827454), ... ]
    

    A variant proposed in Levy and Goldberg (2014):

    >>> mr.most_similar_cosmul(positive=['woman', 'king'], negative=['man'])
    [(u'ark', 1.0164012908935547), ... ]
    

    This is a pretty weird answer.

    Of course if you were using the Wikipedia trained model described in Rehurek's tutorial and Mikolov et al. (2013), you would instead get this as the completion of the analogy:

    [('queen', 0.50882536), ...]
    

    You can also use the model to find outlier words that don't fit in a group of words: For this, we find the centroid of the word vectors and choose the one with greatest distance from the centroid as the odd vector out:

    >>> mr.doesnt_match("breakfast cereal dinner lunch".split())
    'cereal'
    

    Models can be significantly improved by finding bigram words ("New York", "Press Secretary"):

    bigram_transformer = gensim.models.Phrases(sentences)
    mr = Word2Vec(movie_reviews.sents())
    

    Using a word vector model's implicit co-occurrence probabilities, it can also be used find the probability of a sentence:

    mr.score(["Colorless green ideas sleep furiously".split()])
    

    In other words, the model goes through each word and computes the probability of the other words occurring as neighbors, returning an overall likelihood score.

    Finally, you can just look at a vector, which by default has 100 dimensions. It's an array with 100 floating point numbers.

    >>> mvec =  mr['man']
    >>> len(mvec)
    100
    >>> mvec
    array([-0.10986061, -0.12899914, -0.01356757,  0.09450436, -0.04457339,
            0.1201788 ,  0.07375111,  0.00555919, -0.27961457, -0.04920399,
            0.21768376, -0.15391812, -0.07826918,  0.2606242 , -0.12305038,
           -0.0137245 ,  0.02650702, -0.01748919, -0.12054206, -0.13024689,
           -0.06372885, -0.23327361,  0.33404183,  0.22624712, -0.22911069,
           -0.12921527,  0.28556904, -0.23052499, -0.19462241,  0.26367468,
            0.25053203,  0.03706881,  0.12325867, -0.33901975, -0.02694192,
           -0.05358029, -0.00767274, -0.13719793,  0.00530363, -0.20839927,
           -0.03976608,  0.0351226 ,  0.18464074, -0.24369034,  0.15803961,
           -0.0514226 , -0.13602231,  0.25484481, -0.08208569,  0.06340741,
            0.21154985,  0.09053291,  0.13411717, -0.24650463,  0.2090898 ,
           -0.14951023, -0.02048201, -0.22660583,  0.04167137,  0.06884803,
            0.31761509, -0.1049979 ,  0.11771846,  0.11075541, -0.05071831,
            0.21371891,  0.12598746, -0.2079615 , -0.13616957, -0.01921517,
           -0.16636346,  0.1169065 ,  0.23653744,  0.31624255,  0.11505274,
           -0.09718218, -0.06874531,  0.10780501, -0.01663529, -0.10346226,
           -0.30455628,  0.00246542, -0.15952916, -0.01670113, -0.08883165,
            0.13546473, -0.39362314,  0.27298909, -0.08167259, -0.1424706 ,
            0.12223504,  0.18078147,  0.08870253,  0.15700033,  0.17984635,
            0.13593708, -0.43276551,  0.03234629, -0.16896026, -0.12703048], dtype=float32)
    

    The assignment

    1. Do the assignment here, which provides a link to download a full Wikipedia-trained set of word vectors. We'll refer to this as the wikipedia model.
    2. Find the nearest neighbors of the word computer in the Movie Review, Brown, Penn Treebank, and Wikipedia models. Show your code. Show the neighbors you found and their scores.
    3. Now that you know the 5 nearest neighbors of computer in the Brown model, would you expect the similarity score of program and computer to be greater or lesser than .75. Why? Verify. Show your code.
    4. Find the 10 nearest neighbors of great in the Movie Review model. Do the same with a negative sentiment word. Try some other techniques for collecting positive and negative evaluative words, such as changing the value of the "topn" parameter illustrated above and trying other seed words. Compare these results with the results using the Wikipedia model. Which dataset works best for collecting evaluative words. Speculate as to why.
    5. Complete the following analogy: great:excellent::good:??. Which model does this best?
    6. Find the outlier in the following set of words: great excellent terrible outstanding. Which model works best, or all equally good? Speculate as to why.
    7. Experiment with scoring sentences as done here

        mr.score(["Colorless green ideas sleep furiously".split()])
        

      to answer the following question: Does the probability of the sentence grow larger or smaller as the sentence grows longer? Either way, there is a kind of length bias showing up. Suggest a fairer way to judge likelihood of a sentence. Use it to decide which of the following sentences is more likely:

      1. My laptop goes to sleep easily.
      2. My child goes to sleep easily.

    8. Train a phrasal model. This will discover "words with spaces" or "phrasal words" or "collocations" like new york and compress them into one word units (new_york).

        from gensim.models import Phrases
        Phraser = Phrases(movie_reviews.sents())
        mrs_phrased = Phraser[mrs]
        mr_p = Word2Vec(mrs_phrased) 
        

      Compare the top 25 most similar words to great in this new model with the top 25 most similar words to great in mr model. Report on any effects that having phrases included has had. Are all the results good?