Beginners Word2vec Tutorial on large Text corpus

Word2vec when run on large text corpus, automatically captures relationships and similarities in text data. For example, if you ask Germany capital, it will say Berlin. Word2vec generates vectors for each word and a simple cosine of the vector, shows how close two words are.

I will be using hillary emails text corpus from kaggle as input and we will observe some interesting results. I will use the native c code of word2vec for speed. First we clean the raw data and remove some common english stop words.

Hillary data comes in sqllite and from this command we dump rawtext field from database into a file.

sqlite3 -header -csv database.sqlite "select RawText from Emails;" > rawtext.csv

We clean up text and only allow certain alpha numeric characters using sed (much faster).

tr -c "A-Za-z0-9@-._ \n" " " < rawtext.txt > rawfiltered.txt

Here we translate all uppercase to lowercase

 tr '[:upper:]' '[:lower:]' < rawfiltered.txt > data.txt

This perl script will remove common english stopwords from the text corpus

./ stopwords.txt sample.txt

Once all is ready we run word2vec to generate word vectors in commandline

./word2vec -train data.phrase -output hillary-vectors.bin -cbow 1 -size 300 -window 10 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15

This will generate only word vectors, not efficient for phrases. To generate vectors for  phrases use this other approach.

./word2phrase -train data.cleaned -output data.phrase0 -threshold 200 -debug 2
./word2phrase -train data.cleaned -output data.phrase1 -threshold 100 -debug 2
 ./word2vec -train data.phrase1 -output hillary-vectors-phrase.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15
This will take few minutes, after that use the cosine distance tool to play with it.
./distance vectors-phrase.bin

Let us see the results..

Word: hillary
Position in vocabulary: 255
Word Cosine distance\
rodham 0.785763\
clinton 0.770176\
mrs. 0.668949\
mazing 0.652066\
clinton. 0.651392\ 0.601229\
-over 0.595913\
11-day 0.590282\
secretary 0.588066\
eback. 0.584084\
mrs 0.569783\
week– 0.564768\
old-fashioned 0.552296\
grueling 0.548099\
congratulations 0.543035\
thanks-iand 0.540592\
umit 0.539003\
unfolds. 0.537835\
madam 0.533925\
professional. 0.515469\
ecial 0.515023\
verde. 0.514702\
secretary- 0.506147\
shining 0.504990\
…keep 0.504837\
gratitude 0.503390\
ugural 0.502349\
d… 0.497087\
madame 0.496499\
routed 0.493445\
youknow 0.492397\
seven-nation 0.491600\
pplauded 0.490920\
grateful. 0.488790\
ffair 0.488063\
cape 0.486159\
labott 0.486055\
siders. 0.477942\
dear 0.477759\
blend 0.475713\
Observations: Top 3 results are truly spectacular since the the vectors automatically capture the relationships from the corpus. Rodham is i suppose middle name and clinton last name when asked about hillary. The secretary is also captured in top results. Her blackberry email is also on top results. Hillary must love her blackberry a lot :J\


Let us see how clustering works on the vectors. We will cluster 5 topics from the dataset..

./word2vec -train data.cleaned -output hillary-classes.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -iter 15 -classes 20
sort hillary-classes.txt -k 2 -n > hillary.sorted.txt

You will see words grouped into clusters.