Word2vec when run on large text corpus, automatically captures relationships and similarities in text data. For example, if you ask Germany capital, it will say Berlin. Word2vec generates vectors for each word and a simple cosine of the vector, shows how close two words are.
I will be using hillary emails text corpus from kaggle as input and we will observe some interesting results. I will use the native c code of word2vec for speed. First we clean the raw data and remove some common english stop words.
Hillary data comes in sqllite and from this command we dump rawtext field from database into a file.
sqlite3 -header -csv database.sqlite "select RawText from Emails;" > rawtext.csv
We clean up text and only allow certain alpha numeric characters using sed (much faster).
tr -c "A-Za-z0-9@-._ \n" " " < rawtext.txt > rawfiltered.txt
Here we translate all uppercase to lowercase
tr '[:upper:]' '[:lower:]' < rawfiltered.txt > data.txt
This perl script will remove common english stopwords from the text corpus
./remove.pl stopwords.txt sample.txt
Once all is ready we run word2vec to generate word vectors in commandline
./word2vec -train data.phrase -output hillary-vectors.bin -cbow 1 -size 300 -window 10 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15
This will generate only word vectors, not efficient for phrases. To generate vectors for phrases use this other approach.
./word2phrase -train data.cleaned -output data.phrase0 -threshold 200 -debug 2 ./word2phrase -train data.cleaned -output data.phrase1 -threshold 100 -debug 2 ./word2vec -train data.phrase1 -output hillary-vectors-phrase.bin -cbow 1 -size 200 -window 10 -negative 25 -hs 0 -sample 1e-5 -threads 20 -binary 1 -iter 15
This will take few minutes, after that use the cosine distance tool to play with it. ./distance vectors-phrase.bin
Let us see the results..
Word: hillary |
Position in vocabulary: 255 |
Word Cosine distance\ |
————————————————————————\ |
rodham 0.785763\ |
clinton 0.770176\ |
mrs. 0.668949\ |
mazing 0.652066\ |
clinton. 0.651392\ |
hr15@mycingular.blackberry.net 0.601229\ |
-over 0.595913\ |
11-day 0.590282\ |
secretary 0.588066\ |
eback. 0.584084\ |
mrs 0.569783\ |
week– 0.564768\ |
old-fashioned 0.552296\ |
grueling 0.548099\ |
congratulations 0.543035\ |
thanks-iand 0.540592\ |
umit 0.539003\ |
unfolds. 0.537835\ |
madam 0.533925\ |
professional. 0.515469\ |
ecial 0.515023\ |
verde. 0.514702\ |
secretary- 0.506147\ |
shining 0.504990\ |
…keep 0.504837\ |
gratitude 0.503390\ |
ugural 0.502349\ |
d… 0.497087\ |
madame 0.496499\ |
routed 0.493445\ |
youknow 0.492397\ |
seven-nation 0.491600\ |
pplauded 0.490920\ |
grateful. 0.488790\ |
ffair 0.488063\ |
cape 0.486159\ |
labott 0.486055\ |
siders. 0.477942\ |
dear 0.477759\ |
blend 0.475713\ |
Comments\ |
Observations: Top 3 results are truly spectacular since the the vectors automatically capture the relationships from the corpus. Rodham is i suppose middle name and clinton last name when asked about hillary. The secretary is also captured in top results. Her blackberry email is also on top results. Hillary must love her blackberry a lot :J\ |
kMeansClustering
Let us see how clustering works on the vectors. We will cluster 5 topics from the dataset..
./word2vec -train data.cleaned -output hillary-classes.txt -cbow 1 -size 200 -window 8 -negative 25 -hs 0 -sample 1e-4 -threads 20 -iter 15 -classes 20
sort hillary-classes.txt -k 2 -n > hillary.sorted.txt
You will see words grouped into clusters.