svn checkout http://word2vec.googlecode.com/svn/trunk/
を#include <malloc.h>
に置き換えます。#include <stdlib.h>
要するに text8 というファイルができれば良いわけ。デフォルトの Apple gzip ではなく外部から取ってきた gzip でやればいいんだろうけど、面倒なので普通にコマンドラインで下記を実行して text8 を得る。wget http://mattmahoney.net/dc/text8.zip -O text8.gz gzip -d text8.gz -f
curl http://mattmahoney.net/dc/text8.zip -O text8.zip unzip text8.zip
% cut -c1-300 text8 | fold -w 60 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the en glish revolution and the sans culottes of the french revolut ion whilst the term is still used in a pejorative way to des cribe any act that used violent means to destroy the organiz
% ./demo-word.sh make: Nothing to be done for `all'. Starting training using file text8 Vocab size: 71290 Words in train file: 16718843 Alpha: 0.000121 Progress: 99.58% Words/thread/sec: 48.64k real 1m46.381s user 5m57.164s sys 0m1.418s Enter word or sentence (EXIT to break):
% ./distance vectors.bin Enter word or sentence (EXIT to break):
% ./distance vectors.bin
Enter word or sentence (EXIT to break): tokyo
Word: tokyo Position in vocabulary: 4909
Word Cosine distance
-----------------------------------
narita 0.662572
osaka 0.653032
incheon 0.607367
fukuoka 0.595367
beijing 0.571802
kansai 0.567351
seoul 0.564947
jiaotong 0.558960
sheremetyevo 0.558197
niigata 0.556976
Enter word or sentence (EXIT to break): sea of japan
Word: sea Position in vocabulary: 356
Word: of Position in vocabulary: 2
Word: japan Position in vocabulary: 582
Word Cosine distance
--------------------------------
senkaku 0.570105
vardar 0.568383
seto 0.560345
dangrek 0.542683
endorheic 0.525477
caspian 0.520474
shikoku 0.518149
blantyre 0.516904
westeros 0.505320
honshu 0.504499
% ./demo-phrases.sh make: Nothing to be done for `all'. Starting training using file text8 Words processed: 17000K Vocab size: 4399K Vocab size (unigrams + bigrams): 2419827 Words in train file: 17005206 Words written: 17000K real 0m56.124s user 0m52.278s sys 0m2.000s Starting training using file text8-phrase Vocab size: 84069 Words in train file: 16307293 Alpha: 0.000117 Progress: 99.60% Words/thread/sec: 21.08k real 3m43.700s user 13m6.280s sys 0m2.559s Enter word or sentence (EXIT to break):
% cut -c1-300 text8-phrase | fold -w 60 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the en glish revolution and the sans_culottes of the french revolut ion whilst the term is still used in a pejorative way to des cribe any act that used violent means to destroy the organiz
% ./distance vectors-phrase.bin Enter word or sentence (EXIT to break):
Enter word or sentence (EXIT to break): los_angeles
Word: los_angeles Position in vocabulary: 1680
Word Cosine distance
------------------------------------
california 0.625356
san_francisco 0.617207
san_diego 0.594875
taiko 0.555129
lakers 0.551306
san_jose 0.550566
oakland 0.546525
santa_monica 0.535825
beverly_hills 0.535591
% ./word-analogy vectors-phrase.bin Enter three words (EXIT to break):
Enter three words (EXIT to break): japan tokyo korea
ord: japan Position in vocabulary: 547
Word: tokyo Position in vocabulary: 4715
Word: korea Position in vocabulary: 2559
Word Distance
----------------------------------------
seoul 0.485544
south_korea 0.438259
paekche 0.427070
chungcheong_south 0.418033
osaka 0.400935
Enter three words (EXIT to break): new_york los_angeles tokyo
...
Word Distance
--------------------------------
osaka 0.483591
taipei 0.479719
kaohsiung 0.446331
seoul 0.433824
akihabara 0.428942
Enter three words (EXIT to break): bread eat beer
...
Word Distance
-------------------------------------
drink 0.416754
custard 0.400807
keg 0.396400
eating 0.385646
milk_chocolate 0.378704
Enter three words (EXIT to break): cat lion dog
...
Word Distance
---------------------------------------
wolf 0.375175
belgian_shepherd 0.370185
keeshond 0.367813
hound 0.367517
bear 0.360686
% ./demo-classes.sh make: Nothing to be done for `all'. Starting training using file text8 Vocab size: 71290 Words in train file: 16718843 Alpha: 0.000121 Progress: 99.58% Words/thread/sec: 48.55k real 2m32.658s user 6m42.357s sys 0m1.459s The word classes were saved to file classes.sorted.txt
formerly 493 fort 493 founded 493 fountain 493 francisco 493 frankfurt 493 fredericton 493 freeway 493 frontage 493 galleries 493 gallery 493
salt_lake 478 sam_houston 478 samora 478 san_antonio 478 san_bernardino 478 san_diego 478 san_fernando 478 san_francisco 478 san_jacinto 478 san_joaquin 478 san_jose 478 san_juan 478

「ポエム化の一番の問題点は、目的が抽象的だという点です。例えば、震災後に政治家やマスコミがこぞって使っていた『被災者に寄り添う』『想いを伝えたい』などのフレーズは耳当たりはいいけれど非常に抽象的ですよ
