svn checkout http://word2vec.googlecode.com/svn/trunk/
を#include <malloc.h>
に置き換えます。#include <stdlib.h>
要するに text8 というファイルができれば良いわけ。デフォルトの Apple gzip ではなく外部から取ってきた gzip でやればいいんだろうけど、面倒なので普通にコマンドラインで下記を実行して text8 を得る。wget http://mattmahoney.net/dc/text8.zip -O text8.gz gzip -d text8.gz -f
curl http://mattmahoney.net/dc/text8.zip -O text8.zip unzip text8.zip
% cut -c1-300 text8 | fold -w 60 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the en glish revolution and the sans culottes of the french revolut ion whilst the term is still used in a pejorative way to des cribe any act that used violent means to destroy the organiz
% ./demo-word.sh make: Nothing to be done for `all'. Starting training using file text8 Vocab size: 71290 Words in train file: 16718843 Alpha: 0.000121 Progress: 99.58% Words/thread/sec: 48.64k real 1m46.381s user 5m57.164s sys 0m1.418s Enter word or sentence (EXIT to break):
% ./distance vectors.bin Enter word or sentence (EXIT to break):
% ./distance vectors.bin Enter word or sentence (EXIT to break): tokyo Word: tokyo Position in vocabulary: 4909 Word Cosine distance ----------------------------------- narita 0.662572 osaka 0.653032 incheon 0.607367 fukuoka 0.595367 beijing 0.571802 kansai 0.567351 seoul 0.564947 jiaotong 0.558960 sheremetyevo 0.558197 niigata 0.556976
Enter word or sentence (EXIT to break): sea of japan Word: sea Position in vocabulary: 356 Word: of Position in vocabulary: 2 Word: japan Position in vocabulary: 582 Word Cosine distance -------------------------------- senkaku 0.570105 vardar 0.568383 seto 0.560345 dangrek 0.542683 endorheic 0.525477 caspian 0.520474 shikoku 0.518149 blantyre 0.516904 westeros 0.505320 honshu 0.504499
% ./demo-phrases.sh make: Nothing to be done for `all'. Starting training using file text8 Words processed: 17000K Vocab size: 4399K Vocab size (unigrams + bigrams): 2419827 Words in train file: 17005206 Words written: 17000K real 0m56.124s user 0m52.278s sys 0m2.000s Starting training using file text8-phrase Vocab size: 84069 Words in train file: 16307293 Alpha: 0.000117 Progress: 99.60% Words/thread/sec: 21.08k real 3m43.700s user 13m6.280s sys 0m2.559s Enter word or sentence (EXIT to break):
% cut -c1-300 text8-phrase | fold -w 60 anarchism originated as a term of abuse first used against early working class radicals including the diggers of the en glish revolution and the sans_culottes of the french revolut ion whilst the term is still used in a pejorative way to des cribe any act that used violent means to destroy the organiz
% ./distance vectors-phrase.bin Enter word or sentence (EXIT to break):
Enter word or sentence (EXIT to break): los_angeles Word: los_angeles Position in vocabulary: 1680 Word Cosine distance ------------------------------------ california 0.625356 san_francisco 0.617207 san_diego 0.594875 taiko 0.555129 lakers 0.551306 san_jose 0.550566 oakland 0.546525 santa_monica 0.535825 beverly_hills 0.535591
% ./word-analogy vectors-phrase.bin Enter three words (EXIT to break):
Enter three words (EXIT to break): japan tokyo korea ord: japan Position in vocabulary: 547 Word: tokyo Position in vocabulary: 4715 Word: korea Position in vocabulary: 2559 Word Distance ---------------------------------------- seoul 0.485544 south_korea 0.438259 paekche 0.427070 chungcheong_south 0.418033 osaka 0.400935
Enter three words (EXIT to break): new_york los_angeles tokyo ... Word Distance -------------------------------- osaka 0.483591 taipei 0.479719 kaohsiung 0.446331 seoul 0.433824 akihabara 0.428942 Enter three words (EXIT to break): bread eat beer ... Word Distance ------------------------------------- drink 0.416754 custard 0.400807 keg 0.396400 eating 0.385646 milk_chocolate 0.378704 Enter three words (EXIT to break): cat lion dog ... Word Distance --------------------------------------- wolf 0.375175 belgian_shepherd 0.370185 keeshond 0.367813 hound 0.367517 bear 0.360686
% ./demo-classes.sh make: Nothing to be done for `all'. Starting training using file text8 Vocab size: 71290 Words in train file: 16718843 Alpha: 0.000121 Progress: 99.58% Words/thread/sec: 48.55k real 2m32.658s user 6m42.357s sys 0m1.459s The word classes were saved to file classes.sorted.txt
formerly 493 fort 493 founded 493 fountain 493 francisco 493 frankfurt 493 fredericton 493 freeway 493 frontage 493 galleries 493 gallery 493
salt_lake 478 sam_houston 478 samora 478 san_antonio 478 san_bernardino 478 san_diego 478 san_fernando 478 san_francisco 478 san_jacinto 478 san_joaquin 478 san_jose 478 san_juan 478
「ポエム化の一番の問題点は、目的が抽象的だという点です。例えば、震災後に政治家やマスコミがこぞって使っていた『被災者に寄り添う』『想いを伝えたい』などのフレーズは耳当たりはいいけれど非常に抽象的ですよ