テキストから辞書にある文字列をすべて取り出す簡単なプログラム (Pure Perl) (たつをの ChangeLog)

テキストから辞書にある文字列をすべて取り出す簡単なプログラム (Pure Perl)

2014-05-09-1 [Programming][NLP]

機械学習のデータ加工用のスクリプト。辞書によりテキストから素性を抽出する簡単なプログラムです。先日ちょっと需要がありまして、Pure Perl で動くものを作りました。

あらかじめ用意された辞書（文字列とIDのペア）があり、テキスト中にその辞書の文字列が存在したら対応するIDを出力する、というタスクです。

辞書マッチは最長一致のみではなく、一部重なりや内包含むすべて。最長一致のみならば辞書エントリを長い順にならべてORでつなげた正規表現を作れば一発なのですが、すべて欲しいというわけで（疑似）TRIEを採用。ハッシュで TRIE という懐かしい実装です（初期のJUMAN風）。

■辞書の例 (fepp-dic.txt)。フォーマットは「^エントリ(タブ)ID$」。IDは1始まり。

abc	1
bc	2
c	3
xxx	4
あい	5
あいう	6
いう	7
いうえ	8
うか	9

■テキストの例 (fepp-test.txt)。フォーマットは「^ラベル(タブ)テキスト$」。

1	abcdbdb
0	bcxxx
1	あいうえお
0	そういうかな

■実行例：

% ./fepp.pl fepp-dic.txt < fepp-test.txt 
1:1 2:1 3:1
2:1 3:1 4:1
5:1 6:1 7:1 8:1
7:1 9:1

% ./fepp.pl -a fepp-dic.txt < fepp-test.txt
1 1:1 2:1 3:1
0 2:1 3:1 4:1
1 5:1 6:1 7:1 8:1
0 7:1 9:1

% ./fepp.pl -a -d 3 fepp-dic.txt < fepp-test.txt
[DIC] len=1  a:0, b:0, c:3, x:0, あ:0, い:0, う:0
[DIC] len=2  ab:0, bc:2, xx:0, あい:5, いう:7, うか:9
[DIC] len=3  abc:1, xxx:4, あいう:6, いうえ:8
[INPUT] 1	abcdbdb
[MATCH] abc (1)
[MATCH] bc (2)
[MATCH] c (3)
1 1:1 2:1 3:1
[INPUT] 0	bcxxx
[MATCH] bc (2)
[MATCH] c (3)
[MATCH] xxx (4)
0 2:1 3:1 4:1
[INPUT] 1	あいうえお
[MATCH] あい (5)
[MATCH] あいう (6)
[MATCH] いう (7)
[MATCH] いうえ (8)
1 5:1 6:1 7:1 8:1
[INPUT] 0	そういうかな
[MATCH] いう (7)
[MATCH] うか (9)
0 7:1 9:1

■コード (fepp.pl)：

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use open ':utf8';
binmode STDOUT, ":utf8";
binmode STDIN, ":utf8";

use Getopt::Long;
my $answer_mode = 0; # input with answer?
my $debug = 0;
GetOptions (
    "answer" => \$answer_mode,
    'debug=s' => \$debug,
    );

my $dat_fn = shift;

my @ndic;
open(my $fh, "<:utf8", $dat_fn) or die;
while (<$fh>) {
    chomp;
    next if not /^(.+?)\t(.+)$/;
    my $id = $2;
    my @c = split(//, $1);
    $ndic[@c]{$1} = $id;
    for (my $i = 0; $i < $#c; $i++) {
        $ndic[$i+1]{join("",@c[0..$i])} ||= 0;
    }
}
close($fh);

if ($debug =~ /(1|3)/) {
    for (my $i = 1; $i < @ndic; $i++) {
        print "[DIC] len=$i  ".join(", ", map {"$_:$ndic[$i]{$_}"} sort keys %{$ndic[$i]})."\n";
    }
}

while (<>) {
    print "[INPUT] $_" if $debug =~ /(2|3)/;
    chomp;
    my $ans = ($_ =~ s/^((.+?)\t)//) ? $2 : ""  if $answer_mode;

    my @c = split(//, $_);
    my %m;
    for (my $i = 0; $i < @c; $i++) {
        for (my $j = $i; $j < @c; $j++) {
            my $s = join("",@c[$i..$j]);
            last if not defined $ndic[$j-$i+1]{$s};
            next if $ndic[$j-$i+1]{$s} == 0;
            print "[MATCH] $s ($ndic[$j-$i+1]{$s})\n" if $debug =~ /(2|3)/;
            $m{$ndic[$j-$i+1]{$s}}++;
        }
    }

    print "$ans " if $answer_mode;
    print join(" ", map {"$_:$m{$_}"} sort {$a <=> $b} keys %m)."\n";
    # print join(" ", map {"$_:1"} sort {$a <=> $b} keys %m)."\n";
}

この記事に言及しているこのブログ内の記事