【Perl】ターミナル画面でテキスト検索結果を KWIC で表示する (たつをの ChangeLog)

【Perl】ターミナル画面でテキスト検索結果を KWIC で表示する

2014-06-24-2 [Programming]

先日書いた「文字列から半角N文字分取る方法」[2014-06-19-2]を使って、grep のようなテキスト走査による検索結果をターミナル画面に KWIC 形式で表示する Perl プログラムを書きました。

KWIC とは KeyWord In Context の略で、この場合は、中心に検索キー、左右にコンテキスト（前後の文字列）を配置するというものです。これらをターミナル画面にずれないように表示するためには、検索キーとコンテキスト文字列の正確な長さ（半角文字数）が必要になります。そのため前述の半角N文字分取る方法が必要になるのです。

何はともあれ実行例から。

■実行例：

ツイッターのログからキーワード「大喜び」で検索し、出てきた順に表示。

% ./kwicgrep.pl '大喜び' tweets.csv
_や器からタレや肉まで始終絶賛、[大喜び]でした。両親は国立新美術館をみ_
_がとうございました。とらちゃん[大喜び]でした。またご機嫌とらちゃんと_
_はサンタさんからのプレゼントに[大喜び]。「サンタさんもらった」「だい_
白優勝！ うちの白オタマトーンも[大喜び]（うそ）"______________________
_ーダーに登録してもらえると私が[大喜び]です。"________________________

右コンテキストでソート("-s r")。左でソートしたい場合は "-s l"、逆順ソートは "-r"。

% ./kwicgrep.pl -s l '大喜び' tweets.csv
_はサンタさんからのプレゼントに[大喜び]。「サンタさんもらった」「だい_
_がとうございました。とらちゃん[大喜び]でした。またご機嫌とらちゃんと_
_や器からタレや肉まで始終絶賛、[大喜び]でした。両親は国立新美術館をみ_
_ーダーに登録してもらえると私が[大喜び]です。"________________________
白優勝！ うちの白オタマトーンも[大喜び]（うそ）"______________________

大量の英文ファイルを集めてきて、このスクリプトで検索＆表示すれば、今は亡き「英語例文検索 EReK」のような英作文支援システムになります。

% ./kwicgrep.pl -s l 'enabled' /usr/share/doc/bash/bash.html | head
ll option in the list will be [enabled] before_______________________
__________________________are [enabled], non-zero otherwise.  When se
ble Completion</B> above) are [enabled]._____________________________
<B>Pathname Expansion</B> are [enabled]._____________________________
_____________<I>names</I> are [enabled].  For example, to use the____
ro if all <I>optnames</I> are [enabled]; non-zero____________________
___________________________If [enabled], history expansion will be pe
_____________shell option, if [enabled], causes the shell to attempt
_____________________Names of [enabled] shell builtins.______________
____A colon-separated list of [enabled] shell options.  Each word in_

■コード (kwicgrep.pl)：

#!/usr/bin/perl
use strict;
use warnings;
use utf8;
use Encode;
use open ':utf8';
binmode STDOUT, ":utf8";
binmode STDIN, ":utf8";

use Getopt::Long;
my $term_width = 70;
my $sort_lr_context = ""; # for sort
my $reverse_mode = 0; # for sort
GetOptions (
    'width=s' => \$term_width,
    "sort=s" => \$sort_lr_context,
    "reverse" => \$reverse_mode,
    );

my $key = shift @ARGV;
$key = Encode::decode_utf8($key) if not utf8::is_utf8($key);

my ($dummy, $klen) = get_first_n_hkchars($key, 100);
my $context_width = int(($term_width - ($klen + 2)) / 2);

my @matches;
while (<>) {
    chomp;
    next if not /^(.*?)($key)(.*)$/;
    my ($lc, $rc) = ($1, $3);
    my ($lstr, $llen) = get_last_n_hkchars($lc, $context_width);
    my ($rstr, $rlen) = get_first_n_hkchars($rc, $context_width);
    my $lfill = "_" x ($context_width - $llen);
    my $rfill = "_" x ($context_width - $rlen);
    my $ostr = "$lfill$lstr"."[".$key."]"."$rstr$rfill";
    if ($sort_lr_context) {
        if ($sort_lr_context =~ /^l/) {
            push @matches, [join("", reverse split(//, $lstr)), $ostr];
        } elsif ($sort_lr_context =~ /^r/) {
            push @matches, [$rstr, $ostr];
        }
    } else {
        print "$ostr\n";
    }
}

if ($sort_lr_context) {
    if ($reverse_mode) {
        @matches = sort {$b->[0] cmp $a->[0]} @matches;
    } else {
        @matches = sort {$a->[0] cmp $b->[0]} @matches;
    }
    foreach my $l (@matches) {
        print $l->[1]."\n";
    }
}

# 先頭から半角N文字分取る
# 取った文字列と実際の長さ(半角文字数)を返す
sub get_first_n_hkchars {
    my ($str, $n) = @_;
    my $s = "";
    my $slen = 0;
    foreach my $c (split(//, $str)) {
        my $clen = 2;
        if (
            $c =~ /\p{InBasicLatin}/
            or
            ($c =~ /\p{InHalfwidthAndFullwidthForms}/ and $c =~ /\p{Katakana}/)
            ) {
            $clen = 1;
        }
        last if $slen + $clen > $n;
        $slen += $clen;
        $s .= $c;
    }
    return ($s, $slen);
}

sub get_last_n_hkchars {
    my ($s, $n) = @_;
    my $sr = join("", reverse split(//, $s));
    my ($str, $slen) = get_first_n_hkchars($sr, $n);
    my $rv = join("", reverse split(//, $str));
    return ($rv, $slen);
}

この記事に言及しているこのブログ内の記事

【Perl関連記事リスト】Perl でテキスト処理やデータ処理 (2016-08-04)