たつをの ChangeLog : 2023-09-03

前後の日: 2023-09-02 < > 2023-09-05
この月: 2023-09
別の年の同じ日: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2013 2015 2016 2017 2018 2019 2021 2023

手軽に使える look コマンドの速度比較

2023-09-03-1 [Programming]

大きめな TSV ファイルが手元にあって、クエリを行頭からマッチさせて、マッチした行を取ってくる。
そんな状況で、複数のクエリに対してそれぞれにマッチする行を一気に全部取ってくるタスク。

例えば、こういう TSV ファイルがあって、

A102[tab]2022/01/01[tab]2022/02/11[tab]2387
A120[tab]2022/02/20[tab]2023/12/31[tab]100
A280[tab]2022/03/01[tab]2022/03/02[tab]89
B007[tab]2022/04/05[tab]2022/08/29[tab]980
B010[tab]2022/05/01[tab]2022/05/10[tab]12
C763[tab]2023/01/01[tab]2023/06/30[tab]7800
...

複数のクエリ "A120","A280","B020" が一度に与えられて、これらが行頭にマッチする行を全て取得する、という感じ。

これを手軽に手早く行いたい。

ということで、UNIX系OSのコマンドラインで使える look コマンドやそれ系の Perl モジュール「Search::Dict」、あと join コマンドや perl の正規表現での走査などで速度比較してみた。
（だいぶ前にやったのを今更ながらブログ記事としてまとめてみました）

実験環境

実験環境は、私が「さくらの500円サーバ」と呼んでいるこれ↓

WordPress、EC-CUBE、concrete5等CMSが使えるスタンダードプラン - レンタルサーバーはさくらインターネット

OS は FreeBSD 13.0 が入っている。

実験用データ

検索対象ファイル current.tsv
- 860Mバイト、120万行、TSV 形式
```
wc current.tsv
1207384 12954296 857036094 current.tsv
```
- ファイルの中身のイメージ
```
B00AE090AL[tab]XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX[tab]XXXX...
B00J7A10QC[tab]XXXXXXXXXXXXXXX[tab]XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
B00JJFB0EC[tab]XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX[tab]XXXXXXXXX...
```
- 行頭（第一カラム）はユニークなID文字列
  - ID文字列は[0-9A-Z]の10文字からなる (例: B08C538N3A)
- ソート済み（重要）
  - look や join で使うには対象がソート済みの必要あり
- データはこちらで入手可能
  - ttps://chalow.net/misc/data/current.tsv.gz (18M)

検索クエリの集合

cut -f1 current.tsv | shuffle.pl | head -10 | sort > sample-asins-10.txt
cut -f1 current.tsv | shuffle.pl | head -100 | sort > sample-asins-100.txt
cut -f1 current.tsv | shuffle.pl | head -1000 | sort > sample-asins-1k.txt
cut -f1 current.tsv | shuffle.pl | head -10000 | sort > sample-asins-10k.txt
cut -f1 current.tsv | shuffle.pl | head -100000 | sort > sample-asins-100k.txt

「shuffle.pl | head -1000」は「shuf -1000」に相当

SUFARY(sass)で使うためのインデックス current.tsv.ary
```
mkary -l current.tsv
```
SUFARYについてはここでは説明しない

実験用プログラム

look.sh,look-sass.sh:

#!/usr/bin/env zsh
while read line
do
    key=`echo $line | cut -f1`
    if [ $key ]; then
        up=`look $key $1`
        #up=`sass $key $1` # look-sass.sh
        if [ $up ]; then
            echo $up
            continue
        fi
    fi
done

look.pl: look コマンドと同等（と思われる）Search::Dict モジュールを使用

#!/usr/bin/perl
use strict;
use warnings;
use Search::Dict;
my $fn = shift;
open(my $fh, "<", $fn) or die "can't open [$fn]";
while (<>) {
    chomp;
    next if /^\s*$/;
    look $fh, $_;
    my $line = readline($fh);
    print $line if $line =~ /^\Q$_\E/;
}
close($fh);

追記231017:
looks.py: pure Python による実装 (buffer_size=10000)

実験結果

何回か測定したうちの中間値的なものを掲載。

評価対象ツール＼クエリ数	10	100	1000	1万	10万
look.sh (look)	0.04	0.35	3.6	35.1
look-sass.sh (SUFARY)	0.23	2.86	28.6	267.2
look.pl (Search::Dict)	0.01	0.03	0.1	0.9	8.7
looks.py (pure Python)	0.04	0.08	0.6	5.6	54.3
join	11.1	11.2	11.4	11.6	11.6
perl-regexp	1.3	1.3	1.3	1.4

(単位: 秒)

詳細

10クエリ

[look:10]
time (cat sample-asins-10.txt | ./look.sh current.tsv > a1)
0.01s user 0.04s system 132% cpu 0.038 total

[SUFARY(sass):10]
time (cat sample-asins-10.txt | ./look-sass.sh current.tsv > a2)
0.01s user 0.26s system 104% cpu 0.259 total

[look(Search::Dict):10]
time (cat sample-asins-10.txt | ./look.pl current.tsv > a3)
0.01s user 0.01s system 123% cpu 0.010 total

[join:10]
time (cat sample-asins-10.txt | join -t$'\t' - current.tsv > a4)
10.92s user 0.15s system 99% cpu 11.067 total

[perl-regexp:10]
PAT=`cat sample-asins-10.txt | xargs | sed 's/ /|/g'`
time (perl -nle 'print if /^('$PAT')/' current.tsv > a5)
1.09s user 0.21s system 99% cpu 1.307 total

100クエリ

[look:100]
time (cat sample-asins-100.txt | ./look.sh current.tsv > a1)
0.04s user 0.37s system 117% cpu 0.350 total

[SUFARY(sass):100]
time (cat sample-asins-100.txt | ./look-sass.sh current.tsv > a2)
0.12s user 2.82s system 103% cpu 2.857 total

[look(Search::Dict):100]
time (cat sample-asins-100.txt | ./look.pl current.tsv > a3)
0.02s user 0.02s system 143% cpu 0.026 total

[join:100]
time (join -t$'\t' sample-asins-100.txt current.tsv > a4)
10.98s user 0.19s system 99% cpu 11.173 total

[perl-regexp:100]
PAT=`cat sample-asins-100.txt | xargs | sed 's/ /|/g'`
time (perl -nle 'print if /^('$PAT')/' current.tsv > a5)
1.10s user 0.22s system 99% cpu 1.318 total

1000クエリ

[look:1000]
time (cat sample-asins-1k.txt | ./look.sh current.tsv > a1)
0.95s user 3.72s system 131% cpu 3.553 total

[SUFARY(sass):1000]
time (cat sample-asins-1k.txt | ./look-sass.sh current.tsv > a2)
0.98s user 28.41s system 102% cpu 28.585 total

[look(Search::Dict):1000]
time (cat sample-asins-1k.txt | ./look.pl current.tsv > a3)
0.09s user 0.03s system 105% cpu 0.109 total

[join:1000]
time (join -t$'\t' sample-asins-1k.txt current.tsv > a4)
11.25s user 0.18s system 99% cpu 11.440 total

[perl-regexp:1000]
PAT=`cat sample-asins-1k.txt | xargs | sed 's/ /|/g'`
time (perl -nle 'print if /^('$PAT')/' current.tsv > a5)
1.09s user 0.20s system 99% cpu 1.292 total

1万クエリ

[look:10000]
time (cat sample-asins-10k.txt | ./look.sh current.tsv > a1)
8.73s user 36.74s system 129% cpu 35.129 total

[SUFARY(sass):10000]
time (cat sample-asins-10k.txt | ./look-sass.sh current.tsv > a2)
9.68s user 265.59s system 103% cpu 4:27.22 total

[look(Search::Dict):10000]
time (cat sample-asins-10k.txt | ./look.pl current.tsv > a3)
0.58s user 0.32s system 100% cpu 0.906 total

[join:10000]
time (join -t$'\t' sample-asins-10k.txt current.tsv > a4)
11.45s user 0.17s system 99% cpu 11.623 total

[perl-regexp:10000]
PAT=`cat sample-asins-10k.txt | xargs | sed 's/ /|/g'`
time (perl -nle 'print if /^('$PAT')/' current.tsv > a5)
1.19s user 0.19s system 99% cpu 1.377 total

10万クエリ (look.pl vs. join)

[look(Search::Dict):100000]
time (cat sample-asins-100k.txt | ./look.pl current.tsv > a7)
6.49s user 2.24s system 99% cpu 8.734 total

[join:100000]
time (join -t$'\t' sample-asins-100k.txt current.tsv > a8)
11.32s user 0.23s system 99% cpu 11.562 total

追記231017: looks.py (ref. [2023-10-17-1])

( cat sample-asins-10.txt | ./looks.py current.tsv > a6; )
0.02s user 0.02s system 107% cpu 0.036 total
( cat sample-asins-100.txt | ./looks.py current.tsv > a6; )
0.05s user 0.03s system 105% cpu 0.082 total
( cat sample-asins-1k.txt | ./looks.py current.tsv > a6; )
0.44s user 0.16s system 100% cpu 0.600 total
( cat sample-asins-10k.txt | ./looks.py current.tsv > a6; )
4.25s user 1.33s system 99% cpu 5.582 total
( cat sample-asins-100k.txt | ./looks.py current.tsv > a6; )
41.67s user 12.60s system 99% cpu 54.311 total

考察と結論

やはり look 系が速い
- 今回の複数クエリタスクでは look コマンドよりも perl で Search::Dict を使った方が速い
- ファイルを毎回 open, close するか否かの違い
join は全体を走査しているのでクエリ数にかかわらず一定の速度
perl-regexp も全体走査で一定速度のうえ join よりも速いのだが、クエリ数が多いと使えない
- 10万クエリのときは長すぎてエラー出た
- 同じ方式（クエリを"|"で繋げた正規表現で検索）で egrep も試してみたが、10クエリの段階であまりに遅すぎて断念

ということで、私の第一選択肢は look 系の Search::Dict モジュールを使った Perl スクリプト！

もっと速くする方法がありそうだけど深入りはやめておく。そもそもメモリに全部載せてハッシュ使った方が速いだろうし、本格的にやるならマイクロ秒レベルの低レイテンシのシステム使うべきだし。知らんけど。

§

なお、検索対象データを公開しているので興味のある方はどうぞ。
実験用データの説明のところに URL あります。

また、PHP や Python で Search::Dict 相当のものがあれば知りたいです。
ご存知でしたら教えていただける幸いです。

この記事に言及しているこのブログ内の記事

ソート済みテキストを二分探索する look コマンドの pure Python 実装 (2023-10-17)

前後の日: 2023-09-02 < > 2023-09-05
この月: 2023-09
別の年の同じ日: 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2013 2015 2016 2017 2018 2019 2021 2023

たつをの ChangeLog