Overclock.net - An Overclocking Community - Reply to Topic

Thread: Benchmark 'Sherlock Holmes' - superfast grepping into 12GB English subtitles Reply to Thread
Title:
Message:

Register Now

In order to be able to post messages on the Overclock.net - An Overclocking Community forums, you must first register.
Please enter your desired user name, your email address and other required details in the form below.
User Name:
If you do not want to register, fill this field only and the name will be used as user name for your post.
Password
Please enter a password for your user account. Note that passwords are case-sensitive.
Password:
Confirm Password:
Email Address
Please enter a valid email address for yourself.
Email Address:

Log-in


  Additional Options
Miscellaneous Options

  Topic Review (Newest First)
05-22-2019 03:53 AM
Sanmayce
Benchmark 'Sherlock Holmes' - superfast grepping into 12GB English subtitles

Grepping time.

Until several days ago I didn't know of existence of 'ripgrep', the author claims it is the fastest.
Speaking of 'Exact Matching', he used a corpus of plain English sentences (a rip from 400+ thousand subtitle files) and benchmarked it.
His console tool is written in Rust, I proposed him a showdown of my Kazahana tool written in C, he agreed conditionally.

The thing that confuses me is how a GitHub project (as his) with 14,000+ stars and his author are reluctant to compare speeds in one open Rust vs C showdown.
The guy obviously is not a fan of heavy/exhaustive benchmarking, so I will do it myself, my wish was (and still is) to see what optimizations are left unused.

To reproduce his benchmark scenario, here comes his pattern:

Enter Command Prompt, you may double-click on my wide-screen prompt shortcut 'MokujIN GREEN 224 prompt.lnk' and enter:

Code:
F:\grep_vs_ripgrep_vs_Kazahana>GREP_BENCHMARK.bat "Sherlock Holmes" "OpenSubtitle_corpus_en_2018_(441,450,449_lines_FROM_446,612_files).txt"
Code:
Pattern: "Sherlock Holmes"
Pattern Length: 15
Haystack: OpenSubtitle_corpus_en_2018_(441,450,449_lines_FROM_446,612_files).txt
Haystack Length: 13,113,340,782 (Cached in System RAM)
Testmachine: Windows 10, i7-3630QM (4cores/8threads) 6MB cache, 16GB DDR3 1600MHz(@800MHz)
---------------------------------------------------------------------------------------------------
| Searcher                                                            | Global  Time |       Hits |
|--------------------------------------------------------------------------------------------------
| Kazahana_Trolldom_Monad_GCC_472_SSE41_32bit.exe              520120 |        5.954 |      7,673 |
| Kazahana_Trolldom_Hexadecad_GCC_730_SSE41_64bit.exe          520120 |        5.083 |      7,673 |
| Kazahana_Trolldom_Hexadecad_IntelV15_SSE41_64bit.exe         520120 |        5.513 |      7,673 |
| grep-2.5.4.exe -F -c (LC_ALL=C)                                     |       20.165 |      7,673 |
| ripgrep-11.0.1-x86_64-pc-windows-gnu.exe                            |        4.817 |      7,673 |
---------------------------------------------------------------------------------------------------





Binaries are downloadable at:https://github.com/BurntSushi/ripgrep

The full benchmark package is downloadable at my Internet Drive, grep_vs_ripgrep_vs_Kazahana.7z, (1,411,102,425 bytes):
https://drive.google.com/file/d/1EZu...ew?usp=sharing

It contains:

Code:
F:\grep_vs_ripgrep_vs_Kazahana>dir
 Volume in drive F is Sanmayce_223GB_B
 Volume Serial Number is 443D-57A7

 Directory of F:\grep_vs_ripgrep_vs_Kazahana

05/21/2019  10:48 PM    <DIR>          .
05/21/2019  10:48 PM    <DIR>          ..
05/22/2019  12:21 AM           451,824 grep-2.5.4-bin.zip
05/22/2019  12:21 AM           898,241 grep-2.5.4-dep.zip
05/22/2019  12:21 AM         1,361,303 grep-2.5.4-src.zip
05/22/2019  12:21 AM            96,256 grep.exe
05/22/2019  12:21 AM            19,385 GREP_BENCHMARK.bat
05/22/2019  12:21 AM               502 Kazahana_compile_GCC_32bit.bat
05/22/2019  12:21 AM               502 Kazahana_compile_GCC_64bit.bat
05/22/2019  12:21 AM               617 Kazahana_compile_Intel12_32bit.bat
05/22/2019  12:21 AM         2,220,512 Kazahana_r1-++fix+nowait_critical_nixFIX_WolfRAM+fixITER+EX+CS_fix_DEFINE_Trolldom.c
05/22/2019  12:21 AM           431,699 Kazahana_Trolldom_Hexadecad_GCC_730_SSE41_64bit.exe
05/22/2019  12:21 AM           217,088 Kazahana_Trolldom_Hexadecad_IntelV15_SSE41_64bit.exe
05/22/2019  12:21 AM           251,390 Kazahana_Trolldom_Monad_GCC_472_SSE41_32bit.exe
05/22/2019  12:21 AM           165,782 Kazahana_Trolldom_Monad_GCC_730_SSE41_64bit.exe
05/22/2019  12:21 AM           195,584 Kazahana_Trolldom_Monad_IntelV15_SSE41_64bit.exe
05/22/2019  12:21 AM         1,008,128 libiconv2.dll
05/22/2019  12:21 AM           103,424 libintl3.dll
05/22/2019  12:21 AM         1,114,552 libiomp5md.dll
05/22/2019  12:21 AM            94,540 LineWordreporter.c
05/22/2019  12:21 AM            69,120 LineWordreporter.exe
05/22/2019  12:21 AM             1,633 MokujIN Amber 224 prompt.lnk
05/22/2019  12:21 AM             1,633 MokujIN GREEN 224 prompt.lnk
05/22/2019  12:21 AM    13,113,340,782 OpenSubtitle_corpus_en_2018_(441,450,449_lines_FROM_446,612_files).txt
05/22/2019  12:21 AM           140,288 pcre3.dll
05/22/2019  12:21 AM            94,300 pthreadGC2.dll
05/22/2019  12:21 AM            79,360 regex2.dll
05/22/2019  12:21 AM            87,572 RESULTS_CPU_i7-3630QM.txt
05/22/2019  12:21 AM           207,289 RESULTS_CPU_i7-3630QM_pattern-'gun'.png
05/22/2019  12:21 AM           208,156 RESULTS_CPU_i7-3630QM_pattern-'Sherlock '.png
05/22/2019  12:21 AM           208,205 RESULTS_CPU_i7-3630QM_pattern-'Sherlock Holmes'.png
05/22/2019  12:21 AM         6,923,495 ripgrep-11.0.1-i686-pc-windows-gnu.zip
05/22/2019  12:21 AM         1,595,373 ripgrep-11.0.1-i686-pc-windows-msvc.zip
05/22/2019  12:21 AM        27,519,247 ripgrep-11.0.1-x86_64-pc-windows-gnu.exe
05/22/2019  12:21 AM         7,084,974 ripgrep-11.0.1-x86_64-pc-windows-gnu.zip
05/22/2019  12:21 AM         1,767,433 ripgrep-11.0.1-x86_64-pc-windows-msvc.zip
05/22/2019  12:21 AM             4,096 timer32.exe
              35 File(s) 13,167,964,285 bytes
               2 Dir(s)   1,365,311,488 bytes free

F:\grep_vs_ripgrep_vs_Kazahana>
Q: Why Kazahana revision 'Trolldom' is not fast as it should?
A: Simple, Kazahana "loses time" to split the buffer into 16 chunks, also Kazahana was intended as PLAIN C tool, the pattern lengths of 2/3 should be searched with SIMD.

Of course, Plain C is awesome for portability, but I have idea to write a SIMD revision of Exact Matching that will be beyond superfast, the idea is simplicity itself:

Code:
Pattern: "to"
Haystack:                           "otto...........toz"
HaystackVector1:                    "otto...........t"
HaystackVector2:                    "tto...........to"
Vector1:                            "tttttttttttttttt"
Vector2:                            "oooooooooooooooo"

Mask1=(HaystackVector1 AND Vector1): 0110...........1
Mask2=(HaystackVector2 AND Vector2): 001............1
Result=(Mask1 AND Mask2):            0010...........1
I intend to implement it, pretty much as in my 'Kamboocha' LCSS tool.
If the Result is zero then skip 16 bytes, of course YMM is more desirable - then 32 'if' statements will be "folded".

Posting Rules  
You may post new threads
You may post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off