Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › Application Programming › FAST 'on the fly' fuzzy string matching console tool written in C
New Posts  All Forums:Forum Nav:

FAST 'on the fly' fuzzy string matching console tool written in C

post #1 of 32
Thread Starter 
When you need to get a suggestion list with strings similar to yours, who are you gonna call?

Galadriel, if you ask me.

I wrote it to be fast, its usefulness can be described in short as: a must-have.
Galadriel is a free fuzzy string suggester/searcher tool utilizing all threads in nowadays PCs - it is haxadecuple-threaded, yes 16.
The Galadriel's home page is here.

Just an example:
Code:
E:\_Kaze_Levenshtein_Galadriel>Galadriel.exe 3 reproducable MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 1++, copyleft Sanmayce 2013-Jan-19.
Galadriel: Total/Checked/Dumped xgrams: 316,423/237,439/15
Galadriel: Performance: 316,423 xgrams/s

E:\_Kaze_Levenshtein_Galadriel>type Galadriel.txt
improducible
irreproducible
produceable
producible
reproachable
reproduce
reproduceable
reproducible
reproducibles
reproducibly
reproductive
reprovable
unproduceable
unproducible
unreproducible

E:\_Kaze_Levenshtein_Galadriel>

In example above I need all words that look like as reproducable since I didn't know the right one, so all similar words with Levenshtein Distance 3 are listed. In order to see how speedy Galadriel is in kind of worst case scenario TESTbigrams.bat comes in handy - it searches a phrase into 35,116,064 phrases for Levenshtein Distance 9 - this is morphy really.

The tool/benchmark (172 MB in size) is downloadable at: _Kaze_Levenshtein_Galadriel.zip

Test on my 'Bonboniera' laptop T7500 2200MHz, 2/2 cores/threads, 2x2GB dual channel DDR2 667MHz, Windows 7 64bit:
Code:
E:\_Kaze_Levenshtein_Galadriel>dir/og/oe
 Volume in drive E is SSD_Sanmayce
 Volume Serial Number is 9CF6-FEA3

 Directory of E:\_Kaze_Levenshtein_Galadriel

01/28/2013  07:51 PM    <DIR>          .
01/28/2013  07:51 PM    <DIR>          ..
01/28/2013  05:09 AM    <DIR>          Galadriel_logo
01/28/2013  07:52 PM               187 Galadriel_compile_Intel.bat
01/28/2013  07:52 PM             2,572 TESTbigrams.bat
01/28/2013  07:52 PM                26 makeEXE.bat
01/28/2013  07:52 PM        79,620,218 4andabove_Gamera.tar.2.sorted.bsc
01/28/2013  07:52 PM            26,593 Galadriel.c
01/28/2013  07:52 PM            87,524 Galadriel_r2-.c
01/28/2013  07:52 PM           744,167 Galadriel_r2-.cod
01/28/2013  07:52 PM            61,305 Galadriel.cod
01/28/2013  07:52 PM           387,072 Galadriel_r2-_HEXADECAD-Threads_IntelV12_32bit.exe
01/28/2013  07:52 PM           459,776 Galadriel_r2-_HEXADECAD-Threads_IntelV12_64bit.exe
01/28/2013  07:52 PM            90,112 Galadriel_r2-_MONAD-Thread_IntelV12_32bit.exe
01/28/2013  07:52 PM           100,352 Galadriel_r2-_MONAD-Thread_IntelV12_64bit.exe
01/28/2013  07:52 PM            58,880 Galadriel.exe
01/28/2013  07:52 PM           598,528 GRAFFITH_r2++_Graphein_2.3.0_Intel_12.1_32bit.exe
01/28/2013  07:52 PM             4,096 Timer.exe
01/28/2013  07:52 PM             1,566 KAZE prompt.lnk
01/28/2013  07:52 PM       889,537,624 4andabove_Gamera.tar.2.sorted
01/28/2013  07:52 PM               722 README.txt
01/28/2013  07:52 PM         3,869,529 MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd
01/28/2013  07:52 PM        53,460,640 googlebooks-eng-all-1gram-20120701_5038456_words.wrd
              20 File(s)  1,029,111,489 bytes
               3 Dir(s)  19,616,112,640 bytes free

E:\_Kaze_Levenshtein_Galadriel>Galadriel.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 1+++, copyleft Sanmayce 2013-Jan-21.
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 2,065,650 xgrams/s

E:\_Kaze_Levenshtein_Galadriel>Timer Galadriel_r2-_MONAD-Thread_IntelV12_32bit.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
Enforcing MONAD i.e. single-thread ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 48 KB/clock
Galadriel: Performance: 1,974 xgrams/clock

Kernel Time  =     0.655 =    3%
User Time    =    17.222 =   96%
Process Time =    17.877 =   99%
Global Time  =    17.915 =  100%

E:\_Kaze_Levenshtein_Galadriel>Timer Galadriel_r2-_HEXADECAD-Threads_IntelV12_32bit.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 73 KB/clock
Galadriel: Performance: 2,989 xgrams/clock

Kernel Time  =     1.357 =   11%
User Time    =    22.058 =  184%
Process Time =    23.415 =  195%
Global Time  =    11.948 =  100%

E:\_Kaze_Levenshtein_Galadriel>Timer Galadriel_r2-_MONAD-Thread_IntelV12_64bit.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
Enforcing MONAD i.e. single-thread ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 54 KB/clock
Galadriel: Performance: 2,202 xgrams/clock

Kernel Time  =     0.468 =    2%
User Time    =    15.631 =   96%
Process Time =    16.099 =   99%
Global Time  =    16.124 =  100%

E:\_Kaze_Levenshtein_Galadriel>Timer Galadriel_r2-_HEXADECAD-Threads_IntelV12_64bit.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
Galadriel: Performance: 81 KB/clock
Galadriel: Performance: 3,281 xgrams/clock

Kernel Time  =     1.294 =   11%
User Time    =    20.841 =  178%
Process Time =    22.136 =  189%
Global Time  =    11.683 =  100%

E:\_Kaze_Levenshtein_Galadriel>

EDIT (revision 2):
To reproduce the test in three steps:
Step1: Enter command prompt either by double-clicking the Windows 'C:\' shortcut or mine KAZE prompt.lnk.
Step2: Execute TESTbigrams.bat.
Step3: The file RESULTS.TXT is the needed one.

My humble laptop offers 3,281,000 phrases per second search speed. I am interested in how the benchmark (running TESTbigrams.bat) behaves on fast 8+ threads CPUs, please share your results with us.
Edited by Sanmayce - 1/31/13 at 11:20am
post #2 of 32
Interesting project, but Windows pretty much has this built in: findstr. It supports regex and while it doesn't (AFAIK) support multiple cores, I think the usefulness of that is limited anyway.

Have you or are you considering opening the source code? And if not, how does the fuzzy matching work (or are you keeping that a trade secret)?

the real crux of this is whether the fuzzy matching is more accurate and convinient that using your own regex matches in findstr
Edited by Plan9 - 1/30/13 at 2:31am
post #3 of 32
Thread Starter 
First off, I am happy to share my latest (last night made) revision 2 of Galadriel - IT/SHE SCREAMS!

Revision 1+++ is the base, it has parsing and searching done single-threadedly.
Revision 2- is the next step, it has single-threaded parsing and 16-threaded searching.
Revision 2 is the last step, it has both 16-threaded parsing and 16-threaded searching.

>... but Windows pretty much has this built in: findstr
Man, you knocked me down, please do not compare Galadriel with findstr, it offends me.

>Have you or are you considering opening the source code?
Don't you see the file Galadriel.c, I am still tuning the OPEN MP revisions (they use algorithm and heuristics exactly as Galadriel.c).

>And if not, how does the fuzzy matching work (or are you keeping that a trade secret)?
No no, I have no secrets much less trade ones - I am an OPEN-SOURCE C program-mess-er.
As for how the fuzzy matching work it is quite simple, the interesting thing is to tune it.

One very good resource:
The Wagner-Fischer Algorithm
http://shaunwagner.com/writings_computer_levenshtein.html

If you are really interested keep asking more specific questions - I will answer.

>the real crux of this is whether the fuzzy matching is more accurate and convinient that using your own regex matches in findstr
If you ask me, forget about findstr, its matching is no match to fuzzy matching ESPECIALLY Galadriel's.

And one more thing, I am very disappointed from the lack of elementary support, to have one's superfast tool benchmarked is like asking for gold - this tells me how little people are open, it is no good.
post #4 of 32
Noob here on this (though I do love what programming I'm capable of atm) but what is this for? and why?
post #5 of 32
Quote:
Originally Posted by th3m3nt4l View Post

Noob here on this (though I do love what programming I'm capable of atm) but what is this for? and why?

I was wondering the same thing!
Perpetual Upgrade
(17 items)
 
Server
(17 items)
 
Galago UltraPro
(9 items)
 
CPUMotherboardGraphicsRAM
i7-4770K MSI Z97M Gaming Zotac GTX 1080 AMP! Edition (2x4GB) Corsair DDR3-2000 
Hard DriveHard DriveCoolingCooling
128GB Crucial M4 (2x) 500GB RAID 0 Swiftech Apogee Black Ice GT Stealth 240 
OSKeyboardPowerCase
Windows 10 Pro 64bit Corsair K70 Vengence Seasonic X650 Aerocool DS Cube 
MouseAudio
Logitech G500 ASUS Xonar DX 
CPUMotherboardGraphicsRAM
Phenom II X4 965 MSI 870A-G54 nVidia 8400GS (2x2GB) Patriot DDR3-1600 
RAMHard DriveHard DriveCooling
(2x4GB) Patriot DDR3-1600 (3x) 320GB RAID 5 (1x) 1TB Backup Storage Coolermaster TX3 
OSPowerOther
Proxmox Hypervisor Antec TruePower 430W HP Smart Array P400 
CPUGraphicsRAMHard Drive
Intel i7-4750HQ Intel Iris Pro Graphics 5200  (2 x 4GB) DDR3-1600 90GB Intel mSATA SSD 
Hard DriveOSOSMonitor
500GB 5400RPM HDD Ubuntu Gnome 15.10 Windows 10 14" 1080p ColorPro IPS 
Case
Galago UltraPro 
  hide details  
Reply
Perpetual Upgrade
(17 items)
 
Server
(17 items)
 
Galago UltraPro
(9 items)
 
CPUMotherboardGraphicsRAM
i7-4770K MSI Z97M Gaming Zotac GTX 1080 AMP! Edition (2x4GB) Corsair DDR3-2000 
Hard DriveHard DriveCoolingCooling
128GB Crucial M4 (2x) 500GB RAID 0 Swiftech Apogee Black Ice GT Stealth 240 
OSKeyboardPowerCase
Windows 10 Pro 64bit Corsair K70 Vengence Seasonic X650 Aerocool DS Cube 
MouseAudio
Logitech G500 ASUS Xonar DX 
CPUMotherboardGraphicsRAM
Phenom II X4 965 MSI 870A-G54 nVidia 8400GS (2x2GB) Patriot DDR3-1600 
RAMHard DriveHard DriveCooling
(2x4GB) Patriot DDR3-1600 (3x) 320GB RAID 5 (1x) 1TB Backup Storage Coolermaster TX3 
OSPowerOther
Proxmox Hypervisor Antec TruePower 430W HP Smart Array P400 
CPUGraphicsRAMHard Drive
Intel i7-4750HQ Intel Iris Pro Graphics 5200  (2 x 4GB) DDR3-1600 90GB Intel mSATA SSD 
Hard DriveOSOSMonitor
500GB 5400RPM HDD Ubuntu Gnome 15.10 Windows 10 14" 1080p ColorPro IPS 
Case
Galago UltraPro 
  hide details  
Reply
post #6 of 32
Thread Starter 
>Noob here on this (though I do love what programming I'm capable of atm) but what is this for? and why?
>I was wondering the same thing!

Long story short:
http://forum.thefreedictionary.com/profile696240.aspx

The name of the game is word/phrase suggesting, having files with words/phrases you can search in them with my superfast Galadriel fuzzy match finder/suggester.

Just received the RESULTS.TXT on Q9550S 4threads:
Code:
r1+++, 3 00,000,00?_optimized_already: 
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 1+++, copyleft Sanmayce 2013-Jan-21.
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/11,432,185/1
Galadriel: Performance: 5,852,677 xgrams/s

r2- 1-thread, 32bit, 3 00,000,00?_optimized_already: 
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
Enforcing MONAD i.e. single-thread ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/11,432,185/1
Galadriel: Performance: 135 KB/clock
Galadriel: Performance: 5,493 xgrams/clock
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31

Kernel Time  =     0.406 =    6%
User Time    =     6.265 =   93%
Process Time =     6.671 =   99%
Global Time  =     6.675 =  100%

r2- 16-threads, 32bit, 3 00,000,00?_optimized_already: 
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2-, copyleft Sanmayce 2013-Jan-25.
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/11,432,185/1
Galadriel: Performance: 189 KB/clock
Galadriel: Performance: 7,642 xgrams/clock
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31

Kernel Time  =     3.781 =   73%
User Time    =    13.968 =  269%
Process Time =    17.750 =  342%
Global Time  =     5.179 =  100%

r2 1-thread, 32bit, 3 00,000,00?_optimized_already: 
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2, copyleft Sanmayce 2013-Jan-30.
Enforcing MONAD i.e. single-thread ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/11,432,185/1
Galadriel: Performance: 157 KB/clock
Galadriel: Performance: 6,347 xgrams/clock
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31

Kernel Time  =     0.562 =    9%
User Time    =     5.296 =   90%
Process Time =     5.859 =   99%
Global Time  =     5.871 =  100%

r2 16-threads, 32bit, 3 00,000,00?_optimized_already: 
Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 2, copyleft Sanmayce 2013-Jan-30.
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
Enforcing HEXADECAD i.e. hexadecuple-threads ...
Allocating memory 8MB ... OK
Galadriel: Total/Checked/Dumped xgrams: 35,116,064/11,432,185/1
Galadriel: Performance: 347 KB/clock
Galadriel: Performance: 14,040 xgrams/clock
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31

Kernel Time  =     1.703 =   58%
User Time    =     8.390 =  286%
Process Time =    10.093 =  344%
Global Time  =     2.927 =  100%

The above fragment shows what is the performance for 32bit versions (the test was run under Windows XP 32bit) for Levenshtein Distance = 3 and string 00,000,00?_optimized_already, 64bit compiles are significantly faster.

As you can see 14,040,000 phrases per second is very promising, but I need at least 12 thread capable CPU tested.
Edited by Sanmayce - 1/31/13 at 10:18am
post #7 of 32
So this would be used for some sort of predictive text or spell checking? Or am I completely misunderstanding word/phrase suggesting?
Perpetual Upgrade
(17 items)
 
Server
(17 items)
 
Galago UltraPro
(9 items)
 
CPUMotherboardGraphicsRAM
i7-4770K MSI Z97M Gaming Zotac GTX 1080 AMP! Edition (2x4GB) Corsair DDR3-2000 
Hard DriveHard DriveCoolingCooling
128GB Crucial M4 (2x) 500GB RAID 0 Swiftech Apogee Black Ice GT Stealth 240 
OSKeyboardPowerCase
Windows 10 Pro 64bit Corsair K70 Vengence Seasonic X650 Aerocool DS Cube 
MouseAudio
Logitech G500 ASUS Xonar DX 
CPUMotherboardGraphicsRAM
Phenom II X4 965 MSI 870A-G54 nVidia 8400GS (2x2GB) Patriot DDR3-1600 
RAMHard DriveHard DriveCooling
(2x4GB) Patriot DDR3-1600 (3x) 320GB RAID 5 (1x) 1TB Backup Storage Coolermaster TX3 
OSPowerOther
Proxmox Hypervisor Antec TruePower 430W HP Smart Array P400 
CPUGraphicsRAMHard Drive
Intel i7-4750HQ Intel Iris Pro Graphics 5200  (2 x 4GB) DDR3-1600 90GB Intel mSATA SSD 
Hard DriveOSOSMonitor
500GB 5400RPM HDD Ubuntu Gnome 15.10 Windows 10 14" 1080p ColorPro IPS 
Case
Galago UltraPro 
  hide details  
Reply
Perpetual Upgrade
(17 items)
 
Server
(17 items)
 
Galago UltraPro
(9 items)
 
CPUMotherboardGraphicsRAM
i7-4770K MSI Z97M Gaming Zotac GTX 1080 AMP! Edition (2x4GB) Corsair DDR3-2000 
Hard DriveHard DriveCoolingCooling
128GB Crucial M4 (2x) 500GB RAID 0 Swiftech Apogee Black Ice GT Stealth 240 
OSKeyboardPowerCase
Windows 10 Pro 64bit Corsair K70 Vengence Seasonic X650 Aerocool DS Cube 
MouseAudio
Logitech G500 ASUS Xonar DX 
CPUMotherboardGraphicsRAM
Phenom II X4 965 MSI 870A-G54 nVidia 8400GS (2x2GB) Patriot DDR3-1600 
RAMHard DriveHard DriveCooling
(2x4GB) Patriot DDR3-1600 (3x) 320GB RAID 5 (1x) 1TB Backup Storage Coolermaster TX3 
OSPowerOther
Proxmox Hypervisor Antec TruePower 430W HP Smart Array P400 
CPUGraphicsRAMHard Drive
Intel i7-4750HQ Intel Iris Pro Graphics 5200  (2 x 4GB) DDR3-1600 90GB Intel mSATA SSD 
Hard DriveOSOSMonitor
500GB 5400RPM HDD Ubuntu Gnome 15.10 Windows 10 14" 1080p ColorPro IPS 
Case
Galago UltraPro 
  hide details  
Reply
post #8 of 32
Thread Starter 
>So this would be used for some sort of predictive text or spell checking?
Close, but not 'or' it is rather 'and' i.e. predictive text AND spell checking.
post #9 of 32
Quote:
Originally Posted by SectorNine50 View Post

So this would be used for some sort of predictive text or spell checking? Or am I completely misunderstanding word/phrase suggesting?

It's basically "grep" for windows so far as I can tell.
Handy tools if you spend a lot of time in the command line (like i do)
post #10 of 32
Quote:
Originally Posted by Sanmayce View Post


>... but Windows pretty much has this built in: findstr
Man, you knocked me down, please do not compare Galadriel with findstr, it offends me.
I appreciate that you're proud of your tool, but it's perfectly fair to compare one tool with another found preinstalled with an OS.
Quote:
Originally Posted by Sanmayce View Post

>Have you or are you considering opening the source code?
Don't you see the file Galadriel.c, I am still tuning the OPEN MP revisions (they use algorithm and heuristics exactly as Galadriel.c).

>And if not, how does the fuzzy matching work (or are you keeping that a trade secret)?
No no, I have no secrets much less trade ones - I am an OPEN-SOURCE C program-mess-er.
As for how the fuzzy matching work it is quite simple, the interesting thing is to tune it.

One very good resource:
The Wagner-Fischer Algorithm
http://shaunwagner.com/writings_computer_levenshtein.html
Interesting stuff. Thanks mate. smile.gif
Quote:
Originally Posted by Sanmayce View Post

>the real crux of this is whether the fuzzy matching is more accurate and convinient that using your own regex matches in findstr
If you ask me, forget about findstr, its matching is no match to fuzzy matching ESPECIALLY Galadriel's.
You say that, but regex can do some pretty advanced matching in the right hands smile.gif
Quote:
Originally Posted by Sanmayce View Post

And one more thing, I am very disappointed from the lack of elementary support, to have one's superfast tool benchmarked is like asking for gold - this tells me how little people are open, it is no good.
That doesn't make any sense.
New Posts  All Forums:Forum Nav:
  Return Home
  Back to Forum: Application Programming
Overclock.net › Forums › Software, Programming and Coding › Coding and Programming › Application Programming › FAST 'on the fly' fuzzy string matching console tool written in C