Overclock.net › Forums › Benchmarks › Benchmarking Software and Discussion › Benchmarking the fastest hash function
New Posts  All Forums:Forum Nav:

Benchmarking the fastest hash function - Page 2

post #11 of 22
Thread Starter 
>... I can run another test with 125MHz ...
No-No-No, please don't change the base clock it is enough, also I saw my mistake calling it FSB - since the memory controller was made internal I still don't know exactly how the CPU interacts with Main Memory, grrr.

Maybe XEONs and OPTERONs should enter and show why they are the real deal.

I added a result (3333 Latin Powers test) from the 100MB test on your Redemption at my 'Fastest Hash' page, thanks, a short dump is given:

Testing LP3.TXT 98,848 bytes long (one of tests included in '_KAZE_hash_Yorikke.7z' package) on i7 SB-E 3200MHz:
Code:
3333 Latin Powers 
3333 lines read
8192 elements in the table (13 bits)
           Jesteress:        430 [  576]
              Meiyan:        441 [  583]
             Yorikke:        424 [  579]
        x17 unrolled:       1112 [  564]
              FNV-1a:        821 [  604]
              Larson:        805 [  581]
              CRC-32:        868 [  613]
             Murmur2:        541 [  600]
                SBox:        759 [  576]
            Murmur2A:        575 [  576]
             Murmur3:        607 [  583]
           XXHfast32:        544 [  596]
         XXHstrong32:        564 [  571]
           iSCSI CRC:        464 [  594]

The 3333 keys in LP3.TXT are in range 7..43 bytes and look like:

thousand
million
billion
trillion
quadrillion
quintillion
sextillion
septillion
octillion
nonillion
decillion
undecillion
dodecillion
tredecillion
quattuordecillion
quindecillion
sexdecillion
septendecillion
octodecillion
novemdecillion
...
domilliaquadringentrequinquagintillion
domilliaquadringenquattuorquinquagintillion
domilliaquadringenquinquinquagintillion
...
tremilliatrecenoctovigintillion
tremilliatrecennovemvigintillion
tremilliatrecentrigintillion
tremilliatrecenuntrigintillion
tremilliatrecendotrigintillion

For these important keys (they resemble 2-grams, 3-grams lengths) FNV1A_Yorikke proves to be fastest.
The main target for FNV1A_Yorikke is English phrases (so called n-grams), she is the first line of offensive (not defensive, he-he) while millions of phrases are to be processed, that is, hashing precedes all the next stages, mainly searching.

I find next two (made with former Everest benchmarker) graphs most useful:

Cache READ bandwidth:


Main Memory bandwidth:


Looking at i7 3930K Main Memory READ bandwidth which is 17GB/s I see how wrong was I thinking Redemption's 8.639MB per clock i.e. 8.5GB/s being slow - in fact this is 50% bandwidth utilization with 32bit code (the test is 32bit).

Looking at i7 3930K L1 Cache Memory READ bandwidth which is 120GB/s I cannot answer why Redemption's 11.498MB per clock i.e. 11GB/s for FNV1A_Yorikke: (16KB block) is TEN TIMES WORSE?! Can anyone enlighten me here, please.

And some music to my ears:

"According to the company, future production processes down to 5 nm are on the horizon and will most likely be reached without significant problems. Following the current 22 nm process, Intel's manufacturing cadence suggests that the first 14 nm products will arrive in late 2013, 10 nm in 2015, 7 nm in 2017, and 5 nm in 2019. A slight adjustment has been made to include different production processes for traditional processors and now SoCs. The company previously indicated that SoCs will be accelerated to catch up with the process applied to Intel's main processor products.

According to reports, Intel does not see any reason to believe that Moore's Law, which is really more an accepted guideline and observation rather an actual "law", will be breached by the company within 10 years, which indicates that Intel has visibility even beyond 5 nm. At this time, Intel has 14 nm in development, and 10 nm manufacturing in its research phase."

/'Intel Has 5 nm Processors in Sight ' excerpt from http://www.tomshardware.com/news/intel-cpu-processor-5nm,17578.html/
post #12 of 22
Thread Starter 
Okay, thanks to tests performed on i7 3930K I was able to see FNV1A_Yoshimitsu's overkill i.e. using four hash lines is not suitable (YET) since the main loop is bigger than 64bytes and the loop counter is not in a register but in stack.

So, bye-bye to the 4 hash lines 4+4bytes each, that is, hashing 32bytes per cycle is slower than hashing 16bytes per cycle (FNV1A_Yorikke).

The obvious gap to be bridged is stride 24, which is 3 hash lines 4+4bytes each, this resulted in FNV1A_YoshimitsuTRIAD.
On my Core 2 FNV1A_YoshimitsuTRIAD is always (blocks >= 16KB) faster than FNV1A_Yorikke both with Intel 12.1 and Microsoft 16 compilers used.

The loop no longer exceeds 64bytes, it is 61bytes (291a-28df+2) and all memory accesses are 6x4bytes.
Interestingly Microsoft 16 generates 59bytes long main loop but the three IMULs are not grouped together (as Intel did), quick-tests confirm that Intel know what they are doing.

The main loop of FNV1A_YoshimitsuTRIAD follows (Intel 12.1, /Ox used):
Code:
;;;     for(; wrdlen >= 3*2*sizeof(uint32_t); wrdlen -= 3*2*sizeof(uint32_t), p += 3*2*sizeof(uint32_t)) {

  028d7 83 fa 18         cmp edx, 24                            
  028da 72 43            jb .B4.5 ; Prob 10%                    
                                ; LOE eax edx ecx ebx ebp esi edi
.B4.2:                          ; Preds .B4.1
  028dc 89 34 24         mov DWORD PTR [esp], esi               ;
                                ; LOE eax edx ecx ebx ebp edi
.B4.3:                          ; Preds .B4.2 .B4.3

;;;             hash32 = (hash32 ^ (_rotl(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+4))) * PRIME;        

  028df 8b 31            mov esi, DWORD PTR [ecx]               
  028e1 83 c2 e8         add edx, -24                           
  028e4 c1 c6 05         rol esi, 5                             
  028e7 33 71 04         xor esi, DWORD PTR [4+ecx]             
  028ea 33 de            xor ebx, esi                           

;;;             hash32B = (hash32B ^ (_rotl(*(uint32_t *)(p+8),5) ^ *(uint32_t *)(p+12))) * PRIME;        

  028ec 8b 71 08         mov esi, DWORD PTR [8+ecx]             
  028ef c1 c6 05         rol esi, 5                             
  028f2 33 71 0c         xor esi, DWORD PTR [12+ecx]            
  028f5 33 fe            xor edi, esi                           

;;;             hash32C = (hash32C ^ (_rotl(*(uint32_t *)(p+16),5) ^ *(uint32_t *)(p+20))) * PRIME;        

  028f7 8b 71 10         mov esi, DWORD PTR [16+ecx]            
  028fa c1 c6 05         rol esi, 5                             
  028fd 33 71 14         xor esi, DWORD PTR [20+ecx]            
  02900 83 c1 18         add ecx, 24                            
  02903 33 ee            xor ebp, esi                           
  02905 69 db e7 d3 0a 
        00               imul ebx, ebx, 709607                  
  0290b 69 ff e7 d3 0a 
        00               imul edi, edi, 709607                  
  02911 69 ed e7 d3 0a 
        00               imul ebp, ebp, 709607                  
  02917 83 fa 18         cmp edx, 24                            
  0291a 73 c3            jae .B4.3 ; Prob 82%                   

Revision 4 of my test is HASH_linearspeed_YoshimitsuTRIAD_vs_CRC32_FURY.zip (670,475 bytes) at:
http://www.sanmayce.com/Fastest_Hash/HASH_linearspeed_YoshimitsuTRIAD_vs_CRC32_FURY.zip

It includes:
Code:
E:\HASH_linearspeed_YoshimitsuTRIAD_vs_CRC32_FURY>dir

10/29/2012  06:47 AM           208,515 cachemem_Core 2 Duo T7500.png
10/29/2012  06:47 AM           210,501 cpuid_Core 2 Duo T7500.png
10/29/2012  06:47 AM            88,116 HASH_linearspeed_FURY.c
10/29/2012  06:47 AM         1,416,029 HASH_linearspeed_FURY_Intel_IA-32_12.cod
10/29/2012  06:47 AM           112,640 HASH_linearspeed_FURY_Intel_IA-32_12.exe
10/29/2012  06:47 AM           309,257 HASH_linearspeed_FURY_Microsoft_32bit_16.cod
10/29/2012  06:47 AM            96,256 HASH_linearspeed_FURY_Microsoft_32bit_16.exe
10/29/2012  06:47 AM               314 KAZE_compile_Intel.bat
10/29/2012  06:47 AM               329 KAZE_compile_Microsoft.bat
10/29/2012  06:47 AM               117 RUNME.bat
10/29/2012  06:47 AM             1,590 Yorikke prompt.lnk

For those who want to compile the C source by themselves there are 'KAZE_compile_Intel.bat' and 'KAZE_compile_Microsoft.bat' batches.
For convenience I created 'RUNME.bat' which can be executed directly from Windows Explorer, it takes some 4-10 minutes to complete and loads automatically the results into NOTEPAD, thus the resultant file easily can be shared.

The result on my laptop:


Code:
Fetching/Hashing a 64MB block 1024 times i.e. 64GB ...
BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 15351 clocks or 4.269MB per clock
BURST_Read_3DWORDS: (64MB block); 65536MB fetched in 15600 clocks or 4.201MB per clock
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 16021 clocks or 4.091MB per clock
FNV1A_Yorikke: (64MB block); 65536MB hashed in 16442 clocks or 3.986MB per clock
CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71651 clocks or 0.915MB per clock

Fetching/Hashing a 10MB block 8*1024 times ...
BURST_Read_4DWORDS: (10MB block); 81920MB fetched in 18907 clocks or 4.333MB per clock
BURST_Read_3DWORDS: (10MB block); 81920MB fetched in 18767 clocks or 4.365MB per clock
FNV1A_YoshimitsuTRIAD: (10MB block); 81920MB hashed in 19812 clocks or 4.135MB per clock
FNV1A_Yorikke: (10MB block); 81920MB hashed in 20186 clocks or 4.058MB per clock
CRC32_SlicingBy8K2: (10MB block); 81920MB hashed in 89685 clocks or 0.913MB per clock

Fetching/Hashing a 5MB block 8*1024 times ...
BURST_Read_4DWORDS: (5MB block); 40960MB fetched in 8752 clocks or 4.680MB per clock
BURST_Read_3DWORDS: (5MB block); 40960MB fetched in 8876 clocks or 4.615MB per clock
FNV1A_YoshimitsuTRIAD: (5MB block); 40960MB hashed in 9438 clocks or 4.340MB per clock
FNV1A_Yorikke: (5MB block); 40960MB hashed in 9563 clocks or 4.283MB per clock
CRC32_SlicingBy8K2: (5MB block); 40960MB hashed in 44507 clocks or 0.920MB per clock

Fetching/Hashing a 2MB block 32*1024 times ...
BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 9547 clocks or 6.865MB per clock
BURST_Read_3DWORDS: (2MB block); 65536MB fetched in 9688 clocks or 6.765MB per clock
FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 9750 clocks or 6.722MB per clock
FNV1A_Yorikke: (2MB block); 65536MB hashed in 10171 clocks or 6.443MB per clock
CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 69217 clocks or 0.947MB per clock

Fetching/Hashing a 128KB block 512*1024 times ...
BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 9235 clocks or 7.096MB per clock
BURST_Read_3DWORDS: (128KB block); 65536MB fetched in 9594 clocks or 6.831MB per clock
FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 9687 clocks or 6.765MB per clock
FNV1A_Yorikke: (128KB block); 65536MB hashed in 10078 clocks or 6.503MB per clock
CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 68967 clocks or 0.950MB per clock

Fetching/Hashing a 16KB block 4*1024*1024 times ...
BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 7878 clocks or 8.319MB per clock
BURST_Read_3DWORDS: (16KB block); 65536MB fetched in 7862 clocks or 8.336MB per clock
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 9048 clocks or 7.243MB per clock
FNV1A_Yorikke: (16KB block); 65536MB hashed in 9626 clocks or 6.808MB per clock
CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69779 clocks or 0.939MB per clock

As you can see there is a tiny function 'BURST_Read_3DWORDS' with following code:
Code:
.B3.3:                          ; Preds .B3.3 .B3.2
  028db 83 c6 f4         add esi, -12                           

;;;         hash32 = *(uint32_t *)(p+0);

  028de 8b 07            mov eax, DWORD PTR [edi]               

;;;         hash32B = *(uint32_t *)(p+4);

  028e0 8b 4f 04         mov ecx, DWORD PTR [4+edi]             

;;;         hash32C = *(uint32_t *)(p+8);

  028e3 8b 57 08         mov edx, DWORD PTR [8+edi]             
  028e6 83 c7 0c         add edi, 12                            
  028e9 83 fe 0c         cmp esi, 12                            
  028ec 73 ed            jae .B3.3 ; Prob 82%                   

This mini burst loader reads 3x4bytes per cycle, it gives some clue what are the upper limits for hasher with similar stride.

For Core 2:
16KB block gives L1 (32KB) speed, 128KB block gives L2 (2+MB) speed, 2MB block gives L2 (2+MB) speed.
For i7:
16KB block gives L1 (32KB) speed, 128KB block gives L2 (256KB) speed, 2MB block gives L3 (2+MB) speed.

The speed performance of one hasher (for chunks-checksuming) should be the L2 counterpart, here, 6.722MB per clock.
The speed performance of one hasher (for hash-table) should be the L1 counterpart, here, 7.243MB per clock.

As far as I can see there is no faster 32bit linear hasher than FNV1A_YoshimitsuTRIAD.
Testing the revision 4 on different FAST machines will show is this so, please share your Results.txt with us.
Edited by Sanmayce - 10/31/12 at 8:19am
post #13 of 22
Thread Starter 
Thanks to kzone75 here comes the Results.txt obtained on his AMD machine:
http://valid.canardpc.com/show_oc.php?id=2561519



Being an AMD fan since the amazing Barton I wanted to see how hashing is going on their chips, so here it goes:
Code:
Fetching/Hashing a 64MB block 1024 times i.e. 64GB ...
BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 10000 clocks or 6.554MB per clock
BURST_Read_3DWORDS: (64MB block); 65536MB fetched in 9094 clocks or 7.207MB per clock
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 8234 clocks or 7.959MB per clock
FNV1A_Yorikke: (64MB block); 65536MB hashed in 11453 clocks or 5.722MB per clock
CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 45610 clocks or 1.437MB per clock

Fetching/Hashing a 10MB block 8*1024 times ...
BURST_Read_4DWORDS: (10MB block); 81920MB fetched in 11203 clocks or 7.312MB per clock
BURST_Read_3DWORDS: (10MB block); 81920MB fetched in 7875 clocks or 10.403MB per clock
FNV1A_YoshimitsuTRIAD: (10MB block); 81920MB hashed in 9828 clocks or 8.335MB per clock
FNV1A_Yorikke: (10MB block); 81920MB hashed in 15047 clocks or 5.444MB per clock
CRC32_SlicingBy8K2: (10MB block); 81920MB hashed in 56593 clocks or 1.448MB per clock

Fetching/Hashing a 5MB block 8*1024 times ...
BURST_Read_4DWORDS: (5MB block); 40960MB fetched in 4516 clocks or 9.070MB per clock
BURST_Read_3DWORDS: (5MB block); 40960MB fetched in 3156 clocks or 12.978MB per clock
FNV1A_YoshimitsuTRIAD: (5MB block); 40960MB hashed in 4438 clocks or 9.229MB per clock
FNV1A_Yorikke: (5MB block); 40960MB hashed in 5531 clocks or 7.406MB per clock
CRC32_SlicingBy8K2: (5MB block); 40960MB hashed in 28094 clocks or 1.458MB per clock

Fetching/Hashing a 2MB block 32*1024 times ...
BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 4000 clocks or 16.384MB per clock
BURST_Read_3DWORDS: (2MB block); 65536MB fetched in 4000 clocks or 16.384MB per clock
FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 5797 clocks or 11.305MB per clock
FNV1A_Yorikke: (2MB block); 65536MB hashed in 7218 clocks or 9.080MB per clock
CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 44500 clocks or 1.473MB per clock

Fetching/Hashing a 128KB block 512*1024 times ...
BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 3797 clocks or 17.260MB per clock
BURST_Read_3DWORDS: (128KB block); 65536MB fetched in 3969 clocks or 16.512MB per clock
FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 5766 clocks or 11.366MB per clock
FNV1A_Yorikke: (128KB block); 65536MB hashed in 7218 clocks or 9.080MB per clock
CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 44391 clocks or 1.476MB per clock

Fetching/Hashing a 16KB block 4*1024*1024 times ...
BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 3078 clocks or 21.292MB per clock
BURST_Read_3DWORDS: (16KB block); 65536MB fetched in 3922 clocks or 16.710MB per clock
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 5859 clocks or 11.186MB per clock
FNV1A_Yorikke: (16KB block); 65536MB hashed in 6407 clocks or 10.229MB per clock
CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 44437 clocks or 1.475MB per clock

I guess the L1 16KB Data Cache cannot house the last 16KB block entirely, so it suffers in some degree.
Obviously Bulldozer/Zambezi loves FNV1A_YoshimitsuTRIAD (more than FNV1A_Yorikke): 11.186MB per clock for 16KB block.
Edited by Sanmayce - 10/29/12 at 9:52am
post #14 of 22
Thread Starter 
At last a more fledged hash test I made:
http://www.sanmayce.com/Fastest_Hash/index.html#NightLightSky

It is interesting what the face-offs Intel vs AMD and Intel 12.1 compiler vs Microsoft 16 (VS2010) compiler can reveal regarding the best overall performer in all the 4 situations.

The whole test took 67 minutes on my laptop which should be less than half an hour on any i7.
post #15 of 22
Thread Starter 
Wow, so glad I am, having the output of my 'Night-Light Sky' benchmark obtained on a really fast machine.

A good man made my month, thanks to Fantasy and his monstrous machine I am very happy to re-share the results:
Results.txt 119k .txt file

This night I will ponder on them but here some highlights are given:

HASH_linearspeed_FURY_Intel_IA-32_12:
Code:
Fetching/Hashing a 64MB block 1024 times i.e. 64GB ...
BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 4665 clocks or 14.048MB per clock
BURST_Read_3DWORDS: (64MB block); 65536MB fetched in 4744 clocks or 13.815MB per clock
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 5728 clocks or 11.441MB per clock
FNV1A_Yorikke: (64MB block); 65536MB hashed in 6309 clocks or 10.388MB per clock
CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 37544 clocks or 1.746MB per clock

Fetching/Hashing a 2MB block 32*1024 times ...
BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 3253 clocks or 20.146MB per clock
BURST_Read_3DWORDS: (2MB block); 65536MB fetched in 3343 clocks or 19.604MB per clock
FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 4560 clocks or 14.372MB per clock
FNV1A_Yorikke: (2MB block); 65536MB hashed in 5283 clocks or 12.405MB per clock
CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 36394 clocks or 1.801MB per clock

Fetching/Hashing a 16KB block 4*1024*1024 times ...
BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 1964 clocks or 33.369MB per clock
BURST_Read_3DWORDS: (16KB block); 65536MB fetched in 2593 clocks or 25.274MB per clock
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 4390 clocks or 14.928MB per clock
FNV1A_Yorikke: (16KB block); 65536MB hashed in 5123 clocks or 12.793MB per clock
CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 36210 clocks or 1.810MB per clock

HASH_linearspeed_FURY_Microsoft_32bit_16:
Code:
Fetching/Hashing a 64MB block 1024 times i.e. 64GB ...
BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 4675 clocks or 14.018MB per clock
BURST_Read_3DWORDS: (64MB block); 65536MB fetched in 4729 clocks or 13.858MB per clock
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 5870 clocks or 11.165MB per clock
FNV1A_Yorikke: (64MB block); 65536MB hashed in 5982 clocks or 10.956MB per clock
CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 33884 clocks or 1.934MB per clock

Fetching/Hashing a 2MB block 32*1024 times ...
BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 3231 clocks or 20.284MB per clock
BURST_Read_3DWORDS: (2MB block); 65536MB fetched in 3302 clocks or 19.847MB per clock
FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 4756 clocks or 13.780MB per clock
FNV1A_Yorikke: (2MB block); 65536MB hashed in 4923 clocks or 13.312MB per clock
CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 33099 clocks or 1.980MB per clock

Fetching/Hashing a 16KB block 4*1024*1024 times ...
BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 1954 clocks or 33.539MB per clock
BURST_Read_3DWORDS: (16KB block); 65536MB fetched in 2561 clocks or 25.590MB per clock
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 4630 clocks or 14.155MB per clock
FNV1A_Yorikke: (16KB block); 65536MB hashed in 4793 clocks or 13.673MB per clock
CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 33024 clocks or 1.984MB per clock

Obviously FNV1A_YoshimitsuTRIAD proves to be faster than (on linear hashing) FNV1A_Yorikke with 2GB/s with Intel's code and 300+MB/s with Microsoft's code.
Intel outspeeds Microsoft with 700+MB/s: 14.928MB/14.155MB per clock respectively with both Intel/Microsoft compilers.

The following excerpts show speed (first number) and collisions (the second).
The most important test (for me) is the n-gram-like one:
Code:
3333 Latin Powers: 
3333 lines read
8192 elements in the table (13 bits)
           Jesteress:        457 [  576]
              Meiyan:        480 [  583]
             Yorikke:        466 [  579]
          Yoshimitsu:        484 [  609]
     YoshimitsuTRIAD:        482 [  615]
              FNV-1a:        891 [  604]
              Larson:        874 [  581]
              CRC-32:        942 [  613]
             Murmur2:        588 [  600]
             Murmur3:        652 [  583]
           XXHfast32:        593 [  596]
         XXHstrong32:        614 [  571]
           iSCSI CRC:        500 [  594]

All my functions outspeed the second-best 'iSCSI CRC' (the CPU built-in CRC), especially my favorite hasheress FNV1A_Yorikke.

In next test YoshimitsuTRIAD is fastest:
Code:
100MB as one line (Microsoft_16): 
1 lines read
4 elements in the table (2 bits)
           Jesteress:     163305 [    0]
              Meiyan:     163300 [    0]
             Yorikke:     129557 [    0]
          Yoshimitsu:     134376 [    0]
     YoshimitsuTRIAD:     128540 [    0]
              FNV-1a:     832815 [    0]
              Larson:     832612 [    0]
              CRC-32:     787677 [    0]
             Murmur2:     268087 [    0]
             Murmur3:     320202 [    0]
           XXHfast32:     130332 [    0]
         XXHstrong32:     203735 [    0]
           iSCSI CRC:     207864 [    0]

In next test Yoshimitsu is fastest:
Code:
100MB as one line (Intel_12.1): 
1 lines read
4 elements in the table (2 bits)
           Jesteress:     108399 [    0]
              Meiyan:     108412 [    0]
             Yorikke:      79093 [    0]
          Yoshimitsu:      77660 [    0]
     YoshimitsuTRIAD:      80786 [    0]
              FNV-1a:     777742 [    0]
              Larson:     777634 [    0]
              CRC-32:     707218 [    0]
             Murmur2:     204159 [    0]
             Murmur3:     254884 [    0]
           XXHfast32:      79367 [    0]
         XXHstrong32:     146397 [    0]
           iSCSI CRC:     153574 [    0]

In next test Yorikke is fastest:
Code:
5,000,000 Knight Tours: 
5000000 lines read
16777216 elements in the table (24 bits)
           Jesteress:    4562409 [676877]
              Meiyan:    4560650 [676877]
             Yorikke:    4458158 [677478]
          Yoshimitsu:    4497790 [675312]
     YoshimitsuTRIAD:    4601132 [676878]
              FNV-1a:   11013126 [2080003]
              Larson:   26046348 [4475748]
              CRC-32:    8279700 [676997]
             Murmur2:    5339850 [675965]
             Murmur3:    5647758 [676857]
           XXHfast32:    4671484 [675637]
         XXHstrong32:    5043364 [675834]
           iSCSI CRC:    4695385 [676111]

In next two tests (64/128 bytes long HEX sequences) Meiyan is fastest, Yorikke is close:
Code:
Zarathustra BBs orders 32: 
515858 lines read
1048576 elements in the table (20 bits)
           Jesteress:     176050 [111346]
              Meiyan:     175913 [111346]
             Yorikke:     252470 [108144]
          Yoshimitsu:     265755 [108467]
     YoshimitsuTRIAD:     268524 [108605]
              FNV-1a:     451509 [108269]
              Larson:     449601 [108719]
              CRC-32:     444670 [107992]
             Murmur2:     305201 [108191]
             Murmur3:     322249 [108233]
           XXHfast32:     279688 [108792]
         XXHstrong32:     296693 [108703]
           iSCSI CRC:     250346 [108197]
Zarathustra BBs orders 64: 
520609 lines read
1048576 elements in the table (20 bits)
           Jesteress:     250191 [110190]
              Meiyan:     250158 [110190]
             Yorikke:     343299 [109926]
          Yoshimitsu:     349583 [110169]
     YoshimitsuTRIAD:     356380 [110354]
              FNV-1a:     758438 [109761]
              Larson:     758453 [110226]
              CRC-32:     745229 [110499]
             Murmur2:     435561 [110482]
             Murmur3:     466358 [110343]
           XXHfast32:     366001 [110172]
         XXHstrong32:     404415 [110020]
           iSCSI CRC:     368246 [110170]

Tests matter, they reveal the practical value of functions.
Edited by Sanmayce - 10/31/12 at 11:21am
post #16 of 22
I ran "Night_Light_Sky_hash_package_r1". Took 35 minutes.
Results.txt 119k .txt file
I might add that this test was boring. CPU was never really used (~8%).

Anyway, result is attached. But I dont think that they are too good. I guess some 6GHz CPU would score better since this test
is not multi threaded. The 3930K could run all these test parallel instead sequentially tongue.gif .
/* Redemption*/
(14 items)
 
  
CPUMotherboardGraphicsRAM
I7 3930K Asus Sabertooth Asus GTX 680 8x4GB G.Skill@1337MHz 
Hard DriveOptical DriveCoolingOS
2xM4 64GB/ / F3 - 1TB / 2x2TB Baracudas some LG Modified EK 360 HFX 2x(Win7 x64) 
MonitorKeyboardPowerCase
SyncMaster P2770HD and SyncMaster 940NW Roccat Isku Corsair Gold AX750 NZXT 810 Switch 
MouseMouse Pad
Rocat Kone[+] Razer exactmat X 
  hide details  
Reply
/* Redemption*/
(14 items)
 
  
CPUMotherboardGraphicsRAM
I7 3930K Asus Sabertooth Asus GTX 680 8x4GB G.Skill@1337MHz 
Hard DriveOptical DriveCoolingOS
2xM4 64GB/ / F3 - 1TB / 2x2TB Baracudas some LG Modified EK 360 HFX 2x(Win7 x64) 
MonitorKeyboardPowerCase
SyncMaster P2770HD and SyncMaster 940NW Roccat Isku Corsair Gold AX750 NZXT 810 Switch 
MouseMouse Pad
Rocat Kone[+] Razer exactmat X 
  hide details  
Reply
post #17 of 22
Thread Starter 
>But I dont think that they are too good.
Ha-ha, I like that, because I'm in constant search for better ones.
Your Redemption scores:
FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 5195 clocks or 12.615MB per clock

>CPU was never really used (~8%).
Not so, now let's do a little dummy math:
The benchmark is single-threaded (there is no reason whatsoever to be multi-threaded), you have 6x2 threads, 100%/12 = 8.3333333333333333333333333333333%, got it?
If otherwise the CPU will fall to its knees instantly, i7 having MAX 4x12.8GB bandwidth and making it run all the 12 threads would result in 12x12.615MB per clock = 151.38MB per clock which is 147GB/s.

This test is all about utilizing the CPU throughput with few instructions used, the major factor in hashing is the code to be nimble (with little code overhead), thus it becomes easily blendable with the surrounding/main code.

I added your Results_Redemption.txt to my page, thanks.
post #18 of 22
RCPC#1
(17 items)
 
Professional
(13 items)
 
CPUMotherboardGraphicsRAM
AMD Phenom II X6 960T Asus M4A88T-VEVO GTX 650 SuperTalent Perfomance 
RAMHard DriveHard DriveOptical Drive
GSkill Snipers Monster Daytona Seagate Barracuda 500GB 7,200 RPM 16Mb cache Memorex DVD/RW 
CoolingOSMonitorKeyboard
Corsair H60 Windows 8N IBM 9494 19" LCD IBM 
PowerCaseMouseMouse Pad
Corsair GS500 In Win H-Frame Wolfking OCZ Behemoth 
Audio
JBL Creature 
CPUMotherboardGraphicsRAM
Phenom II X6 1100t MSI 890FX GD65 MSI Radeon HD5670 GSkill RipjawsX DDR3 PC3 12800 2x4GB CL8 
Hard DriveOptical DriveCoolingOS
WD Black 1TB SATA III Samsung BD Zalman 9900MAX Windows 7 64 Professional 
MonitorKeyboardPowerCase
AOC 22" LED Logitech Kingwin Lazer Platinum 500w Fractal Design R3 
Other
Samsung 470 SSD 128GB 
  hide details  
Reply
RCPC#1
(17 items)
 
Professional
(13 items)
 
CPUMotherboardGraphicsRAM
AMD Phenom II X6 960T Asus M4A88T-VEVO GTX 650 SuperTalent Perfomance 
RAMHard DriveHard DriveOptical Drive
GSkill Snipers Monster Daytona Seagate Barracuda 500GB 7,200 RPM 16Mb cache Memorex DVD/RW 
CoolingOSMonitorKeyboard
Corsair H60 Windows 8N IBM 9494 19" LCD IBM 
PowerCaseMouseMouse Pad
Corsair GS500 In Win H-Frame Wolfking OCZ Behemoth 
Audio
JBL Creature 
CPUMotherboardGraphicsRAM
Phenom II X6 1100t MSI 890FX GD65 MSI Radeon HD5670 GSkill RipjawsX DDR3 PC3 12800 2x4GB CL8 
Hard DriveOptical DriveCoolingOS
WD Black 1TB SATA III Samsung BD Zalman 9900MAX Windows 7 64 Professional 
MonitorKeyboardPowerCase
AOC 22" LED Logitech Kingwin Lazer Platinum 500w Fractal Design R3 
Other
Samsung 470 SSD 128GB 
  hide details  
Reply
post #19 of 22
Thread Starter 
Thank you very much Redwoodz, I needed AMD results.

I cannot believe my eyes, such a beauty to dare to address a monster like me (recently I face only soullessness and profanism), maman was over my shoulder when I asked her to see your sweet profile, she liked you a lot - and she has the eyes for nice people, believe me.

Seeing your wallpaper I hope you are gonna like my favorite nebula called 'butterfly' and one of my favorite songs ever Sam Obernik - Mr Butterfly.

I salute you with this cuore song:
http://www.allthelyrics.com/forum/lyrics-request/116708-sam-obernik-mr-butterfly.html
http://www.sanmayce.com/Fastest_Hash/Sam%20Obernik%20-%20Work%20Of%20Art/Sam%20Obernik.pdf
post #20 of 22
Thread Starter 
I am happy to share my latest 32bit fastest hash slasher.

A remainderless variant for 16[+] bytes keys appeared while I was playing with Yorikke and wanting to try an old idea of mine - to reduce the branching.
When the GOLDEN Yorikke was Interleaved&Interlaced a DIAMANT appeared - it's time for a new generation slasher: FNV1A_Yoshimura.

Up to now the TOP benchmark results I was able to gather (thanks to Redwoodz & Fantasy):
On AMD (Phenom II X6 1600T, 4000MHz) FNV1A_YoshimitsuTRIAD reigns with 11.360MB per clock or 11360/1024 = 11.093GB/s.
On Intel (i7-3930K, 4500MHz) FNV1A_YoshimitsuTRIAD reigns with 14.928MB per clock or 14928/1024 = 14.578GB/s.

The latest benchmark package is: HASH_linearspeed_Yoshimura.zip
The Results.txt on my laptop running 32bit code by Intel 12.1 compiler (/Ox used):
Code:
Fetching/Hashing a 64MB block 1024 times i.e. 64GB ...
BURST_Read_4DWORDS: (64MB block); 65536MB fetched in 15366 clocks or 4.265MB per clock
BURST_Read_3DWORDS: (64MB block); 65536MB fetched in 15178 clocks or 4.318MB per clock
FNV1A_YoshimitsuTRIAD: (64MB block); 65536MB hashed in 15959 clocks or 4.107MB per clock
FNV1A_Yorikke: (64MB block); 65536MB hashed in 16458 clocks or 3.982MB per clock
FNV1A_Yoshimura: (64MB block); 65536MB hashed in 14602 clocks or 4.488MB per clock
CRC32_SlicingBy8K2: (64MB block); 65536MB hashed in 71604 clocks or 0.915MB per clock

Fetching/Hashing a 10MB block 8*1024 times ...
BURST_Read_4DWORDS: (10MB block); 81920MB fetched in 18720 clocks or 4.376MB per clock
BURST_Read_3DWORDS: (10MB block); 81920MB fetched in 18689 clocks or 4.383MB per clock
FNV1A_YoshimitsuTRIAD: (10MB block); 81920MB hashed in 19640 clocks or 4.171MB per clock
FNV1A_Yorikke: (10MB block); 81920MB hashed in 20093 clocks or 4.077MB per clock
FNV1A_Yoshimura: (10MB block); 81920MB hashed in 17410 clocks or 4.705MB per clock
CRC32_SlicingBy8K2: (10MB block); 81920MB hashed in 89497 clocks or 0.915MB per clock

Fetching/Hashing a 5MB block 8*1024 times ...
BURST_Read_4DWORDS: (5MB block); 40960MB fetched in 8689 clocks or 4.714MB per clock
BURST_Read_3DWORDS: (5MB block); 40960MB fetched in 8736 clocks or 4.689MB per clock
FNV1A_YoshimitsuTRIAD: (5MB block); 40960MB hashed in 9142 clocks or 4.480MB per clock
FNV1A_Yorikke: (5MB block); 40960MB hashed in 9313 clocks or 4.398MB per clock
FNV1A_Yoshimura: (5MB block); 40960MB hashed in 7395 clocks or 5.539MB per clock
CRC32_SlicingBy8K2: (5MB block); 40960MB hashed in 44522 clocks or 0.920MB per clock

Fetching/Hashing a 2MB block 32*1024 times ...
BURST_Read_4DWORDS: (2MB block); 65536MB fetched in 9547 clocks or 6.865MB per clock
BURST_Read_3DWORDS: (2MB block); 65536MB fetched in 9672 clocks or 6.776MB per clock
FNV1A_YoshimitsuTRIAD: (2MB block); 65536MB hashed in 9750 clocks or 6.722MB per clock
FNV1A_Yorikke: (2MB block); 65536MB hashed in 10172 clocks or 6.443MB per clock
FNV1A_Yoshimura: (2MB block); 65536MB hashed in 10046 clocks or 6.524MB per clock
CRC32_SlicingBy8K2: (2MB block); 65536MB hashed in 69046 clocks or 0.949MB per clock

Fetching/Hashing a 128KB block 512*1024 times ...
BURST_Read_4DWORDS: (128KB block); 65536MB fetched in 9235 clocks or 7.096MB per clock
BURST_Read_3DWORDS: (128KB block); 65536MB fetched in 9609 clocks or 6.820MB per clock
FNV1A_YoshimitsuTRIAD: (128KB block); 65536MB hashed in 9688 clocks or 6.765MB per clock
FNV1A_Yorikke: (128KB block); 65536MB hashed in 10093 clocks or 6.493MB per clock
FNV1A_Yoshimura: (128KB block); 65536MB hashed in 11264 clocks or 5.818MB per clock
CRC32_SlicingBy8K2: (128KB block); 65536MB hashed in 68983 clocks or 0.950MB per clock

Fetching/Hashing a 16KB block 4*1024*1024 times ...
BURST_Read_4DWORDS: (16KB block); 65536MB fetched in 7878 clocks or 8.319MB per clock
BURST_Read_3DWORDS: (16KB block); 65536MB fetched in 7878 clocks or 8.319MB per clock
FNV1A_YoshimitsuTRIAD: (16KB block); 65536MB hashed in 9048 clocks or 7.243MB per clock
FNV1A_Yorikke: (16KB block); 65536MB hashed in 9610 clocks or 6.820MB per clock
FNV1A_Yoshimura: (16KB block); 65536MB hashed in 9765 clocks or 6.711MB per clock
CRC32_SlicingBy8K2: (16KB block); 65536MB hashed in 69826 clocks or 0.939MB per clock

Looking at above dump, the margin for non-cached MAIN RAM is GREAT, FNV1A_Yoshimura slashes lightningly fast at 4.488MB per clock (4.107MB per clock for FNV1A_YoshimitsuTRIAD), yet, for 16KB block
6.711MB per clock vs 7.243MB per clock is an ugly sight, I think this is a 'Core 2' drawback, I hope i7 to fix this atrocity.

It is worth the attempt to explore the Jesteress-Yorikke (i.e. 1 hash line 4+4 vs 2 hash lines 4+4) 8-16 GAP in order to lessen their collisions further more.
Simply put, 3 hash lines 4 bytes each, 12 bytes per loop.
The 'non power of 2' workaround I see as one MONOLITH function with no remainder mixing at all.
The idea #1 is to exterminate all nasty IFs outwith the main loop, I believe such branchless etude will outperform Jesteress.
The idea #2 is to STRESS memory by fetching not-so-adjacent areas.

For example:
Key: hash_with_overlapping_aye_aye
Key left-to-right quadruplets and remainder: 'hash', '_wit', 'h_ov', 'erla', 'ppin', 'g_ay', 'e_ay', 'e'
Key right-to-left quadruplets and remainder: 'h', 'ash_', 'with', '_ove', 'rlap', 'ping', '_aye', '_aye'
Key_Length: 29
Loop_Counter: 3 //if ( Key_Length%(3*4) ) Loop_Counter = Key_Length/(3*4)+1; else Loop_Counter = Key_Length/(3*4);

Loop #1 of 3:
Hash line 1: hash
Hash line 2: h_ov
Hash line 3: ping
Loop #2 of 3:
Hash line 1: _wit
Hash line 2: erla
Hash line 3: _aye
Loop #3 of 3:
Hash line 1: h_ov
Hash line 2: ppin
Hash line 3: _aye

Anyway the 3 lines are in stack, for the time being let's see how 'INTERLEAVED' Yorikke, which I called FNV1A_Yoshimura, behaves.
And the source of the 'awful" samurai:
Code:
// [North Star One-Sword School]
// - My name is Kanichiro Yoshimura.
//   I'm a new man. Just so you'll know who I am...
//   Saito-sensei.
// - What land are you from?
// - 'Land'?
// - Yes.
// - I was born in Morioka, in Nanbu, Oshu.
//   It's a beautiful place.
//   Please...
//   Away to the south is Mt Hayachine...
//   with Mt Nansho and Mt Azumane to the west.
//   In the north are Mt Iwate and Mt Himekami.
//   Out of the high mountains flows the Nakatsu River...
//   through the castle town into the Kitakami below Sakuranobaba.
//   Ah, it's pretty as a picture!
//   There's nowhere like it in all Japan!
// /Paragon Kiichi Nakai in the paragon piece-of-art 'The Wolves of Mibu' aka 'WHEN THE LAST SWORD IS DRAWN'/
// As I said on one Japanese forum, Kiichi Nakai deserves an award worth his weight in gold, nah-nah, in DIAMONDS!
uint32_t FNV1A_Hash_Yoshimura(const char *str, uint32_t wrdlen)
{
    const uint32_t PRIME = 709607;
    uint32_t hash32 = 2166136261;
    uint32_t hash32B = 2166136261;
    const char *p = str;
    uint32_t Loop_Counter;
    uint32_t Second_Line_Offset;

if (wrdlen >= 2*2*sizeof(uint32_t)) {
    Second_Line_Offset = wrdlen-((wrdlen>>4)+1)*(2*4); // ((wrdlen>>1)>>3)
    Loop_Counter = (wrdlen>>4);
    //if (wrdlen%16) Loop_Counter++;
    Loop_Counter++;
    for(; Loop_Counter; Loop_Counter--, p += 2*sizeof(uint32_t)) {
                // revision 1:
                //hash32 = (hash32 ^ (_rotl(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+4))) * PRIME;        
                //hash32B = (hash32B ^ (_rotl(*(uint32_t *)(p+0+Second_Line_Offset),5) ^ *(uint32_t *)(p+4+Second_Line_Offset))) * PRIME;        
                // revision 2:
                hash32 = (hash32 ^ (_rotl(*(uint32_t *)(p+0),5) ^ *(uint32_t *)(p+0+Second_Line_Offset))) * PRIME;        
                hash32B = (hash32B ^ (_rotl(*(uint32_t *)(p+4+Second_Line_Offset),5) ^ *(uint32_t *)(p+4))) * PRIME;        
    }
} else {
    // Cases: 0,1,2,3,4,5,6,7,...,15
    if (wrdlen & 2*sizeof(uint32_t)) {
                hash32 = (hash32 ^ *(uint32_t*)(p+0)) * PRIME;
                hash32B = (hash32B ^ *(uint32_t*)(p+4)) * PRIME;
                p += 4*sizeof(uint16_t);
    }
    // Cases: 0,1,2,3,4,5,6,7
    if (wrdlen & sizeof(uint32_t)) {
                hash32 = (hash32 ^ *(uint16_t*)(p+0)) * PRIME;
                hash32B = (hash32B ^ *(uint16_t*)(p+2)) * PRIME;
                p += 2*sizeof(uint16_t);
    }
    if (wrdlen & sizeof(uint16_t)) {
        hash32 = (hash32 ^ *(uint16_t*)p) * PRIME;
        p += sizeof(uint16_t);
    }
    if (wrdlen & 1) 
        hash32 = (hash32 ^ *p) * PRIME;
}
    hash32 = (hash32 ^ _rotl(hash32B,5) ) * PRIME;
    return hash32 ^ (hash32 >> 16);
}

To reproduce the quick-test below here comes: http://www.sanmayce.com/Fastest_Hash/DOUBLOON_hash_micro-package_r2.zip
The results on my 'Bonboniera' T7500, throwing mostly 16+ long keys at the "awful greedy country samurai":
Code:
E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r2>RUNME.BAT
E:\Night_Light_Sky_hash_package_r1+\DOUBLOON_hash_micro-package_r2>type Results.txt

Intel 12.1:
3333 lines read
8192 elements in the table (13 bits)
           Jesteress: 493 [  576]
              Meiyan: 515 [  583]
             Yorikke: 458 [  579]
           Yoshimura: 379 [  593] !!! SIGNIFICANTLY fastEST !!!
          Yoshimitsu: 497 [  609]
     YoshimitsuTRIAD: 489 [  615]
              FNV-1a: 969 [  604]
              Larson: 947 [  581]
              CRC-32: 894 [  613]
             Murmur2: 656 [  600]
             Murmur3: 711 [  583]
           XXHfast32: 504 [  596]
         XXHstrong32: 528 [  571]
1000 lines read
2048 elements in the table (11 bits)
           Jesteress:  268 [  205]
              Meiyan:  268 [  205]
             Yorikke:  224 [  207]
           Yoshimura:  235 [  187] ??? the slowest of all the four Yo*, something to ponder about ???
          Yoshimitsu:  225 [  225]
     YoshimitsuTRIAD:  221 [  219]
              FNV-1a: 1125 [  225]
              Larson: 1131 [  212]
              CRC-32:  919 [  230]
             Murmur2:  439 [  222]
             Murmur3:  497 [  223]
           XXHfast32:  250 [  223]
         XXHstrong32:  309 [  192]
32359 lines read
65536 elements in the table (16 bits)
           Jesteress: 12249 [ 6883]
              Meiyan: 12369 [ 6897]
             Yorikke: 11000 [ 6872]
           Yoshimura:  9876 [ 6908] !!! fastEST, yet with high collisions !!!
          Yoshimitsu: 11489 [ 6937]
     YoshimitsuTRIAD: 11094 [ 6843]
              FNV-1a: 39491 [ 6840]
              Larson: 39714 [ 6889]
              CRC-32: 34264 [ 6891]
             Murmur2: 17678 [ 6786]
             Murmur3: 19626 [ 6850]
           XXHfast32: 10383 [ 6859]
         XXHstrong32: 12708 [ 6887]

The two new techniques (Interleaving&Interlacing) are promising, nah-nah, my vocabulary is poor to put it right, they are ... paragon.
To explore and prove their might, though, I need benchmarking on very fast PC machines.
New Posts  All Forums:Forum Nav:
  Return Home
Overclock.net › Forums › Benchmarks › Benchmarking Software and Discussion › Benchmarking the fastest hash function