Overclock.net › Forums › Benchmarks › Benchmarking Software and Discussion › MokujIN - 16-threaded benchmarker that calculates 2^n
New Posts  All Forums:Forum Nav:

MokujIN - 16-threaded benchmarker that calculates 2^n

post #1 of 9
Thread Starter 
Well, I wrote a multi-threaded console benchmarker (it multiplies big natural numbers) which stresses the CPU mostly with next instructions (meaning no FPU or Main RAM loads):
Code:
movzx
jae
jne     
jbe     
jb      
lea
xor
sub
add
inc
cmp
dec
mov
It is compiled with latest Intel(R) C++ Compiler XE for applications running on IA-32, Version 12.1 using maximum optimizations.

Thus a new CPU clock pseudo-measure has emerged - MokujINs.
MokujINs stand for number of cycles of main loop of MUL function made per second.
At each iteration/cycle a digit vs digit multiplication is made.

I already gave the C source of MokujIN in 'High-precision program that calculates 2^n ' thread, but here comes the multi-threaded revision.

The logo of revision 4 (4 threads enforced) and the C source are given at:
http://www.sanmayce.com/Downloads/MokujIN_88-A4-pages.pdf

The package (Open Source) is freely downloadable at:
http://www.sanmayce.com/Downloads/MokujIN.zip

In the ZIP archive three folders are given:
- r. 3+ which is the single-thread revision;
- r. 4 which is the 4 threaded revision;
- r. 5 which is the 16 threaded revision.

My ‘Bonboniera’ Core 2 T7500 2200MHz laptop gives 73/140 MegaMokujINs (1thread/2threads).

It is interesting how 16-threaded revision behaves on CPUs having 12 threads only, I guess they will choke the scheduler.
I ran 16-threaded revision on my humble machine (2/2 cores/threads) it was significantly slower than the 4-threaded revision.

Being an AMD fan since my last 'Barton' chip I wonder how AMD's 16 thread capable processors would run my bench:
Code:
MokujIN_r5_16-Threads.exe 2 1048576 /stats

In order to run it just go to 'MokujIN_r5' folder and start 'RUNME.bat', the output looks like this:
Code:
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

D:\WorkTemp>cd D:\Downloads\_2012-Nov-12\_PATCH-Nov-11\D\MokujIN\MokujIN_r5

D:\Downloads\_2012-Nov-12\_PATCH-Nov-11\D\MokujIN\MokujIN_r5>runme
Revision 3 Single-Thread results:
Computing 2^1048576 took 0,454 seconds with '/TURBO' with Intel v12.1 on T7500 2200MHz.
Computing 2^1048576 took 1,856 seconds without '/TURBO' with Intel v12.1 on T7500 2200MHz.
Computing 2^1048576 took 0,426 seconds with '/TURBO' with Microsoft v16 on T7500 2200MHz.
Computing 2^1048576 took 1,678 seconds without '/TURBO' with Microsoft v16 on T7500 2200MHz.
SHA1 should be:
adebb3aac8ded6438719f8170a455f38dfebaae3
Computing 2^1048576 ...

D:\Downloads\_2012-Nov-12\_PATCH-Nov-11\D\MokujIN\MokujIN_r5>time0<enter 1>TotalTime.txt

D:\Downloads\_2012-Nov-12\_PATCH-Nov-11\D\MokujIN\MokujIN_r5>timer "MokujIN_r5_16-Threads.exe" 2 1048576 /stats
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
MokujIN, Multiplication of INtegers, an OpenMP (multi-threaded) string multiplier, 16 threads enforced, written by Kaze, 2012-Nov-11, revision 5.
omp_get_num_procs( ) = 2
omp_get_max_threads( ) = 2
Multiplying performance for operands 1 digits long: 1 MokujINs i.e. digits per second.
Multiplying performance for operands 1 digits long: 1 MokujINs i.e. digits per second.
Multiplying performance for operands 2 digits long: 4 MokujINs i.e. digits per second.
Multiplying performance for operands 3 digits long: 9 MokujINs i.e. digits per second.
Multiplying performance for operands 5 digits long: 25 MokujINs i.e. digits per second.
Multiplying performance for operands 10 digits long: 100 MokujINs i.e. digits per second.
Multiplying performance for operands 20 digits long: 400 MokujINs i.e. digits per second.
Multiplying performance for operands 39 digits long: 1,521 MokujINs i.e. digits per second.
Multiplying performance for operands 78 digits long: 6,084 MokujINs i.e. digits per second.
Multiplying performance for operands 155 digits long: 24,025 MokujINs i.e. digits per second.
Multiplying performance for operands 309 digits long: 95,481 MokujINs i.e. digits per second.
Multiplying performance for operands 617 digits long: 380,689 MokujINs i.e. digits per second.
Multiplying performance for operands 1234 digits long: 1,522,756 MokujINs i.e. digits per second.
Multiplying performance for operands 2467 digits long: 6,086,089 MokujINs i.e. digits per second.
Multiplying performance for operands 4933 digits long: 24,334,489 MokujINs i.e. digits per second.
Multiplying performance for operands 9865 digits long: 97,318,225 MokujINs i.e. digits per second.
Multiplying performance for operands 19729 digits long: 129,744,480 MokujINs i.e. digits per second.
Multiplying performance for operands 39457 digits long: 129,737,904 MokujINs i.e. digits per second.
Multiplying performance for operands 78914 digits long: 127,090,191 MokujINs i.e. digits per second.
Multiplying performance for operands 157827 digits long: 127,088,581 MokujINs i.e. digits per second.
Dumping the result to 'MokujIN.txt' ... OK
Total Time: 261 second(s).

Kernel Time  =     0.156 =    0%
User Time    =   495.000 =  189%
Process Time =   495.156 =  189%
Global Time  =   261.523 =  100%

D:\Downloads\_2012-Nov-12\_PATCH-Nov-11\D\MokujIN\MokujIN_r5>time0<enter 1>>TotalTime.txt

D:\Downloads\_2012-Nov-12\_PATCH-Nov-11\D\MokujIN\MokujIN_r5>sha1sum.exe MokujIN.txt
adebb3aac8ded6438719f8170a455f38dfebaae3  MokujIN.txt

D:\Downloads\_2012-Nov-12\_PATCH-Nov-11\D\MokujIN\MokujIN_r5>type TotalTime.txt
The current time is: 17:05:57.20
Enter the new time:
The current time is: 17:10:18.78
Enter the new time:

D:\Downloads\_2012-Nov-12\_PATCH-Nov-11\D\MokujIN\MokujIN_r5>

Having read the article 'AMD Bulldozer 16-core server CPUs "trounce" Intel Xeon' makes me eager to see its power in numbers.

"Trounce", ha-ha, I like it that pun.

SOED says:

1. Afflict, distress; discomfit. M16–M17.
2. Beat, thrash, esp. as a punishment. M16.
3. Censure; rebuke or scold severely. E17.
4. Punish severely; (now dial.) punish by legal action or process; indict, sue. Also, get the better of, defeat heavily. M17.
...
2. verb trans. Cause to move rapidly; cause to go. rare. E19.

If that's not progress, I don't know what is. ... Interlagos promises to bring unbeatable price-performance to heavily multithreaded workloads. ... It costs considerably less than its closest Intel counterparts.

In my view MokujIN benchmarker can say something on Opteron vs Xeon topic.

Share your results with us, please.
post #2 of 9
Thread Starter 
Just received results on OLD FAST CPU Core 2 Q9550S:


Code:
Microsoft Windows XP [Version 5.1.2600]
(C) Copyright 1985-2001 Microsoft Corp.

C:\WorkTemp>cd "C:\WorkTemp\MokujIN\MokujIN_r5"

C:\WorkTemp\MokujIN\MokujIN_r5>dir
 Volume in drive C is S1T_Vol2
 Volume Serial Number is 3061-B575

 Directory of C:\WorkTemp\MokujIN\MokujIN_r5

11/12/2012  11:32 AM    <DIR>          .
11/12/2012  11:32 AM    <DIR>          ..
11/12/2012  05:33 AM                 2 ENTER
11/12/2012  05:33 AM           565,392 MokujIN_16threads.c
11/12/2012  05:33 AM           886,515 MokujIN_16threads.cod
11/12/2012  05:33 AM               162 MokujIN_compile_Intel.bat
11/12/2012  05:33 AM           386,560 MokujIN_r5_16-Threads.exe
11/12/2012  05:33 AM           100,352 MokujIN_r5_One-Thread.exe
11/12/2012  05:33 AM               690 RUNME.bat
11/12/2012  05:33 AM            62,464 sha1sum.exe
11/12/2012  05:33 AM             4,096 Timer.exe
               9 File(s)      2,006,233 bytes
               2 Dir(s)  10,947,481,600 bytes free

C:\WorkTemp\MokujIN\MokujIN_r5>RUNME.bat
Revision 3 Single-Thread results:
Computing 2^1048576 took 0,454 seconds with '/TURBO' with Intel v12.1 on T7500 2200MHz.
Computing 2^1048576 took 1,856 seconds without '/TURBO' with Intel v12.1 on T7500 2200MHz.
Computing 2^1048576 took 0,426 seconds with '/TURBO' with Microsoft v16 on T7500 2200MHz.
Computing 2^1048576 took 1,678 seconds without '/TURBO' with Microsoft v16 on T7500 2200MHz.
SHA1 should be:
adebb3aac8ded6438719f8170a455f38dfebaae3
Computing 2^1048576 ...

C:\WorkTemp\MokujIN\MokujIN_r5>time0<enter 1>TotalTime.txt

C:\WorkTemp\MokujIN\MokujIN_r5>timer "MokujIN_r5_16-Threads.exe" 2 1048576 /stats
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
MokujIN, Multiplication of INtegers, an OpenMP (multi-threaded) string multiplier, 16 threads enforced, written by Kaze, 2012-No
v-11, revision 5.
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
Multiplying performance for operands 1 digits long: 1 MokujINs i.e. digits per second.
Multiplying performance for operands 1 digits long: 1 MokujINs i.e. digits per second.
Multiplying performance for operands 2 digits long: 4 MokujINs i.e. digits per second.
Multiplying performance for operands 3 digits long: 9 MokujINs i.e. digits per second.
Multiplying performance for operands 5 digits long: 25 MokujINs i.e. digits per second.
Multiplying performance for operands 10 digits long: 100 MokujINs i.e. digits per second.
Multiplying performance for operands 20 digits long: 400 MokujINs i.e. digits per second.
Multiplying performance for operands 39 digits long: 1,521 MokujINs i.e. digits per second.
Multiplying performance for operands 78 digits long: 6,084 MokujINs i.e. digits per second.
Multiplying performance for operands 155 digits long: 24,025 MokujINs i.e. digits per second.
Multiplying performance for operands 309 digits long: 95,481 MokujINs i.e. digits per second.
Multiplying performance for operands 617 digits long: 380,689 MokujINs i.e. digits per second.
Multiplying performance for operands 1234 digits long: 1,522,756 MokujINs i.e. digits per second.
Multiplying performance for operands 2467 digits long: 6,086,089 MokujINs i.e. digits per second.
Multiplying performance for operands 4933 digits long: 24,334,489 MokujINs i.e. digits per second.
Multiplying performance for operands 9865 digits long: 97,318,225 MokujINs i.e. digits per second.
Multiplying performance for operands 19729 digits long: 389,233,441 MokujINs i.e. digits per second.
Multiplying performance for operands 39457 digits long: 389,213,712 MokujINs i.e. digits per second.
Multiplying performance for operands 78914 digits long: 366,318,788 MokujINs i.e. digits per second.
Multiplying performance for operands 157827 digits long: 361,005,245 MokujINs i.e. digits per second.
Dumping the result to 'MokujIN.txt' ... OK
Total Time: 92 second(s).

Kernel Time  =     0.218 =    0%
User Time    =   353.984 =  386%
Process Time =   354.203 =  386%
Global Time  =    91.693 =  100%

C:\WorkTemp\MokujIN\MokujIN_r5>time0<enter 1>>TotalTime.txt

C:\WorkTemp\MokujIN\MokujIN_r5>sha1sum.exe MokujIN.txt
adebb3aac8ded6438719f8170a455f38dfebaae3  MokujIN.txt

C:\WorkTemp\MokujIN\MokujIN_r5>type TotalTime.txt
The current time is: 11:33:09.64
Enter the new time:
The current time is: 11:34:41.34
Enter the new time:

C:\WorkTemp\MokujIN\MokujIN_r5>

The timer written by Igor Pavlov gave 'Process Time = 386%', not bad for 16 threads fighting each other to get job done on a "poor" 4 threads CPU, he-he.
Edited by Sanmayce - 11/12/12 at 7:55am
post #3 of 9
Hm, I am no mod but maybe this fits better in the programming section.
:: Since I am at work, I cant run that thing. Working on a VM that is slower that you mobile mad.gif
/* Redemption*/
(14 items)
 
  
CPUMotherboardGraphicsRAM
I7 3930K Asus Sabertooth Asus GTX 680 8x4GB G.Skill@1337MHz 
Hard DriveOptical DriveCoolingOS
2xM4 64GB/ / F3 - 1TB / 2x2TB Baracudas some LG Modified EK 360 HFX 2x(Win7 x64) 
MonitorKeyboardPowerCase
SyncMaster P2770HD and SyncMaster 940NW Roccat Isku Corsair Gold AX750 NZXT 810 Switch 
MouseMouse Pad
Rocat Kone[+] Razer exactmat X 
  hide details  
Reply
/* Redemption*/
(14 items)
 
  
CPUMotherboardGraphicsRAM
I7 3930K Asus Sabertooth Asus GTX 680 8x4GB G.Skill@1337MHz 
Hard DriveOptical DriveCoolingOS
2xM4 64GB/ / F3 - 1TB / 2x2TB Baracudas some LG Modified EK 360 HFX 2x(Win7 x64) 
MonitorKeyboardPowerCase
SyncMaster P2770HD and SyncMaster 940NW Roccat Isku Corsair Gold AX750 NZXT 810 Switch 
MouseMouse Pad
Rocat Kone[+] Razer exactmat X 
  hide details  
Reply
post #4 of 9
Thread Starter 
>Hm, I am no mod but maybe this fits better in the programming section.
I disagree, MokujIN is a tiny-finy bench, but I see the thing that made you said that - the C source is given, not likely at all for usual benches.

>Working on a VM that is slower that you mobile ...
Yes, it is useless. You see 16 threads make the CPU go hot, that is, they utilize tightly the resources.

For more than a year I wanted to see all the digits of Mersenne prime 2^43,112,609-1, a mammoth 12,978,189 digit number!
Now, who can wait for 'timer MokujIN_r5_16-Threads.exe 2 43112609 /stats' to complete?!
Edited by Sanmayce - 11/13/12 at 8:34am
post #5 of 9
Man your site does not work.thumb.gif
You hacked yourself by accident.

Besides, I LOVE the name mokujin.
/* Redemption*/
(14 items)
 
  
CPUMotherboardGraphicsRAM
I7 3930K Asus Sabertooth Asus GTX 680 8x4GB G.Skill@1337MHz 
Hard DriveOptical DriveCoolingOS
2xM4 64GB/ / F3 - 1TB / 2x2TB Baracudas some LG Modified EK 360 HFX 2x(Win7 x64) 
MonitorKeyboardPowerCase
SyncMaster P2770HD and SyncMaster 940NW Roccat Isku Corsair Gold AX750 NZXT 810 Switch 
MouseMouse Pad
Rocat Kone[+] Razer exactmat X 
  hide details  
Reply
/* Redemption*/
(14 items)
 
  
CPUMotherboardGraphicsRAM
I7 3930K Asus Sabertooth Asus GTX 680 8x4GB G.Skill@1337MHz 
Hard DriveOptical DriveCoolingOS
2xM4 64GB/ / F3 - 1TB / 2x2TB Baracudas some LG Modified EK 360 HFX 2x(Win7 x64) 
MonitorKeyboardPowerCase
SyncMaster P2770HD and SyncMaster 940NW Roccat Isku Corsair Gold AX750 NZXT 810 Switch 
MouseMouse Pad
Rocat Kone[+] Razer exactmat X 
  hide details  
Reply
post #6 of 9
Thread Starter 
Quote:
Originally Posted by Mr.Eiht View Post

Man your site does not work.thumb.gif
Sorry, it happens from time to time - but for a few minutes, it needs some maintenance.
Now it works typer.gif

>Besides, I LOVE the name mokujin.
Me too, I was impressed by 'Tekken: Blood Vengeance' (2011) movie, where Mokujins played a key role, so much in common with beloved 'Final Fantasy' core plot. The both sagas are very dear to me.
My desire was to have a measure of primitive digit vs digit operations (161 bytes etude) replacing the boring multi-specs of CPU internals - turbo-boosting, number of this-and-that and what not.
Now e.g. having the result on Core 2 Q9550S, numberly, 360 MegaMokujINs I am careless how many cores/threads/hertzes/caches it has got - just a number - it tells me how other similar (and common) etudes will behave on this machine.

The test computes 2^(2^20) which at final step multiplies two 157,827 digits long numbers, dividing it on 4 parts it runs 4x4 threads multiplying 40,000 digits/bytes long numbers, the intersting thing here is that all 16 threads' data (40,000+40,000+2x40,000) fit in the 16 L2 256KB data caches, I cannot predict what boost that can bring but my dummy guess is that 16 threads at 2.8GHz should give 22- seconds.
Also it is interesting how well the CPU is loaded since the weird number 12 is not a power of 2.
Edited by Sanmayce - 11/13/12 at 8:45am
post #7 of 9
Hey mate, totally forgot this one. Since it was a bit cold I thought my CPU can warm the room a bit.
4 threaded version:

16 threads

Edited by Mr.Eiht - 11/16/12 at 2:59pm
/* Redemption*/
(14 items)
 
  
CPUMotherboardGraphicsRAM
I7 3930K Asus Sabertooth Asus GTX 680 8x4GB G.Skill@1337MHz 
Hard DriveOptical DriveCoolingOS
2xM4 64GB/ / F3 - 1TB / 2x2TB Baracudas some LG Modified EK 360 HFX 2x(Win7 x64) 
MonitorKeyboardPowerCase
SyncMaster P2770HD and SyncMaster 940NW Roccat Isku Corsair Gold AX750 NZXT 810 Switch 
MouseMouse Pad
Rocat Kone[+] Razer exactmat X 
  hide details  
Reply
/* Redemption*/
(14 items)
 
  
CPUMotherboardGraphicsRAM
I7 3930K Asus Sabertooth Asus GTX 680 8x4GB G.Skill@1337MHz 
Hard DriveOptical DriveCoolingOS
2xM4 64GB/ / F3 - 1TB / 2x2TB Baracudas some LG Modified EK 360 HFX 2x(Win7 x64) 
MonitorKeyboardPowerCase
SyncMaster P2770HD and SyncMaster 940NW Roccat Isku Corsair Gold AX750 NZXT 810 Switch 
MouseMouse Pad
Rocat Kone[+] Razer exactmat X 
  hide details  
Reply
post #8 of 9
Thread Starter 
Thanks a lot Mr.Eiht,
your results confirm my expectations speaking of thread scheduling, - when actual threads are not divisible by 16 a wait-for-other-to-finish occurs:
here I enforce 16 threads at each step: when 12 finish (I guess 100% load) the remaining 4 take over (maybe 33% load) thus leaving 8 threads unused, so not surprisingly my dummy tool behaves as 8-threaded.

I like the user time of 4-threaded variant: 399%, MokujIN utilizes well those 4 threads:
I7 3930K - 498 MegMokujINs vs Core 2 Q9550S' 360 MegaMokujINs

Useful info for me, thanks again.
post #9 of 9
Thread Starter 
I spent some time to write a simplistic GUI (a shell in fact) for this purely console benchmark tool, now pressing one button and waiting 8-65 minutes is what is needed.

In fact MokujINs are simple byte lookup table operations resembling lookups used in TABLE based search algorithms like Boyer-Moore.
Thus MokujINs are useful to estimate the similar AND VERY WIDELY USED lookups into small L1 cache residing tables.

EDIT: Some color changes and REFRESH button added, a screenshot on my computer:



Free download (3.51 MB) as always at:
www.sanmayce.com/Downloads/Mokujins_reporter.zip

My laptop (2 threads at 2200MHz) worked for 3924 seconds, since the MokujIN is 16 threaded and utilizes fully the cores I expect HASWELL with 8 threads to finish 4*(3500/2200) = 6 times faster or for 616 seconds.

Please share here with rest of us your 'MokujIN_results.txt' or its screenshot.
Edited by Sanmayce - 8/29/13 at 8:10am
New Posts  All Forums:Forum Nav:
  Return Home
Overclock.net › Forums › Benchmarks › Benchmarking Software and Discussion › MokujIN - 16-threaded benchmarker that calculates 2^n