CPU-RAM subsystem benchmark 'Freaky_Dreamer' reporting IPC (Instructions-Per-Clock) - Overclock.net - An Overclocking Community

Forum Jump: 

CPU-RAM subsystem benchmark 'Freaky_Dreamer' reporting IPC (Instructions-Per-Clock)

Reply
 
Thread Tools
post #1 of 27 (permalink) Old 07-13-2015, 04:20 PM - Thread Starter
Integer Benchmarker
 
Sanmayce's Avatar
 
Join Date: Mar 2012
Location: Sofia
Posts: 404
Rep: 23 (Unique: 20)
Roughly every fifth post on OCN speaks of IPC, but I don't see any actual IPC numbers anywhere, just talks.

Today I wrote an integer CPU-RAM stressing benchmark which helps to see how single-threaded & 32-threaded decompression translates into IPC.
The beauty of this new benchmark is in its codepath, the 32 sections of code (which decompress) are executed either one-by-one (ST) or all-at-once (MT).

This allows us to compare safely ST vs MT.

I called the benchmark, poetically, 'Freaky_Dreamer' tongue.gif, it requires:
- AVX support, AMD Vishera/Zambezi, Intel 2500K and newer;
- 12GB free RAM, 32 blocks (88MB long) of compressed data are decompressed to 32 blocks (260MB long) of uncompressed data.

What affects the IPC result?

1] CPU clock;
2] RAM clock;
3] RAM latency;
4] Cache size&levels.

The 46 instructions that are executed are next:
Code:
.B30.3::                        
  00030 45 8b 38         mov r15d, DWORD PTR [r8]               
  00033 44 89 f9         mov ecx, r15d                          
  00036 83 f1 03         xor ecx, 3                             
  00039 41 bc ff ff ff 
        ff               mov r12d, -1                           
  0003f c1 e1 03         shl ecx, 3                             
  00042 bd 01 00 00 00   mov ebp, 1                             
  00047 41 d3 ec         shr r12d, cl                           
  0004a 45 23 fc         and r15d, r12d                         
  0004d 45 33 e4         xor r12d, r12d                         
  00050 45 89 fe         mov r14d, r15d                         
  00053 45 89 fb         mov r11d, r15d                         
  00056 41 83 e6 0f      and r14d, 15                           
  0005a 48 89 c1         mov rcx, rax                           
  0005d 41 83 fe 0c      cmp r14d, 12                           
  00061 44 0f 44 e5      cmove r12d, ebp                        
  00065 4c 89 c5         mov rbp, r8                            
  00068 41 c1 eb 04      shr r11d, 4                            
  0006c 49 ff cc         dec r12                                
  0006f 45 89 da         mov r10d, r11d                         
  00072 4d 89 e6         mov r14, r12                           
  00075 49 2b ca         sub rcx, r10                           
  00078 49 f7 d6         not r14                                
  0007b 48 ff c9         dec rcx                                
  0007e 49 23 ee         and rbp, r14                           
  00081 49 23 cc         and rcx, r12                           
  00084 41 ff c3         inc r11d                               
  00087 4d 23 d6         and r10, r14                           
  0008a 4d 23 de         and r11, r14                           
  0008d c5 fe 6f 44 29 
        01               vmovdqu ymm0, YMMWORD PTR [1+rcx+rbp]  
  00093 44 89 fd         mov ebp, r15d                          
  00096 83 e5 03         and ebp, 3                             
  00099 41 83 e7 0c      and r15d, 12                           
  0009d ff c5            inc ebp                                
  0009f 41 83 c7 04      add r15d, 4                            
  000a3 89 e9            mov ecx, ebp                           
  000a5 c1 e9 02         shr ecx, 2                             
  000a8 41 d3 e7         shl r15d, cl                           
  000ab 49 23 ec         and rbp, r12                           
  000ae 4d 23 fc         and r15, r12                           
  000b1 4c 03 dd         add r11, rbp                           
  000b4 4d 03 d7         add r10, r15                           
  000b7 4d 03 c3         add r8, r11                            
  000ba c5 fe 7f 00      vmovdqu YMMWORD PTR [rax], ymm0        
  000be 49 03 c2         add rax, r10                           
  000c1 4d 3b c1         cmp r8, r9                             
  000c4 0f 82 66 ff ff 
        ff               jb .B30.3 

They are executed 32 times (for each block), the actual codepath (ST&MT) and formula for IPC are given:
Code:
...
#ifdef Commence_OpenMP
                printf("Enforcing %d thread(s).\n", NumberOfThreadsToPlayWith);
#else
                printf("Enforcing 1 thread.\n");
#endif

#ifdef Commence_OpenMP
                printf("omp_get_num_procs( ) = %d\n", omp_get_num_procs( ));
                printf("omp_get_max_threads( ) = %d\n", omp_get_max_threads( ));
#endif

#if defined(_icl_mumbo_jumbo_)
ticksStart = GetRDTSC();
#endif

#ifdef Commence_OpenMP
#pragma omp parallel shared(TargetBlock, SourceBlock, TargetFileSize, SourceFileSize) private(TargetSize001,TargetSize002,TargetSize003,TargetSize004,TargetSize005,TargetSize006,TargetSize007,TargetSize008,TargetSize009,TargetSize010,TargetSize011,TargetSize012,TargetSize013,TargetSize014,TargetSize015,TargetSize016,TargetSize017,TargetSize018,TargetSize019,TargetSize020,TargetSize021,TargetSize022,TargetSize023,TargetSize024,TargetSize025,TargetSize026,TargetSize027,TargetSize028,TargetSize029,TargetSize030,TargetSize031,TargetSize032)
#endif
{
#ifdef Commence_OpenMP
  #pragma omp sections
#endif
    {

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 001:
                TargetSize001 = Decompress001(TargetBlock+(uint64_t)(1-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(1-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize001) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 002:
                TargetSize002 = Decompress002(TargetBlock+(uint64_t)(2-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(2-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize002) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 003:
                TargetSize003 = Decompress003(TargetBlock+(uint64_t)(3-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(3-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize003) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 004:
                TargetSize004 = Decompress004(TargetBlock+(uint64_t)(4-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(4-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize004) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }
/*
#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 005:
                TargetSize005 = Decompress005(TargetBlock+(uint64_t)(5-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(5-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize005) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 006:
                TargetSize006 = Decompress006(TargetBlock+(uint64_t)(6-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(6-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize006) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 007:
                TargetSize007 = Decompress007(TargetBlock+(uint64_t)(7-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(7-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize007) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 008:
                TargetSize008 = Decompress008(TargetBlock+(uint64_t)(8-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(8-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize008) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 009:
                TargetSize009 = Decompress009(TargetBlock+(uint64_t)(9-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(9-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize009) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 010:
                TargetSize010 = Decompress010(TargetBlock+(uint64_t)(10-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(10-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize010) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 011:
                TargetSize011 = Decompress011(TargetBlock+(uint64_t)(11-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(11-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize011) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 012:
                TargetSize012 = Decompress012(TargetBlock+(uint64_t)(12-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(12-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize012) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 013:
                TargetSize013 = Decompress013(TargetBlock+(uint64_t)(13-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(13-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize013) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 014:
                TargetSize014 = Decompress014(TargetBlock+(uint64_t)(14-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(14-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize014) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 015:
                TargetSize015 = Decompress015(TargetBlock+(uint64_t)(15-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(15-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize015) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 016:
                TargetSize016 = Decompress016(TargetBlock+(uint64_t)(16-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(16-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize016) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }


#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 017:
                TargetSize017 = Decompress017(TargetBlock+(uint64_t)(17-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(17-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize017) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 018:
                TargetSize018 = Decompress018(TargetBlock+(uint64_t)(18-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(18-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize018) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 019:
                TargetSize019 = Decompress019(TargetBlock+(uint64_t)(19-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(19-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize019) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 020:
                TargetSize020 = Decompress020(TargetBlock+(uint64_t)(20-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(20-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize020) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 021:
                TargetSize021 = Decompress021(TargetBlock+(uint64_t)(21-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(21-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize021) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 022:
                TargetSize022 = Decompress022(TargetBlock+(uint64_t)(22-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(22-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize022) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 023:
                TargetSize023 = Decompress023(TargetBlock+(uint64_t)(23-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(23-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize023) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 024:
                TargetSize024 = Decompress024(TargetBlock+(uint64_t)(24-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(24-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize024) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 025:
                TargetSize025 = Decompress025(TargetBlock+(uint64_t)(25-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(25-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize025) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 026:
                TargetSize026 = Decompress026(TargetBlock+(uint64_t)(26-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(26-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize026) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 027:
                TargetSize027 = Decompress027(TargetBlock+(uint64_t)(27-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(27-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize027) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 028:
                TargetSize028 = Decompress028(TargetBlock+(uint64_t)(28-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(28-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize028) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 029:
                TargetSize029 = Decompress029(TargetBlock+(uint64_t)(29-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(29-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize029) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 030:
                TargetSize030 = Decompress030(TargetBlock+(uint64_t)(30-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(30-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize030) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 031:
                TargetSize031 = Decompress031(TargetBlock+(uint64_t)(31-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(31-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize031) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }

#ifdef Commence_OpenMP
    #pragma omp section
#endif
        {
        // Thread 032:
                TargetSize032 = Decompress032(TargetBlock+(uint64_t)(32-1)*(TargetFileSize+512), SourceBlock+(uint64_t)(32-1)*SourceFileSize, SourceFileSize);
                if (TargetFileSize != TargetSize032) { printf("Lexx: Failure! Decompressed size mismatch!\n"); exit(13); }
        }
*/
    }
}// pragma

#ifdef Commence_OpenMP
                printf("All threads finished.\n");
#endif

#if defined(_icl_mumbo_jumbo_)
ticksTOTAL2 = ticksTOTAL2 + GetRDTSC() - ticksStart;
#endif

                printf("Decompression time: %s ticks.\n", _ui64toaKAZEcomma(ticksTOTAL2, llTOaDigits, 10));   
                printf("TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: %.3f\n", (float)(ticksTOTAL2) / (float)((float)46*loopcounterfor411*NumberOfThreadsToPlayWith) );
                printf("IPC (Instructions_Per_Clock_during_branchless_decompression) performance: %.3f\n\n", (float)((float)46*loopcounterfor411*NumberOfThreadsToPlayWith) / (float)(ticksTOTAL2) );
...

This time the results are dumped into 'Results.txt', so please share your CPU-RAM stats and 'Results.txt'.

'Freaky_Dreamer' contains:
Code:
D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>dir
 
07/13/2015  05:14 PM        91,964,279 Autobiography_411-ebooks_Collection.tar.Nakamichi
07/13/2015  05:14 PM               287 Get_IPC.bat
07/13/2015  05:14 PM         1,114,552 libiomp5md.dll
07/13/2015  05:14 PM             1,228 MakeEXEs_Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC.bat
07/13/2015  05:14 PM             1,632 MokujIN 224 prompt.lnk
07/13/2015  05:14 PM           129,024 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_1-thread.exe
07/13/2015  05:14 PM           345,439 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.c
07/13/2015  05:14 PM         2,054,019 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.cod
07/13/2015  05:14 PM           131,584 Nakamichi_Oniyanma_Monsterdragonfly_Lexx_IPC_32-threads.exe
07/13/2015  05:14 PM             6,144 timer64.exe

D:\_KAZE\Instructions_per_tick_during_branchless_decompression_32-threaded>

You may download package, here.

The 'Results.txt 'looks like this on my Core 2:
Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Current priority class is REALTIME_PRIORITY_CLASS.
Allocating 367,857,628 bytes...
Allocating 1,093,609,472 bytes...
Source&Target buffers are allocated.
Simulating we have 4 blocks for decompression...
Enforcing 1 thread.
Decompression time: 15,863,949,483 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 2.897
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 0.345


Kernel  Time =     0.889 =    5%
User    Time =     5.444 =   32%
Process Time =     6.333 =   38%    Virtual  Memory =   1398 MB
Global  Time =    16.615 =  100%    Physical Memory =   1399 MB

Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Current priority class is REALTIME_PRIORITY_CLASS.
Allocating 367,857,628 bytes...
Allocating 1,093,609,472 bytes...
Source&Target buffers are allocated.
Simulating we have 4 blocks for decompression...
Enforcing 4 thread(s).
omp_get_num_procs( ) = 4
omp_get_max_threads( ) = 4
All threads finished.
Decompression time: 7,282,914,164 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 1.330
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 0.752


Kernel  Time =     1.544 =   11%
User    Time =    10.015 =   76%
Process Time =    11.559 =   88%    Virtual  Memory =   1400 MB
Global  Time =    13.027 =  100%    Physical Memory =   1400 MB

Since I don't have AVX nor 12GB free RAM for sake of the dump I reduced 32 threads to 4.
Now I want to see the biggest IPC on this forum thumb.gif, my guess is that the CPU-RAM sub-system should feature both high clock & low RAM latency, something like 4790K @4.7 and RAM @3000.

Needless to say, I wrote 'Freaky_Dreamer' in order to compare IPC's of 5960x and upcoming AMD 'Zen'. Also 5775C (if they manage to tune L4) can show some high IPC.

Get down get down get down get it on show love and give it up
What are you waiting on?
Sanmayce is offline  
Sponsored Links
Advertisement
 
post #2 of 27 (permalink) Old 07-14-2015, 02:53 PM - Thread Starter
Integer Benchmarker
 
Sanmayce's Avatar
 
Join Date: Mar 2012
Location: Sofia
Posts: 404
Rep: 23 (Unique: 20)
Thanks to bonami2 here comes the current best IPC I have ever seen:

CPU: Intel i7-4790k 4.7GHz 1.3v
RAM: G.SKILL Trident X Series 16GB (4 x 4GB) 240-Pin DDR3 SDRAM DDR3 2400 (PC3 19200) Cas Latency: 10

His ST IPC is 0.754 whereas MT IPC is just 2.322, this is strange, I expected MT IPC to be minimum 4 (8*0.754=6.032 maximum).

I guess, my hopes were too high, up until now I thought that fighting for RAM accesses was a cakewalk.
Let's see how many bytes/clock of decompression speed those 2.322 equal:

(32 * 273,401,856) bytes / 18,867,608,955 ticks = 0.463 bytes/clock, ugh, wanted much more mad.gif

From a different angle, those 2.322 are not that hopeless, they equal:

(32 * 273,401,856) bytes / (18,867,608,955 ticks / 4,700,000,000 ticks) = 2,179,377,325 bytes / second or 2,078 MB/s decompression speed of English texts.

His 'Results.txt' contains:
Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 1 thread.
Decompression time: 58,081,071,043 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 1.326
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 0.754



Kernel Time = 1.921 = 11%
User Time = 12.812 = 77%
Process Time = 14.734 = 89% Virtual Memory = 11173 MB
Global Time = 16.450 = 100% Physical Memory = 11152 MB
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 32 thread(s).
omp_get_num_procs( ) = 8
omp_get_max_threads( ) = 8
All threads finished.
Decompression time: 18,867,608,955 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 0.431
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 2.322



Kernel Time = 3.656 = 54%
User Time = 33.140 = 494%
Process Time = 36.796 = 549% Virtual Memory = 11175 MB
Global Time = 6.697 = 100% Physical Memory = 11153 MB

Interestingly what Intel Xeon E5-2698 v3 could offer with its 32-threads!?
I saw one guy showing how in ST Xeon lags behind but in MT smokes the 5960X:



CPU: Intel Xeon [email protected] Cooled with Corsair H80i
MB: Asus X99-E WS
RAM: 32Gb Crucial Vengeance LPX DDR4 2667Mhz CL16 (8x4Gb Fully Populated)

Get down get down get down get it on show love and give it up
What are you waiting on?
Sanmayce is offline  
post #3 of 27 (permalink) Old 07-14-2015, 03:28 PM - Thread Starter
Integer Benchmarker
 
Sanmayce's Avatar
 
Join Date: Mar 2012
Location: Sofia
Posts: 404
Rep: 23 (Unique: 20)
Once I asked which words map onto numbers in order to use letters instead of digits, just now I saw some light on the topic:
https://en.wikipedia.org/wiki/IUPAC_numerical_multiplier

So,
1-threaded equals mono-threaded
2-threaded equals di-threaded
4-threaded equals tetra-threaded
8-threaded equals octa-threaded
12-threaded equals dodeca-threaded
16-threaded equals hexadeca-threaded
18-threaded equals octadeca-threaded
32-threaded equals dotriaconta-threaded
64-threaded equals tetrahexaconta-threaded
128-threaded equals octacosahecta-threaded
256-threaded equals hexapentacontadicta-threaded

Every number should be alphabet translatable.

Get down get down get down get it on show love and give it up
What are you waiting on?
Sanmayce is offline  
Sponsored Links
Advertisement
 
post #4 of 27 (permalink) Old 07-19-2015, 02:43 PM
Linux Lobbyist
 
Join Date: Sep 2007
Posts: 6,636
Rep: 485 (Unique: 408)
My 5960x @ 4.2ghz:
Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 1 thread.
Decompression time: 59,803,022,277 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 1.365
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 0.733



Kernel  Time =     2.328 =   10%
User    Time =    18.218 =   81%
Process Time =    20.546 =   92%    Virtual  Memory =  11173 MB
Global  Time =    22.273 =  100%    Physical Memory =  11152 MB
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 32 thread(s).
omp_get_num_procs( ) = 16
omp_get_max_threads( ) = 16
All threads finished.
Decompression time: 11,476,279,165 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 0.262
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 3.818



Kernel  Time =     7.156 =  109%
User    Time =    54.171 =  826%
Process Time =    61.328 =  935%    Virtual  Memory =  11176 MB
Global  Time =     6.553 =  100%    Physical Memory =  11154 MB

Checkout my DOSBox LIVEusb - https://sites.google.com/site/dosboxdistro/
Retrogaming made portable.

PassMark System Score: Passmark Rating 5,710, CPU Mark 19,985CPU-Z Validation: LINK AIDA64: LINKCinebench15: LINK Geekbench3 scores: LINK Geekbench4.1 scores: LINKUserBenchmarks: CPU: 105.5%, GPU: 111.7% MEM: 127.9%
BinaryDemon is offline  
post #5 of 27 (permalink) Old 07-19-2015, 03:00 PM - Thread Starter
Integer Benchmarker
 
Sanmayce's Avatar
 
Join Date: Mar 2012
Location: Sofia
Posts: 404
Rep: 23 (Unique: 20)
Many thanks BinaryDemon, very much appreciated.

We have new best IPC result, 3.818 obtained on:


CPU: i7-5960x @ 4.26 ghz core / 3.55 ghz uncore - 1.3v
RAM: 16gb (4x4gb) Crucial DDR4 2133 CL15

Seeing the CPU utilization being around 900% (of possible 1600%) tells me that threads are NOT-THAT-BADLY fed.

Are those 4 sticks in quad-channeled config?
Code:
Decompression time: 11,476,279,165 ticks.

The 4790K @4.7 with dual-channeled DDR3 @2400 gave-
Code:
Decompression time: 18,867,608,955 ticks.

I expected 5960X to half them, I see two main reasons of that not happening:
- lower CPU clock;
- higher RAM latency, (for, roughly, 17MB of that 88MB compressed block RAM reads outside 1MB sliding window are used).

Anyway, great result, at last I can say 5960X decompresses at nearly 4 Instructions-Per-Clock.

Get down get down get down get it on show love and give it up
What are you waiting on?
Sanmayce is offline  
post #6 of 27 (permalink) Old 07-19-2015, 03:10 PM
Linux Lobbyist
 
Join Date: Sep 2007
Posts: 6,636
Rep: 485 (Unique: 408)
My memory is operating in Quad Channel, but given my conservative overclock and budget DDR4 I expect most other 5960x owners would best 3.818.

Checkout my DOSBox LIVEusb - https://sites.google.com/site/dosboxdistro/
Retrogaming made portable.

PassMark System Score: Passmark Rating 5,710, CPU Mark 19,985CPU-Z Validation: LINK AIDA64: LINKCinebench15: LINK Geekbench3 scores: LINK Geekbench4.1 scores: LINKUserBenchmarks: CPU: 105.5%, GPU: 111.7% MEM: 127.9%
BinaryDemon is offline  
post #7 of 27 (permalink) Old 07-19-2015, 03:20 PM - Thread Starter
Integer Benchmarker
 
Sanmayce's Avatar
 
Join Date: Mar 2012
Location: Sofia
Posts: 404
Rep: 23 (Unique: 20)
Quote:
Originally Posted by BinaryDemon View Post

My memory is operating in Quad Channel, but given my conservative overclock and budget DDR4 I expect most other 5960x owners would best 3.818.

In my eyes those who push the voltage above 1.2v are asking for early amortization. Your overclock is not reckless, I guess.
Please tell me, if you have tested it, what is your RAM latency, somewhere in DDR4 reviews I got scared by some 70ns, they said that it is to be bettered with time since DDR4 is still young.

Get down get down get down get it on show love and give it up
What are you waiting on?
Sanmayce is offline  
post #8 of 27 (permalink) Old 07-19-2015, 03:47 PM - Thread Starter
Integer Benchmarker
 
Sanmayce's Avatar
 
Join Date: Mar 2012
Location: Sofia
Posts: 404
Rep: 23 (Unique: 20)
The RAM latency of the 4790K was 49ns:


Get down get down get down get it on show love and give it up
What are you waiting on?
Sanmayce is offline  
post #9 of 27 (permalink) Old 07-19-2015, 04:59 PM
Linux Lobbyist
 
Join Date: Sep 2007
Posts: 6,636
Rep: 485 (Unique: 408)
Worse than you predicted, 83 ns. Guess I might want to try and tighten that up.



Oops, glad I looked. My DRAM timings were way conservative when set to AUTO. I'm getting 67ns now.

I should probably re-run the benchmark too.
Code:
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 1 thread.
Decompression time: 50,476,507,752 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 1.152
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 0.868



Kernel  Time =     2.359 =   12%
User    Time =    15.656 =   80%
Process Time =    18.015 =   92%    Virtual  Memory =  11173 MB
Global  Time =    19.480 =  100%    Physical Memory =  11152 MB
Nakamichi 'Oniyanma-Monsterdragonfly-Lexx_IPC', written by Kaze, based on Nobuo Ito's LZSS source, babealicious suggestion by m^2 enforced, muffinesque suggestion by Jim Dempsey enforced.
Allocating 2,942,857,440 bytes...
Allocating 8,748,875,776 bytes...
Source&Target buffers are allocated.
Simulating we have 32 blocks for decompression...
Enforcing 32 thread(s).
omp_get_num_procs( ) = 16
omp_get_max_threads( ) = 16
All threads finished.
Decompression time: 9,564,708,596 ticks.
TPI (Ticks_Per_Instructions_during_branchless_decompression) performance: 0.218
IPC (Instructions_Per_Clock_during_branchless_decompression) performance: 4.581



Kernel  Time =    10.125 =  178%
User    Time =    42.140 =  740%
Process Time =    52.265 =  918%    Virtual  Memory =  11176 MB
Global  Time =     5.688 =  100%    Physical Memory =  11154 MB

Wow significant improvement, although I wouldnt have guessed it should exceed 4 IPC.

Checkout my DOSBox LIVEusb - https://sites.google.com/site/dosboxdistro/
Retrogaming made portable.

PassMark System Score: Passmark Rating 5,710, CPU Mark 19,985CPU-Z Validation: LINK AIDA64: LINKCinebench15: LINK Geekbench3 scores: LINK Geekbench4.1 scores: LINKUserBenchmarks: CPU: 105.5%, GPU: 111.7% MEM: 127.9%
BinaryDemon is offline  
post #10 of 27 (permalink) Old 07-20-2015, 01:13 PM - Thread Starter
Integer Benchmarker
 
Sanmayce's Avatar
 
Join Date: Mar 2012
Location: Sofia
Posts: 404
Rep: 23 (Unique: 20)
You halved the 4790K result, very very good!
Quote:
Originally Posted by BinaryDemon View Post

...
Wow significant improvement, although I wouldnt have guessed it should exceed 4 IPC.

Yes, RAM latency is a key factor in 'Freaky_Dreamer', in my view when really modern CPUs with some sort of "inlined" on the die RAM start to hit mainstream then we can expect even eightfold increase in MT IPC, 32 and above.

All kind of similar to LZSS (with 256MB sliding window) algorithms like BWT (Burrows–Wheeler Transform) used in bzip and bsc and many other modern compressors which stress the RAM (they also use many reads/writes outside the LLC (Last-Level-Cache)) fall in one category when we need a single number (like MT IPC) to estimate their performance.

Thanks for sharing this useful information, now we have new best MT IPC: 4.581.

Also, the ST IPC vs MT IPC is very informative, 5960X executes one and the same code 4.581/0.868=5.2x faster when multi-threaded!

Get down get down get down get it on show love and give it up
What are you waiting on?
Sanmayce is offline  
Reply

Quick Reply
Message:
Options

Register Now

In order to be able to post messages on the Overclock.net - An Overclocking Community forums, you must first register.
Please enter your desired user name, your email address and other required details in the form below.
User Name:
If you do not want to register, fill this field only and the name will be used as user name for your post.
Password
Please enter a password for your user account. Note that passwords are case-sensitive.
Password:
Confirm Password:
Email Address
Please enter a valid email address for yourself.
Email Address:

Log-in



Currently Active Users Viewing This Thread: 1 (0 members and 1 guests)
 
Thread Tools
Show Printable Version Show Printable Version
Email this Page Email this Page


Forum Jump: 

Posting Rules  
You may post new threads
You may post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are Off
Pingbacks are Off
Refbacks are Off