Overclock.net banner

1 - 4 of 4 Posts

·
Integer Benchmarker
Joined
·
434 Posts
Discussion Starter #1 (Edited)
Speaking of virtual memory performance, my wish is to collect useful statistics on IOPS random 4KiB read/write/read-write.

The scenario that interests me is the actual performance of modern SSDs when the application (single-threaded) starts bombarding the drive with small packets like 128B/4KiB ... across 48GB big and more pool ('Test Size' in CrystalDiskMark terminology). Seeing how Samsung 860 PRO boasts:

CACHE MEMORY
512 MB Low Power DDR4 (256 GB, 512 GB)
1 GB Low Power DDR4 (1,024 GB)
2 GB Low Power DDR4 (2,048 GB)
4 GB Low Power DDR4 (4,096 GB)

RANDOM READ (4KB, QD1)
Up to 11,000 IOPS

RANDOM WRITE (4KB, QD1)
Up to 43,000 IOPS

Since the Japanese author didn't report the IOPS, manually I have to divide 4KiBQ1T1 score by 4KiB.

It prompts for using much bigger 'Test Size' than the default 1GiB, otherwise stressing the cache mainly, which is also useful.

These days I intend to buy the 256GB model and put it to the test, at the moment I can only test my latest SSD - the cheapest Kingston - A400 2.5" 240GB SATA III TLC:



#1 Screenshot: 1GiB



So, the (5.411*1000*1000)/(4*1024)= 1,321 IOPS (random reads within 1GiB)

#2 Screenshot: 16GiB



So, the (5.465*1000*1000)/(4*1024)= 1,334 IOPS (random reads within 16GiB)

#3 Screenshot: 32GiB



So, the (5.307*1000*1000)/(4*1024)= 1,295 IOPS (random reads within 32GiB)

Notice the inconsistency, how write performance is halved and restored.

By the way, currently I am into writing my own (not using the OS virtual memory allocator) I/O code bypassing the malloc(). All small blocks/pools are to be handled by single fread()/fwrite().
Naturally, the I/O intensive bombardment across 48+GB pool with ~128B long chunks requires much more than 10,000 IOPS, OPTANE is promising, but I want to dig into regular SSDs first.

Incoming results for ...



In the meantime, you are welcome to share screenshots of your own.
 

Attachments

·
Integer Benchmarker
Joined
·
434 Posts
Discussion Starter #2
Despite being the latest downloadable version of CDM, I found it somewhat superficial, ATTO is another benchmark tool far better in my view, just downloaded the latest version and will be playing with it and my Kingston A400 240GB and incoming Samsung 860 PRO 256GB.

Huge discrepancy for CDM and ATTO results:

For Kingston A400 240GB:
4KB QD1 3,470 IOPS/reads within 16GB

 

Attachments

·
Integer Benchmarker
Joined
·
434 Posts
Discussion Starter #3 (Edited)
Okay, wanna share my experience with RAMDISK and Samsung 860 PRO (connected via USB 3.0 port) 4KB performance on my Laptop 'Compressionette': i5-7200u, 8GB DDR4 2133MHz, Windows 10.

For some unknown to me reason, the 4KB Queue Depth 1 discrepancy (shown by latest CrystalDiskMark and ATTO Disk Benchmark) is unacceptable - 754.6MB/s vs 195.80MB/s, why so?!

On the screenshot Drive D: is the 256GB Samsung 860 PRO, Drive R: the ImDisk RAMDisk:



I am beginning to develop hate for those synthetic benchmarks, therefore to counter them with a realworld benchmark I will make my own package - with C source included, it will report the actual performance of small packages (like 128B) during the sort of millions or billions of such 128B to be sorted keys - very versatile it is and allows to torture the whole system (OS' cache as well)... The package's name being ... 'Sandokan'...

The showdown of Windows_10_sort vs External_QuickSort vs Niemann_replacement_selection+polyphase_merge_sort will answer the call/need for authentic benchmarking. Suggestions are welcome.
For a long time I wanted to throw at any machine a text file with billions of such lines to sort:
Code:
R:\Sandokan_r3_vs_Windows-sort_vs_TN-sort>type KT3M.txt|more
A8C7E8G7H5G3H1F2H3G1E2C1A2B4A6B8D7F8H7G5F7H8G6H4G2E1C2A1B3A5B7D8C6A7C8E7G8H6G4H2F1D2B1A3B5D6F5D4F3E5C4B2D3F4E6C5A4B6D5F6E4C3D1E3
A8C7E8G7H5G3H1F2H3G1E2C1A2B4A6B8D7F8H7G5F7H8G6H4G2E1C2A1B3A5B7D8C6A7C8E7G8H6G4H2F1D2B1A3B5D6F5D4F3E5C4B2D3F4E6C5A4B6D5E3D1C3E4F6
A8C7E8G7H5G3H1F2H3G1E2C1A2B4A6B8D7F8H7G5F7H8G6H4G2E1C2A1B3A5B7D8C6A7C8E7G8H6G4H2F1D2B1A3B5D6F5D4F3E5C4B2D3F4E6C5E4F6D5B6A4C3D1E3
A8C7E8G7H5G3H1F2H3G1E2C1A2B4A6B8D7F8H7G5F7H8G6H4G2E1C2A1B3A5B7D8C6A7C8E7G8H6G4H2F1D2B1A3B5D6F5D4F3E5C4B2D3F4E6C5E4F6D5E3D1C3A4B6
A8C7E8G7H5G3H1F2H3G1E2C1A2B4A6B8D7F8H7G5F7H8G6H4G2E1C2A1B3A5B7D8C6A7C8E7G8H6G4H2F1D2B1A3B5D6F5D4F3E5C4B6A4B2D3C5E6F4D5F6E4C3D1E3
...
Too often a nasty thought of helplessness strikes me, despite the constant boosts in hardware/software, I still cannot sort ... 7,000,000,000 names of people, ugh, shame!
 

Attachments

·
Integer Benchmarker
Joined
·
434 Posts
Discussion Starter #4
Thought for a moment to create a dedicated thread to sorting 7,000,000,000 Knight-Tours (128bytes long) via my amateurish ExternalQuickSort Linux/Windows tool. But, it needs a machine with 64GB RAM (8bytes*7B=52GB for the pointers) and SSD with 2x7,000,000,000x128=1669GB free space. When my hands get on such a machine, definitely will run it. For now, the package runs 22,000,000 Knight-Tours, three sort tools are used on RAMDISK and Samsung 860 PRO (external connection via USB 3.0), thus 6 results.

As it can be seen on the screenshot taken on i5-7200u, 8GB DDR4, the REALWORLD PERFORMANCE during RandomReads 128bytes long load is: 279,656 RandomReads_128B_long-Per-Second. Single-threaded, 64bit code.

Code:
D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>dir

08/20/2018  04:11 PM         2,824,696 atto-disk-benchmark-4000f2.zip
08/20/2018  04:11 PM               187 Compile_Sandokan_Intel_32.bat
08/20/2018  04:11 PM               187 Compile_Sandokan_Intel_64.bat
08/20/2018  04:11 PM         2,132,181 Compressionette.png
08/20/2018  04:11 PM             1,138 GENERATE_Xmillion_Knight-Tours_and_SORT_them.bat
08/20/2018  05:13 PM             1,138 GENERATE_Xmillion_Knight-Tours_and_SORT_them_64bit.bat
08/20/2018  04:11 PM           572,899 ImDiskTk-x64.exe
08/20/2018  04:11 PM           607,032 ImDiskTk.exe
08/20/2018  04:11 PM             1,569 KAZE prompt.lnk
08/20/2018  04:11 PM            25,166 Knight-tour_r8dump.c
08/20/2018  04:11 PM            77,312 Knight-tour_r8dump_32bit_Intel.exe
08/20/2018  04:21 PM     2,860,000,000 kt22.txt
08/20/2018  04:11 PM             1,631 MokujIN GREEN 224 prompt.lnk
08/20/2018  04:11 PM            15,832 Niemann_replacement_selection+polyphase_merge_sort.c
08/20/2018  04:11 PM            81,920 Niemann_replacement_selection+polyphase_merge_sort.exe
08/20/2018  05:58 PM            14,029 Results_i5-7200U.txt
08/20/2018  04:11 PM           553,472 Sandokan_Logo.doc
08/20/2018  04:11 PM           381,683 Sandokan_Logo.pdf
08/20/2018  04:11 PM           170,813 Sandokan_QuickSortExternal_4+GB.c
08/20/2018  04:11 PM           118,784 Sandokan_QuickSortExternal_4+GB_32bit_Intel.exe
08/20/2018  04:11 PM           128,000 Sandokan_QuickSortExternal_4+GB_64bit_Intel.exe
08/20/2018  04:11 PM             9,458 sha1sum.c
08/20/2018  04:11 PM            80,384 sha1sum_32bit_Intel.exe
08/20/2018  04:11 PM            49,152 sha1sum_Microsoft_V16_32bit_Ox.exe
08/20/2018  04:11 PM             4,096 Timer.exe

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>GENERATE_Xmillion_Knight-Tours_and_SORT_them_64bit.bat kt22.txt r: 11 11000 22000000

Microsoft Windows [Version 10.0.15063]
Sorting these 22000000 Knight-Tours ...

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>timer Sandokan_QuickSortExternal_4+GB_64bit_Intel.exe kt22.txt /slow /ascend 768
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Sandokan_QuickSortExternal_4+GB r.3+, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 2,860,000,000
Counting lines ...
Lines encountered: 22,000,000
Longest line (including CR if present): 129
Allocated memory for pointers-to-lines in MB: 167
Assigning pointers ...
SLOW! Get on with slow external accesses.
Sorting 22,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
- RightEnd: 000,003,891,611; NumberOfSplittings: 0,002,454,015; Done: 100% ...
NumberOfComparisons: 597,577,524
The time to sort 22,000,000 items via Quicksort+Insertionsort was 4,273,654 clocks.
Performance: 139,828 Comparisons_128B_long-Per-Second i.e 279,656 RandomReads_128B_long-Per-Second.
Dumping the sorted data (Regime=0)...
\ Done 100% ...
Dumped 22,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dump time: 277,257 clocks.
Total time: 4,709,532 clocks.
Performance: 607 bytes/clock.
Done successfully.

Kernel Time  =  3508.390 =   74%
User Time    =   398.671 =    8%
Process Time =  3907.062 =   82%
Global Time  =  4709.734 =  100%

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>sha1sum_32bit_Intel.exe "QuickSortExternal_4+GB.txt"
7382033463276ef70213b52b18391ceede5fe7ef  QuickSortExternal_4+GB.txt

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>del "QuickSortExternal_4+GB.txt"

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>timer Niemann_replacement_selection+polyphase_merge_sort.exe kt22.txt kt22.txt.Niemann 11 11000
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
extSort: nFiles=11, nNodes=11000, lrecl=130

Kernel Time  =    62.250 =    8%
User Time    =   147.156 =   20%
Process Time =   209.406 =   28%
Global Time  =   728.065 =  100%

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>sha1sum_32bit_Intel.exe kt22.txt.Niemann
7382033463276ef70213b52b18391ceede5fe7ef  kt22.txt.Niemann

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>del kt22.txt.Niemann

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>timer sort /M 1048576 /T r: kt22.txt /O kt22.txt.Windows
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31

Kernel Time  =     3.421 =    1%
User Time    =   129.375 =   45%
Process Time =   132.796 =   46%
Global Time  =   283.895 =  100%

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>sha1sum_32bit_Intel.exe kt22.txt.Windows
7382033463276ef70213b52b18391ceede5fe7ef  kt22.txt.Windows

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>del kt22.txt.Windows

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>timer Sandokan_QuickSortExternal_4+GB_64bit_Intel.exe kt22.txt /fast /ascend 768
Timer 9.01 : Igor Pavlov : Public domain : 2009-05-31
Sandokan_QuickSortExternal_4+GB r.3+, written by Kaze, using Bill Durango's Quicksort source.
Size of input file: 2,860,000,000
Counting lines ...
Lines encountered: 22,000,000
Longest line (including CR if present): 129
Allocated memory for pointers-to-lines in MB: 167
Assigning pointers ...
sizeof(int), sizeof(void*): 4, 8
Trying to allocate memory for the file itself in MB: 2727 ... OK! Get on with fast internal accesses.
Uploading ...
Sorting 22,000,000 Pointers ...
Quicksort (Insertionsort for small blocks) commenced ...
- RightEnd: 000,003,891,611; NumberOfSplittings: 0,002,454,015; Done: 100% ...
NumberOfComparisons: 597,577,524
The time to sort 22,000,000 items via Quicksort+Insertionsort was 141,921 clocks.
Performance: 4,210,605 Comparisons_128B_long-Per-Second i.e 8,421,210 RandomReads_128B_long-Per-Second.
Dumping the sorted data (Regime=2)...
\ Done 100% ...
Dumped 22,000,000 lines.
OK! Incoming and resultant file's sizes match.
Dump time: 105,115 clocks.
Total time: 489,665 clocks.
Performance: 5,840 bytes/clock.
Done successfully.

Kernel Time  =    17.109 =    3%
User Time    =   192.281 =   39%
Process Time =   209.390 =   42%
Global Time  =   490.180 =  100%

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>sha1sum_32bit_Intel.exe "QuickSortExternal_4+GB.txt"
7382033463276ef70213b52b18391ceede5fe7ef  QuickSortExternal_4+GB.txt

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>del "QuickSortExternal_4+GB.txt"

D:\Sandokan_r3+_vs_Windows-sort_vs_TN-sort>
Despite being many times slower than other tools, my Sandokan features read-only mode i.e. the drive is tortured only in read mode. Along with that the required output size is N.

Intend to write a 2-way Merge Sort dealing specifically with those haunting 7 billion keys, supersimplistic C code is given by this coolman:


Oh, and the package capable of generating/sorting arbitrary big Knight-Tours files, I successfully went beyond generating 1 trillion of them, that is right, 128Terabytes.
 

Attachments

1 - 4 of 4 Posts
Top