Overclock.net banner
1 - 20 of 29 Posts

·
Registered
Joined
·
18 Posts
Discussion Starter · #1 · (Edited)
A tool written in vulkan compute to stress test video memory for correctness. Ideological successor to memtestCL.
Releases at GitHub

Font Electronic device Software Screenshot Multimedia


Computer Font Screenshot Personal computer Multimedia



Open-source, prebuilt binaries available for windows and linux, supports aarch64 NVidia Jetson.
Simple to use - no any parameters except optional card selection.

Building .exe via github actions is configured. So... anybody can fork repo->enable workflows on the actions tab->make any small changes you want even via browser->commit and github will build the binary from changes you make for you personally!
 

·
Registered
Joined
·
3,537 Posts
very interesting.. thank you for helping the community. will try this out
 

·
Iconoclast
Joined
·
33,727 Posts
Didn't see any security issues with the windows version, though it accessed just about every .json file in C that had 'vulkan' in it, for whatever reason.

Anyway, it seems to use bcrypt and loads the memory on my Navi21 parts fairly well, though not as heavily as things like ethash. The output being every 100 seconds is also a bit long of an interval. Most GPUs won't throw errors at borderline settings, so being able to see the bandwidth more frequently will help dial in settings faster.
 
  • Rep+
Reactions: z390e

·
Registered
Joined
·
18 Posts
Discussion Starter · #4 ·
About security. The code is open, however unfortunately I didn't provide build instructions that can be used on windows machine - the windows binary was cross-built from linux (and tested on windows).
It seems that I should switch to github CI, so that downloaded files would be directly connected to source code.

Acessing json files is done during vulkan initialization by the erupt library or windows vulkan loader.

Displaying memory bandwidth is more a debug feature. The tool uses a lot of mem bandwidth, but it does NOT try achieve a maximum. The memory access pattern is 'randomly-iteratively-designed' a mix of linear and random accesses that should be best for detecting errors, not for checking max memory bandwidth. So bcrypt or any other crypto-based algo is not used in testing or application logic, maybe it is imported by some of the compiled-in libraries. Tool is targeted as "stability check tool", not as "memory bandwidth estimator".

So, thank for notice, I'll add this to readme and make stats output more frequent (30 sec I think).
 

·
Iconoclast
Joined
·
33,727 Posts
Tool is targeted as "stability check tool", not as "memory bandwidth estimator".
I understand this. I was just pointing out that, with memory EDC retrying transactions, looking at relative performance is often the only way to see VRAM stability issues. Performance will stop scaling with memory clock, or start to regress, often dozens or hundreds of MHz below where actual errors make it though.
 
  • Rep+
Reactions: galkinvv

·
Registered
Joined
·
101 Posts
I understand this. I was just pointing out that, with memory EDC retrying transactions, looking at relative performance is often the only way to see VRAM stability issues. Performance will stop scaling with memory clock, or start to regress, often dozens or hundreds of MHz below where actual errors make it though.
Are you sure it's EDC and not because of changing memory table entries as clock is increased?
 

·
Iconoclast
Joined
·
33,727 Posts
Are you sure it's EDC and not because of changing memory table entries as clock is increased?
Yes, because the reduction can happen within timing straps, be delayed until errors are corrected at a sufficient rate, or countered with changes to voltage/cooling.
 

·
Registered
Joined
·
94 Posts
Found the time to try this tool out, comes in pretty handy not only to find VRAM errors, but also to exactly pinpoint the spot where performance starts to degrade even before errors will occur.
I'd absolutely agree with @Blameless though, output should indeed be displayed more frequently, 20-30sec max. or maybe even make the interval user-configurable ;)

Anyway, nice work (y)
 

·
Registered
Joined
·
18 Posts
Discussion Starter · #9 ·
I understand this. I was just pointing out that, with memory EDC retrying transactions, looking at relative performance is often the only way to see VRAM stability issues. Performance will stop scaling with memory clock, or start to regress, often dozens or hundreds of MHz below where actual errors make it though.

Agree.
I released new version v0.4.0 with a try to support AMD gpus without validation errors and also stats printing period reduced to 30 seconds as suggested in this thread.
The binaries are now also can be downloaded from github actions Screenshot added · GpuZelenograd/[email protected]
This make a bit more transparent link between source code and binaries.

However action results are stored only for 90 days, so standard release also present Release v0.4.0 - fixes for AMD and usability · GpuZelenograd/memtest_vulkan
 

·
Registered
Joined
·
101 Posts
That was quick, just started to try it (first linked version). Worked okay as is with Fedora 36, W10 fails due to old nvidia driver not being compatible with Vulkan so looks like I have to upgrade from 38x to something newer for it to work.
 

·
Registered
Joined
·
18 Posts
Discussion Starter · #11 ·
The tool requires only Vulkan 1.1 but most of real tests were done only with up-to-date drivers.
Since Vulkan is relatively new technology the goal was more "make a future-proof test" than "support old drivers/hardware".

I use dated drivers/hardware a lot, but for memory testing on those - older opensource tools already exists:
 

·
Registered
Joined
·
101 Posts
Updated to nvidia driver 516.

Running memtest_vulkan shows it is very strong in finding errors on my GTX 1050Ti.

Errors started showing at 2200MHz, no errors were seen below that although I only ran it for about 2 minutes each time.

Further up in frequency the errors come thick and fast, they scroll by too fast to read. Is it viable to have a reduced report to screen, maybe even one that doesn't scroll? The log file contains the details if needed.

Around 2350MHz the card collapses into a crash with horrible screen artifacts requiring a power reset. Didn't want to leave it in that state waiting to see if Windows would reset it.

Some description of what the output is showing with example would be helpful.

Maybe the "done_iter_on_err:" value could be signed.

It would be interesting to see the results of the 3080 which is reported by Techpowerup to have infinite EDC Replay. With multi-bit errors do some still get through or is it all quite on the memtest_vulkan side?


Very nice functional app @galkinvv (y)

Note also this 1050Ti memory tables stop at 2250, seems shortly after it picks some different timing and while the line looks better it still has plenty of errors. Also enabling / disabling EDC & EDC replay had no effect on this card.
 

·
Registered
Joined
·
18 Posts
Discussion Starter · #13 · (Edited)
Some description of what the output is showing with example would be helpful.

Maybe the "done_iter_on_err:" value could be signed.
Half of the output of v0.3.0 was actually developer-targeted debug. I'd cut most of it in v0.4.0.

Here is description of the output left in v0.4.0
Font Computer Screenshot Rectangle Software



Further up in frequency the errors come thick and fast, they scroll by too fast to read. Is it viable to have a reduced report to screen, maybe even one that doesn't scroll? The log file contains the details if needed.
Noted this as a feature request, but can't promise anything about implementing it...

Also, for everybody. The tool is opensource and has building .exe via github actions configured.
So... you can fork repo->enable workflows on the actions tab->make any small changes you want even via browser->commit and github will build the binary from changes you make for you personally!


I didn't tried overclocking 3080 to see what kind of errors it would produce.

In theory there is a lot of different types of errors:
  • The single-bit errors like in an image above. Such errors are counted in ToggleCnt column 0x01 and the exact bit indices are counted in SingleIdx column. Such errors may be detected by EDC in theory if they occur during transmitting by EDC-enabled part of GPU<->memory wire. But I'm not sure if EDC helps if they occure when transmitting between gpu cache and gpu core or something like this.
  • The errors on data-inversion bit (if not detected by EDC). Those should be counted in ToggleCnt columns 0x07/0x08 without SingleIdx info for them.
  • The multi-bit transmission errors. Those should be counted in ToggleCnt columns above 0x01, without SingleIdx info for them.
  • The errors flipped in the memory chips itself during data storage/"refresh cycles". This may be caused by too big period of refresh or other problems. memtest_vulkan uses a part of memory in a "write once at start but reread everytime" pattern - it is the reason fot read GB is more then written GB. If a data flips inside this part of memory - there would be infinite log of error messages marked with "Mode NEXT_RE_READ" (in oppposite to Mode INITIAL_READ). Lowering the clocks without restarting test doesn't get rid of such errors.
  • The errors on the address-transmission bus. The metest_vulkan is designed to perform reads to the non-sequential series of medium-sized sequential blocks. And if the address is wrongly interpreted by a memory chip - the result is completely garbage from wromg cell. Data-bus EDC can't help here. Those errors typically gives completely random error patterns with normal distribution of bits count and flipped bits (so typical number of flipped bits are 12-20 of 32 and getting 1 flipped bit for this case is extremely unrelalistic). The result looks like
    Code:
    Error found. Mode INITIAL_READ, total errors 0x2B788 out of 0x18000000 (0.04422069%)
    Errors address range: 0x6000E900..=0xBFDFF9FF  iteration:38
    values range: 0xFFFFA1A4..=0x0000166F   FFFFFFFF-like count:0    bit-level stats table:
             0x0 0x1  0x2 0x3| 0x4 0x5  0x6 0x7| 0x8 0x9  0xA 0xB| 0xC 0xD  0xE 0xF
    SinglIdx                 |                 |                 |                
    TogglCnt                2|   7  18   95 264| 8451786 40056770| 11k 15k  20k 23k
       0x1?  23k 21k  17k 12k|81944859 24701266| 486 248   62  29|   4   2        
    1sInValu                3|  19  66  223 700|17683704 6856 11k| 16k 21k  25k 26k
       0x1?  23k 17k  12k6327|2883 917  282  64|   9             |
  • Other critical errors inside memory chips or memory controller. This gives normal distributions for TogglCnt, but for 1sInValu the distribution may be different - since critical internal errros may be reported by some fixed patterns (0x00000000, 0xFFFFFFFF - for some EDC problems, 0x0BADAC?? - for some nvidia problems).
  • Memory errros in the areas where error counts are stored)) This often shows as millions of errros in all table entries, typically with the total errors greater then tested memory size. Such results are numerically garbage but meas that the gpu/memory is really mostly non-functional.
  • The errors in GPU during calculation of addresses and desired values or in value comparison. This can lead to the any pattern of reporting at all, since the logic of program is broken.
 

·
Registered
Joined
·
101 Posts
Thank you for the nice explanation @galkinvv I wouldn't think there would be many here who would fork, probably just want something to click and run.

I have rolled back nvidia driver to 378 and vulkan is 1.037 so memtest_vulkan exits after displaying control-c info. nvidia driver is too old but has worked well for me over the years plus I'm not likely to upgrade to a newer card any time soon.

Below shows performance with Pascal 1050Ti. Only regression seen is when a new table entry comes into effect. It starts giving some errors after a certain frequency which with further increase will eventually cause a crash. There can be some throughput drop off at the higher clocks if the GPU clock isn't high enough, such as with this card and GPU core clock at 1700MHz, 2000MHz core clock seems not too bad though.


Rectangle Slope Font Line Red
 

·
Registered
Joined
·
18 Posts
Discussion Starter · #16 ·
Below shows performance with Pascal 1050Ti.
Thanks for sharing. It is useful for analyzing "the freq vs timings importance comparison for video memory".

Btw, what is that upper window with "RW BW ..." title - is it some gpu tool or a plot created from text/table data?

About stable frequencies - you paint that memtest_vulkan finds errors at 2190Mhz. And also mention some other errors at 2240Mhz - how this was measured? The visual artifacts or smth?


I wouldn't think there would be many here who would fork, probably just want something to click and run.
I undestand this, it's fine)

However I think that the ability of github to generate a "click-and-run" binary just after editing 1-2 lines in the browser is much more accessible variant than "you need to use the full-fledged programming environment for even a small change".
 

·
Iconoclast
Joined
·
33,727 Posts
I've encountered an issue with ReBAR/SAM on AMD RDNA2 GPUs in Windows. With ReBAR enabled, the application will only allocate ~4GiB of VRAM, but with it disabled, it will allocate nearly all of it.
 

·
Registered
Joined
·
101 Posts
Kind of a crude tool that was written back in the early days of Pascal to try and see why performance would suddenly drop of at a specific memory clock. You can see the huge drop in the screenshot above. I had mostly forgotten about it until this thread. I did clean it up a little and added the memory check which is almost as crude as the program itself which works with hard coded values. It was strange using it on the newer nvidia driver as admin privileges were required to change clocks, something that can be done from userland with older driver! The error checking just consists of two blocks of randomly generated data (1GiB each in above example) that are copied to 2 of 3 blocks on the device with cudamemcpy then copy shuffled with the 3 blocks on the device a number of times at each 10MHz memory clock increment and used for bandwidth measurement. Actual checking isn't done until the full frequency span is completed. The first 2 device blocks are read back to the host and checked against original data.

Errors appeared first with memtest_vulkan after 2190MHz, ie at 2200MHz and above. The errors after 2240MHz ie 2250MHz and above were with my test method.
 

·
Registered
Joined
·
18 Posts
Discussion Starter · #19 ·
I've encountered an issue with ReBAR/SAM on AMD RDNA2 GPUs in Windows. With ReBAR enabled, the application will only allocate ~4GiB of VRAM, but with it disabled, it will allocate nearly all of it.
There is a known issue that "some vulkan drivers on some GPUs on some conditions" fail to allocate more than 4GB contigous memory array for not-known-to-me-reason. Maybe it can be solved by allocating several arrays but for now the test just falls back to a smaller test area. It is not perfect soluation, but from practice most of the times the presence/absence of errors is identical for "All memory" and "Only 4GB of video memory". This was verified by running the tool from .cmd file with arguments -
Code:
memtest_vulkan 1 3000000000
pause
The first argument is gpu index, the second is the memory size to test. Note that execeuted such way it will not automatically pause at the end, so the pause command is needed.
 

·
Registered
Joined
·
301 Posts
Hi, thanks for creating this tool, will be interesting to see how accurate I was with my OC.

I've experienced a weird issue, the tool will allocate <1GB of VRAM, and the memory clocks will stay at 192MHz while the GPU is fully utilized. After restarting the tool a couple of times, it allocated ~4GB and the VRAM clocks were correct. As posted before, I have disabled ReBAR in drivers and now I have the correct utilization. It seems the ReBAR issue can manifest a few ways....
Here's the log from the weird run:
logging started at 2022-11-02T02:22:36.211457Z
Testing 1: Bus=0x2D:00 DevId=0x73FF 8GB AMD Radeon RX 6600 XT
1 iteration. Since last report passed 408.1136ms written 1.8GB, read: 3.5GB 12.9GB/sec
4 iteration. Since last report passed 1.2236566s written 5.2GB, read: 10.5GB 12.9GB/sec
16 iteration. Since last report passed 5.1014429s written 21.0GB, read: 42.0GB 12.3GB/sec
87 iteration. Since last report passed 30.1716933s written 124.2GB, read: 248.5GB 12.4GB/sec
159 iteration. Since last report passed 30.4022675s written 126.0GB, read: 252.0GB 12.4GB/sec
230 iteration. Since last report passed 30.1847944s written 124.2GB, read: 248.5GB 12.3GB/sec
301 iteration. Since last report passed 30.172516s written 124.2GB, read: 248.5GB 12.4GB/sec
372 iteration. Since last report passed 30.1921199s written 124.2GB, read: 248.5GB 12.3GB/sec
443 iteration. Since last report passed 30.178316s written 124.2GB, read: 248.5GB 12.4GB/sec
515 iteration. Since last report passed 30.3841395s written 126.0GB, read: 252.0GB 12.4GB/sec
etc.
 
1 - 20 of 29 Posts
Top