Overclock.net banner

Memory leak ROCm 4.0 + Kernels>5.9.14

2408 Views 19 Replies 3 Participants Last post by  tictoc
This is a bit of a long shot, but before I start bisecting kernels, I thought I'd see if anyone else has experienced this issue.

When running OpenCL compute applications on ROCm 4.0 and recent kernels, the kernel fails to release all of the memory after a task is completed. Eventually the OOM killer is invoked, processes are killed, and a reboot is necessary.

If the application (like rendering) has sufficiently long run times this issue might fly under the radar, unless you are running 24/7 with long uptime between reboots. Generally the compute work that I am doing (whether it is random BOINC project or my own private work) has short runtimes, so with 4x Radeon VIIs crunching away I will be out of memory (128GB of memory in the system) in less than 24 hours.

This issue didn't present itself until I started running kernel 5.10.7. Since then, it has persisted through all the 5.10 kernels, the 5.11rc kernels, and the 5.11 stable kernels. The issue does not exist when running the amd-staging-drm-next kernel. I have tested both my normal setup, which involves building ROCm from source, and using the OpenCL bits from the latest amdgpu-pro driver, which I believe as of amdgpu-pro-20.45.1164792 is using ROCm as it's upstream for OpenCL.

This leads me to believe that it is not a ROCm issue, but rather something in the amdgpu kernel driver.

@Diffident Are you building and running ROCm on a newish upstream kernel?
1 - 20 of 20 Posts

· BOINC Cruncher
Joined
·
1,972 Posts
I'm using the OpenCL bits from the pro driver with the 5.11 kernel and I've experienced the memory leak. I had to stop folding during the ExtremeHW folding event because of it. I've seen it happen twice in the past 2 weeks, on Fedora and Arch. I would have to drop down to a TTY cause my mouse wouldn't move. I opened htop and saw all 32GB of memory used and 2GB of my swapfile.
 
  • Rep+
Reactions: tictoc

· Premium Member
Joined
·
5,120 Posts
Discussion Starter · #3 ·
Thanks for the report. The amd-staging-drm-next kernel does not have the leak, so I am going to start bisecting kernels to see when and what went into the stable kernel to cause the leak.

I'll probably submit the bug report here: Issues · drm / amd or here: RadeonOpenCompute/ROCm or to the kernel mailing list/bugzilla once I can figure out what broke.
 

· Registered
Joined
·
6 Posts
Thanks for the report. The amd-staging-drm-next kernel does not have the leak, so I am going to start bisecting kernels to see when and what went into the stable kernel to cause the leak.

I'll probably submit the bug report here: Issues · drm / amd or here: RadeonOpenCompute/ROCm or to the kernel mailing list/bugzilla once I can figure out what broke.
Just to let you know, I recently upgraded from kernel 5.8.18 to 5.10.19, and I too am seeing the memory leak problem. Let me know if there is anything I can do to help.
 

· Premium Member
Joined
·
5,120 Posts
Discussion Starter · #5 ·
Just to let you know, I recently upgraded from kernel 5.8.18 to 5.10.19, and I too am seeing the memory leak problem. Let me know if there is anything I can do to help.
I just finished stepping back through and testing all the 5.10 kernels, and the memory issue was actually introduced with the first 5.10 stable kernel. Kernel 5.9.14 does not have the memory issue. Back in December when the first 5.10 kernel was released I was using AMDGPU-Pro for OpenCL due to some compatibility issues with ROCm 3.9.

Now that I know when the bug was introduced, hopefully it won't be too hard to bisect and find the commit that is causing the kernel to not release all the memory.
 

· BOINC Cruncher
Joined
·
1,972 Posts
I don't know if it's related but, Milkyway is broken for me now. Everything fails with

Error creating command queue (-6): CL_OUT_OF_HOST_MEMORY
Error getting device and context (-6): CL_OUT_OF_HOST_MEMORY

It was working fine in Fedora with 5.10, now it's not in Arch with 5.11.
 

· Premium Member
Joined
·
5,120 Posts
Discussion Starter · #7 ·
I haven't had time to dig into this any further, but I hope to get to it in the next couple of days.

@Diffident The errors you are getting in MilkyWay are because the latest AMDGPU-Pro now uses the ROCm OpenCL driver for Vega and newer GPUs. You will have to roll back to an older version of the -pro driver, because MilkyWay will not run on the ROCm OpenCL driver.
 

· BOINC Cruncher
Joined
·
1,972 Posts
I guess I was using the older driver in Fedora. Fedora doesn't have a package for it so I installed it manually.

AMD is really trying it's hardest to make their GPU's useless. It was last week or the week before AMD said the ROCm driver was never intended to work with everything, so what do they do....they replace the driver that does work with everything with the one that doesn't.
 

· Premium Member
Joined
·
5,120 Posts
Discussion Starter · #9 · (Edited)
That was some serious foot in mouth in their Github repo. At least they walked it back, but OpenCL has been quite the mess since they EOL'd the fglrx driver. fglrx was also a mess, but at least I could always get it running one way or the other.

When the ROCm ocl runtime works it works well, but there doesn't seem to be any regression testing on older OpenCL code. That is kind of stupid since one of the big things with OpenCL is that it will run on everything. I have some ancient OpenCL 1.1 code that still happily runs on the NVIDIA closed driver and on Intel iGPUs, but falls on it's face with ROCm.

Right now I have been testing some HIP code, but once that is done, I plan on digging into the memory leak and some of the issues with ROCm on older OpenCL projects.
 

· BOINC Cruncher
Joined
·
1,972 Posts
I've had enough. This weekend I'm taking the Radeon VII out...again..and putting in my 1070ti...again..but this time for good. When the waterblock dries out, it's being put in a box and hopefully I can sell it for at least the total price I paid for the GPU, waterblock and backplate. Maybe someday, when hell freezes over, I'll be able to buy a 3080.
 

· Registered
Joined
·
6 Posts
I haven't had time to dig into this any further, but I hope to get to it in the next couple of days.
@tictoc Did you ever get a chance to look into this?

I tried to take a quick stab at it. Interestingly, I did notice some changes to the amdgpu driver code, but nothing stood out. I also noticed many changes to this same code with almost every kernel update up to 5.11.7. I'm wondering if they've managed to fix the memory leak with one of these updates. Has anyone tried the latest kernel?
 

· Premium Member
Joined
·
5,120 Posts
Discussion Starter · #12 ·
@tictoc Did you ever get a chance to look into this?

No, internet and power were down for multiple days here, so I have not really been able to do anything more with this.

Internet came back for a few hours yesterday, and I did test the latest stable kernel (5.11.7), the last mainline kernel (5.12rc3), and the latest amd-staging kernel. All three kernels now have the memory leak problem, so the bad code has now made it's way into the amd-staging-drm-next kernel.

There is a commit from yesterday that addresses a memory leak: Commits · amd-staging-drm-next · Alex Deucher / linux
I haven't tested that yet, so I don't know if it addresses this issue. Provided that my internet stays up, I should be able to start digging into this tonight.
 

· Registered
Joined
·
6 Posts
No, internet and power were down for multiple days here, so I have not really been able to do anything more with this.

Internet came back for a few hours yesterday, and I did test the latest stable kernel (5.11.7), the last mainline kernel (5.12rc3), and the latest amd-staging kernel. All three kernels now have the memory leak problem, so the bad code has now made it's way into the amd-staging-drm-next kernel.

There is a commit from yesterday that addresses a memory leak: Commits · amd-staging-drm-next · Alex Deucher / linux
I haven't tested that yet, so I don't know if it addresses this issue. Provided that my internet stays up, I should be able to start digging into this tonight.
Any news on this memory leak?

I'm surprised there's no other google hits regarding this. Is it just a small number of us BOINCers who are affected by it?
 

· Premium Member
Joined
·
5,120 Posts
Discussion Starter · #14 ·
Nothing new as I've been out of town and too busy to do anything other than load up the latest stable (5.11.11) and rc (5.12-rc5) kernels, but if I don't have time to dig into this further, then I will at least submit a bug report or two to get a few more eyes on the problem.

I would imagine that more people are affected by this, since the bug shows up on all of the OpenCL projects that I tested. The percentage of people that are running new kernels is really pretty small in the Linux world. The vast majority of people are probably still running running the older LTS kernels (4.4.xxx, 4.9.xx, 4.14.xx, etc.).
 

· Registered
Joined
·
6 Posts
Nothing new as I've been out of town and too busy to do anything other than load up the latest stable (5.11.11) and rc (5.12-rc5) kernels, but if I don't have time to dig into this further, then I will at least submit a bug report or two to get a few more eyes on the problem.

I would imagine that more people are affected by this, since the bug shows up on all of the OpenCL projects that I tested. The percentage of people that are running new kernels is really pretty small in the Linux world. The vast majority of people are probably still running running the older LTS kernels (4.4.xxx, 4.9.xx, 4.14.xx, etc.).
Thanks for the update!
 

· Premium Member
Joined
·
5,120 Posts
Discussion Starter · #16 · (Edited)
The memory leak has been fixed upstream!! :)

I am running ROCm 4.1 and the the latest amd-staging-drm-next kernel: Files · amd-staging-drm-next · Alex Deucher / linux
In addition to the memory leak being fixed the upstream amdgpu driver has also been fixed to allow the Radeon VII to work with ROCm 4.1. (y)

I am not sure when this will hit the mainline or stable kernels, but for now ROCm 4.1 + the AMD staging kernel should be a good combo for running OpenCl. Hopefully this makes it into the 5.12 kernel. I did a quick scan through 5.12-rc8 and I don't see anything there, but since I never was able to track down what caused the leak it's possible that it is fixed. I'll be building 5.12-rc8 later, and I'll post here if the fix made it into the mainline kernel/driver.

EDIT
The latest staging kernel does address this issue, but unfortunately it comes with a bug of it's own.
Code:
amdgpu: SDMA gets an Register Write SRBM_WRITE command in non-privilege command buffer
This error is flooded to the journal, and eventually leads to a hard lock up of the kernel, that requires a hard power off in order to recover. Bug report filed here: SDMA Errors (#1576) · Issues · drm / amd
 

· Premium Member
Joined
·
5,120 Posts
Discussion Starter · #18 · (Edited)
That's good news, at least regarding the memory leak!
I never really had time to work on diagnosing the latest issue (SDMA errors), but it has been fixed as of 4/30/21 in the latest amd-staging-drm-next kernel. :) I have been running that kernel with ROCm 4.1 on a Radeon VII for the last 12 hours with zero errors or issues.

So far testing has been pretty limited in terms of OpenCl applications. Right now the only things that I can confirm are working without issue on ROCm 4.1 is [email protected] and the Geekbench OpenCl benchmark.

--Edit-- The BOINC [email protected] project also seems to work without errors, although I did have to rebuild boinc 7.16.16, or tasks would immediately segfault.
 

· Premium Member
Joined
·
5,120 Posts
Discussion Starter · #20 ·
This might be a moot point now, but after a month of using an older stable kernel, I recently upgraded to kernel 5.12.5-200 and I can confirm that the memory leak has been fixed.
Thanks for the update. I have been running the AMD staging kernel, because there was an issue with GPU detection for Radeon VII's on ROCm 4.0/4.2.

Currently I'm running OpenCL work that won't run on ROCm, so I'm running an old OpenCL driver. Once that wraps up I'll jump back on a stable kernel and ROCm 4.2 to confirm. (y)
 
1 - 20 of 20 Posts
This is an older thread, you may not receive a response, and could be reviving an old thread. Please consider creating a new thread.
Top