This is a bit of a long shot, but before I start bisecting kernels, I thought I'd see if anyone else has experienced this issue.
When running OpenCL compute applications on ROCm 4.0 and recent kernels, the kernel fails to release all of the memory after a task is completed. Eventually the OOM killer is invoked, processes are killed, and a reboot is necessary.
If the application (like rendering) has sufficiently long run times this issue might fly under the radar, unless you are running 24/7 with long uptime between reboots. Generally the compute work that I am doing (whether it is random BOINC project or my own private work) has short runtimes, so with 4x Radeon VIIs crunching away I will be out of memory (128GB of memory in the system) in less than 24 hours.
This issue didn't present itself until I started running kernel 5.10.7. Since then, it has persisted through all the 5.10 kernels, the 5.11rc kernels, and the 5.11 stable kernels. The issue does not exist when running the amd-staging-drm-next kernel. I have tested both my normal setup, which involves building ROCm from source, and using the OpenCL bits from the latest amdgpu-pro driver, which I believe as of amdgpu-pro-20.45.1164792 is using ROCm as it's upstream for OpenCL.
This leads me to believe that it is not a ROCm issue, but rather something in the amdgpu kernel driver.
Are you building and running ROCm on a newish upstream kernel?