Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

0x0@social.rocketsfall.net · edit-2 11 months ago

Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

Rockslide0482@discuss.tchncs.de · 11 months ago

TLDR: do memtest on your RAM

I recently had an issue for quite some time where my computer would occasionally just hard crash. When it first started happening I tried many of the common tests including memcheck but found nothing. For a while it wasnt super common so I just lived through it. I thought it was an OS thing but it occurred on a different Linux distro and even on the ancient Windows 10 install I have but rarely use. I was just about to pull the trigger on replacing mobo and maybe even CPU+RAM. Before I did that I followed someone’s suggestion to do a mem test. I could have at least sworn that I already did that and it came clean but it was an easy enough test to run, so why not.

Sure enough, found an error. I isolated the faulted DIMM, pulled it out and I haven’t had a crash since. Crazy since I’m all but certain I did both memtest from a Linux live iso and the Windows memory checking utility.

In short, test your RAM. Do multiple passes. Maybe even just try swapping out single DIMMs and running on that for a reasonable ammount of time to see if you can isolate a culprit. It was my first thought when the issue first occurred because it’s usually what causes stuff like that. When the tests came up clean originally I assumed it had to be something else. I was wrong.

0x0@social.rocketsfall.net · 11 months ago

This is what I’ll try next. I do think memory is the problem now that I’ve had a few more hours of research. Kernel 6.7 has issues with elevated RAM usage, so it’s absolutely doing something funky with memory that might be exposing underlying hardware issues. I also realized my stable kernel was a version or two away from 6.6.13 (6.6.10), so I’m running it now to see if the issue was introduced late in the 6.6 release cycle, which would be easier to bisect than 6.7.

Shadow@lemmy.ca · 11 months ago

If you can’t sysrq then you’re down to bisecting kernel releases to find the patch that introduced the issue. You could also review for any new features that are enabled by default in 6.7

Have you upgraded all bios / fw versions?

0x0@social.rocketsfall.net · 11 months ago

I was afraid of that. Since I’m not the only one, maybe someone else is doing it already. But if it’s still an issue in a few weeks, maybe I’ll take it on as a weekend project. As for the motherboard, I believe the latest version is currently on it (2022 or 2023.)

Shadow@lemmy.ca · edit-2 11 months ago

Also, can try moving the ssd to another pc to rule out hardware.

A Basil Plant@lemmy.world · 11 months ago

I recently had an issue with my computer freezing occasionally on a Deb12 Linux 6.1 where no errors showed up in syslog after a force reboot.

The way I finally found out about the issue was having dmesg open on a different monitor and waiting for the freeze to happen. Just before the freeze did happen, a number of error logs were spewed to dmesg - enough for me to catch a glimpse of the issue: intel WiFi.

I’m not saying that intel WiFi is your issue. I’m suggesting you keep dmesg -w open in another monitor (if you can) and go about your normal activity until a freeze happens.

mvirts@lemmy.world · 11 months ago

I’ve never done it, but I would try reproducing this in a VM like qemu… I would be googling at this point but I think you can debug a kernel crash from there somehow.

0x0@social.rocketsfall.net · 11 months ago

That’s an interesting idea. I’ll have to look into whether it’s a viable option first, though.

Corngood@lemmy.ml · 11 months ago

I did this recently and it was extremely quick to bisect and debug, but I was lucky enough to have a simple repro that worked in the emulator.

I think if I were you I’d try to repro on bleeding edge first. Then if it’s still broken, I’d try to get the repro time down as much as possible and automate it. Then I’d either bisect on qemu if possible, or bare metal.

0x0@social.rocketsfall.net · 11 months ago

Yeah, the qemu idea was brought up earlier in the thread and it’s very interesting. Glad you confirmed you could repro real issues there in the test environment, so it’s at least a little likely I’ll be able to do the same. Makes sense that it would work and is way better than letting the real system crash and burn. My kernel compile time is pretty short so it shouldn’t be too bad to bisect, I’m just not sure how many commits separate my stable kernel from the bugged 6.7. TBH I’m not that familiar with kernel dev., so maybe it’s way simpler than that.

Corngood@lemmy.ml · 11 months ago

The one I was able to test on qemu was a reliable failure of memory management syscalls triggered by a certain usage pattern. Unfortunately yours sounds like it’s probably hardware dependent. People in that Reddit thread mentioned video decoding, so you could try hammering that.

The nice thing about bisecting is that it’s mostly logarithmic, so doubling the commits should only take one extra step. I’d be surprised if you had to do more than a 10-12 steps.

You may already have a good kernel config, but for this sort of thing I usually use make localmodconfig. That’ll build all the modules that are loaded when you run it, which can cut down on compile time massively.

0x0@social.rocketsfall.net · 11 months ago

I’m fresh off ruling out the RAM via memtest. I’ll let it do a longer soak overnight to see if anything fails then, but I’m now on to bisecting the kernel from what I believe is the last release of 6.6 (6.6.13) to hopefully whatever the offending commit is. Been a while since I’ve had to mess around with manually building the kernel without the aid of linux-tkg, but I’m off to learn it anyway. Thanks for the help!

Corngood@lemmy.ml · 11 months ago

Good luck! Sounds like you got it under control, but I’m happy to help if you run into trouble. I’m curious what you’ll find.

lemmyreader@lemmy.ml · edit-2 11 months ago

In the comments of the web link you shared (The link you wrote didn’t work for me but I looked up the original and adding it here so that others can choose to use their preferred libreddit or teddit) at least three comments mention that 6.7 zen kernel works fine for them. Care to try that ?

0x0@social.rocketsfall.net · 11 months ago

Fixed the link. Thanks!

I’ve also tried linux-tkg, which I believe rolls in the Zen patches. If it doesn’t, I’ll definitely try it.

Pankkake@lemmy.world · 11 months ago

I used to have the same issue. Turns out, it was fixed by a firmware update on my motherboard.

Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

Troubleshooting errorless system hang on kernel 6.7 (AMDGPU)

Update (01-27-2024)

List of similar issues

Patched/Unpatched 6.8rc1 attempts

Bisecting 6.6 to 6.7

The state of AMDGPU in general