I have this Debian server that won’t stop crashing. It crashes once every 2 or 3 days. Everything’s up to date, the cpu is good and prime 95 never finds any problems. The ram is good and I’ve run every ram test there is and never found anything wrong. I just can’t get it to stop crashing and it’s driving me insane.
I used to have an arduino connected to the motherboard’s reset jumper and then set up a bash script as a systemctl service that sent a signal to the arduino every 10 seconds and if the arduino didn’t receive a signal after 30 seconds it forces a reboot. This doesn’t even automate the process of restarting after a crash because too often, the server will crash just lightly enough that everything except that autorestart bash script service stops working so it won’t reboot. It does double amount the time the server works without manual intervention though which is better than nothing but not good enough.
Other than just randomly installing different distros until I find one that doesn’t do this (reinstalling an os and then setting all the server stuff back up is very time consuming), what can I do to troubleshoot/solve/stop or otherwise do anything about these crashes?
I’ve been using Debian for years without crashes, so I don’t think it’s a software issue. It sounds like a hardware issue to me; it could be your motherboard or power supply.
I’ve already replaced the cpu and the hard drives. I could try swapping out the ram even though its never failed any kind of ram check. I have evga 450br psu which is supposedly a good budget psu but I guess I could try replacing it. When I’m done with my gaming watercooling build I’ll have a spare motherboard I could try but if I change the motherboard, I’d likely have to reinstall just to get all the chipset drivers to work and if I’m reinstalling an os, I should choose something other than Debian because I would be changing 2 things in the same amount of time and effort it takes to change 1 thing which double the chances arriving on a combination of things that results in it not crashing anymore.
If the only way to get this working is seriously to randomly replace more parts and hope something finally works, I might make a serious effort to go back to using my known stable Athlon Xp I was using before I “upgraded” to this one. I’d have to install Gentoo and probably lose compatibility with a few things but i am so sick and tired of dealing with this server crashing all the time that it might be worth it.
Yeah, unfortunately it’s damn near impossible to pin down the failing part exactly without a bunch of spares parts.
You could look around your mobo for bulging capacitors, but that could be a long shot.
You could also try sifting through your journalctl looking for warnings and errors.