518
79

The right FUCKING time to get TWO ram sticks damaged

7d 23h ago by lemmy.world/u/Wispy2891 in mildlyinfuriating

Now I need to take a loan in order to afford 32gb for replacement thanks to the ai bros hoarding all the chips...

Tried on three different PCs, both Intel and AMD, both sticks are damaged, somehow

If you haven't yet, I would try disabling the XMPP/DOCP profile to see if that passes a test. This will tell you if the RAM is just dead or if it's degraded a bit and can't hit the same speeds as it did before. If it does pass, then re-enable that profile and try downclocking or loosening the timings a bit to see if that'll work.

Failing that, you could try increasing the voltage slightly (like +0.05V, I wouldn't go above 1.4V), but I'd be careful on this front to not cause anymore damage.

Sucks that this happened right now, but IMO it'd be better to sacrifice a slight hit in performance than to buy RAM by itself at these premiums.

This guy RAMs !

RAM has Jabber stuff now? /s

The universe: Fuck this guy in particular

A lot of ram is under lifetime warranty, check the manifacturer site (usually a serial lookup is enough).

Do they run stably if you downclock the memory in your BIOS? I'd at least try that first if replacing them is going to be a major problem.

No, even tried to run them at 1866...

Ah, fair enough. Long shot, but thought I'd at least mention it on the off chance that maybe it would work and maybe you hadn't yet tried it. Sorry.

tries to think of anything else that could be done

Are you using Linux? Linux has a patch that was added many years back with the ability to map around damaged regions in memory. I mean, if your memory is completely hosed and you can't even boot the kernel, then that won't work, but if you can identify specific areas that fail, you can hand that off to the kernel and it can just avoid them. Obviously decreases usable memory by a certain amount, but...shrugs

I've never needed to do it myself, but let me go see if I can find some information. Think it was the "badram" feature.

searches

Okay. You're running memtest86. It looks like that has the ability to generate the string you need, and you hand that off to GRUB, which hands it off to the kernel.

https://www.memtest86.com/blacklist-ram-badram-badmemorylist.html

MemTest86 Pro (v9 or later) supports automatic generation of BadRAM string patterns from detected errors in the HTML report, that can be used directly in the GRUB2 configuration without needing to manually calculate address/mask values by hand.

To enter the address ranges to blacklist manually, do the following:

Edit /etc/default/grub and add the following line:

GRUB_BADRAM=addr,mask[,addr,mask...]

where the list of addr,mask pairs specify the memory range to block using address bit matching
Eg. GRUB_BADRAM=0x7ddf0000,0xffffc000 shall exclude the memory range 0x7DDF0000-0x7DDF4000
Open and terminal and run the following command

sudo update-grub

Reboot the system

If you can't even boot the system sufficiently to get update-grub to run, then you might need to do a fancier dance (swap drive to another machine or something), but that's probably a good first thing to try. I'd try booting to "rescue mode" or whatever if your distro has an option like that in GRUB, something that doesn't start the graphical environment, as it'll touch less memory.

EDIT: If your distro doesn't have something like that "rescue mode" set up --- all the distros I've used do, but that doesn't mean that all of them do --- or it it can't even bring "rescue mode" up, because your memory is too hosed for that --- then you probably want to do something like hit "edit kernel parameters" in GRUB and boot while adding "init=/bin/bash" to the end of the kernel command line. That'll start your system up in a mode where virtually nothing is running --- no systemd or other init system, no graphics, no virtual consoles, no anything. Bash running on bare metal Linux kernel. Control-C won't work because your terminal won't be in cooked mode, everything will be very super-duper minimal...but you should be able to bring up bash. From there, you'll want to manually bring your root filesystem, which the kernel will have mounted read-only, as it does during boot, up to read-write, with:

# mount / -o remount,rw

Once that's done, do your editing of the grub config file in vi or whatever, run the update-grub command.

Then run:

# sync

Because you don't have an init system running and it's not gonna flush the disk on shutdown and your normal power-down commands aren't gonna work because you have no init system to talk to.

Go ahead and manually reboot the system by killing its power, and hopefully that'll let it boot up with badram mapping around your damaged region of memory.

EDIT2: It occurs to me that someone could make a utility that can run entirely in Linux to do memory testing to the extent possible inside Linux using something like memtester instead of memtest86, generate the badram string and then write it out for GRUB. That's less bulletproof than memtest86 because memtester can't touch every bit of memory, but it's also easier for a user to do than the above stuff, and if you additionally added it to the install media for a distro, it'd make it easier to run Linux on broken hardware without a whole lot of technical knowledge. I guess it'd be pretty niche, though --- doubt that there are a lot of systems with damaged memory floating around.

EDIT3: Oh, that's only the commercial version of memtest86 that will auto-generate the string. Well, if you know how to do a bitmask and you can get a list of affected addresses from memtest86, then you can probably just do it manually. If not, post the list of addresses here and someone can probably do a base address and bitmask that covers the addresses in question for you. Stick the memory back into your computer first, though, since the order of the DIMMs is gonna affect the addresses.

wow i'm running linux, so it might be perfect

though i'm a bit scared that it will get worse over time. Today i got a freeze that forced me to test the ram with memtest86, but since september i got some random corruption in the btrfs filesystem (luckily always "useless" files like flatpak or docker stuff that i could delete and download again in seconds) and i assumed it was a btrfs bug, not hardware problem

If I were in this position I'd strongly consider using 16GB for the next year or two. Especially with an NVME SSD, good swap performance makes the impact of running out of memory much smaller than it used to be.

It's very strange both sticks failed at the same time, have you tried them in another motherboard?

I had to do this on my busted ddr4 2 weeks ago. Badram didn't work, but memmap did. I had to do bit flipping to get the translation from BADRAM as explained here.

I think the latest memtest86+ has the option to report in memmap format. But you will need to take a photo of the screen, coz it's Foss and not as fancy as Passmarks memtest.

Edit: Adding badram to grub broke grub for me, I have to undo the grub config using a live boot rescue thingamajig. Then I went hunting why.

You can even make linux run an automatic memtest on boot and reserve the bad areas it finds. This is with the memtest=N kernel parameter, where N is the number of passes. memtest=17 tests all patterns. With this, the kernel will run an automatic test on every boot.

To add to what the above commenter said: afaik Grub allows specifying kernel parameters at boot by pressing some hotkey. You could type in the string from memtest86 if you find what the parameter should be called (or add the memtest parameter instead).

doubt that there are a lot of systems with damaged memory floating around.

Let’s say that you would be surprised if we actually started checking this. I will not disclose my occupation but there are thousands of critical telco infrastructure pieces of equipment that run not only a non-ECC ram because of cost cutting, but with actually broken DRAM modules, regularly rebooting at least a few times a day and causing local outages…

Back to the topic at hand - doesn’t it seem strange that only CPU4 finds issues in memtest86? It could be a CPU or even motherboard that got damaged and not the DRAM itself, no?

Back to the topic at hand - doesn’t it seem strange that only CPU4 finds issues in memtest86? It could be a CPU or even motherboard that got damaged and not the DRAM itself, no?

I noticed that, but OP said that he ran the thing in three different systems, so I'm assuming that he's seen the same problems with multiple CPUs. It may be --- I don't know --- that memtest86 doesn't, at least as he's running it, necessarily try to hit each byte of memory with each CPU, or at least that the order it does so doesn't have errors from other CPUs visible.

I also wondered if it might be a 13th or 14th gen Intel CPU, the ones that destroyed themselves over time. But (a) it's a mobile CPU, and only the desktop CPUs had the problem there, and (b) it's 11th gen.

The Tiger King/Joe Exotic meme where he says “I’m never going to financially recover from this.”

I have two sticks of RAM worth $850 now that went bad but I was able to successfully RMA them - can you do that?

On Linux you can mask out bad memory ranges. Don't know about Windows.

That's neat, I'll definitely have a look about the topic.

At least its DDR4?

Im grasping.

Bro I’ve had my ram for yeaaaars. Anytime my computer glitches I just think “yep it’s time” and my wallet sheds a tear

Does the BIOS support any overclocking/tweaking?

I'm not familliar with Rocket Lake (your CPU generation), but you may be able to bump the voltage or loosen the timings a bit to get it stable. Even without BIOS support, it's possible you could do this from your operating system, like you can with Ryzen.

Nightmare scenario. My condolences

RIP OP's Kidney

Might be cheaper to buy a pre built laptop at this point...

Sounds dumb but check craigslist

I'm eying FB Marketplace lately, not all users are aware of the RAM situation.

Could this possibly be caused by a bad connection of the ram contacts?

I'm grasping for ya.

If not.. F

Is this a laptop ? Are you in the EU ? Is 2x8 Gb enough for your needs ?

No it's some kind of hybrid bastard mobo from AliExpress where they use a soldered mobile CPU but with desktop memory in microATX form factor

Unless it's ddr5 check your local ewaste recyclers, most have shops where you can buy used parts.

i wonder if it was the motherboard sending wrong voltage or something like that. What are the chances of TWO modules failing AT THE SAME TIME (although it's the same kit, identical memory, so maybe it could be damaged silicon and i never noticed before)

Exactly my thoughts. Take your RAM and test it with another CPU + MoBo combo. Ask friends. I bet the RAM is good.

Try and single out the cpu cache with cache less mode. 2 sticks is a weird issue if timing didn't slip on the memory controller or overvoltage was applied.

Yeah two sticks at once to me says mobo issue, IF you tested each stick individually and they both failed separately. Maybe not fried, I'd be hesitant to try them in a better mobo to not fry a slot too, but they might still be fine.

Did you test each stick individually to confirm both are dead? If two sticks are in there and it fails all that means is "at least one failed". That's just an indicator to go one stick at a time to determine which one.

Maybe 1 is causing the other to fail?
Could try the sticks individually.

It is strange that 2 sticks fail at the same time. It smells like a symptom instead of the root issue.

In fact they should try. Due to dual/quad mode the only thing testing multiple sticks at once will tell you is if any of the sticks have failed. Only going one by one will tell you which ones or how many, otherwise you'll have red herrings

Yeh, the 16/32 in the screenshot and that 2 sticks are dead suggests they have 4x 8gb sticks, and lends credence that one channel is being messed with.
They said they tested the ram on multiple systems, but they might have just thrown both "dead" sticks in there at the same time - leading to a similar failure mode as they are both on the same channel.

I bet 1 stick is dead, and they could probably get away with 24gb of ram in a 3/2 channel distribution

Agreed. Luckily with RAM, you know pretty quickly if a stick is dead. Yeah the test can run for hours, but in my experience if a stick is dead, memtest will go red almost immediately, most of the time not even making it 2 minutes.

What kind of RAM? DDR4? I can sell you old G Skill DDR4.

I have an old team force ddr4 16gb kitm not exactly top of the line but should do the trick. I'll gladly sell it to you for a much fairer price than what's around these days long as you're in the us. dm me if interested, no worries if not

Ah too bad, I'd have sold you the RAM at a great discount (~50€)

Is this a laptop ?

I'm not OP, but an i7-11800H is a mobile processor, so while I'm sure that there are non-laptop PCs out there using laptop CPUs, I'd guess that it's probably a laptop.

Someone else may have said this but try reseating the memory, making sure there isn't dust or anything in the slots

I'm sorry. I hope you don't have kids who need(ed) to go to college.

Well luckily you're on the right generation of Intel that allows you to use DDR4. It'll probably be cheaper to buy a new motherboard than it would be to buy DDR5

They're already on DDR4, according to the screenshot

Then what's the fucking problem? Just buy more RAM.

Even DDR4 went from ~50€/2x16gb module to 180€ - utterly insane

YHMYmRbGkLdKKD9.png

I am so sorry that this happened to you. My last computer was failing, but it was failing on two different accounts: The power supply was dying and the main harddrive was dying. When I got a new computer I got a new backup HD and my old HD gave its one last dying breath to transfer all the files before croaking.

it was a real hero

Looks like most of the nibbles are fine. Maybe something happened to the connectors or traces. At least you know it's the ram not your motherboard.

at least? wouldn't it be cheaper to replace the motherboard nowadays?

Idk I just bought a 32GB stick of ddr4 sodimm for $60, adapters are less than $10 each so maybe not if you stay away from scalpers and don't pay attention to ram speed 😅

I was also thinking the soldered CPU motherboard with 8 cores must cost at least 200 but maybe that's a bad assumption and I didn't look it up.

If I may ask: how?

(background: always owned multiple pc / built frequently / never had one stick of bad ram over decades. Was it just luck or better vendor or good handling)

The only RAM issue I ever had was like the 3rd PC I ever built. Using 2 modules in single channel mode worked fine. Putting them in for dual channel fried both the 3rd and 4th DIMM slots on the motherboard and the RAM that was in the 3rd slot.

I RMA'd both. It happened again.

When I sent in for a second RMA, I started wondering what is the issue, the board or the RAM? I never got an answer. Instead I got two companies blaming the other and starting a flame war in my email inbox. The board was from ASRock. I forgot who made the RAM. I just ran that thing in single channel and it was fine until it just got old and needed an upgrade.

It could be the board, it could be the RAM, and it could be a fauly memory controller on your CPU. Although, if this was a while ago (pre-2003 for AMD, pre-2008 for intel) then the memory controller would be on the ASRock board.

In other words, a nightmare to diagnose.

Interesting. I actually read stuff like that related to Asrock boards and never used them. Sounds like I made the right choice.

I have used a handful of their mid-to-high end motherboards (relative to their product range) and they have never caused any major issues. This is obviously only anecdotal, though.

Yeah, and to be fair IIRC, this was one of their low-end boards. My dumbass built it on the fucking quadcore Celeron instead of the Core2Duo that just came out because I figured 4 has gotta be better than 2, right? But I was also, like, 14 and the most demanding game I played was Counter-Strike.

I wonder also. I'm guessing maybe a bad lot?

The story starts two years ago when I bought them in a kit with 4 16gb sticks from micron. When installed all four the motherboard, I installed Linux and it crashed (froze) when running a VM with KVM. Tested with memtestx86 and it will always fail at the 5th test (after around 20 mins of crunching) and at reboot the bios would reset to default. Because it was an AMD Ryzen and all the web results said so, I assumed it was some kind of incompatibility and removed two sticks. With two sticks, it passed the test. I swapped the two sticks and it passed the test again. So I left those 2 16gb sticks in the Ryzen and used the other 2 16gb sticks with the Intel. Both passed the test.

Fast forward 18 months, in the Intel I'm copying a file from the nvme to the HDD and it tells me Input/output error.

I start diagnosing the btfrs filesystem, find corruption in the counter, scrub finds uncorrectable errors in the virtio-win.iso file, the one I wanted to move. I assumed it was some btrfs bug, deleted the file as I could download it again, moved on. After a few weeks a flatpak app wouldn't start. Read the dmesg, see a btrfs message about some corrupted inode or something like that. I use find to find the file at that inode, it was the flatpak. Again assumed it was a btrfs bug, reinstalled the flatpak and moved on.

Then yesterday the system froze. This time I tested with memtestx86. It failed immediately within seconds. Took out one stick, swapped them, no change. I went back and swap them with the other two sticks bought in the same lot, those would pass the test.

Hm, sounds like it. Micron is certainly reputable, and the issues hints at memory, although I've had similar that turned out not to be ram. Certainly very weird!

On another note, I personally do absolutely not trust btrfs due to its creator and very long time shaky lossy raid issues. Since there are enough performant and proven long time reliable filesystems available I use others.

Anyhow, getting to the point - I'd not entirely rule out something with btrfs as a separate issue too. I've just seen too many things that your ram issues could've been a few freak issues resulting in this.

Although I fully agree in your deduction!

100% consistent with static damage.

At least it took less than 2 minutes to find....

Is this your board? https://www.youtube.com/watch?v=K0Fh-VTAf3U

yes, and I agree with it being TOTAL TRASH

  • it takes two boots to start, first boot can't see any drives connected (sata, nvme). If you boot from usb some OS installer, the installer won't see any drive until reboot.
  • it has no support whatsoever. BIOS update? LOL you get the alpha "if it compiles, it ships" version
  • the bios is in engineering mode and has like 15 pages of incomprehensible options
  • even if the bios has 15 pages of ultradetailed engineer-only options, it's missing basic options like "numlock on at start", "WOL", fan control, and other stuff that i forgot about (or maybe they're buried somewhere under some weird acronym)

this can happen due to static discharge. It’s mandatory to follow static safe handling procedures when installing PC parts

The computer worked fine for at least two years, then today it started to randomly freeze without me touching any hardware...

thats the thing about ESD damage - it can reduce reliability or operating margins. It rarely shows up as “whole part entirely nonworking”.

I'm not sure if "mandatory" is the right word for it. More like best practise or something 

its mandatory if you don’t want to damage CMOS gates

I'm just saying most of the time nothing bad will happen, it's not like you're constantly shooting static electricity bolts. But it's still a good idea

yeah, just best ptactice. I've build 5 desktops over the years, never cared about static discharge, and only ever had a mobo fail after a 4-5 year lifespan

People think this and end up like OP.

Often the symptoms resulting from damage are subtle, irritating, and situationally intermittent.

maybe you built your PC on a humid day or didn’t happen to do anything to raise a charge, But I wouldn’t go around telling other people that it’s not necessary.

So what is best practice?

Only the other week some of my SSDs failed. So little in stock and such high prices too. 😔

I’d test each stick one at a time to confirm its actually both sticks that are dead, if you haven’t

Could be caused by a power surge maybe? Lighting strike? Just thinking out loud. How this could be happening if your ram stick was running fine for years.