Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad memory on node MOC-R4PAC10U13-S3 #1390

Open
larsks opened this issue Sep 24, 2024 · 4 comments
Open

Bad memory on node MOC-R4PAC10U13-S3 #1390

larsks opened this issue Sep 24, 2024 · 4 comments
Assignees
Labels
mghpcc MGHPCC related tasks

Comments

@larsks
Copy link
Member

larsks commented Sep 24, 2024

ESI node MOC-R4PAC10U13-S3 (service tag J16N3Y2) failed earlier today with:

  Tue Sep 24 2024 12:55:45 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.  
    Tue Sep 24 2024 12:55:45 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
    Tue Sep 24 2024 12:55:45 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
    Tue Sep 24 2024 12:55:45 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
    Tue Sep 24 2024 12:55:45 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
    Tue Sep 24 2024 12:55:45 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
    Tue Sep 24 2024 12:55:45 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
    Tue Sep 24 2024 12:55:44 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
    Tue Sep 24 2024 12:55:44 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.
    Tue Sep 24 2024 12:55:44 Multi-bit memory errors detected on a memory device at location(s) DIMM_B2.

The errors did not repeat after a cold boot, but perhaps we should have the memory replaced.

@schwesig
Copy link

from a call today
Yeah probably the ECC memory corrector is unable to fix things anymore. That might be why on the reboot, it clears out that counter.

@larsks larsks added the mghpcc MGHPCC related tasks label Sep 25, 2024
@hakasapl
Copy link

hakasapl commented Oct 9, 2024

@larsks is this node currently in use? I see it as active in ESI

@larsks
Copy link
Member Author

larsks commented Oct 9, 2024

@hakasapl I thought we pulled it from the cluster, but @tssala23 is probably the best person to ask.

@tssala23
Copy link

tssala23 commented Oct 9, 2024

@larsks @hakasapl We didn't pull it from the cluster, when it had the memory issue it wouldn’t boot without someone telling it, it was “okay” to boot. Lars you went in a did that and it was fine in terms of working with the cluster. So it is still active in the beta-test cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
mghpcc MGHPCC related tasks
Projects
None yet
Development

No branches or pull requests

4 participants