Wednesday, September 6, 2023

ChatGPT actually gave some useful suggestions!


ChatGPT actually gave some useful suggestions!


Thought I'd share them here.

Physical impacts can cause a range of issues, from visible damage to subtle problems that only manifest under specific conditions. Here's a systematic approach to surveying the motherboard for damage:


1. **Visual Inspection**:

   - **Magnifying Glass or Loupe**: Use a magnifying glass or a loupe to inspect the motherboard closely. Look for any visible cracks, broken traces, or damaged components.

   - **Check Connectors**: Ensure that all connectors, including the PCIe slots, are intact and not bent or broken.

   - **Inspect Solder Joints**: Look for any solder joints that might have cracked or become loose due to the impact.

   - **Check for Debris**: Ensure there's no debris or foreign objects lodged between components or traces.


2. **Physical Inspection**:

   - **Press Gently**: Gently press down on various parts of the motherboard to see if there's any unusual movement, which could indicate a broken stand-off or mounting point.

   - **Check the Back**: Turn the motherboard over and inspect the back for any signs of damage or unusual marks.


3. **Functional Testing**:

   - **Power On**: If the visual and physical inspections don't reveal any issues, consider powering on the server. Listen for any unusual sounds and observe any irregular LED patterns.

   - **BIOS/UEFI Check**: Access the BIOS/UEFI and check for any hardware errors or warnings. Some systems run a self-test during startup and might report issues.

   - **PCIe Slot Test**: Since the impact dislodged a card from the PCIe slot, test that specific slot with a known working card to ensure it's still functional. Also, test the previously dislodged card in another slot or system to ensure the card itself isn't damaged.

   - **Stress Testing**: Use software tools to stress-test the system. Tools like Prime95, Memtest86, and FurMark can help identify issues with the CPU, memory, and GPU, respectively.


4. **Check Peripheral Components**:

   - **RAM**: Ensure that all RAM modules are seated correctly. Consider running a memory test to ensure they're functioning correctly.

   - **Storage**: Check that all storage devices (HDDs, SSDs) are securely connected and are recognized by the system.

   - **Cooling**: Ensure that all fans are operational and that heatsinks are securely attached.


5. **External Inspection**:

   - **Case and Chassis**: Check the server case and chassis for any signs of deformation or damage. Ensure that the motherboard is securely and correctly mounted within the case.


6. **Seek Expert Assistance**:

   - If you're unsure about any findings or if the server behaves erratically, consider seeking assistance from a professional or a specialized repair service. They might have specialized equipment, like multimeters or oscilloscopes, to diagnose issues more accurately.


7. **Document Everything**:

   - Take photos of any visible damage or irregularities. This can be useful for insurance claims, future reference, or if you decide to seek professional assistance.


Remember, even if the server appears to function normally after the impact, there could be latent issues that might manifest later. It's essential to be thorough in your inspection and testing to ensure the server's reliability and longevity.


Tuesday, September 5, 2023

ML350 G9, the continuing saga.

 ML350 G9, the continuing saga.


Part I: received the server, box was pretty 'bashed up'. 

The case was pretty 'bashed up', it had had a hard impact into the power-supplies (probably used to rest the case on the ground, by the delivery guys).


Also the PCIe storage card (for the tapedrive and the cd-rom drive) had 'jumped' out of the PCIe slot. Not good signs. I repaired the power board (because the power supplies would not be recognised, in the meantime I had a new power-board on the way ($20).

I've since replaced the power-board too, no luck so far. The same error keeps popping up. It's about an EFUSE (20h), but I have no idea where that is, I suspect it might be protecting the PCIe slots (maybe some of the pins have shorted?) but I have no idea where to look.
A new motherboard is now on order (~$100, these older parts are getting quite cheap).

According to this post, it could be the PSUs, but they give a 'green light' when plugged in: https://community.hpe.com/t5/proliant-servers-ml-dl-sl/error-power-on-fault-system-board-aux-main-efuse-regulator-1-20h/td-p/7181745

So: Motherboard first, then some 'flex' power supplies. Let's see where this goes.

In the meantime, I also have a storj.io node now. I've already 'made' $0.07

In other news, also expanded my NAS by 8Tbyte, as I am now running overseerr and people can request stuff.

Just to get it all linked back to one place, here is the link for the HPE forums with the same problem (no resulution): https://community.hpe.com/t5/proliant-servers-ml-dl-sl/ml350-gen9-not-booting-with-critical-error-aux-main-efuse/m-p/7180208/thread-id/180199 
And my own post on Reddit describing my 'pains' with the server board: https://www.reddit.com/r/homelab/comments/168o7ib/help_me_resurrect_my_ml350_g9/