k1nfs0 didn't respond any request from another nodes.
Since it did not respond to commands also from the physical console, I performed a forced reboot.
I also recovered NFS connection (/kagra and /users) on each NFS clients.
If you will find unrecovered nodes (I might forget), please let me know.
-----
I noticed a new console couldn't be launched on the workstations due to a disconnection of NFS region. According to some process logs depending on NFS region, it was dead around 2:30am-2:40am. It's occurred multiple nodes, so it seemed a problem on k1nfs0 or the core network switch instead of workstations. Checking the console of k1nfs0, then it wasn't respond any command with messages shown in Fig.1.
According to the messages, CPU didn't seem to come back to controllable state by the kernel after some kind of task was doing on that CPU core. Anyway, reboot and shutdown commands couldn't be executed, I performed a forced reboot. BTW, BMC interface wasn't available because the primary NIC port was used for the PICO network instead of the DGS network. So I reboot it by the power switch instead of BMC power control interface. When BMC is used as shared LAN mode, the primary NIC must be assigned for the DGS network. We must change NIC settings by NetworkManager or must set BMC as the dedicated LAN mode for the future trouble shooting. Otherwise, physical access will be required during an emergency recovery, making remote recovery impossible.
After rebooting, k1nfs0 could be launched. I checked the boot logs and then, found the boot disk was mounted as RO mode once and then re-moutned as RW mode. On the other hand, there was no SMART report for all disks including the boot disk. Though k1nfs0 is running in normal now, SATA path (cable, controller or route on motherboard) might be asing. It may be a good time to replace the hardware (and also a legacy OS).
Finally, I recovered NFS connection on all clients. Then the system was recovered. But there are several tens of clients and I might miss some of them. If you found unrecovered nodes, please let me know.









