Abstract
As reported in klog#34806, k1dc0 hung up today.Though I tried to recover it, it didn't comes back online due to probably a hardware trouble.
According to my impression, we need to replace one of Myricom PCIe cards on k1dc0 in the mine.
So I gave up to recover it today.
Finally, I requested the DOWN state to LSC_LOCK guardian.
Note that all fast channels are now unavailable.
So we can know the IFO situation only by using EPICS channels on MEDM and StripTool (not on ndscope).
By the way, it is unclear whether the hardware trouble was the cause of this issue or the result of this issue at this point.
Details
When I started the investigation of this issue, k1dc0 was in power OFF according to the BMC web interface (see also Fig.1) and k1dc0 was an only node which was in power OFF. It's slightly strange situation. If it's an OS hang-up such as klog#34618, power itself should be ON. If it's a power source and/or UPS issue, other DAQ nodes should also downed.Because I couldn't find another strange point on BMC interface and network status on the web interface of the network switch (see Fig.2), I tried to turn k1dc0 ON. Then OS was launched up with any problem and the daqd process also came back online. But mx_stream which was a data tranfer process between real-time models and daqd@k1dc0 didn't came back and it cannot be recovered by a manual restart of mx_stream with following errors.
cat: /var/log/mx_stream.pid: No such file or directory
chrt: failed to get pid 99's policy: No such process .
And also a following error can be seen on the log file of mx_stream (it's example on k1lsc0).
OMX: Failed to find peer index of board 7f:8a:39:68:fc:50 (Peer Not Found in the Table)
mx_connect failed Nic ID not Found in Peer Table
So I stopped mx_stream@k1lsc0 once and restarted open-mx@k1lsc0. After then mx_stream@k1lsc0 was started again. But this procedure didn't also work well. According to the /opt/open-mx/bin/omx_info, open-mx on RTFE doesn't seem to find mx on k1dc0. Checking some logs on k1dc0, I found soon that only one Myricom PCIe card was detected by OS. One of two Myricom card on k1dc0 might be broken in some reason.
As conclusion, what we must do is replacing Myricom card in the mine (I'm not sure where spare of Myricom card is stored; maybe SK building? Anyway, I cannot find it in Mozumi building.). So I gave up recover it today. After replacing Myricom card on k1dc0, rebooting all real-time front-end may be required to reconstruct the MX peer table.
Finally, I requested the DOWN state to LSC_LOCK guardian because it's unhealthy to continue FIND_RESONANCE sweeps. If it's better to keep the ARM lock, please request so. But in this case please note the fact that fast channels are now unavailable.