Reports 1-1 of 1 Clear search Modify search
DGS (General)
takahiro.yamamoto - 21:46 Monday 11 August 2025 (34808) Print this report
One of network interface seems to be broken on k1dc0

Abstract

As reported in klog#34806, k1dc0 hung up today.
Though I tried to recover it, it didn't comes back online due to probably a hardware trouble.
According to my impression, we need to replace one of Myricom PCIe cards on k1dc0 in the mine.
So I gave up to recover it today.

Finally, I requested the DOWN state to LSC_LOCK guardian.
Note that all fast channels are now unavailable.
So we can know the IFO situation only by using EPICS channels on MEDM and StripTool (not on ndscope).

By the way, it is unclear whether the hardware trouble was the cause of this issue or the result of this issue at this point.
 

Details

When I started the investigation of this issue, k1dc0 was in power OFF according to the BMC web interface (see also Fig.1) and k1dc0 was an only node which was in power OFF. It's slightly strange situation. If it's an OS hang-up such as klog#34618, power itself should be ON. If it's a power source and/or UPS issue, other DAQ nodes should also downed.

Because I couldn't find another strange point on BMC interface and network status on the web interface of the network switch (see Fig.2), I tried to turn k1dc0 ON. Then OS was launched up with any problem and the daqd process also came back online. But mx_stream which was a data tranfer process between real-time models and daqd@k1dc0 didn't came back and it cannot be recovered by a manual restart of mx_stream with following errors.
cat: /var/log/mx_stream.pid: No such file or directory
chrt: failed to get pid 99's policy: No such process
.

And also a following error can be seen on the log file of mx_stream (it's example on k1lsc0).
OMX: Failed to find peer index of board 7f:8a:39:68:fc:50 (Peer Not Found in the Table)
mx_connect failed Nic ID not Found in Peer Table


So I stopped mx_stream@k1lsc0 once and restarted open-mx@k1lsc0. After then mx_stream@k1lsc0 was started again. But this procedure didn't also work well. According to the /opt/open-mx/bin/omx_info, open-mx on RTFE doesn't seem to find mx on k1dc0. Checking some logs on k1dc0, I found soon that only one Myricom PCIe card was detected by OS. One of two Myricom card on k1dc0 might be broken in some reason.

As conclusion, what we must do is replacing Myricom card in the mine (I'm not sure where spare of Myricom card is stored; maybe SK building? Anyway, I cannot find it in Mozumi building.). So I gave up recover it today. After replacing Myricom card on k1dc0, rebooting all real-time front-end may be required to reconstruct the MX peer table.

Finally, I requested the DOWN state to LSC_LOCK guardian because it's unhealthy to continue FIND_RESONANCE sweeps. If it's better to keep the ARM lock, please request so. But in this case please note the fact that fast channels are now unavailable.
Images attached to this report
Comments to this report:
takahiro.yamamoto - 16:56 Tuesday 12 August 2025 (34809) Print this report
[Oshino, YamaT]

Abstract

All DAQ streams came back online.
Though we misunderstood this trouble because of abnormal BMC settings without proper logs, we found this trouble was kernel panic issue same as klog#34618.
And also, there is no broken Myricom card fortunately.

Details

At the begining of the recovery work, we checked the indicator of Myricom cards and both two cards seemed to be alive though OS surely found only 1 card. So we decided to check the possibility of a loose connection and a broken card. In order to uninstall Myricom card, we shutdown K1DC0. But power of computer chassis wasn't stopped and real console was showing kernel hang-up messages.

So I noticed the possibility about incomplete BMC settings soon. As a result, this expectation was correct. BMC IP for k1dc0 in DGS Wiki was still used by old computer chassis and current k1dc0 useed another IP addr for BMC. BMC settings was probably skipped when the current k1dc0 was installed. Because this fact wasn't reported on klog, wiki, etc., we took overnight to notice. Anyway, the reason why k1dc0 found only 1 Myricom card yesterday was that old k1dc0 was launched up via wrong BMC interface. We rebooted current k1dc0 after stopping old one. Then we was able to confirm 2 Myricom cards were available on k1dc0. And also, k1dc0 and RTFEs found each other on the peer table of MX/Open-MX.

If we will face to unfortunate situation that we need to switch k1dc0 from current one to old one, we will probably enter the mine to check the situation. This means keeping BMC interface of old k1dc0 is unnecessary. So we unplugged power cables from the old k1dc0 in order to stop BMC for that. And then we set the proper BMC IP addr for the current k1dc0 on the console. After settings, we tried to access BMC web interface but login information is not a default one and not shared anywhere. In this situation, BMC interface isn't available when some troubles occurr. So we stopped k1dc0, unplugged all cables, and opened chassis body to see the BMC login information which is written on the baseboard.

After confirming the accessibility of BMC, we reverted all cables of current k1dc0 and booted it up. Then Myrinet connection between daqd@k1dc0 and models on each RTFE was established properly. Finally, we restarted daqd from DAQ guardian in order to clear undesirable old settings on running processes and all DAQ streams came back online.

Timeline
1105 mine in
1108 server room in
1215 server room out
1218 mine out
takahiro.yamamoto - 20:21 Tuesday 12 August 2025 (34812) Print this report
Absent frames by this issue are as follows.

full/science (NDS0)
[1438926048, 1438943776)
[1438945408, 1438945632)
[1438946688, 1438946976)
[1439000224, 1439000352)
[1439000576, 1439001888)
[1439002080, 1439003104)
[1439003232, 1439003328)


full/science (NDS1, Kashiwa)
[1438926048, 1438943776)
[1438945408, 1438945600)
[1438946688, 1438946976)
[1439000224, 1439000352)
[1439000576, 1439001856)
[1439002080, 1439003104)
[1439003232, 1439003328)


trend/second (NDS0, NDS1, Kashiwa)
[1438926000, 1438943400)
[1439000400, 1439002800)


trend/minute (NDS0, NDS1, Kashiwa)
[1438923600, 1438941600)
[1438999200, 1439002800)


LL (Kashiwa)
[1438926076, 1438945280)
[1438945419, 1438949376)
[1439000675, 1439002624)
[1439002625, 1439006720)
Search Help
×

Warning

×