Abstract
We found K1IOO1 (k1iopioo1, k1alsfib, and k1psliss) was dead around 13:00 JST on May 23rd.Finally, K1IOO1 was recovered by replacing HIB host and adapter cards to connect the front-end computer and IO chassis.
Due to an undesirable connection on the power line between IO chassis of K1IOO0 and K1IOO1, we had to reboot also K1IOO0 after recovering K1IOO1.
Then, IRIG-B synchronization issue occurred on K1IOO0 and it will probably takes several hours.
Please restart all models on K1IOO0 once after IRIG-B timing will become stable at least in tomorrow morning.
Details
After the morning briefing, we found K1IOO1 had been dead since around 13:00 JST on May 23rd. It wasn't a kernel trouble that sometimes occur and we was able to access K1IOO1 remotely. But all PCIe cards in IO chassis couldn't be found by the lspci command (e.g. 10b5:9056 is the General Standard product).$ lspci -nvvv | grep 10b5:9056 -A1For this reason, it did not appear that a model restart would resolve the issue, and it seemed that either a system reboot (a better case) or a power cycle of the I/O chassis (a worse case) would be necessary. So we asked commissioners to clear all SDF differences in the morning (klog#36948) and we started to recover it in this afternoon.
14:04.0 1180: 10b5:9056 (rev ff) (prog-if ff)
!!! Unknown header type 7f
--
16:04.0 1180: 10b5:9056 (rev ff) (prog-if ff)
!!! Unknown header type 7f
--
At first, we did a visual inspection around IO chassis and it seemed to run properly (LED on the baseboard and PCIe cards blinked and timing slave was synchronized). If this issue was caused by a momentary power outage, both IOO0 and IOO1 should be dead at the same time. But only K1IOO1 was dead in this time. So it didn't seem to be the momentary power outage at the upstream of the power distribution board in the IO chassis. But as a very rare case, we doubted a some kind of problem on the power supply path in the IO chassis just in case and tried to shutdown the front-end computer after disabling Dolphin connection, to unplug Dolphin cable and then to boot up the front-end computer. However, PCIe cards couldn't be found by the real-time OS.
Because this issue didn't seem to be a instantaneous power supply trouble, we doubted a malfunction of IO chassis itself (In the past, some capacitors on the baseboard of IO chassis were broken by aging such as klog#16759). In this case, we can identify it by the power cycle. However, IO chassis was able to boot up problems except the issue that real-time OS cannot find any PCIe cards in IO chassis.
After confirming that it's not a problem of the baseboard of IO chassis, we tried to swap HIB host and adapter cards to connect the front-end computer and the IO chassis. HIB card trouble is only remaining cause we had faced in the past (e.g. klog#6174). First, we replaced only HIB adapter card which was used for EX0 before replacing to V2 IO chassis in klog#36654. Because there was no change in situation, HIB host card was also replaced as one which was used for IX1 before replacing to V2 IO chassis in klog#36572 (One used in EX0 was already brought back to Mozumi, so we wasn't able to use it today). After that, we remembered the compatibility issue (klog#33429), so we replaced HIB adapter card to one which was used in IX1 before and was still installed in the old V1 IO chassis. Then, K1IOO1 was able to be recovered with a pair of HIB host and adapter cards which were used in IX1 before.
During the work above, we turned off the power breaker of K1IOO1. Somehow breaker switch affected the IO chassis of K1IOO0 and the front-end computer of K1IOO0 lost IO chassis. So we also needed to reboot K1IOO0. It came back just a reboot of front-end computer after disabling Dolphin connection. But unfortunately, IRIG-B synchronization issue occurred and it seems to take several hours. Maybe IRIG-B synchronization will become stable around 0-1am. After then, all the real-time models on K1IOO0 must be restarted to recover them. So please restart all models in tomorrow morning. Until that, we cannot lock PMC/IMC and so on.