KAGRA Logbook

DGS (General)

takahiro.yamamoto - 22:58 Monday 25 May 2026 (36952)

Lost PCIe connection on K1IOO1

[Ikeda, YamaT]

Abstract

We found K1IOO1 (k1iopioo1, k1alsfib, and k1psliss) was dead around 13:00 JST on May 23rd.
Finally, K1IOO1 was recovered by replacing HIB host and adapter cards to connect the front-end computer and IO chassis.
Due to an undesirable connection on the power line between IO chassis of K1IOO0 and K1IOO1, we had to reboot also K1IOO0 after recovering K1IOO1.
Then, IRIG-B synchronization issue occurred on K1IOO0 and it will probably takes several hours.
Please restart all models on K1IOO0 once after IRIG-B timing will become stable at least in tomorrow morning.

Details

After the morning briefing, we found K1IOO1 had been dead since around 13:00 JST on May 23rd. It wasn't a kernel trouble that sometimes occur and we was able to access K1IOO1 remotely. But all PCIe cards in IO chassis couldn't be found by the lspci command (e.g. 10b5:9056 is the General Standard product).

$ lspci -nvvv | grep 10b5:9056 -A1
14:04.0 1180: 10b5:9056 (rev ff) (prog-if ff)
        !!! Unknown header type 7f
--
16:04.0 1180: 10b5:9056 (rev ff) (prog-if ff)
        !!! Unknown header type 7f
--

For this reason, it did not appear that a model restart would resolve the issue, and it seemed that either a system reboot (a better case) or a power cycle of the I/O chassis (a worse case) would be necessary. So we asked commissioners to clear all SDF differences in the morning (klog#36948) and we started to recover it in this afternoon.

At first, we did a visual inspection around IO chassis and it seemed to run properly (LED on the baseboard and PCIe cards blinked and timing slave was synchronized). If this issue was caused by a momentary power outage, both IOO0 and IOO1 should be dead at the same time. But only K1IOO1 was dead in this time. So it didn't seem to be the momentary power outage at the upstream of the power distribution board in the IO chassis. But as a very rare case, we doubted a some kind of problem on the power supply path in the IO chassis just in case and tried to shutdown the front-end computer after disabling Dolphin connection, to unplug Dolphin cable and then to boot up the front-end computer. However, PCIe cards couldn't be found by the real-time OS.

Because this issue didn't seem to be a instantaneous power supply trouble, we doubted a malfunction of IO chassis itself (In the past, some capacitors on the baseboard of IO chassis were broken by aging such as klog#16759). In this case, we can identify it by the power cycle. However, IO chassis was able to boot up problems except the issue that real-time OS cannot find any PCIe cards in IO chassis.

After confirming that it's not a problem of the baseboard of IO chassis, we tried to swap HIB host and adapter cards to connect the front-end computer and the IO chassis. HIB card trouble is only remaining cause we had faced in the past (e.g. klog#6174). First, we replaced only HIB adapter card which was used for EX0 before replacing to V2 IO chassis in klog#36654. Because there was no change in situation, HIB host card was also replaced as one which was used for IX1 before replacing to V2 IO chassis in klog#36572 (One used in EX0 was already brought back to Mozumi, so we wasn't able to use it today). After that, we remembered the compatibility issue (klog#33429), so we replaced HIB adapter card to one which was used in IX1 before and was still installed in the old V1 IO chassis. Then, K1IOO1 was able to be recovered with a pair of HIB host and adapter cards which were used in IX1 before.

During the work above, we turned off the power breaker of K1IOO1. Somehow breaker switch affected the IO chassis of K1IOO0 and the front-end computer of K1IOO0 lost IO chassis. So we also needed to reboot K1IOO0. It came back just a reboot of front-end computer after disabling Dolphin connection. But unfortunately, IRIG-B synchronization issue occurred and it seems to take several hours. Maybe IRIG-B synchronization will become stable around 0-1am. After then, all the real-time models on K1IOO0 must be restarted to recover them. So please restart all models in tomorrow morning. Until that, we cannot lock PMC/IMC and so on.

Comments to this report:

takahiro.yamamoto - 21:39 Tuesday 26 May 2026 (36954)

[Ikeda, Nakagaki, YamaT]

Abstract

Similar trouble occurred again on K1IOO1 around 1:15 JST on May 26th and we weren't able to resolve that issue today.
Based on the results of several investigations, a 100-meter HIB cable emerged as the prime suspect.
Two additional tests are needed to confirm the suspicion, and we plan to conduct them tomorrow.
If the HIB cable is indeed the cause of this issue, we will need to run a new cable from the server room to the REFL area.

Details

Because K1IOO1 was dead again, restarting real-time models on K1IOO0 was postponed and recovering K1IOO1 was resumed.
According to the server logs, it seemed to a same issue as one in yesterday. At first, K1IOO1 front-end was just rebooted and it's confirmed that IO chassis was missed from the real-time OS. After then all cables were removed from IO chassis (actually, one DB37 cable for BIO was forgotten to remove). It's a different procedure and investigation under different situation from yesterday, so, strictly speaking, reproducibility and identity of the problem weren't able to be checked by today's investigation. (Since we had already confirmed early this morning on the test stand that a problematic HIB card in yesterday's work wasn't broken, there was no reason to hurry up to disconnect the cable out of fear of a malfunction.) Anyway, IO chassis couldn't be found by real-time OS after removing all signal cables.

Even if all signal cables were removed, electrical connection between IO chassis and another equipment in the field racks because GND of power line of IO chassis (DC24V) is connected to one of DC18V and also AC. So we prepared dedicated power supply unit to float IO chassis completely from the field racks. In this situation, IO chassis should work as the standalone. But the problem was not solved. From this result, we assumed this issue was different from the trouble on MCF0 that another equipment around field racks affects stability of IO chassis.

Note that, at this point, we have not been able to determine whether the problem lies with the HIB cable itself or with the combination of the long cable and the power supply conditions around the REFL. So we plan to connect IO chassis located in K1IOO1 and the server room with the spare cable by the temporal cabling. A result of this test will tell us HIB cable is really bad or not. Also, if the HIB cable is a cause of this issue, it might be worth checking to see if cleaning the terminals improves the situation or not. (The IOO area gets incredibly dusty.) The standard approach is to run a new cable from the server room to the IOO1 rack. In that case, the action items are as follows:

Checking a stock of corrugated pipe (and purchasing if necessary)
Checking a route of the new cable
Drilling work on the wall of the server room
Drilling work on the wall between the front room and the corner station
Assigning human resources for work at heights

takahiro.yamamoto - 16:50 Wednesday 27 May 2026 (36956)

[Ikeda, Nakagaki, Oshino, YamaT]

K1IOO1 was able to be launched properly with a spare 100m fiber laid on the floor.
So, we finally concluded that malfunction and/or aging of a HIB cable is a cause.
Because the contact cleaner of cable terminals didn’t fix this problem, we plan to lay a new HIB cable between the server room and the IOO1 rack.

----
Preparation status of a recovery work:

It appears we have an enough amount of corrugated tube in stock.
There is a space to lay a new corrugated tube in the hole on the wall at the server room.
It looks like we won’t need to drill any holes in the wall this time.
(We need to consider a hole issue at the future DGS upgrade.)
We need to discuss a route of the new cable, human assignment of technical staffs, and a use of an aerial work platform.
(Hopefully, we can discuss them with Hayakawa-san tomorrow.)

Prospect of recovery
I expect 0.5~1 day for cabling work except at heights. It’s now still unclear when the technical staffs will be available and the aerial work platform can be used, but even if we can do so Thursday afternoon or Friday morning, it will likely take at least until the end of Friday or around noon on Monday.

Consideration about temporary measures until full restoration:
k1shutter (shutter control for main IR), k1alsfib (fiber noise cancelation) and k1psliss don't work at all now. So the shutter cannot be opened now. And also, according to Ushiba-kun, green lasers cannot be aligned because the offsets of the woofer PZTs from DAC are dead.

IR laser shutter can be opened by the local operation mode of the laser shutter circuit. Though a shutter operation remotely via EPICS cannot be done in this mode, IMC lock can be recovered. (Thanks to the hardware interlock, there should be no concern about laser safety). We can also change the output power from PSL room via HWP, so we can use main IR beam for some purposes by this operation. But I'm not sure the stability and noise level because ISS is still unavailable.

Regarding the PZT offset, we can use the 75V output from the Thorlabs(?) PZT driver instead of the 5V output from the DAC. (In my understanding, we normally use 5V offset output from DAC without any offset output from PZT driver.) Fine and remote alignment cannot be available, but rough alignment of green lasers should come back.