Reports 1-1 of 1 Clear search Modify search
DGS (General)
takahiro.yamamoto - 14:47 Tuesday 02 December 2025 (35733) Print this report
hang-up of k1pr0

Abstract

PR3 front-end (k1pr0) hung up around 9:31.
This trouble also seemed to be a kernel issue such as klog#34618.
But its behavior was slightly different from past cases.
After we waited an end of some activities in the mine, PR3 front-end computer was rebooted and then it came back online.

Details

All EPICS records and fast channels related to PR3 front-end became unreachable around 9:31 on MEDM and ndscope, respectively. It looked like a kernel hang-up issue such as klog#34618. A different point from the past experiences was that IFO kept a locked state. The fact that IFO was locked could be confirmed by GigE cameras, ARM TRANS PDs etc. This fact means that CPU processes and PCI bus for the front-end models, the shared memory communication between k1vispr3t and k1vispr3p, and the Dolphin communication between k1pr0 and another front-end computers (k1asc0, k1lsc0, ...) were alive. On the other hand, k1pr0 seemed to be disconnected from the DGS network because common network connections (ssh, ping, EPICS ca) were surely unavailable. We recovered this issue remotely, so I couldn't check the mouse/keyboard input on the real console (In the past case, they were also dead). This fact may suggest that North-bridge (CPU, RAM, PCI) was kept alive and only South-bridge (LAN, USB) was dead in this time.

Anyway, what we can do was only rebooting k1pr0 with the SAFE mode for another front-ends. Because IFO lock was able to be kept, we recovered it after waiting the end of some mine works in this morning. Recovery procedure is as follows.
1. Request LOCKLOSS, then DOWN for LSC_LOCK in the manual mode because of Ezca connection error for K1:VIS-PR3_TM_SET_{P,Y}
2. Request SAFE for all guardian and front-end nodes except VIS_PR3 and k1pr0
3. Clear all SDF differences on safe.snap except PR3 related models
4. Disable the Dolphin connection of k1pr0
5. Reset k1pr0 on the BMC web interface

By the way, we found a script for loading SDF tables stopped due to a run-time error when some real-time model was dead. It's inconvenient for trouble shooting such as this time. So I added a try-except sentence to this script in order to avoid stopping by a run-time error induced by dead real-time models. If some real-time models are dead, a process loading SDF table for that model is skipped with warning messages. Though this script is used also on LSC_LOCK guardian, it should be no problem for ensuring observing-mode because SYS_SDF guardian checks not only a number of SDF differences but also a name of loaded SDF tables.
Search Help
×

Warning

×