KAGRA Logbook

MIF (General)

takahiro.yamamoto - 17:35 Tuesday 10 September 2024 (31045)

Helper guardian for lockloss check

Abstract

I prepared a new guardian code in order to skip the initial check of lockloss investigation.
Because it's now working on an unused guardian node (CAL_PROC), I will move it on a new guardian node with a proper name in the next maintenance day.
Initial check results are available on /users/Commissioning/data/lockloss/yyyy/mmdd/yyyy-mm-dd.json.

Details

This guardian provides following things (see also the attachment as an example) when a jump to LOCKLOSS in LSC_LOCK guardian is detected.
- GPS/UTC/JST times which are aligned with the sampling timing of DAQ (not a guardian log time)
- State name of LSC_LOCK guardian just before jumping to LOCKLOSS.
- Guardian nodes which detects lockloss earlier than (or at the same time in the DAQ interval) LSC_LOCK.
- Label of known issues such as klog#31032, and O4a experiences.

It would reduce the effort of checking for lockloss, because it would narrow down the items to be checked by automatically labeling known issues and revealing the behavior of other guardian nodes.

Non-image files attached to this report

2024-09-10.json.txt

Comments to this report:

takahiro.yamamoto - 12:29 Wednesday 11 September 2024 (31052)

Recent (today's and yesterday's) locklesses are now available also on Run Summary.
So you can find initial-check results of locklosses easier than seeing json files.

takahiro.yamamoto - 15:35 Friday 20 September 2024 (31099)

I moved this function from the temporal guardian node (CAL_PROC) to a new node (SYS_LOCKLOSS).

Information of recent two locklosses (latest and previous) are shown on k1mon7. Info. is automatically updated by SYS_LOCKLOSS guardian when a new lockloss is detected. EPICS channels for this info. are still temporal ones. So I need to add new channels with proper name on k1grdconfig.

ndscope plot around latest lockloss time is also automatically launched by SYS_LOCKLOSS on k1mon8. For the lockloss due to known issues (e.g. overflow, earthquake, etc.) related channels are shown on that plot. For unknown issues, ndscope templates in /users/DET/tools/lockloss/share/LSC_LOCK/ will be launched. Template names must be same name as the state name of LSC_LOCK guardian in order to open proper template for each lockloss. (I haven't prepared templates for all states yet.)

takahiro.yamamoto - 17:10 Thursday 26 September 2024 (31133)

Abstract

Lockloss guardian was stopped due to a coding bug when lockloss occurred around 10am on 9/24.
Bug was fixed. And also, missing locklosses are now listed by the offline execution of the lockloss check.

Details

Latest locklosses were not updated due to a coding bug on lockloss guardian. This bug was injected in the work as klog#31099 and it's just a trivial bug. So I fixed it soon and lockloss check was resumed from a lockloss at 1411367342.6875 (= 2024-09-26 06:28:44.687500 UTC = 2024-09-26 15:28:44.687500 JST). Missing locklosses were also added on the list by an offline analysis. Last lockloss before fixing this problem had been one at 1410219337.75 (= 2024-09-12 23:35:19.750000 UTC = 2024-09-13 08:35:19.750000 JST). I also checked locklosses between 9/12 (last lockloss) and 9/24 (error occurred) just in case but there was no lockloss because there was no work with full IFO during this term. So I re-analyzed only locklosses between 9/24 (error occurred) and today. I found totally 19 locklosses which were missed by the guardian bug. Now all of them are available in the list on the web pages.

hirotaka.yuzurihara - 16:22 Monday 07 October 2024 (31224)

I added the SYS_LOCKLOSS guardian in the medm screen. See the bottom of the attached screenshot.

Images attached to this comment

takahiro.yamamoto - 11:50 Tuesday 08 October 2024 (31232)

I updated k1grdconfig for adding necessary EPICS channels on SYS_LOCKLOSS guardian as shown in Fig.1 and Fig.2.
All added channels are StringIn blocks, so there is no change in the channel list for DAQ.

Compile check was already completed but it's not installed yet.
It will be installed on the next maintenance day.

Images attached to this comment

31232_1728355761_Screenshot from 2024-10-08 11-36-36.png

31232_1728355764_Screenshot from 2024-10-08 11-38-08.png

takahiro.yamamoto - 19:10 Wednesday 09 October 2024 (31254)

As one of the lockloss situations in states beyond DC lock, Ushiba-kun suggested a case that an amplitude of AS17Q across 0.
This situation corresponds that DARM fluctuation around the operation point reaches opposite side of dark point.

I implemented a new function for checking this situation and checked all 40 locklosses from the DC lock since last night.
This new function could provide consistent results with my eye-check for all locklosses.
So I deployed this function in SYS_LOCKLOSS guardian.

Attached figures show 29 of 40 locklosses from DC lock in which we can see 0-cross of AS17Q before the lockloss flag is raised.
The two numbers at the end of the file names correspond to date (yymmdd) and ID in the lockloss list.
(Because of too many plots, 11 of 40 locklosses which seems to be another reasons will be attached the next post.)

Images attached to this comment

takahiro.yamamoto - 19:15 Wednesday 09 October 2024 (31258)

I attached remaining 11 plots which doesn't seem to be related to the 0-cross issue.
Some of them (e.g. the 4th plot) may be related to the 0-cross issue.
But it's difficult to judge they are really 0-cross issue or not because of the poor sampling rate (16Hz) of guardian flag.

Images attached to this comment

takahiro.yamamoto - 14:41 Sunday 03 November 2024 (31493)

I made a minor update of SYS_LOCKLOSS guardian.

When I created this guardian node, unused EPICS channels which prepared for an another purpose were used to record lockloss information.
Though I had prepared new EPICS channels for this purpose last month (see also klog#31232), guardian code hadn't updated yet.
So I modified guardian code as using these new channels to record lockloss info.
MEDM screen for showing lockloss info. was also updated.

takahiro.yamamoto - 21:49 Wednesday 04 December 2024 (31909)

Yuzurihara-kun pointed out that the possibility of missing the lockloss reason about the human request when we request other than DOWN state.
So I modified LOCKLOSS guardian to detect such a case.

This is reason why guardian check the time series array of REQUEST_N values include 1.0 or not. guardian misses the lockloss reason with old implementation when, for example, someone changes guardian request from DC lock to RF lock. In this case REQUEST_N is chaged from 9990 to 1400. On the other hand state transition is 9990 (DC lock) -> -1 (LOCKLOSS) -> 1 (DOWN) -> ... -> 1400 (RF lock). So I improved the implementation and now guardian can detect the case above. Specifically, it was changed to check if REQUEST_N was changed to a smaller value. It should works well because the state number is well ordered. If LSC_LOCK guardian will be modified without the manner about the state number order, LOCKLOSS guardian must be changed. But in this case, only solution is to describe all cases one-by-one.

takahiro.yamamoto - 11:38 Friday 13 December 2024 (31982)

A new function was deployed around 11:30 in LOCKLOSS guardian for the cases of lockloss due to ISS saturation reported in klog#31974.
So it can be automatically tagged from now.

-----
Figure 1 and Figure 2 show the lockloss cases due to non-ISS issue and ISS issue. In the case of non-ISS issue, ISS guardian goes to DONW after (or at the same time in 16Hz) LSC_LOCK guardian goes to LOCKLOSS. AOM feedback becomes also large value after the lockloss. On the other hand, we can see ISS down and oscillation of AOM output a few second before the lockloss in the case of ISS issue.
We could identify the most cases of known lockloss reasons by seeing proper signals 0~2 seconds before the lockloss time. But in the ISS case, problem occurs much earlier than the lockloss time and AOM output is already stopped by ISS guardian just before the lockloss. So checking longer duration must be needed to identify the ISS issue. Figure 3 is also ISS-related lockloss in another time. Relative down time between ISS and LSC_LOCK is different from the case of Fig.2. According to check several cases, this issue seems to be detected by seeing AOM output 0~4 seconds before the lockloss time.

Note for future maintenance.
When ISS goes to DOWN, BPC control doesn't work because BPC lines are hidden by intensity noise. So difference in the relative time between ISS and LSC_LOCK may depends on the speed of BPC control and we may need to modify the lockloss detection scheme when BPC control will be drastically changed.

Images attached to this comment

31982_1733998263_Screenshot from 2024-12-12 13-34-49.png

31982_1733998266_Screenshot from 2024-12-12 19-10-41.png

31982_1733998268_Screenshot from 2024-12-12 19-10-50.png

takahiro.yamamoto - 15:08 Wednesday 18 December 2024 (32036)

SYS_LOCKLOSS guardian might be stopped since Monday morning due to a trouble on the connection between NDS and a poor error handling.

guardian code was modified to treat NDS related errors and a re-analysis for missed period was almost done.
But it's difficult to deploy them remotely without careful checks. So I'll fix them after going back to Mozumi.

takahiro.yamamoto - 18:35 Thursday 19 December 2024 (32059)

Error handling for the NDS connection was added in the guardian code and SYS_LOCKLOSS was resumed.
All lockloss events while SYS_LOCKLOSS was stopped were analyzed in the offline process.

- 12/15: 54 events (No. 0~27 were already analyzed in online process, so only No. 28-53 were analyzed.)
- 12/16: 29 events
- 12/17: 26 events
- 12/18: 18 events
- 12/19: 8 events (until 9:20 UTC)

takahiro.yamamoto - 20:18 Friday 17 January 2025 (32380)

Yuzurihara-kun requested to add a duration of staying in the last state to the lockloss list.
So I added a new function to compute a duration in the guardian code and added them in the lockloss table.
A duration for the past lockloss hasn't been computed.
If someone make a list of a duration for the past lockloss events, I can add them to the json file for making the table on the Web pages.

For this update, k1grdconfig was updated to add new channels as shown in Fig.1. All added channels are the EPICS StringIn. So the daqd was not restarted for this work because there was no change in the DAQ list. k1grdconfig is only a process which has been restarted.

-----
For computing an accurate duration, guardian may have to read so long time data when lockloss occurred after staying some state for a long time. Such an implementation is not realistic for guardian which is a semi-real-time process. For solving this issue, LOCKLOSS guardian saves a time stamp when LSC_LOCK guardian moves to states and a duration is computed as a difference between the lockloss time and the saved time stamp. For this reason, this duration doesn't have same accuracy with the DAQ time stamp. Please use it as a rough reference.

Images attached to this comment

32380_1737112684_Screenshot from 2025-01-15 21-01-43.png

hirotaka.yuzurihara - 16:36 Thursday 23 January 2025 (32437)

Thank you for updating the SYS_LOCKLOSS guardian.

It is possible to add the label indicating if someone was adding the excitation signal? In the lockloss of 2025-01-19 02:47:58 UTC, the commissioners were working on the measurement of the transfer function by using the excitation.
Ushiba-san commented that when the excitation is happening, any situations are possible. If you add such label, we can ignore the lockloss under the excitation, during the lockloss investigation.

Images attached to this comment

32437_1737617611_1421411838_1421412138_K1_LSC-DARM_IN1_DQ_Q-Transform.png

takahiro.yamamoto - 16:45 Friday 24 January 2025 (32458)

If it's only referring the Excitation check guardian, it's easy and it might be a good practice to touch guardian. Note that the reason why lockloss check code is prepared as guardian is all on-site people can improve it by themselves (If I'm only people to edit it, I want to refactor it as a non-guardian code for flexibility).

What we must seriously consider is that excitation channels are often left with un-closed even if an actual excitation is finished. Unfortunately it seemed happen today as shown in Fig.1. Though some excitations from suspensions seemed to be presence in this 4hrs., it's hard to separate actual excitation from wrong flagged data due to un-closed excitations. It's a matter of a manner of user codes and activities, some people had already reported many times (so it's no longer possible for us to improve it?).

From the view point of observation, we can just eliminate the all suspicious data. It becomes just a reduction of the duty factor by our fault. On the other hand, it had better to avoid a false labeling for lockloss study because workers become not to check them. For this sense, labeling with a low reliable way is worse than non-labeling. Automatic labeling is for reducing our work, not for hiding check items it's really needed. We should decide to adding EXC flag now with consideration whether the most people prefer to increase labelled events even if reliability is low or to label only reliable events.

Images attached to this comment

takahiro.yamamoto - 1:50 Tuesday 28 January 2025 (32468)

This is remaining work of klog#32380.

Lockloss guardian was updated on Jan17 to add the information about the duration staying in the last state. Though lockloss information files stored in /users/Commissioning/data/lockloss/yyyy/ has duration after this update, files before Jan16 didn't have it yet. So I re-analized all lockloss events (the 1st event by lockloss guardian is on Sep10, 2024) and re-provided lockloss information files from Sep10 2024 to Jan16 2025. I checked `diff` between all old and new files, so there should be no difference other than the addition of the duration.

They are now available /users/Commissioning/data/lockloss/yyyy/ and the lockloss table on the web is provided based on these new files. Old files are kept in /users/Commissioning/data/lockloss/yyyy/__old/2025_0126_before_add_duration/.

hirotaka.yuzurihara - 13:14 Tuesday 18 February 2025 (32722)

Related to klog32695,

Ushiba-san requested me to add the label to know if ISC watch dog was turned on before the lockloss. I implemented the function and and tested in the test guardian. The label name is ISC_WD_L.

If the LSC_LOCK guardian will be DOWN state tomorrow morning (such as 8:50~9:00), I will add this feature in SYS_LOCKLOSS guardian. I will replace the json files of past data (after 2/14) after that.

Note

The working directory: /users/yuzu/work/20250217_lockloss_guardian/ISC_WD
The condition to label LSC_WD_L is K1:VIS-ETMX_ISCWD_WD_AC_L_RMSMON > K1:VIS-ETMX_ISCWD_WD_AC_L_RMS_MAX.
- Exampe of the label.
- If the LSC_LOCK guardian entered lockloss state and the watch dog was turned on at the same time (Fig), we will not flag the label to avoid the false detection. (memo) at this time, the Ushiba-san's idea to flag this phenomena is to check if the RMSMON data just before the lockloss is approaching to the threshold or not. When the ISC_WD_P or Y is implemented, I will add this feature.
This label will be checked only when the LSC_LOCK guardian is more than ENGAGE_ALS_DARM.
This kind of watch dog will be implemented for pitch and yaw, and for other type-A suspensions. In future, it's possible to check 12 watch dogs. As the chat with Ushiba-san, we will implement three label (ISC_WD_L, ISC_WD_P, ISC_WD_Y), instead of preparing 12 labels

Images attached to this comment

32722_1739850038_Screenshot from 2025-02-18 12-40-28.png

32722_1739850084_Screenshot from 2025-02-18 12-41-19.png

hirotaka.yuzurihara - 9:29 Wednesday 19 February 2025 (32736)

I finished to implement the function to label the ISC_WD_L in the SYS_LOCKLOSS guardian, at 8:50. At that time, the LSC_LOCK guardian was in the DOWN state.

I also updated the function to treat the case that the LSC_LOCK guardian entered lockloss state and the watch dog was turned on at the same time (Fig). I implemented the code to check if the average of RMSMON data just before the lockloss is approaching to the threshold or not. I set the threshold for the the average of RMSMON data as 0.9 * RMS_MAX.
For the lockloss events for 02/14, 15 and 17, this threshold (0.9 * RMS_MAX) works well. If necessary in future, I will tune it.
The related lockloss are

1423577533.0625 (Fig)
1423582334.9375 (Fig)
1423695458.3125 (Fig)

Images attached to this comment

32736_1739923453_Screenshot from 2025-02-19 09-04-05.png

32736_1739923505_Screenshot from 2025-02-19 09-04-54.png

32736_1739923530_Screenshot from 2025-02-19 09-05-25.png

hirotaka.yuzurihara - 10:17 Wednesday 19 February 2025 (32742)

I replaced the json files for 02-14, 02-15, 02-17, 02-18, stored in /users/Commissioning/data/lockloss/2025. The previous files are backuped to /users/Commissioning/data/lockloss/2025/__old/2025_0219_before_add_ISC_WD_L/.

hirotaka.yuzurihara - 14:43 Thursday 27 March 2025 (33128)

The lockloss due to the oscillation of IMC control could be seen in the recent lockloss investigation (klog33036, klog32903, klog32803). I implemented new label to tag the lockloss identify to the oscilation of IMC length control. The label name is IMCL.
In the some case, even though the 1Hz seismic motion is less than the threshold, the oscillation of the IMC control occured and it leads the lockloss. (Fig1) (Fig2) By adding this label, we can identify such loskloss.

I tested the function in local directory and by using the CAL_PROC guardian, Betweek 2024/09~2024/03/24, 323 lockloss were tagged with this label. I performed the eye scan for all the lockloss and confirmed that it works well. After the test, I added to SYS_LOCKLOSS guardian at 14:18.

I replaced the json files in /users/Commissioning/data/lockloss with the update json files. The old json files are stored in /users/Commissioning/data/lockloss/2025/__old/2025_0327_before_add_IMCL

Images attached to this comment

33128_1743053746_Screenshot from 2025-03-27 14-35-38.png

33128_1743053788_Screenshot from 2025-03-27 14-36-22.png

takahiro.yamamoto - 21:37 Monday 14 April 2025 (33416)

Many lockloss labeled as IMCL can be seen in recent days. So I saw several ISC channels for some events. Then noise excess can be seen not only on L but also P and Y. And also, characteristic frequency of some of them (I haven't checked all events) are ~4Hz (e.g. Fig.1).

According to Ushiba-kun, it's close to a resonance of MCI_P (I heard ones of MCO_P and MCE_P are ~3Hz). So the noise excess on MCL might come from a coupling of P2L. Though it depends on the intention of "IMCL", there might be more proper category at least for some of "IMCL" events. (e.g. if it means that just a excess can be seen on MCL control signals, it's true. In this case, is it better to add labels also as "IMCP" and "IMCY"? On the other hand, it would mean some kind of a problem on the MCL control loop, I wonder we need to gather more detailed evidences.)

Images attached to this comment

33416_1744633397_Screenshot from 2025-04-14 21-05-50.png

takahiro.yamamoto - 21:38 Monday 05 May 2025 (33680)

LOCKLOSS guardian stopped due to NDS connection error during an analysis process for a lockloss at 09:22:36 UTC on May 5th.
(It may be caused by a heavy load of NDS of my CAL-related analysis around 09:00-10:00 UTC.)

It was recovered around 12:00 UTC just by doing reload. During a down time, 6 locklosses were missed by guardian as shown in Fig.1. Cross-hair and t=0 represent an error occurred time and a recovered time, respectively. Because a lockloss archive file on May 5th UTC is being attached by the guardian process, missed lockloss events will be analyzed and added on the next UTC day (after 9:00 JST tomorrow) in order to avoid conflicting file access.

Images attached to this comment

33680_1746448689_Screenshot from 2025-05-05 21-37-46.png

takahiro.yamamoto - 17:07 Tuesday 06 May 2025 (33683)

Reproduction of missed lockloss events were done as /users/Commissioning/data/lockloss/2025/2025-05-05.json.
An old file was kept as /users/Commissioning/data/lockloss/2025/__old/2025_0505_reproduction_klog33680_issue/2025-05-05.json.

takahiro.yamamoto - 16:10 Wednesday 07 May 2025 (33699)

LOCKLOSS guardian stopped again by NDS connection error for the lockloss evnet at 05:37:27 JST on May 7th (20:37:27 UTC on May 6th).
I resumed it at 14:43 JST and reproduced missed lockloss events.
There was 14 lockloss events on May 6th UTC (Fig.1) and 9 events on May 7th (Fig.2).

NDS overload is more likely to occur.
Has a number of processes accessing NDS increased recently?

Images attached to this comment

takahiro.yamamoto - 15:37 Monday 12 May 2025 (33750)

I modified LOCLLOSS guardian to use k1nds3 instead of k1nds0 (see also klog#33749 about tertiary NDS).

If NDS connection error comes from server load on k1nds0, it should be improved by this change. Even if it comes from much disk access, it might be mitigated because k1nds3 accesses different disk from k1nds0 as primary NDS.