KAGRA Logbook

DGS (General)

takahiro.yamamoto - 14:11 Monday 10 July 2023 (25925)

IPC glitch comes again

IPC glitch comes again around 11:00 on July 7th Fri.
It seems comes from the change in the morning works on the last maintenance day.
I already confirmed NIC settings on DC0 is surely proper.

A number of glitch in each day is as follows.
55 2023-06-22 53 2023-06-23 23 2023-06-24 1 2023-06-25 0 2023-06-26 1 2023-06-27 0 2023-06-28 1 2023-06-29 0 2023-06-30 1 2023-07-01 0 2023-07-02 1 2023-07-03 1 2023-07-04 2 2023-07-05 1 2023-07-06 12 2023-07-07 51 2023-07-08 35 2023-07-09 28 2023-07-10

Comments to this report:

satoru.ikeda - 16:33 Friday 21 July 2023 (26058)

Overview.

IPC glitch occurred again from K-Log#25925.
We checked to see if this was due to a change in K-Log#25892.

Result
IPC glitch occurred.
We believe that it was not caused by a change in the model file or an increase or decrease in the amount of these data.

Procedure

1. revert back to before the change of K-Log#25892.

k1visetmxt, k1visetmyt, k1visitemxt, k1visitmyt, TOWER_MASTER, k1aso, k1sdfmanage
After the change, DAQ restart was performed and the status was checked until about 3pm.
All states are SAFE and DOWN at the time of confirmation.

Checked from 11:00 to 15:00.
IPC Glitch was detected at 14:30.

2. revert back to after changes to K-Log#25892.

k1visetmxt, k1visetmyt, k1visitemxt, k1visitmyt, TOWER_MASTER, k1aso, k1sdfmanage

Non-image files attached to this comment

ipc_glitch_test.txt

satoru.ikeda - 16:36 Friday 21 July 2023 (26060)

The order of rtsystab was switched to confirm the results.

Analysis of past data revealed that there are two types of IPG glitches.

1. the case of lost in consecutive time order
(2) Intermittent glitches with the same decimal point at the same time.
　
In case 2 above, we found that the iop model and the next model in the transfer order of mx_stream were not in error.
(In case 1, glitch occurs including iop.)

Therefore, we tested what happens when the order of mx_stream is changed.
We will check it at IPC glitch time after tomorrow.

Changes
1. create backup
cd /diskless/root/etc/
cp rtsystab rtsystab.20230721
Make the following changes to rtsystab

Before change
k1omc0 k1iopomc0 k1visommt1 k1visommt2 k1visostm k1omc k1aso k1ascbeacon
k1ex0 k1iopex0 k1vistmsx k1calex k1pemex0 k1tmsx
After change
k1omc0 k1iopomc0 k1aso k1visommt1 k1visommt2 k1visostm k1omc k1ascbeacon
k1ex0 k1iopex0 k1pemex0 k1vistmsx k1calex k1tmsx

2. restart DAQ and check if the order of mx_stream arguments in k1omc0 and k1ex0 has been changed

Non-image files attached to this comment

ipc_glitch_test2.txt

satoru.ikeda - 15:03 Friday 28 July 2023 (26125)

The relationship between IPC glitches and rtsystab was confirm.

The rtsystab change in the previous K-Log#2606060 eliminated the error in the changed area.
time_datas_0721.pdf
However, the mechanism is still unclear.

1. Therefore, we decided to make the following additional changes and see how it goes.

We made the following changes to rtsystab and checked the IPC glitch time.

Before change
k1lsc0 k1ioplsc0 k1lsc k1lsc2 k1alspdh k1calcs
k1omc0 k1iopomc0 k1aso k1visommt1 k1visommt2 k1visostm k1omc k1ascbeacon
k1bs k1iopbs k1visbst k1visbsp
k1sr2 k1iopsr2 k1vissr2t k1vissr2p
k1sr3 k1iopsr3 k1vissr3t k1vissr3p k1pemsr3
k1srm k1iopsrm k1vissrmt k1vissrmp
k1ex0 k1iopex0 k1pemex0 k1vistmsx k1calex k1tmsx
k1lsc0 and k1omc0 state with previous changes

After change
k1lsc0 k1ioplsc0 k1lsc2 k1alspdh k1calcs k1lsc
k1omc0 k1iopomc0 k1aso k1ascbeacon k1visommt1 k1visommt2 k1visostm k1omc
k1bs k1iopbs k1visbsp k1visbst
k1sr2 k1iopsr2 k1vissr2p k1vissr2t
k1sr3 k1iopsr3 k1vissr3p k1vissr3t k1pemsr3
k1srm k1iopsrm k1vissrmp k1vissrmt
k1ex0 k1iopex0 k1pemex0 k1tmsx k1vistmsx k1calex

Result
time_datas_0728.pdf
k1lsc0 k1ioplsc0 k1lsc2 k1alspdh k1calcs k1lsc
=> k1lsc and k1lsc2 errors disappear
k1omc0 k1iopomc0 k1aso k1ascbeacon k1visommt1 k1visommt2 k1visostm k1omc
=> k1aso k1ascbeacon error disappears, but other errors become new errors
k1bs k1iopbs k1visbsp k1visbst
k1sr2 k1iopsr2 k1vissr2p k1vissr2t
=> Payload errors disappear and all errors are gone
k1sr3 k1iopsr3 k1vissr3p k1vissr3t k1pemsr3
=> k1vissr3p and also k1vissr3t become errors
k1srm k1iopsrm k1vissrmp k1vissrmt
=> Payload error disappears and all errors are gone
k1ex0 k1iopex0 k1pemex0 k1tmsx k1vistmsx k1calex
=> all errors are gone

2. The following changes were made in preparation for next week.

Newly added the following
Before change
k1lsc0 k1ioplsc0 k1lsc k1lsc2 k1alspdh k1calcs
k1omc0 k1iopomc0 k1aso k1visommt1 k1visommt2 k1visostm k1omc k1ascbeacon
k1bs k1iopbs k1visbst k1visbsp
k1sr2 k1iopsr2 k1vissr2t k1vissr2p
k1sr3 k1iopsr3 k1vissr3t k1vissr3p k1pemsr3
k1srm k1iopsrm k1vissrmt k1vissrmp
k1ex0 k1iopex0 k1pemex0 k1vistmsx k1calex k1tmsx

k1ix1 k1iopix1 k1visitmxt k1visitmxp
k1iy1 k1iopiy1 k1visitmyt k1visitmyp
k1ex1 k1iopex1 k1visetmxt k1visetmxp k1sendbeacon
k1ey1 k1iopey1 k1visetmyt k1visetmyp
k1ey0 k1iopey0 k1tmsy k1vistmsy k1caley k1pemey0

After change
k1lsc0 k1ioplsc0 k1lsc k1alspdh k1calcs k1lsc2
k1omc0 k1iopomc0 k1ascbeacon k1aso k1visommt1 k1visommt2 k1visostm k1omc
k1bs k1iopbs k1visbst k1visbsp
k1sr2 k1iopsr2 k1vissr2t k1vissr2p
k1sr3 k1iopsr3 k1pemsr3 k1vissr3p k1vissr3t
k1ex0 k1iopex0 k1pemex0 k1vistmsx k1calex k1tmsx

k1ix1 k1iopix1 k1visitmxp k1visitmxt
k1iy1 k1iopiy1 k1visitmyp k1visitmyt
k1ex1 k1iopex1 k1visetmxp k1visetmxt k1sendbeacon
k1ey1 k1iopey1 k1visetmyp k1visetmyt
k1ey0 k1iopey0 k1tmsy k1vistmsy k1pemey0 k1cale

Result
To be confirmed at a later IPC glitch time.

Non-image files attached to this comment

satoru.ikeda - 15:29 Friday 04 August 2023 (26208)

Overview

Results of last week

We changed the model order in several FEs and checked the occurrence of IPC glitches after the changes.
The situation did not change much, although there were increases and decreases due to the changes.

Next, we tried the configuration according to the amount of data per endpoint and CPU threads.

Procedure

Considering the amount of data transfer for each model, we adjusted and arranged them according to the amount of data for one endpoint.
K1LSC0, K1ASC00, and K1OMC0 were assumed to be alone since they exceed 4000.

In addition, since the CPU of k1dc0 currently has 6 cores and 12 threads, the number of endpoints used was set to 10 for maximum CPU efficiency.

endpoint sum
0 5884 K1LSC0 5884
1 4312 K1ASC0 4312
2 4182 K1OMC0 4182
3 4555 k1ioo 2640 k1sr3 967 k1bs 948
4 4240 K1IOO1 2204 K1SRM 1088 K1SR2 948
5 3898 k1als0 1793 k1imc0 1157 k1px1 948
6 3920 K1EX0 1786 K1IY0 1360 K1PRM 774
7 3865 K1EY0 1718 K1TEST0 1410 K1PR2 737
8 3896 K1EX1 1633 K1IY1 1541 K1PR0 722
9 4020 K1EY1 1607 K1IX1 1560 K1MCF0 663 K1OMC1 190
10 Not used
11 Not used
12 Not used
13 Not used
14 Not used
15 Not used

Result
Will take data for a few days

Non-image files attached to this comment

rtsystab.txt

satoru.ikeda - 17:57 Friday 18 August 2023 (26381)

Verification Results

No improved effect was seen.
Therefore, we reverted the rtsystab back to the original of 20230721.
> 1. create backup
> cd /diskless/root/etc/
> cp rtsystab rtsystab.20230721

We will review FE's NIC settings next time.

Non-image files attached to this comment

rtsystab.txt

satoru.ikeda - 17:34 Friday 25 August 2023 (26479)

All FE NIC settings were changed in relation to IPC glitch.

Background
In past DGS tests, the NIC setting (ITR: InterruptThrottleRate) on the FrontEnd side improved performance.
Currently, such settings have not been made.
We will check if changing ITR this time leads to improvement of IPC glitch.

Procedure
1. Log in to the FE.
2.sudo ethtool -C eth1 rx-usecs 0

Result
We plan to check the occurrence of IPC Glitch for a few days.
There seems to be no abnormal load as far as I can see with vmstat.
The values of in and cs are increasing and decreasing, but the changes are acceptable (they change every time data is acquired).
(k1px1 and k1test0 were changed to August 10 in advance for testing.
k1ioo forgot to take the data.)
before after
- in cs us sy id wa in cs us sy id wa
k1lsc0 20 52 26 2 71 0 46 33 26 2 71 0
k1asc0 3 31 29 2 69 0 7 30 29 2 69 0
k1als0 11 5 1 0 99 0 0 12 1 0 99 0
k1ioo - 59 14 13 1 86 0
k1ioo1 16 0 6 1 94 0 31 21 6 1 94 0
k1imc0 7 6 16 2 82 0 31 60 16 2 82 0
k1pr2 11 4 2 0 97 0 21 14 2 0 97 0
k1pr0 20 16 2 0 97 0 5 2 2 0 97 0
k1prm0 19 12 2 0 97 0 4 22 2 0 97 0
k1mcf0 0 0 0 0 100 0 0 0 0 0 100 0
k1bs 6 4 4 0 96 0 16 15 4 0 96 0
k1sr2 4 4 4 0 96 0 13 15 4 0 96 0
k1sr3 18 29 7 1 93 0 33 11 7 1 93 0
k1srm 16 15 5 0 95 0 2 2 5 0 95 0
k1omc0 60 7 18 2 80 0 10 53 18 2 80 0
k1omc1 1 1 0 0 99 0 0 0 0 0 99 0
k1ix1 1 0 6 0 94 0 0 0 6 0 94 0
k1iy1 0 2 7 1 92 0 3 1 7 1 92 0
k1ex1 0 1 9 1 91 0 0 1 4 1 95 0
k1ey1 1 0 6 0 94 0 0 1 6 0 94 0
k1ex0 1 0 4 1 95 0 0 1 4 1 95 0
k1ey0 33 25 5 1 94 0 1 0 5 1 94 0
*k1px1 1 0 0 0 100 0
*k1test0 0 1 16 2 82 0
k1iy0 0 0 1 0 99 0 0 0 1 0 99 0
in Number of interrupts per second, including clock interrupts
cs Number of context switches per second
us Time used to execute non-kernel code (including user time and nice time) (%)
sy Time used to execute kernel code (system time) (%)
id Idle time (%)
wa IO wait time (%)

k1boot:/diskless/root/etc/rc.Local
Before /usr/sbin/ethtool -C eth1 rx-usecs 1
After /usr/sbin/ethtool -C eth1 rx-usecs 0

Non-image files attached to this comment

fe_nic_japanese.txt

satoru.ikeda - 15:26 Friday 01 September 2023 (26583)

Results of changing all FE NIC settings related to IPC glitch.

8/26:2 sets (14 times)
8/27:0 sets (0 times)
8/28:2 sets (14 times)
8/29:2 sets (15 times)
8/30:2 sets (15 times)
<<8/31:DAQ restarted
8/31:98 sets (100 times)
=> IPC glitch recurred at 17:00 (97 times)
9/1:1 set (5 times) - until 9:00
* Sets: Errors occurring in 1 second are counted as 1.
* Times: counted in 16 Hz units

It was improving, at least until I restarted DAQ.
I restarted DAQ again today, so we will see how it goes for another week,

Non-image files attached to this comment

satoru.ikeda - 16:09 Friday 08 September 2023 (26696)

In relation to the IPC glitch, I tried the following steps to restart the DAQ.

K-Log#25698 to change the k1dc0 settings to improve the situation.
I tried again to finalize the procedure for this.

1. restored the rx-usecs settings for FE and k1dc0.
　0->1 for FE, 1->10 for k1dc0

K-Log when set
FE, k1dc0

2. Restarted DAQ.

3. Changed rx-usecs setting of k1dc0 again.
k1dc0 is 10->1 (K-Log#25698)

Results will be checked next week.

Non-image files attached to this comment

ipcglitch_japanese.txt

satoru.ikeda - 13:14 Friday 15 September 2023 (26774)

Here are the results of last week's setup.
The IPC glitch that occurred at regular intervals is gone.
However, it started to recur after the DAQ restart occurred on the 12th(K-Log#26744),
Therefore, the same procedure was used to restart the DAQ. (k1dc0 rx-usec 1->10->DAQ restart->1)
Testing will continue this week.

Non-image files attached to this comment