Reports 1-1 of 1 Clear search Modify search
DGS (General)
takahiro.yamamoto - 0:47 Saturday 16 April 2022 (20498) Print this report
Comment to mx_stream hung up on K1PR2 and K1IY0 (20435)

Abstract

- Most significant cause is the network traffic of the DAQ switch in the C1 rack.
- DAQ now becomes stable with running all real-time models by connecting some RTFEs in the C1 rack to the DAQ switch in the B1 rack directly.
- As a long-term solution, we should buy and replace a DAQ switch as more faster one in order to keep a simple cabling in each rack.
- HT was disabled on k1dc0. It's not effective but seems to be better.

Details

At first, I disabled the HT on k1dc0 in order to improve the load balancing of I/O interruption reported as klog#20464.

This was not effective for instability of the DAQ system. But I/O interruption became processed on multi-cores. This configuration seems to be better to apply for other DAQ nodes. But today, I have no enough time and it will be done in next maintenance day.


We knew that DAQ became stable by stopping k1ioo1, k1pr2, k1pr3 or k1prm as a temporal solution. Today I enabled mx_stream only on k1ioo1, k1pr2, k1pr3 and k1prm and the timing error didn't occur. After then, when mx_stream on k1imc0 or k1ioo was enabled, timing error appeared. On the other hand, timing error didn't appeared in spite of enabling on k1lsc0 or k1asc0.
Note, all of these models are installed in the C1 rack and connected to a same network switch.

Because of this fact, I had thought it had not been a problem on the network traffic in the C1 rack. In fact, sum of data rate of all models in C1 rack is ~18Mbps (See also klog#20474). LSC and ASC models are the most and the 2nd most large models in KAGRA, respectively.

But, because the problem seemed to lie in the C1 rack, I tried to disconnect some RTFEs from the DAQ switch in C1 rack and to directly connect them to the DAQ switch in the B1 rack which was the upstream of the C1 rack. Then error rate drastically decreased (once per minute -> once per a few tens of minutes) when k1ioo, k1ioo1, k1imc0, k1pr2, k1pr3, or k1prm was connected to the B1 rack. But connecting k1lsc or k1asc was no effect.

Now k1lsc and k1asc are not used and many channels are always 0. So the compression of data works effectively and the traffic by k1lsc and k1asc may be not so large compared with the model size.

Finally, k1ioo, k1ioo1, k1imc in the C1 rack and k1bs in the C2 rack were connected to the DAQ switch in the B1 rack directly.
There is no error in recent several hrs.

-----
HT
enabled
=> FW0, NDS0, NDS1
disabled
=> DC0, FW1, TW0, TW1, BCST0

DC0:/etc/intrrupt
enabled
=> Only on CPU0 (CPU0-11)
disabled
=> Only on CPU0-2, 5 (CPU0-5)
Search Help
×

Warning

×