Reports 1-1 of 1 Clear search Modify search
DGS (General)
takahiro.yamamoto - 17:31 Friday 05 April 2024 (29110) Print this report
Hardware modification of k1dc0
[Ikeda, YamaT]

This is series work of klog#28944

Abstract

In order to check IPC glitches are caused by NIC performance or not, we replaced the k1dc0 computer that has two Myrinet cards.
Now some of real-time front-ends sends data to the primary card on k1dc0 and anothers sends to the secondary card.
Because this configuration seems to work stably during today's daytime, we keep this configuration in the next week
in order to see changes in the situation of IPC glitches.
In this work, k1test0 was rebooted for recovering from bugs because we made a bug in the boot script of mx_stream
(but we might not need to reboot it).

Details

If NIC performance is a cause of IPC glitches, IPC glitches might be improved by distributing the load across the two NICs. So we replaced k1dc0 computers that has two Myrinet cards. Other configurations (part number of computer chassis, CPU, RAM etc.) are same as original one.

At first, RTFEs and a new k1dc0 couldn't communicate each other because RTFEs still try to send data to hardware address of an old k1dc0. It seems that omx needs to be restarted, not mx_stream, to change the destination of the data stream. After restarting omx, RTFEs started to communicate with a new k1dc0.

Next, we modified the boot script of mx_stream to distribute data stream to the 1st NIC and the 2nd NIC, and then mx_stream was restarted. Because we made bugs in teh boot script of mx_stream, only k1test0 didn't work. So k1test0 was rebooted once. After we rebooted k1test0, it wasn't synchronized with TDS and for fixing synchronized error, we needed to do the cold boot.

Finally, all RTFEs started to communicate a new k1dc0 and DAQ stream seems to stably work. So we decided to keep this situation for a next week. If IPC glitch problem still remains, it will occur in this mid-night. In this case, it may be better to distribute data stream to the 1st and the 2nd NIC with considering the data size of each RTFE because, they are just mechanically assigned in hostname order now. So there may be a bias in the amount of data between the 1st and the 2nd NIC.
Search Help
×

Warning

×