Reports 1-1 of 1 Clear search Modify search
DetChar (General)
shoichi.oshino - 10:34 Tuesday 14 March 2023 (24411) Print this report
SummaryPage trouble: process was stopped from 3/13 21:00 UTC
SummaryPage LSC process was stopped from 3/13 21:00 UTC.
I checked logs of this process and found that this process was stopped during generating coherence plots.
I deleted this process.
After this work, SummaryPage starts to create plots.
Comments to this report:
takahiro.yamamoto - 18:35 Tuesday 14 March 2023 (24418) Print this report

How about reducing the timeout period?

Current timeout is set as 12hrs. (43200s).
Default of SummaryPages (on LIGO git) is 3hrs. (10800s).
So someone intentionally(?) changed it in KAGRA.

Figure 1 shows calculation time of each process during recent 1-week.
(Log is stored only recent 1-week.)
Because amount of using data increases in the end of a day, calculation time increases with time in UTC.
According to the calculation time around 23h in UTC, it seems to be no problem reducing timeout period
though number of outliers seems to be large during the day-time in JST.

By the way, is there possibility that SummaryPages uses a lot of NDS resources?
Most of outliers can be seen around 3h-12h in UTC (=12h-21h).
It is roughly same time as commissioning activities (mainly by Ushiba-kun).
And also we sometimes face "NDS overloaded" errors on ndscope even if TestPoint channels don't use so many.
So SummaryPages and commissioning activities may compete for NDS resources.

If so, this error may be improved by preparing a dedicated NDS server for SummaryPages.

Images attached to this comment
shoichi.oshino - 16:45 Wednesday 15 March 2023 (24430) Print this report
[Oshino, Yamamoto]

We confirmed script and HTcondor submit files and found that "condor-timeout" parameter manages the timeout value of each process.
Therefore, we set 1 hour for this parameter to limit a calculation time.

Also, we checked the decision of usage of NDS server.
The method is to set up a fake NDS server and confirmed the SummaryPage process.
For some reason, SummaryPage worked.
So somehow SummaryPage is getting the correct NDS information.
hirotaka.yuzurihara - 11:01 Wednesday 22 March 2023 (24484) Print this report

Related this issue, I checked the recent history and found one example halted by timeout. So, I guess this modification improved the stability of the summary page!

controls@k1sum0:~$ condor_history  -w | grep " X" | more
3711672.0   controls        3/19 14:06   0+01:00:02 X         ???  /home/controls/bin/miniconda2/envs/ligo-summary-3.7/bin/gw_summary day --multi-process
 4 --verbose --ifo K1 --on-segdb-error warn --on-datafind-error warn --output-dir . --no-html --archive K1GIF --config-file /home/controls/public_html/su
mmary/etc/defaults.ini,/home/controls/public_html/summary/etc/k1global.ini,/home/controls/public_html/summary/etc/k1gif.ini

 

hirotaka.yuzurihara - 15:41 Tuesday 28 March 2023 (24569) Print this report

I checked the recent condor_history and found the one example of timeout. The timeout option is controbuting the stable operation. Nice! (Did someone find the summary page stop?)

controls@k1sum0:~$ condor_history  -w | grep " X" | more
3723095.0   controls        3/26 03:06   0+01:00:02 X         ???  /home/controls/bin/miniconda2/envs/ligo-summary-3.7/bin/gw_summary day --multi-process
 4 --verbose --ifo K1 --on-segdb-error warn --on-datafind-error warn --output-dir . --no-html --archive K1PSL --config-file /home/controls/public_html/su
mmary/etc/defaults.ini,/home/controls/public_html/summary/etc/k1global.ini,/home/controls/public_html/summary/etc/k1psl.ini
 

Search Help
×

Warning

×