Reports 1-1 of 1 Clear search Modify search
DetChar (General)
hirotaka.yuzurihara - 16:11 Friday 18 July 2025 (34595) Print this report
Duplication in the recent cache files at Kashiwa cluster

[Hido, Yuzurihara]

Hido-san reported that there was a duplication in the recent cache files at the kmst-2 (Kashiwa cluster). The time of the duplication are summarized in the attached txt file. I checked several thing, but I'm not sure the critical cause. It's better to perform the countermeasure.

Details

  • As I checked the history of condor submission, the job to produce the cache file finished without errors.
  • I found many job were running around the time of the duplication.
    • To separate the cause of the duplication, I tried to run many jobs to occupy the cpu resources and to run the job in the detchar account. The last submitted job run at the dedicated cpu. So, the many jobs are not direct cause.
    • Note that the job to produce the cache file is running at the dedicated cpu since 2025/05/07 (see klog33698).
  • I reproduced the cache files for 14359, 14360, 14366 directories. The cache files including the duplication are stored in /home/detchar/work/20250718_HTcondor 
  • We will need to reproduce the segment files for these time, after several checks.
  • As a countermeasure, it's better to update the script to remove the duplications in the cache file (such as using uniq). Before running the script regularly, I will test the script.
Non-image files attached to this report
Comments to this report:
takahiro.yamamoto - 17:28 Friday 18 July 2025 (34596) Print this report
How about a delay on transition from IDLE to RUN of condor queue list?

Though I don't remember an detailed implementation the executed script probably decides the time span when it should analyze at the beginning of the script. This is done after transiting from IDLE to RUN. So if waiting time in IDLE is longer than the job submission interval and multiple makeCache jobs spooled on the queue list transit to RUN at the same time, duplication can be occurred.
hirotaka.yuzurihara - 12:43 Tuesday 22 July 2025 (34621) Print this report

I checked the log file of HTCondor (/home/detchar/condor/jobs.log-20250713). For example the jobs with ID = 875399 and 875400 were the process to make a duplication.

  • (875399.000.000) 2025-07-07 17:12:01 Job submitted from host:
  • (875399.000.000) 2025-07-07 17:15:03 Job executing on host:
  • (875400.000.000) 2025-07-07 17:14:01 Job submitted from host:
  • (875400.000.000) 2025-07-07 17:15:03 Job executing on host

After the initial submission of ID=875399, the IDLE state continued over 3 minutes. The crontab is submitting the process to the HTCondor every 2 minutes. So, the duplictation was caused.
For the other example, the same thing happened for the jobs with ID=876466, 876467.

So, I think your intuition is correct.
The script to submit the job is /home/detchar/git/kagra-detchar/tools/Cache/Script/condor.sh. I'm not sure why the line of `periodic_remove = (JobStatus == 1) && (time() - QDate) > 45` is commented out. I thought this can solve the issue that multiple processes run and make the duplication in the cache file.

hirotaka.yuzurihara - 16:01 Wednesday 23 July 2025 (34639) Print this report

I edited the script to `periodic_remove = (JobStatus == 1) && (time() - QDate) > 110`. I selected the number of 110, because the 45 second was not enough to transiting from IDLE to RUN. I hope this modification can solve the duplication issue.

I also reproduced the segment files of 2025-07-{07, 08, 09, 15, 16} in /home/detchar/Segments.

hirotaka.yuzurihara - 0:29 Monday 28 July 2025 (34677) Print this report

This is a follow-up confirmation. 

Thanks to the code update, the multiple running to produce the cache file didn't happen as I checked.
For example, between 10:14 and 10:24 on July 24, it took longer time (more than 2 minutes) to transition the job from IDLE to RUN. After 110 seconds passed, the submitted job was automatically deleted to avoid the multiple running. This means that the script fix is working as intended.
(command memo: condor_history -constraint 'JobStatus == 3' | grep "detchar" | more)

Also, there were no duplicates in the cache file after July 23.

In addition, I also reproduced the cache files and segment files of 5/23, 6/9, 7/2, and 7/22.

Search Help
×

Warning

×