[Hido, Yuzurihara]
Hido-san reported that there was a duplication in the recent cache files at the kmst-2 (Kashiwa cluster). The time of the duplication are summarized in the attached txt file. I checked several thing, but I'm not sure the critical cause. It's better to perform the countermeasure.
Details
- As I checked the history of condor submission, the job to produce the cache file finished without errors.
- I found many job were running around the time of the duplication.
- To separate the cause of the duplication, I tried to run many jobs to occupy the cpu resources and to run the job in the detchar account. The last submitted job run at the dedicated cpu. So, the many jobs are not direct cause.
- Note that the job to produce the cache file is running at the dedicated cpu since 2025/05/07 (see klog33698).
- I reproduced the cache files for 14359, 14360, 14366 directories. The cache files including the duplication are stored in /home/detchar/work/20250718_HTcondor
- We will need to reproduce the segment files for these time, after several checks.
- As a countermeasure, it's better to update the script to remove the duplications in the cache file (such as using uniq). Before running the script regularly, I will test the script.