Abstract
We took a backup of core servers in the digital system.Backup of all servers whose maintenance has a conflict with other activities using the digital system was done.
Taking backup of remaining servers will be done after next week.
But probably it can be done with time sharing with other activities.
Such kind of maintenance should be done more frequently.
In this time, there were so many troubles coming from the cause that backup didn't take for a long time.
Details
We took a back up of following servers.k1boot, k1dc0, k1fw0, k1fw1, k1nds0, k1nds1, k1tw0, k1tw1, k1nfs0, k1grd0, and k1epics,
On taking backup of k1boot k1dc0 and k1nfs0, we faced some troubles.
k1boot
We couldn't copy the system disk because of too many bad sectors.
Because this error couldn't solved by changing settings of a disk-copy instrument.
After then, we backed the HDD to the server. But it cannot be booted up because of SATA erorrs.
So we gave up this HDD and tried to restore latest backup disk taken on Jun. 2022.
Though this backup disk was alive, we needed to restore changes during Jun. 2022 - Apr. 2023.
In this time, restoring changes was not completely (I missed a few changes at first.)
Recovering real-time models spent too long time.
After recovering k1boot, we did a reboot of all real-time front-ends.
At this time, the Dolphin glitch occurred and too many front-ends were dead.
Recovering work was done in same procedure as a usual.
But on some front-ends around a BS-SR area whose IO chassis was driven by AC power, ADC timing was out of sync.
I doubt that the possibility of electrical glitches on AC power due to the instantaneous black out
or turning ON/OFF some instruments such as vacuum pump around 15:00-16:00 (?).
Both problems that there are too many bad sectors and there is only too old backup can be avoid by taking backup more frequently.
So it's better to have a time for such kind of maintenance.
k1dc0
Disk copy couldn't be done with a strict-check mode due to bad sectors.
So backup was taken as coping bad sectors on the original disk as one on a new disk.
For this HDD, we will apply 'fsck' after taking one more backup.
And also, we plan to take cleaned (removing bad sectors) backup on a new HDD.
k1nfs0
When the server was booted up after taking backup, the server didn't connect to the network.
This was able to be solved by swapping two LAN cables (for DGS and PICO networks) each other.
Because we didn't touch the hardware such as network switches in this maintenance,
this fact suggest that NIC settings on OS was swapped between before and after rebooting.
After rebooting the server, there is no conflict between NetworkManager settings and an actual hardware connection.
Had anyone changed the settings NetworkManager and left them without rebooting?
Are multiple network management tools working on k1nfs0?
If former one, this was just a human mistake and we can avoid it.
If latter one, it may occurs again at a next reboot.
So we should find the cause of this problem.