KAGRA Logbook

DGS (General)

takahiro.yamamoto - 13:18 Saturday 15 April 2023 (24818)

Server backup

[Oshino, Ikeda, Nakagaki, YamaT]

Abstract

We took a backup of core servers in the digital system.
Backup of all servers whose maintenance has a conflict with other activities using the digital system was done.

Taking backup of remaining servers will be done after next week.
But probably it can be done with time sharing with other activities.

Such kind of maintenance should be done more frequently.
In this time, there were so many troubles coming from the cause that backup didn't take for a long time.

Details

We took a back up of following servers.
k1boot, k1dc0, k1fw0, k1fw1, k1nds0, k1nds1, k1tw0, k1tw1, k1nfs0, k1grd0, and k1epics,
On taking backup of k1boot k1dc0 and k1nfs0, we faced some troubles.

k1boot
We couldn't copy the system disk because of too many bad sectors.
Because this error couldn't solved by changing settings of a disk-copy instrument.
After then, we backed the HDD to the server. But it cannot be booted up because of SATA erorrs.

So we gave up this HDD and tried to restore latest backup disk taken on Jun. 2022.
Though this backup disk was alive, we needed to restore changes during Jun. 2022 - Apr. 2023.
In this time, restoring changes was not completely (I missed a few changes at first.)
Recovering real-time models spent too long time.

After recovering k1boot, we did a reboot of all real-time front-ends.
At this time, the Dolphin glitch occurred and too many front-ends were dead.
Recovering work was done in same procedure as a usual.
But on some front-ends around a BS-SR area whose IO chassis was driven by AC power, ADC timing was out of sync.
I doubt that the possibility of electrical glitches on AC power due to the instantaneous black out
or turning ON/OFF some instruments such as vacuum pump around 15:00-16:00 (?).

Both problems that there are too many bad sectors and there is only too old backup can be avoid by taking backup more frequently.
So it's better to have a time for such kind of maintenance.

k1dc0
Disk copy couldn't be done with a strict-check mode due to bad sectors.
So backup was taken as coping bad sectors on the original disk as one on a new disk.

For this HDD, we will apply 'fsck' after taking one more backup.
And also, we plan to take cleaned (removing bad sectors) backup on a new HDD.

k1nfs0
When the server was booted up after taking backup, the server didn't connect to the network.
This was able to be solved by swapping two LAN cables (for DGS and PICO networks) each other.

Because we didn't touch the hardware such as network switches in this maintenance,
this fact suggest that NIC settings on OS was swapped between before and after rebooting.
After rebooting the server, there is no conflict between NetworkManager settings and an actual hardware connection.

Had anyone changed the settings NetworkManager and left them without rebooting?
Are multiple network management tools working on k1nfs0?

If former one, this was just a human mistake and we can avoid it.
If latter one, it may occurs again at a next reboot.
So we should find the cause of this problem.

Comments to this report:

shoichi.oshino - 13:59 Monday 17 April 2023 (24851)

After restarting k1epics, I found that the epics gateway process was not started.
Therefore, I manually start the epics gateway by using the shell script (startLocalEpicsGW.sh).

takahiro.yamamoto - 12:51 Thursday 20 April 2023 (24897)

I repaired bad sectors of boot disks brought back to Mozumi.
They are now available as backup of the system disk.

k1dc0
I applied disk check and repairing to the copy of k1dc0 system disk.
I confirmed that this system disk had no problem in the boot process after repairing.
After then, original system disk was also repaired.

k1boot
The system disk of k1boot cannot be copied at all.
I applied disk check and repairing to the original disk.
After the disk repairing, I tried to copy again.

A copy process was still failure with the default copy option.
But by removing bad sector check, it can be copied.
Some kind of bad sectors seem to remain in region where disk check cannot repair.

Just in case, I created an additional back up disk in a new HDD as a copy of 2022-Oct-version of the system disk.

Reminder for me
A new disk for the backup of /opt must be inserted to k1boot.
Hot-plug is working on this server.
/opt region should be taken a backup to the new disk by rsync.

shoichi.oshino - 12:53 Friday 21 April 2023 (24908)

[Ikeda, Nakagaki, Oshino]

We took backup of remaining servers last week.

k1cam0, 1 have duplicate backups.
k1script1 has one backup which is SSD.
k1det0, 1 have one backup.
k1bcast0 has one backup.

takahiro.yamamoto - 20:16 Friday 21 April 2023 (24920)

Back up of following servers were also taken today.
Today we had no disk trouble.

- k1sum0 for SummaryPages
SummaryPages process was not automatically launched when OS booted up.
It's manually launched today.

- k1sum1 for Omicron and some other pipelines
No special report.

- k1det2 for Bruco
It's a debian8 system and bruco already works debian10 system.
So we can replace it as a new debian10 server.

- k1dc1 and k1nds1 for accessing Kashiwa data.
It's depends on the RCG/DAQ-3.5.1 in /opt on k1boot in the test stand.
In test stand, system worked as RCG/DAQ-3.1.1 in some reason.
So we changed as v3.5.1 for launching k1dc1.