Planed service maintenance of zigur and zapat clusters and disk array /storage/jihlava1-cerit/

18.8.-18.10.2015 - Planed service maintenance of zigur and zapat clusters and disk array /storage/jihlava1-cerit/

Due to HW problems (being solved with original supplier), the zigur and zapat clusters will be available 1 month later, in the second half of October.

With many thanks for understanding.

Dear users,

From August 18, due to moving to Brno, zigur and zapat clusters and disk array /storage/jihlava1-cerit/ will not be available temporarily.

The clusters are covered by maintenance contract therefore the move will be done by the original supplier, approx. time of moving is a month (144 nodes of cluster plus disk array).

The clusters will not be available all the time (approx. 1 month). The walltime limit in the queues will decrease gradually to prevent running any job during the outage. Remaining running jobs will be killed on switching the machines off.
Current data in /storage/jihlava1-cerit/ will be temporary available only for reading since August 14, 11 PM.
The data will be moved to storage-brno4-cerit-hsm.metacentrum.cz (CERIT-SC's HSM), they will be available for reading and writing since August 18 via symlink /storage/jihlava1-cerit/home/$LOGIN.
The link /storage/jihlava1-cerit/home/$LOGIN will point to /auto/brno4-cerit-hsm/fineus/home/$LOGIN (after having finished the data transfer)
Afterwards, the disk array will be available in Brno as /storage/brno7-cerit/ (fineus-home.cerit-sc.cz). PLEASE NOTE: the original data will not be copied back; they will remain accessible in CERIT-SC's HSM. The users are recommended to move the date elsewhere. In case of huge data amount, please contact us at support@cerit-sc.cz to schedule optimal transfer.

Influence on the running jobs:

the jobs that work with the data saved on (or will save data to) another disk array will not be influenced
the jobs that perform their computations within the scratch space, which check the success of copying-out the resulting data (e.g., using the script skeleton available at https://wiki.metacentrum.cz/wiki/Running_jobs_in_scheduler#Recommended_procedures ), and which will try to save the resulting data into /storage/brno1 (/storage/home) during the outage, will not be influenced as well (you'll find the resulting data in the scratch space of the relevant nodes)
the jobs that work directly with the data saved in /storage/brno1 (/storage/home) or the jobs that will not check the success of copying-out the data into this array, will most probably crash. If you have some critical/long-term computations, that may be influenced by the outage, let us know -- we'll try to suspend your computation during the outage (however, the success of the suspend process cannot be guaranteed)

With many thanks for understanding,

Ivana Krenkova
MetaCentrum & CERIT-SC

Ivana Křenková, Thu Oct 01 15:50:00 CEST 2015