/storage/brno12-cerit/ and frontend zuphux outage

18-19.8.2024 - /storage/brno12-cerit/ and frontend zuphux outage

update 26.8., 15 PM: the disk array is back in operation and the data should be readable. Please report any problems. Thank you for your understanding.

update 26.8. from 10:30 AM: during this morning the disk array will be briefly unavailable, we are trying to re-access unreadable data. We apologize for the inconvenience.

update 20.8.:

We regret to inform you that we have been experiencing significant hardware issues with the /storage/brno12-cerit/ directory since Sunday.

A small part of the data in /storage/brno12-cerit is now inaccessible due to a failure on one of the disk arrays, attempting to read it is showing up as an Input/Output error (in terms of blocks of data this is about 1.1%, but since large files over 4MB are spread across multiple devices it is more likely that at least some of them are affected). The fault is being addressed by the manufacturer's support. So far, the data is not definitively lost, but we don't currently know when it will be made available, or whether it will all be OK in the end. If you need some of them quickly, it may be more efficient reload the data (if it was primary input) or recalculate what is needed.

Otherwise, right now /storage/brno12-cerit is running normally, and there's no particular reason to assume that other data is more at risk than usual (however, given the size of the repository, this is not independently backed up, certainly it is not intended for archival or otherwise irreplaceable data), except that there may still be some limitations on operation while the broken piece of hardware is repaired.

Please note that due to the priority to increase the maximum capacity offered, it is not possible to perform a full backup of all data on storage of this size.
To ensure full backups we would need to at least double the funding to purchase suitable HW. As the archive purposes cover the disk arrays of the CESNET Data Care departement, and the branch repositories are also being prepared within the EOSC project, we only backup on our disk arrays in the form of snapshots. These offer some protection in case a user inadvertently deletes some of his files. In general, data that existed same days before the accident can be restored. However, snapshots are stored on the same disk arrays as the data itself, so in the event of a hardware failure these backups may be lost :-(
https://docs.metacentrum.cz/data/metacentrum-backup/

We are very sorry, we try to do our best to get back the lost data together with the HW vendor.

If you need it very urgently, please send the jobs to the system once again. We are able to make your priority higher (to start jobs as soon as possible), if needed.

Thank you for your understanding.

everything-fails-all-the -time-amazon

update 19.8.: update 19.8.: the disk array is only working in limited mode, with short outages. If possible, limit work on this array. We are trying to stabilize the situation.

update 18.8. at 8PM: the storage is back in operation

Dear users,

currently the disk array /storage/brno12-cerit/ is unavailable, we are working on fixing the problem. Also the zuphux frontend is unavailable.

If possible, use other storage and frontends for now.

Thank you for your understanding,

your MetaCentrum Team

Ivana Křenková, Sun Aug 18 15:00:00 CEST 2024