Hippo (HPC Type 3)Operational

Past Events

Mar 22, 2023 06:54: The issue with jobs being slow to start appears to have improved. We will continue to monitor the situation.
Mar 21, 2023 21:31: Maintenance is complete. We will continue to monitor the system. We have noticed that some jobs are slower than usual to start, we will attempt to determine the root cause behind this in the morning.
18:16: Maintenance has started on UCloud and will be ongoing for the next few hours. We will update when the maintenance is complete. We do not expect any disruption to user jobs.
Mar 15, 2023 14:55: The u1-standard machine is currently running at 100% utilization. As a result, you may have to wait before your job starts. Update: UCloud had resources available again around 18:00 Wednesday.
Mar 10, 2023 12:53: Nearly 100% of all UCloud resources are currently in use. This is causing some jobs to immediately transition to a failed state due to our scheduler refusing to accept any new jobs. Update: UCloud had resources available again at around 21:00 Friday.
Mar 3, 2023 10:02: UCloud is currently experiencing higher load than usual. This means that you may experience longer wait times than usual. In particular, it can be hard to start jobs which require large machine types. If you are having trouble starting a job, then try selecting a smaller machine type. The high load is also causing the output of applications to not always appear. We are working on a fix for this issue. If you are running one of the applications which depend on this and the output don’t appear, then please contact support. The support team should be able to retrieve any information you may need from the output. This in particular affects applications such as MinIO and Rsync Server which depend on this output.
Feb 14, 2023 10:50: We have applied several patches since yesterday. Preliminary results suggest that the service is more stable now, but we will continue to monitor the situation.
Feb 13, 2023 15:01: We are still trying to solve stability issues with UCloud caused by the maintenance operation last week. Some apps are more affected by this issue, such as RStudio and JupyterLab. If you experience a disconnect (or 403 errors), please close the app and press the "Open interface" button again.
Feb 9, 2023 15:14: We are still experiencing issues with UCloud related to the maintenance earlier this week. We are aware of the problem and actively trying to find a solution.
Feb 8, 2023 14:21: We have deployed some mitigations against the observed errors. We will continue to observe the system for errors.
12:46: We are observing some errors related to the maintenance yesterday. We are aware of these problems and are monitoring the situation closely.
Feb 7, 2023 12:56: Access to UCloud has been restored. The cluster currently has fewer nodes than normal. As we are still in the process of moving some nodes to a new system. We expect this capacity to return to normal levels by the end of tomorrow. GPU machines in the Type 1 (SDU) system will remain unavailable until they have been fully migrated (expected by end of tomorrow).
12:04: The migration has been completed. We are running some additional tests to check the system.
07:58: We have started scheduled maintenance to migrate the DeiC Interactive HPC provider to a new and improved system.
Feb 1, 2023 11:07: Performance has stabilised but we are still seeing slower than usual queries. We are working on improving performance.
09:54: We are experiencing issues related to the compute sub-system of UCloud.
Jan 10, 2023 15:46: We have observed some file-system sporadic instability on UCloud. We are monitoring the situation, but we are still considering the system operational.
Dec 22, 2022 12:04: Around 11:31 we experienced around 10 minutes of downtime due to a network reconfiguration.
Dec 16, 2022 12:44: The Virtual Machines hosted at AAU may be slow and less responsive between Dec 14 and Dec 31 as AAU ITS is performing scheduled maintenance.
Dec 6, 2022 10:16: An issue with application output has been resolved.
09:13: The last application section has returned correctly to UCloud.
08:31: UCloud is now operational. We are investigating an issue with some applications not showing up correctly in the "apps" interface.
07:51: UCloud is currently experiencing some issues due to unforeseen issues during maintenance.
Dec 4, 2022 12:38: We have experienced system-wide DNS issues from around 03:27 this morning until 12:11 where the problem was identified and solved.
Nov 18, 2022 15:22: The problem with the storage system has been solved.
14:30: Our Ceph storage system is experiencing problems, which is affecting the UCloud platform.
Nov 17, 2022 15:30: The problem with the storage system has been solved.
15:20: Our Ceph storage system is experiencing problems, which is affecting the UCloud platform.
Nov 7, 2022 14:00: The issue with the storage system caused a compute node to crash and consequently the jobs running on the node were cancelled. UCloud should otherwise be working again.
13:52: The problem with the storage system has been solved.
13:34: Our Ceph storage system is experiencing problems, which is affecting the UCloud platform.
Oct 24, 2022 14:10: The problem with the storage system has been solved.
13:42: Our Ceph storage system is experiencing problems, which is affecting the UCloud platform.
Oct 7, 2022 11:11: UCloud is operational again.
10:22: UCloud is experiencing some problems with applications.
Oct 3, 2022 12:12: All Hippo nodes are accepting jobs again.
08:58: The Hippo compute nodes are down for maintenance, but the frontend is available. System should be fully operational again around noon.
Sep 30, 2022 12:09: UCloud is operational again.
11:06: UCloud is experiencing issues across all services. We are working on resolving this issue.
10:48: UCloud files are accessible again.
10:37: UCloud is currently experiencing issues with file related operations. We are working on resolving this issue.
Sep 23, 2022 12:22: The Hippo HPC system is back online. Please check the MOTD for relevant information about system changes.
Sep 21, 2022 16:06: All folders on UCloud should be accesible again.
15:30: We are aware that some users are currently experiencing issues accessing certain folders and files on UCloud. We are looking into this and expect it to be resolved before 6 PM.
08:34: The Hippo HPC system is down for maintenance. The system will be expanded and reinstall during the next few days. The system will resume normal operation on Monday, September 26th.
Sep 8, 2022 08:40: Some of the UCloud compute nodes had to be rebooted, due to issues with the storage system. For this reason we had to terminate some of the currently running user jobs.
Jul 14, 2022 17:40: The UCloud service was unavailable between 16:30 and 17:40 due to a temporary problem with a metadata server in our infrastructure. The service should be fully operational again.
Apr 1, 2022 15:42: The filesystem is finally back online.
15:18: We are still working on recovering the filesystem. Hopefully we are back within an hour.
11:03: Our CephFS filesystem is currently unavailable, affecting several platforms. We are working on getting it back online, but it might take a few hours.
Mar 28, 2022 08:30: After rebooting our core switches and some of the nodes, the network problem causing packet loss should now be solved.