HPC System Reliability Engineering
Broadwing stabilizes a top50 10k+
node hybrid-cloud supercomputer
Situation
Seeking to deliver its most complex HPC environment ever, a globally recognizable HPC manufacturer reached out to Broadwing for help.
Action
Broadwing engineers integrated seamlessly as part of the larger team co-developing the automated healthchecks, triaging the findings, resolving system configuration issues, and coordinating the replacement of hundreds of hardware components.
Result
Within six months of engaging Broadwing, the hardware manufacturer was able to meet all of its critical system integrations technical requirements, including benchmarks.
Action
Broadwing stepped in and overhauled their CI/CD pipelines and artifact warehousing, modernizing the system to handle hundreds of artifacts more efficiently. For the first time, the manufacturer could reliably recreate their product from its components. Broadwing wrote hundreds of integration tests and set up cloud-based deployment targets, transforming their testing infrastructure.
Result
The impact was immediate and significant. Test iteration times dropped from hundreds of man-hours to less than ten, allowing the software engineering teams to focus on more important tasks. Each day, they consulted comprehensive CI/CD dashboards to quickly identify regressions and ensure integrated functionality, making their workflows smoother and more productive than ever before.
Related Solutions
- Four Signals Monitoring
- Single-Pane-of-Glass
- AIOps / Predictive Alerting
- Correlated Events
- Alert Thresholds
- Capacity Planning
- Performance Engineering
- Incident Command
- Rapid Triage
- Root Cause Analyses
- Continuous Improvement
- Automated Function Testing
- Automated Infra Testing
- Runbook Automation