We’ve got our heads in the cloud
In a world increasingly built on cloud computing and storage, combined with next generation optical networks, successful delivery of services is measured in milliseconds – which is just one one-thousandth of a second. And while this measure of time is nearly impossible to contextualize, improving delivery for online services by 20 milliseconds is transformational at scale, and what one of the largest and most innovative universities in the nation requires.
Arizona State University (ASU) is a leader in cloud acceleration. But how does a university that serves 173,000 students on campus and online, in addition to 20,000 thousand faculty and staff, work to consistently improve performance to an ever-growing number of users across a variety of services?
One solution emerged through a new, ultra-high-bandwidth, low-latency approach to cloud storage. The strategic vision, execution model and implementation for this two-phase project was developed by ASU’s Enterprise Technology engineering group, led by executive directors Nate Wilken and Kate Giovacchini.
This cutting-edge approach aimed to optimize the university’s infrastructure for its growing user base, ensuring high-quality performance across services.
Part 1: A new Local Zone proves valuable
In the summer of 2022, a combination of latency, connectivity and bandwidth limitations led to an uptick in disruptions for desktop workflows. As a result, users experienced reduced performance speeds when trying to access cloud-based files and applications.
For example, one team within the university had grown to over 1,500 team members by mid-2022. Microsoft Word and Excel dominated their desktop usage. “We knew that the general public cloud solutions were unacceptable to our customers. The input/output was intensive enough to create noticeable lag and peak concurrent usage was affecting both performance and reliability,” said Wilken. “We knew it was time to reimagine a solution for cloud storage.”
One critical element had fallen into place the prior year when Amazon Web Services (AWS) began spinning up a new and much anticipated AWS “Local Zone” in Phoenix – in data connectivity terms, that was right on the university’s doorstep. Years in the making, it had recently come online – one of more than 30 such facilities popping up around the world.
The concept of Local Zone was exactly what the university needed – bring the cloud’s “edge” and enterprise compute closer, with the potential for single-digit millisecond latencies compared with recorded 40 milliseconds.
The challenges of long-distance computing
Prior to 2021, ASU’s cloud share was hosted by AWS in Portland, Oregon. That 1,000 or so mile-long connection imposed a lower limit on latency, which was in the 30-40 millisecond range. The long distance exacerbated other challenges of maintaining the cloud cluster remotely.
One point of integration that was a particular challenge was the sheer distance to the physical AWS cluster left connectivity vulnerable to any kinetic event in the real world that could slice through fiber or damage a junction box. Wholesale outages were becoming more frequent – unheard of in the pre-cloud days.
Strategic planning pays off
Well before Phoenix Local Zone was ready, the technology team had begun procuring redundant private circuits between ASU and AWS. Good thing, too – lead times for provisioning some spans turned out to be upwards of six to eight months. Early procurement prevented that from affecting project timelines.
By design, ASU would now have redundant connections to the cloud and internet service providers, allocating circuits between Cox and Lumio.
Shaving 1,000 miles off the cable run with the new Phoenix Local Zone was the biggest win for lowering latency, but the last ten feet would also be critical. As the private circuits were coming online, the team was determined to identify reliable storage management that would scale and support automation.
Part 2: An advanced DevSecOps approach to culture, automation and platform design
The missing piece was a robust cloud storage manager to run on top of AWS metal.
The team identified Qumulo, which could run in the AWS environment and checked many boxes for automation support, scalability and performance – not to mention unique features like native MACSEC for encryption, which reduces the overhead of managing VPN connections.
“Because we’re a leading practitioner of DevSecOps, our team at Enterprise Technology was able to emphasize automation and orchestration at a level of sophistication that rivals the biggest tech and banking enterprises,” said Wilken.
By adding a dedicated orchestration engineer to support Qumulo setup, the team built a push-test-deploy pipeline consisting of Terraform for provisioning and lifecycle management, with Jenkins and GitHub for automation and source control – a typical Dev-Ops stack for projects.
The accelerated cloud storage initiative was, of course, all about the customer – which includes students, faculty, researchers and staff.
In test cases, the new stack ran with only 10 milliseconds latency – some 30 milliseconds or so less than the previous share in Portland.
The first production environment was running by the second week of January 2023. And as more resources come online, the team expects to roll out the solution to more users in the coming weeks.
“We made decisions long ago and built and reinforced practices over time that allowed us to pivot quickly,” said Giovacchini. “By fostering a culture of digital innovation, we have promoted a willingness to take calculated risks and embrace new ideas to make these digital transformations possible.”