IT Capacity is Harder than it Seems

IT capacity

At first glance, the concept of IT capacity seems completely straightforward. One wants a certain amount of unused storage, processing power, or bandwidth in place which is available if the load on the system should grow in the near future. And it seems this should be an easy KPI metric to calculate, something like: unused horsepower ÷ total horsepower installed. Why is it then that so many organisations – even large companies with professional IT leaders and experts – seem to fall short on capacity from time to time, even suffering downtime or poor customer experience?

Firstly, the concept is actually more complex than it appears at first glance, for several reasons:

  • Some dimensions of the “horsepower” required in modern systems are more flexible than others. There is a reasonable temptation to save money on those which can be “dialed up” quickly and easily on demand, vs. having unused capacity lying around. The cloud and serverless architectures have put storage and many kinds of processing power into this category. Product licenses are generally in this category as well. “Flex” capacity is a blessing generally, but without protections and monitoring it can also create a very large single point of failure in your architecture.
  • There are limits to how large you can scale certain components. Ignoring this fact can place brick walls on your growth path – walls that are very far out in time, making them hard to see until it’s too late. As an example, if you target having 25% extra capacity above the TTM peak in your transaction database, that can feel good all the way up to the point where your administrators inform you that the database can no longer be expanded. We recommend having distinct metrics: “current” capacity – installed but as yet unused, alongside “architectural” capacity – measuring how much more your current platform/system can be easily expanded without requiring a complete forklift replacement. The time horizon for safety in the latter metric should obviously be much longer – ideally multiple years, given the need for a capital event or major effort to overcome limits once they’re reached.
  • The cloud has taken a lot of focus off of capacity in recent years, but it is not a silver bullet. We still see startups and growing organisations frequently hit a limit in one key resource: data & telecom bandwidth. Despite all of the advances in technology over the last few decades (Starlink, etc.), commercial data speeds, especially within office towers, still require a physical circuit to connect an organisation’s computers to the internet. And if anything, all of that migration of data and applications to the cloud has put more load on that connection and less traffic being routed within the local network. Depending on the city and what is installed or nearby, the lead-time for expanding data circuits can surpass 6 months. (Not the kind of thing you can “dial up” easily if everyone in your office suddenly suffers connection timeouts because of how many PCs are streaming the World Cup match.)
  • There are trade-offs involved in having equipment on-hand but not in use, including the downward slope of the cost/capability curve on most tech gear, and the start of the warranty “clock” before the unit is actually placed in production. Generally there is a balance here, again depending on which kinds of equipment can be safely considered “flex capacity”. The supply chain challenges and chip shortages suffered worldwide since Covid have made this calculus even more complex.

 

Secondly, and perhaps more importantly, it can be difficult for business executives to get behind investing in capacity vs. investing in new features and tangible system upgrades which will deliver benefits all day every day. It is often an education challenge for CIOs to make this case. We suggest utilizing an analogy of an insurance policy. High usage days are rare, yes, as are car accidents or medical emergencies; but having protections in place for when those exceptions DO occur should be important for everyone to be able to sleep at night. Murphy’s Law is a law for a reason.

Like so many other topics we write about, IT capacity can be managed with a proactive programme of metrics and forward planning. As an introduction, the concepts of Flex capacity, Current capacity, Architectural capacity, and Capacity as insurance can be useful concepts to aid your planning. Of course, if outside expertise would be helpful to your organisation to ensure you have a good handle on all the considerations surrounding IT capacity, we are here to help and would love to hear from you.