IT Peer Network

3 Posts authored by: Sartini

Are you constantly looking at cost saving opportunities for your expense planning cycle?.  Have you looked in detail at your vendor maintenance contracts coverage levels and their associated costs?.  I’m sure some of you will say, yes of course, it’s a big ticket item, but we can’t cut those support levels from our vendors, for fear of extended downtime to our systems.


In many cases, that may be true, customer applications are hosted on data center infrastructure like servers, storage, switches, so you’ve got to ensure any failures are quickly resolved. So I can see how you may quickly conclude that it goes without saying that you require 24*7 vendor support. As we know, this doesn’t come cheap, but is it a premium worth paying for? .  To understand that, you need to do a deeper dive into your whole environment and review all systems assets that support your test/integration/staging/development or whatever you call your environment that houses the replica systems to support testing release and deployments before they go into production? . Also, have you run the numbers and check back over the past year or so to see the number of times you’ve had to invoke vendor support to return a mission critical system to a norm state. I’d argue that IT houses could be paying across the board for premium vendor cover when not actually required where you may already have redundancy built in. We assume it happens, but are you actually recording this data and checking it periodically?.


As you know when you purchase an IT capital asset, it generally comes with a few year vendor support contract built in, say 4 years in many cases. When the asset enters year5 and beyond you obviously need to be planning to raise your expense budget to cater for the standard 9*5 NBD [next business day] support level which comes at a cost, or if you need to uplift that to 24*7 premium support, then it’s an even more expensive cost to add to your expense plan. So you’ve a couple of options…


1. Replace the asset after 4years, something that Intel would recommend from a server perspective for many reasons, one relevant to this blog is the server maintenance contract cost which hits your expenses and cannot be capitalized.
2. Continue to maintain the asset but provision more expense budget to manage this asset going forward. Remember that as the asset ages so does the likelihood of hardware faults and the need for vendor maintenance cover.


From my experience, I’ve noticed on several opportunities to reduce the cost burden of yearly maintenance contracts by weighing the risk & reward. I’d like to share some examples of what to consider when you review your costs and weigh up the risks against reducing costs. My theory is that many IT departments could have been overpaying because of the fear of extended downtimes.

 

Low risk item to consider:
1. Reduce all Test/Development/Integration systems from 24*7 or 9*5NBD[next business day] cover to ‘Time & Materials’ cover only – generally you’ll find a substantial cost saving for little added risk, since these systems will not impact customer applications, as they are only used for test/staging/development. However, you’ll have estimate hardware failure rates, perhaps based on historical rates of failure, assuming you’re tracking this.


Medium/ High risk Option to consider:
1. Change maintenance cover for all clustered servers from 24x7 to 9x5NBD – given that they are clustered, means they already have redundancy buier:lt in, so if you lose one node, you don’t impact the customer and you have time to replace within next business day.
2. Change maintenance cover for some systems[e.g. ‘non mission critical’] from 24x7x4 to 9x5NBD

 

Higher risk Option to consider:
3. Change maintenance cover for all servers [including standalone systems] from 24x7x4 to 9x5NBD – higher risk than just clustered systems as this includes standalone systems that don’t have automatic failover redundancy. However, you can offset the risk by carrying some inventory of frequently swapped parts, like cache batteries, disk drives, memory modules etc.

 

I’m interested in hearing other opinions on how you’re improving value for money from vendor contracts. I’d also like to hear folks are already doing the above on a regular basis or if they’ve pushed the vendor support to a minimum and noticed a degradation in service levels?.

Does the risk of downtime to your production lines outweigh the rewards of server refresh?

As an IT manager, one of our roles is to influence our senior management to plan & approve capital funds on IT server upgrades where it makes sense. Needless to say, we always have to demonstrate as best we can, the benefits of such a capital investment, ideally in a Return-On-Investment (ROI) terms. Many of us use varying forms of cost benefit analysis algorithms to come up with what we believe to be logical data based recommendations. Server technology as with most other technologies is advancing year on year with improvements in many areas, most notably, power efficiency, performance, form factor.
My observations from conversations with peers in several manufacturing industries, has been the reluctance to upgrade their environment’s server fleet at the same cadence as the broader enterprise fleet.  I decided to write this blog to understand why. Well, one argument may be when the underlying manufacturing central systems applications are working fine on the existing h/w,  the attitude is ‘if it’s not broke, don’t fix it’.  We don’t want IT getting in the spotlight for unscheduled downtime affecting company product commitments. Furthermore, the impact or loss of production when refresh cycle occurs i.e. in non-clustered environments where you have to bring your automation systems down in a scheduled manner to replace the servers. Ideally at least  in a clustered environment, you can stage the upgrades with less impact, as one set of node in the cluster manages the workload while the other half is being replaced and vice versa. However, sometimes applications have not been developed for clustering/redundancy and one doesn’t have the option. Something for all of you to consider as harden existing environments over time when finance permits is to influence you app development teams to ensure strong redundancy / fault tolerance is built in at design phase.
Another regular problem exists when your production systems are not continually re-qualified on the latest server hardware at the same rate as the h/w being offered by the OEMs. This places a larger burden on the software development teams and others to perform lengthy validations which can take them away from their core job of working on the latest software enhancements for the production systems.
Server refresh introduces unknown risks, as all the pre-production software testing and validation on the new platform doesn’t always find all the bugs. It goes without saying that the risk of introducing new potential issues/impacts into the production environment generally arises from any changes made to any part of the production system, software, hardware or human error during the upgrade process. All of which can cause unscheduled outages to factory production.
So with all the potential risks/downsides, how do we weigh those against the longer term benefits of upgraded IT/Automation server fleet? The answer isn’t always as clear cut from my experience, as you’re trying to compare factual data against many unknowns e.g. how to accurately can you put  a cost on the risk of impact & the potential for production downtime. The information gathered from here in Intel has consistently shown us that our systems reliability, performance and total cost of ownership [TCO] have improved with every refresh cycle. As a simple and very recent example (I could give more); for one of our production applications in a non-virtualized environment, my team recently replaced approx. 90 physical servers with 19 latest OEM models based on IAx86 of course. The ROI payback was <2yrs, as we were able to reduce the our TCO by removing the maintenance contract costs, reducing our power costs, improving server reliability and thus reducing human intervention to enable our engineers work on other value add projects and that’s the big intangible. An added bonus was that we obviously freed up data center physical capacity to prevent the need for future DC expansions which are a costly endeavor. We could have chosen to do nothing as the customer wasn’t complaining of application performance degradation, however, we as IT people knew the benefit in terms of cost savings to our business in many different ways, some more easily quantifiable than others.
So my recommendation from many years managing IT in an Automated Manufacturing / Industrial environment is that it does indeed make a lot of sense to upgrade the server fleet. However, my experience tells me the key drivers continue to be where you see Application performance improvements & EndOfLife / EofOfSupport hardware triggers. You should be influencing your peer orgs to ensure solid requalification & test processes exists to manage the change as this is an ongoing part of our jobs. After that, you can’t deny that you’re offering the business the very latest and greatest IT infrastructure which should always offer more than its technology predecessors at a fraction of the ongoing maintenance costs. No pain, no gain !.
I’d be interested to hear from those who are keeping their IT manufacturing systems up to the latest spec of h/w on an ongoing basis and if you’ve anything to add in terms of the benefits and/or pitfalls encountered.

By Joe Sartini

As both and automation engineer and IT Automation manager for many years, I’ve both contributed to and monitored how many IT standard operating procedures [or SOPs] can introduce errors into a system. The challenge for many IT operations teams is how to eliminate the human induced errors and provide closed loop feedback systems to process developers on how to create and maintain more robust project insertions. Every IT engineer I’ve ever known has great intentions to make SOP changes flawlessly; however, we tend to find that a fair proportion of our operational incidents are as a result of human errors during the change process. Intel Factory Automation has strict change control procedures to help engineers through the change process and protect them from the human errors. Aside from all the change control processes that exist in many organizations, I believe, a key to success in this area is to automate as much as possible and where not feasible, is to utilize an automated checklist.

Let me give you an example to illustrate the issues which can be experienced by many IT orgs and a way to avoid or mitigate by putting more IT solutions in the manual processes that will always exist.

The Problem Suppose you have an engineer performing a standard server build or decommission, in each case your engineer would deem this to be a fairly straight forward task, and you as an IT org I’m sure have a documented standard operating procedure depending on hardware model and O/S rev, right?. The problem can arise when our engineers are multitasking on many projects at once, under time constraints. In their mind, the trivial server build/decom SOP needs to be completed before they rush to their next important meeting. So they’re in the data centre[DC] with no access to the SOP instructions unless they print it out or login to a PC in the DC to view it, needless to say, the engineer in a hurry and has performed this task many times in the past, will proceed on memory to perform that same task. However, assuming something has changed in the process since they last performed the build/decom or let’s say, nothing’s changed but they simply forget to perform a task, like, let’s say disabling a SAN switch port for the decommissioned server. Down the road the issue arises where we run into SAN switch port capacity problems which shouldn’t exist. It’s possible that an IT org needlessly purchases more switches to handle the perceived capacity problem or they have another engineer perform capacity analysis comparison of server assets versus active port usage to find that something doesn’t add up. More time gets needlessly spent trying to find the unused ports and disabling them since engineers in the past have forgotten to disable the ports through the Server decommission SOP.

One Solution From my experience as an IT engineer and manager, I focus on IT automated checklists for SOPs. Utilizing simple, easily configurable IT web based solutions, the IT manager/engineer can develop checklists for all your SOPs which require engineers to check the box using an online form which can be centrally tracked via standard/simple IT reports. In this case, the IT manager can monitor the completion/success of his SOPs via %PAS reports. Furthermore, the engineer knows that his name is tracked against the tasks with timestamps, so he/she is more inclined to follow the checklist and complete all tasks. The beauty of an online checklist is that the engineer can access it wherever they have an internet connection, e.g. LAN, Wifi, etc and can utilize any form factor device e.g. PC, laptop, MID, iphone etc. The IT manager can also easily run reports against the time it takes on average to perform each of their SOPs to help them with resource allocation per task and also feedback to development teams on TTM for new project insertions etc. In the example above, the engineer who was in a rush to a meeting and in the datacentre would have accessed their checklist via laptop, phone etc and clicked each box as they completed the task. Say for example he still forgot to de-assign the switch port, or more typically didn’t have time to complete all tasks in one visit to the DC. In this case the checklist would not be 100% complete and in the daily/weekly operational review they’d notice that this SOP in still in flight and would follow-up with engineer to complete the checklist as it’s all been centrally traced and closed loop until actual completion of all tasks.

Let’s take the pen & paper and human guess work out of our IT operations and use our IT skills to develop foolproof solutions to our daily routines. In this way, we’ll have a better chance of removing the human errors.  I’m sure you’d agree we need as much time as possible to handle the h/w & s/w errors that affect our operations availability & reliability.

Filter Blog

By author:
By date:
By tag: