IT Peer Network

7 Posts authored by: dkamhout

Hi All,

 

Late last year, we made a decision in Intel IT to start up a greenfield cloud environment that was a closer match to our grid computing space; meaning running predominantly on open source.  At the time it was starting to look like the open source communities were really starting to move at a strong pace and beginning the innovation curve that we saw with Linux when it was in its first few years of existence.  As with Linux, we decided that it would be optimum for us to help the community grow and innovate and allow us to continue to work on areas that gave us the most value internally as enhancements vs. us constantly working on the core improvements.

 

Over the last 6 months we have been busy integrating, perfecting, and finally implementing a robust enough solution for cloud aware applications.  I have shared in a few public forums the specific open source solutions we chose, however in this blog I want to focus in on the core of the environment - OpenStack.

 

Today my goal is to say... we have work to do to make this enterprise ready, and the community can/should band together to make it so.  One of my goals is to be completely open about what we are doing with cloud (as much as possible) so I wanted to share this attached list with the entire interweb.  Don't get me wrong, I do believe that the open source community is building solutions that work for numerous use cases, however not every app is built for the cloud and it should still benefit from most of the attributes cloud can provide (self-service, measured services, and resource pooling).  I have no magic to make a vertically built app do elastic scaling, however I still want that app to run and stay up in my cloud infrastructure and there are a few areas that we need to solve to make this so.

 

Let me know what you think, and if you are also working on addressing some of these gaps.  We are looking to see which we will code ourselves, but would love to see the community of IT shops that are working on Open Source solutions chip in together to make this happen faster.  How quickly can we make these open sources solutions ready for the Enterprise?  6 months, 9 months?

 

Until next time....

-Das

Hello Again,

 

Today I am going to share my Top 10 areas of focus for 2012, some of these are stretch goals, but would like to share ideas and see what others are doing in these regards.  Again, we are primarily focused on an internal enterprise private cloud, due to TCO, security, and performance reasons.  However we do intend to use external capacity for specific use cases, we can discuss that later….

 

First of all, as discussed in my last blog, we are going after three big business goals:

Business Goals 

1.) Achieve 80% Effective Utilization (CapEx Reduction)

2.) Velocity Increases at a Cadence for Service Provisioning and Maintenance activities (Agility and lower OpEx through more automation)

3.) Zero Business Impact (Resiliency)

 

Top 10 for IaaS and PaaS - we can discuss SaaS in a future discussion...

1.)  Cloud Bursting automatically, first from one Intel Data Center, then to second Intel Data Center, then to public cloud all through controllable policy.

2.)  Automated Sourcing at Provision and Runtime- as a consumer enters our portal or calls our APIs, based on business requirements entered, security classification, capacity available, and workload characteristics…. the automation and business logic will decide what type of services and the location (public, private, or hybrid cloud).  Workloads are dynamically migrated to higher performance infrastructure (and back) as demands change through the app’s life cycle.  End result is a dynamic infrastructure based on a Hybrid Cloud that adapts to consumer needs automatically.

3.)  Automated End-to-End Service Monitoring - as the Automated Sourcing occurs, all components are dynamically added immediately (as fast as provisioning time) to an end-to-end service model representing health, utilization, and usage of the deployed service.  Dynamic changes to environment are handled through automation (add/remove nodes, etc).  Key service level objectives QoS are exposed to the consumer (i.e.. availability, performance, configuration compliance, associated service requests) providing the consumer a view of how the service is performing to SLA for their precise instance.

4.)  Automated component based recovery - as specific components in the end-to-end service fail, automated remediation is completed resulting in 95% of situations being rectified through destroy/create concepts and or other immediate remediation solutions - net effect is Zero business impact.

5.)  Automated deployment of resilient services - nodes and components are deployed and managed through automation in such a way that allows for 100% uptime (zero business impact), through methods such as affinity, striping across multiple points of failure, active/active deployment across multiple data centers and disaster zones.  All done based on choice in portal on resiliency requirements for application, and with minimal complexity.  Applications built through PaaS are always built with Active/Active resiliency and with Design for Failure elements enabled.

6.)  All aspects of solution are available through Open APIs and rich but simplistic UI, or API layer that allows for usage of different service providers or different platform solutions allowing for write once methods with backwards compatibility for application layer.  Features are exposed via control panel to cloud consumer that permit manipulation of backup schedules, patch parameters, alerting thresholds, and other key elements for supporting a production service. Integrated dashboard views are available for different participants: operations, end-user, and management.

7.)  Security –Security assurance provided allowing for trusted computing for compute and data components of Cloud hosting environments.  Levels of trust available through programmatic queries and UI, with configurable settings to establish level of trust where security standards are not yet in existence, this configuration could include logical segmentation, physical segmentation, and authorized users roles, as well as such elements as encryption of data at rest, in motion, or in memory.   

8.)  Exposure of scale out data services (Relational, structured, unstructured, file shares) - through APIs, with replication between all necessary locations based on placement of nodes supporting application.

9.)  PaaS layer for both Java, and .NET applications - with associated IDE, Manageability, Data, and Compute Services, exposed at PaaS layer instead of IaaS, PaaS layer should automatically enable key design elements such as automated elasticity, automated deployment of resilient services, secure code on a trusted platform, and with client awareness.

10.)Select and Choose web services for consumption with appropriate interfaces exposed based on choices in portal on business solution being developed - encourage of reuse of existing web service stores in both public cloud space, as well as private cloud.  Providing community of mash-ups for specific business processes, and associated IaaS and PaaS underlying technology exposed as needed for use case described in portal.  Net effect is the ability to enable our Innovative Idea to Production Service < 1 day.

 

Are you doing similar things in your Cloud environment, are you doing these today already in your private Cloud?  As usual love to hear your thoughts on where you are going, as many of us are on this same journey to transform how IT is used to help things happen faster, better, and cheaper.

-Das
Intel IT Cloud Lead

Hello Again,

 

While my blogging has been infrequent, I do want to share a little bit now, and if I start doing small blogs, I am hoping I will do more…  so short blog tonight, again on a plane, no clouds in sight though.

 

Ever since Intel IT started sharing what we pulled off with Compute IaaS in our enterprise private Cloud in late 2010, we have been getting called upon by many large enterprises to share how we did it.  The most fascinating thing I have found about Cloud is that it is relevant in every single sector of business that has IT.  We are in an interesting moment of convergence of requirements from how people expect and want to work and live.  The move towards pervasive mobility and the desire to get and share information anywhere and everywhere at any time is driving the app developers to require scalable accessible platforms to build their apps on so they can focus on their end users.  And IT in every single sector has a major role to play here to make this successful, I have yet to find a single sector that doesn’t want and need this to survive.  In the past when I did Grid Computing, it was nearly impossible to find other IT shops that we could discuss ideas with, now with Cloud the concepts of an accessible infrastructure and application platform is really sinking in, and everyone wants to go.

 

What we did at Intel IT isn’t really that mind blowing when compared to some of the Public Cloud solutions out there, but what we did has set a trail for other large enterprises to embark on, and I personally really enjoy helping them go faster down the path to enable their employees with the best accessible solutions possible.

 

We are in an exciting transformation time right now, where the public and private clouds are really going to help us move faster in our technical hyper evolution, and the power is way beyond just growing businesses, the opportunities are endless for anyone who has an innovative idea inside of our companies to supply a new productivity solution, or a genius 12 year old in Mozambique that has an idea on how to make solar panels more efficient.

 

To close out the blog, I am going to share our 3 big business focus areas for Cloud into the next year

  1. Increase our capital utilization – this is through federation, larger pools – same approach we took in driving up our Grid environment to 80% utilization.  All while maintaining strong quality of service.
  2. Increased Velocity at a regular cadence – compute IaaS is just the start, next we need to ensure we have data (structured, unstructured, file, object) services exposed, and we need to tackle the time it takes to get new solutions out the door, we can bog ourselves down in our path to production… and my goal is Innovative Idea to Production Service in under a day.  A combination of PaaS and more IaaS will get us there, as well as more automation to make scaling/functional testing and release management a non-laborious process.
  3. Zero Business Impact – no application/services downtime… embrace Design for Failure, this is how we manage our Grid, and this is how the successful web software apps are running.  No matter how much money you spend on extra pipes, extra power, extra servers, you will have a failure.  Assume it, and build your software to deal with it correctly, go active/active across multiple data centers, push your software vendors to think differently.  At the same time we know we have tons of legacy apps that were written 3 months ago, so we need to ensure we are resilient at the core, but without overspending.

 

My next blog will be in our top 10+ goals for our Cloud investments moving forward, and as usual would love to hear what you are doing with Cloud.

 

Cheers,

-Das

Intel IT Cloud Lead

http://www.intel.com/content/www/us/en/it-leadership/intel-it-it-leadership-cloud-computing-brief.html

Hello World,

 

I have been so busy with my normal work that I haven’t had time to share what we are doing in a while, so on a recent flight home from Washington D.C. where we just held our first IT@Intel Cloud Summit, I thought I could spend a little time to share where we are at, and where we are going.

 

First of all 2010 was a busy year for all of us working on introducing the cloud to the Office and Enterprise environment at Intel.  We took some tough challenges, and pulled most of them off.   Here is a recap…

 

1.)    Pervasive Virtualization – our Cloud foundation is moving forward fast, we went from 18% of our environment virtual at the end of 2009, and beat our goal of 37% by end of 2010, and we are now at around 45%, we are starting to hit some of the tougher workloads but we continue to move at a rapid pace here.

2.)    Elastic Capacity and Measured Services – we made some pretty great strides in ensuring all of our cloud components have instrumentation, and getting that data into our data layer so we can consume it.  Our Ops team is now starting to use the massive amount of data (from guests, to hosts, to storage) to look at aggregate at what is happening in our Cloud, as well as use it to dig into the specifics where we are exceeding thresholds.  We also run our massive DB running an ETL of around 40M records a day on a VM, just to make sure we walk the talk.

3.)    End-to-End Service Monitoring – we made a decision to tightly couple our Cloud work with our move to a true ITIL Service Management environment – this isn’t a simple task and we have lots of more work to do here.  But I think most of my peers I talk to the industry agree that ITIL with Cloud is a great way to combine the discipline of an Enterprise IT shop with the dynamic natures of on demand capacity.  We have completed end-to-end service monitoring for a few entire services, and are going to be making this the norm as we continue through 2011, eventually creating the service models automatically when self-service happens.

4.)    On-Demand Self-Service – we took an extremely manual environment, and made it automated, and we didn’t do it in a pristine greenfield environment, we did this across our entire Office and Enterprise environment.  This means that basically across all of our data centers, and all of our virtual infrastructure we can serve out infrastructure services on-demand to entitled users.  We took a goal of under 3 hours, and we are doing a pretty good job of hitting this consistently.  This year we are going after the last piece of the environment which is our DMZ and secure enclaves, and our teams are busy working through the business process automation as well as new connectors to automate some very laborious manual tasks.

 

Now nothing any of us do in IT is simple, and everything has challenges…  a few retrospective points I would like to share:

 

1.)    Know your workloads – with the data we are pulling from all of our OS instances, we can see what the workloads are doing to the most important components (CPU, memory, network, storage, and I/O).  In fact we have so much data that sometimes it is tough to find the right data.  However with this data you can pick the top 2-3 counters per component and make sure you are optimizing the OS instance as it moves to the multi-tenant environment.  I like to think of what we are doing as moving families out of the suburbs and into high-rise extremely efficient leased apartments.  Being that we control the city, we can make these decisions, but as we do this we need to be careful to make sure we have enough square footage to let the family thrive, if we give them to little space, or we don’t allow them to cool their apartment – we could end up with angry tenants.  Also no one wants a rock band living next door, so we have to make sure those noisy neighbors keep the noise down, or give them a room away from the rest of the tenants.

2.) 

          Know your environment thresholds – most IT shops work in silos, and many of the silos make decisions on their specific component that may not comprehend the entire IT ecosystem, this can be as simple as how large a subnet range is, to how many spindles are provided out to handle a handful of DB VMs.  In my Design background we would go in and break our infrastructure as a practice (of course not while we are using it) and we would then understand specifically how/why we were able to break it, and set a threshold.  This threshold also serves as a challenge – meaning how do you take a 2x or even 10x goal to lift up the threshold as you take on more business, and as the business grows.  If you don’t know how to break your environment, when/if it does break you will be struggling to figure out how to get it back to normal.

 

3.)    Don’t underestimate the cultural shift required to move from a manual environment to an automated environment – our factories and our design environment work extremely well due to our large investments we make on automation.  This isn’t the case for most traditional IT shops I talk too, and neither was it for ours.  We made huge strides of bringing in automation to this environment, but we have a long way to go still.  This isn’t just a technical challenge either, you need to help your organization and workers understand that just because we are automating their work, it doesn’t mean they are going away.  When I started at Intel one of the most valuable pieces of advice I got was to always seek to engineer myself out of a job.  This didn’t mean I was getting laid off, it meant that I could then apply my skills to a higher level task, we are constantly under headcount in IT, especially for those of us that are a cost center and not a profit center – however there is no shortage of valuable work we can do in IT to improve the business services and make evolutionary changes to help the bottom line and the top line.  Also, make automation a part of everyone’s job…  a script with good documentation in it, is always better than documentation with a pointer to a script.

4.) 

          Many years of manual environments means that automation will hit walls – when someone takes a document and uses it to setup something in one data center, it is almost a given that someone in another datacenter is going to follow that doc slightly differently.  Configuration drift leads to some tough challenges, and automation will quickly find these problems and point them out to you – usually with a big red X.  Fortunately we phased in the automation so we were able to see a lot of the problems before we turned on self-service globally.  Now that we have self-service we see configuration and performance issues almost immediately.

 

I am about to land home in Portland, and the captain just said it is cloudy with a chance of rain…  we have a long path still ahead of us  as we continue to enable new businesses at Intel, and existing business rapid growth – the last year of work took us a big leap forward, and I am excited about the coming year.

 

Where are you with your cloud efforts?  How have you handled the challenges and were yours similar or different?

 

Until next time,

-Das

Intel IT Cloud Engineering Lead

One of the core aspects of our Enterprise Private Cloud is On-Demand Self-Service and the only way to make this work is by instituting automation across the enterprise as the norm.  This means almost everything that happens today in the infrastructure environment needs to have a way to automate it, and as we have experienced over the last year... provisioning a virtual machine is easy, but getting all of the supporting aspects automated is a much greater challenge.  We also made a decision to not do our Cloud as a "Greenfield" install, meaning we are taking the approach of setting up our existing infrastructure to be part of our Private Cloud - which is a challenge but necesary.

 

In our enterprise environment we have adopted ITIL like many other enterprise IT organizations, however a lot of the process standardization is still very manual, therefore it is much easier to mask issues in the environment or differences in configurations by using humans (my peers and I) as glue to make it all work.  But once you send a computer system after another computer system with automation you really start to see where things need to be improved. Unlike humans, computers only think in binary unless you give them lots of complex logic to deal with ambiguity or problems that they may encounter.  So when the automation hits a snag or something it wasn't programmed to deal with it, it can't try to figure it out like you and I can (unless of course it is programmed to do some analysis)... it just fails.

 

The really great aspect about all of this is we are finding where infrastructure components have been setup differently even though we have standards, and we can pinpoint problems globally with automation vs, having someone tell us something is broken.  As we move up the computing infrastructure stack in regards to what we offer through on-demand self-service we will continue to find more issues with configurations or lack of robustness in endpoints, however to me this is the exciting part about automation - the real opportunity to attain quality and agility through implementing automation that continues to move up the stack of IT workflow.  Where we previously masked the problems with people working around it, we now have to get in and really root cause the issues and improve the environment.

 

So yes, automating the Enterprise infrastructure is a very challenging assignment, but it is very exciting to be part of making it happen.  As we focus our automation on something that is consumable like our Infrastructure as a Service, we are really starting to see the need to bring together the various teams that previously worked in silos and have them all work together to establish robust and standardized infrastructure components that can equate to a holistic and powerful service that our customers can consume.

 

Are you doing this too?  If so, how have you handled this challenge?

 

-Das

In my introduction I talked about the main aspects of the Intel IT Enterprise Private Cloud, and I would like to take a deeper dive on one of these key aspects: how we have approached driving pervasive virtualization.  We chose this path of pervasive virtualization primarily because we have lots of existing legacy applications, and we needed a method to bring them into our overall automation scope and help resolve some core business challenges we were experiencing (lack of agility and low utilization).

When we started this journey about a year ago, our server OS’s that were running on virtual hardware was around 12% of our total environment.  Clearly with our intent to have pervasive virtualization, we needed to change that number quickly.

We took a number of paths to make sure we could make immediate gains in our virtual to physical ratios… first of all was getting in front of new capacity demands, and we spread our net far.  We took control of the physical server purchases and drove server purchasers towards our virtual platform as the default, we scrubbed everyone’s capital plans and looked for opportunities to get in front of their purchases, we analyzed IT projects and looked for who was going to need capacity and got in front of them.  All of these methods helped us guide new purchases to virtual machines instead of new physical – it also helped us figure out what were the real barriers keeping us from 100% virtual.

Based on the data we collected and some additional analysis we created our list of Technical Limiters to 100% of our OS’s running on virtual machines.  We use this list to track our engineering work, have discussions with our suppliers, help us measure what we need to still solve, and determine which physical servers are optimum candidates for virtual.

Some of the top limiters we are dealing with are:

1.)    Virtualizing our big Tier 1 systems:

a.       We use load balancing pretty extensively for our web heads with a mixture of software and hardware load balancers depending on the app – there are plenty of people doing this today, however we needed to get the solution reproducible for our operations team so it could become the norm.

b.      Lots of our applications use clustering for application level failover…  we are putting the final touches on our implementation, however getting this working and designed for operations was not trivial.

c.       Many of our big Tier 1 servers have significant number of LUNs (Logical Unit Number) on each host, and with the 256 limit per host we had a scaling problem on the clusters.  This required us design smaller clusters for these apps and look at more scaling out options for the applications.

2.)    Virtualization of externally facing apps:

a.       We use a combination of security methods to secure our DMZ (demilitarized zone) environment for externally facing applications, we recently completed the engineering and rollout on this and are now virtualizing our externally facing environments on multi-tenant infrastructure.

3.)    Data Classification Controls:

a.       Due to concerns with checking the integrity of the hypervisor and the potential for a unsecured guest to be an optimum attack surface, we are in the process of engineering a solution that allows us to have mixed multi-tenant clusters for our higher priority servers and minimizing the risk to other guests or the host itself.

4.)    Mega Virtual Machines:

a.       Most of our VMs (virtual machines) fall into one of our 3 units (Small, Medium, and Large) and we rarely roll out VMs with more than 8GB memory or 4 vCPU (virtual CPU).  However in order to cover the rest of the environment we have to deploy much larger VMs, and it seems some software is just asking for more and more memory in a scale-up fashion vs. scale out.  We are analyzing now how to best handle 48GB+ VMs and still have a good functioning cluster (from an operational perspective).

We then take each of these limiters (this is just part of the list) and we figure out how to see if a physical server is impacted by it, and therefore determine how much of our environment is resolved when we fix a limiter.  This method has kept us very data driven and systematic.  Previously there were lots of open ended opinions and FUD (fear, uncertainty, and doubt) and by using this method and sharing the details internally extensively we have made some pretty big leaps in making virtual the first choice for our application owners.  All the servers that are considered not limited are then fed into our operational Virtual Factory which runs the process of analysis, scheduling, migration, testing, and end of life (EOL) of the hardware.  This is another interesting topic that I will write about in the future.

What are your main limiters and how are you dealing with them?

 

-Das

Hello World,

 

I work in Intel IT Engineering in our Data Center Engineering team and am our enterprise private cloud engineering lead.  This means that I focus across all aspects of our cloud engineering work to ensure that we are building optimum solutions, that our technical investments are meeting our overall architectural plans, and that our operations teams are getting “design to run” solutions.  Before this role I was heavily involved in various engineering and operational aspects of our Design Grid environment, and decided to take on the challenge of bringing many of our grid computing concepts to the rest of our IT infrastructure.

 

 

I also spend time out talking to peers in the IT Industry, and realized that it would be beneficial to share my personal perspective as well as that of the many engineers that I work with on the challenges, successes, and failures that we encounter on this Cloud journey.  So now I am blogging to help increase that communication, and I hope I hear from some of you on what you are doing in this exciting space.

 

 

Cloud is a pretty broad topic, and I like to keep my blogs somewhat short to stick to a key point…   we have published a few IT@Intel whitepapers (http://download.intel.com/it/pdf/Entrprse_Priv_Cloud_Arch_final.pdf) which you can take a look at to provide more in-depth context on our journey, and I will pick one-two areas that are either keeping me up at night or that I am proud of to share in the future - this first blog is my introduction, and to give a few pointers on what we are doing.

 

At a high level we are taking a pragmatic approach to shifting our IT Office and Enterprise infrastructure from a silo’d predominantly physical environment to an elastic on-demand infrastructure with multi-tenancy.  We made a decision last year to make pervasive virtualization a foundational aspect of our Enterprise Private Cloud, and have gone from around 12% of our Server OS’s running on virtual hardware in late 2009, to over 35% now.  This is also an interesting topic that I will speak about in another blog; both from an operational perspective to move at that pace, and from a technical perspective on how we are handling the analysis of technical limiters and addressing them systematically.

 

 

On top of virtualization we have introduced on-demand self-service… this is also a very significant and complex area which relies heavily on multiple solutions such as capacity management, entitlement, and appropriate controls to make our infrastructure appear infinite to our consumers.  This isn’t happening overnight, but I am excited about our progress.

 

 

Let me know if there are specific things you are interested in, I will try to get these out on a somewhat regular cadence, and thanks for reading my introduction.

 

 

-Das

 

 

Filter Blog

By author:
By date:
By tag: