1 2 3 Previous Next

The Data Stack

1,580 posts

Today, most of the applications delivered through cloud infrastructure are consumer services, such as search, photos and videos, and services like Uber and AirBnB. But the cloud segment is diversifying as we see a greater push toward the enterprise deployment of more private and hybrid clouds.


This is a trend that will accelerate greatly over the next few years. Intel estimates that by 2020, up to 85 percent of applications will be delivered via cloud infrastructure. As this tectonic shift of workloads toward efficient cloud models takes place, we will all witness industries transform themselves as more consumer services and business services move to cloud-type architectures and new digital services emerge.


At Intel, we are working actively to enable this shift by removing barriers that block the road to enterprise cloud deployments. Together with a broad ecosystem, we are working to create an entirely new set of requirements for globally distributed and highly secured clouds.


The next wave of clouds will be based on off-the-shelf solutions and open source software, rather than build-it-yourself approaches. It will be focused on the convergence of services, to ultimately deliver better experiences to end users. And it will be more transparent about the infrastructure in the cloud environment—because what’s inside matters for performance, security, and reliability.


That’s all good news, but there is just one problem: We’re not moving fast enough. Solution stacks are fragmented. The process of deploying solutions is difficult, time consuming, and error prone. And in too many cases the features required for enterprise use cases are simply not there.


That’s why we created our Cloud for All initiative. This initiative, announced in July, is designed to accelerate cloud adoption by making public, private, and hybrid cloud solutions easier to deploy. Cloud for All is based in three pillars—industry investment, technology optimization, and industry alignment—with an ultimate goal of unleashing tens of thousands of new clouds.


This really starts with investment. Intel has a history of helping to lead the technology framework and drive industry standards, and it’s no different in the cloud space. As I noted in a talk today at the Structure 2015 conference in San Francisco, we are investing significant resources to support communities like the OpenStack Foundation to drive optimized open source solutions that the entire industry can innovate around.  We are also investing in the open container and cloud native foundation initiatives, where the industry is focused on ensuring that the standard framework for containers interoperate and that cloud-native applications interoperate across environments.


All of this work is great, but we’re not stopping there. We are simultaneously making significant investments with Mirantis and RackSpace. These collaborations augment our work within the OpenStack community by addressing key gaps in the enterprise readiness of OpenStack.


Now let’s talk about solving data center problems by optimizing solutions. Here are a few examples of how Intel drives that innovation in cloud infrastructure:


  • At Structure 2015, Diane Bryant announced that Intel will deploy software development platforms to select customers in Q1 2016. These development platforms will include an Intel® Xeon® processor with an Altera FPGA (field programmable gate array) in a multi-chip package, as well as a set of libraries to get started.


  • We are optimizing many Intel products and technologies for software-defined storage, one of the keys to new clouds. And just last week we launched the Intel® Xeon® processor D-1500 family optimized for storage and networking, a chip that will accelerate the move to more agile, cloud-ready communications networks.


  • Elsewhere on the innovation front, we recently announced a new memory technology called 3D XPoint™ technology. This technology will be used to create wicked fast SSDs that have, in early testing, shown a 5x–7x performance advantage over our current fastest data center SSDs. But we’re not stopping at SSDs and have announced Intel DIMMs based on this breakthrough technology.  These DIMMs can deliver big memory benefits to without any modification to the OS or the app, or can be used for non-volatile operations using the SNIA NVDIMM programming standards. Cloud architecture is all about delivering the best performance per TCO dollar and the capability of these DIMMs removes long standing workload bottlenecks – a chance to reimagine how memory and storage work together.


I hope this insight into our view of the next tens of thousands of new clouds inspires you to be part of the journey.  In many ways, we’ve only just begun to scratch the surface of what’s now possible.

Diane Bryant

Power of the Cloud

Posted by Diane Bryant Nov 18, 2015

Data is the most disruptive force in business and society today. In my conversation with Fortune’s Stacey Higginbotham at Structure 2015 today, I will assert that our industry’s overall view of the power of the cloud is too incremental, and that we ought to have much bigger aspirations.


Everyone knows that data, powered by the cloud, is fundamentally changing the way businesses compete. What’s less widely discussed is that data holds the power to tackle the greatest societal challenges of our time— from healthcare to hunger to climate change and beyond. The cloud is the delivery mechanism that will make transformative change possible, and Intel is laying the groundwork to shape the future in this regard.


Today’s view of the cloud is overly constraining. We think of it incrementally—one step beyond virtualization—as a way of optimizing a business and banking some  efficiencies. It is too often viewed in binary terms, as either public or private or only in the context of massive “hyperscale” capacity, a benefit within the reach of only the biggest cloud service providers. But imagine if, instead, we saw the cloud as something more. What if the cloud was something accessible to every business and institution, every entrepreneur and inventor?

A perfect example of the power of the cloud was demonstrated at Oregon Health and Sciences University this past summer where Intel launched the Collaborative Cancer Cloud initiative—an endeavor that we hope will enable a cure for cancer.  Exciting stuff, but merely the beginning, as we put the power of information to work in many other ways, such as analyzing crop yields, curbing greenhouse gas emissions, and tackling the vast field of precision medicine.

Contrary to current thought, these pioneering and transformational endeavors enabled by the disruptive force of cloud computing can be accessible to all. Our whole effort this year – under the moniker “Cloud for All” – was to truly create clouds that can be deployed and accessed by businesses of all size, by entrepreneurs in the most remote parts of world, by researchers and scientists from institutions large and small. At Intel we believe that the ideas and inventions unleashed by the cloud can fundamentally alter the course of human history.

Thinking of the cloud as a tsunami-sized force of change, not just compute, storage and network efficiency, is truly one of the most exciting conversations in technology today.

Our world is confronted with immense challenges, from genetic diseases to natural disasters, and I believe that engineering and science hold the key to how we solve these problems. We’ve all experienced instances where technology has or could have played a better role in amplifying the quality of our lives or the lives of others.lisa-smith-and-diane-bryant-re-croped.jpg


Within High-Performance Computing (HPC), there are still many as-yet discovered ways we could help researchers and scientist make progress faster. Accomplishing this, however, requires a healthy pipeline of computer scientist (i.e., code developers) and domain research experts (i.e., chemist, physics, material scientist, etc.) working together.


These big challenges in society, industry, and science can best be met by leveraging a diverse pool of ideas. We’ve all witnessed progress on these challenges and can attest to the value of having diverse thought, experiences, and mindsets applied to the problem.


In true Intel fashion, we are out front in helping to build an HPC talent pipeline that mimics the makeup of society. Monday, we announced two programs aimed at increasing the participation of underrepresented minorities in HPC. The first is an HPC internship program, which will offer students exposure to the various facets of HPC by rotating them through different computer science and research teams at Intel. We believe that broadening students’ exposure to HPC will encourage them to discover new ways of developing applications that ultimately help bring solutions to humanity.


The second program is a scholarship program. Intel is committing US$300,000 a year for the next five years to fund scholarships for underrepresented minorities who are pursuing graduate degrees in computational and data science. Intel will work with SIGHPC to award these scholarships beginning in 2016.


We encourage those attending SC15 to join the Diversity-focused discussions. Intel is hosting two of them: “Diversity and Innovation in HPC” and “Women Impacting HPC Tech Session.”


Collectively, we can make a deeper impact in technology through inclusiveness.

Last week I had the opportunity to host a panel discussion at New York University on the topic of data science as a career. Held at the NYU Center for Data Science, the panel also included Todd Lowenberg, group head of advanced analytics at MasterCard Advisors, and Kirk Borne, principal data scientist at Booz Allen Hamilton.


In front of a full house of students, we discussed everything from how we each got into the data science field to our “crystal ball” predictions of what the students might expect as they begin their own careers. It was a fantastic experience to get to speak to so many students on the cusp of the professional data science world, and to introduce them to what we’re doing at Intel to advance data analytics. I think it was an eye opener for the audience to hear that Intel is not just a chip maker; Intel is on the inside, but it’s also on the outside enabling companies around the world to use data analytics to become more agile, competitive and intelligent.


Part of what makes a discussion about fields such as data science so successful is personal stories about the inspiring moments we data scientists have had throughout our careers. Kirk shared an uplifting experience that brought him out of the world of astrophysics and into the world of data science. While at NASA, he worked with an IBM internship program that was teaching data analytics to inner city high school students. How, he asked, did they get these kids interested in such an advanced concept when so few of them regularly went to class?


They were interested in learning all about data because the program related it to something they cared about – sports analytics. These kids never knew that using math could influence the play of their favorite athletes, and they were hooked on learning more. The graduation rate at their high schools sat at around 47% - but after the internship, the graduation rate among these students soared to 93%.  Kirk knew then that we were on to something special with data analytics, and has spent his life dedicated to the field ever since.


Todd surprised the panel by introducing MasterCard as a technology company. Which, when you think about it, isn’t a surprise at all. With millions of customers each having a vastly different spending pattern, the data sets available for analysis are some of the most interesting and unique available – a  veritable “kid in a candy store” situation for any data scientist.


Todd also outlined one of the most important concepts of the discussion – data science as a team sport.  It’s true that having advanced knowledge of mathematics and programming is fantastic background for a data scientist. But, in any company, you won’t find just one data scientist doing it all – just like Michael Jordan couldn’t have scored so many points without Scotty Pippen at his side, data scientists all bring their own skills to the table that together build an ideal team.


In fact, we’re looking for all kinds of skills and backgrounds as we look to build out our team at Intel – from programmers to those with creativity, curiosity, and great communications skills.  It’s rare to find a “data unicorn” that can do it all, and we’re not spending our time recruiting for such a talent. We build out teams to reflect a variety of backgrounds and experience, which brings greater insight to our data analytics work. In this spirit, we had a very diverse group in the room, with students majoring in physical science, math and statistics, and computer science. This is incredibly encouraging, since diverse backgrounds build a better data science team.


After our discussion, we had the opportunity to learn all about what was top of mind for the students. A theme that kept popping up in my conversations went something like this – “I’m really good at computer science, so how do I show my mettle as a data scientist?” My advice to them was to get their hands on a data set – whether it’s from Kaggle, DataKind, or the government – and build up a data analytics environment. Calculate something on it, whether it’s a correlation or Tableau visualization, and tell a story with that data. It’s great practice, and will show anyone interested in the field what data science is really like. It will also show future employers that you’ve done work in the field, and that you understand how to deal with messy data and think about these types of problems.


I hope to be able to host more of these university discussions in the future. It’s been some time since I left the world of academia, and it’s invigorating for me to spend time with students, learn about what they’re working on, what’s challenging them, and help guide them on their path to data science. With the huge shortage of data scientists we’re faced with today, it’s fantastic to see so many great minds ready and willing to jump into the field. And maybe if I’m lucky, I’ll have a few of them on my team.

Data centers run 24x7, consuming power at high rates even when workload demand is relatively light.  Servers can be shut down completely when unneeded, but this can negatively impact performance if demand suddenly ramps up, and can still leave power and cooling resources out of balance with performance needs.  Intel Node Manager provides a solution with policy-based power management capabilities, already widely available in data centers today with servers based on the Intel Xeon Processor.graph-chart-photo.jpg


But what are the trade-offs?  Which workloads run most efficiently with power capping policies in place?  And what is the ideal level to limit power without negatively impacting performance?  Intel worked with Principled Technologies, an independent assessment firm, to conduct in-depth analysis and provide answers to these questions.


This is all about efficiency, finding the sweet-spot in performance per watt.  For instance, the study shows that a 65% power level cap produced the optimal performance per watt for database storage-intensive workloads, such as those found in OLTP-style e-commerce sites.  This is almost 20% more efficient than without Intel Node Manager.  CPU- and memory-intensive workloads, such as those found in virtualized desktop environments or in Java application servers, respectively, also showed efficiency improvements.  Mixed workload environments, such as Exchange, were able to achieve up to a massive 42% efficiency boost.  At a scale of thousands of servers in a datacenter, this adds up to significantly lower power consumption costs without unduly sacrificing performance.


Take a look at the detailed report, including the underlying methodologies, and consider taking advantage of Intel Node Manager to optimize power efficiency in your datacenter.

By Aaron Taylor, Senior Analytics Software Engineer, Innovation Pathfinding Architecture Group (IPAG), Data Analytics & Machine Learning, Intel Corporation


Analyzing Big Data requires big computers, and high-performance computing (HPC) is increasingly being pressed into service for the job. However, HPC systems are complex beasts, often having thousands to tens of thousands of computing nodes, each with associated processor, memory, storage, and fabric resources. Keeping all the moving pieces firing on all cylinders and balancing resource tradeoffs between performance and energy is a mammoth job.



Imagine the data traffic management job involved in collecting telemetry data on hundreds of thousands of processors, memory, and networking components every 30 milliseconds. Having compute node component failures every minute is not uncommon in such complex systems.


To stay ahead of failures, data center managers need automated monitoring and management tools capable of collecting, transmitting, analyzing, and acting on torrents of system health data in real time. There are simply no tools available today that can do this across the fabric, memory, and processor resources for an entire cluster.


A new approach to telemetry analytics


We in Intel Data Analytics and Machine Learning Pathfinding have come up with a new approach for managing Big Data analytics systems, called Data Center Telemetry Analytics (DCTA). It uses hierarchical telemetry analytics and distributed compression to move primary analytics close to the source of the raw telemetry data, doing the initial analysis there and then sending only summarized results to a central DCTA system for analysis.


Over time, with enough health monitoring data in hand, you can use machine learning to build predictive fault models that characterize the response of the entire HPC system, not just individual nodes. And you don’t have to store reams of raw telemetry data, because the algorithms learn what they need from incoming data, get smarter from it, then discard the data.


Our tests have demonstrated that DCTA lets data center operators engage in accurate predictive capacity planning; automate root-cause determination and resolution of IT issues; monitor compute-intensive jobs over time to assess performance trends; balance performance with energy constraints; proactively recommend processor, memory, and fabric upgrades or downgrades; predict system or component failures; and detect and respond to cyber intrusions within the data center.


The key: hierarchical data analytics


Key to the success of DCTA, and using HPC to analyze Big Data in general, is hierarchical data analytics. With this technique, raw telemetry data is collected at each node, and using digital signal processing (DSP), statistical and stochastic processes, and machine learning techniques, the data is compressed while still preserving the context of the data. The context of the data improves over time as more data is analyzed and new features are derived, yielding more information about what’s happening on each node.


With enough information gathered over time, machine learning clustering and classification algorithms can characterize the system response at each node, and enable predictive fault detection and automated resource management to improve cluster resiliency and energy efficiency.


The ability to compact large amounts of raw data into a summary form greatly reduces the overhead of processing and transmitting enormous volumes of telemetry data across a data center fabric, which helps balance performance and energy consumption. The ability to tame telemetry data at its source essentially cuts system management down to size.


Consider the ripple effect: using DSP, initial raw telemetry data is compressed, which eliminates the need to store pure raw values. Over time, more context about system behavior is derived through the analysis of higher-level system features (e.g., statistical features). Data about these higher-level features can also be compressed using DSP techniques.


As the context is further built out over time, machine learning algorithms characterize the system responses at each level, yielding a small amount of data to store. There is no need to store the information-level features, as the localized system response has already been characterized.


With data thus shrunk, data center managers realize massive storage savings and can also transmit much less data across the fabric and thus more effectively characterize the entire cluster response while greatly improving fabric latency, which is a major bottleneck in HPC and cloud computing.


In summary, compute-intensive type data compression algorithms (e.g., DSP and machine learning) can be applied in a hierarchical manner at each data source to greatly reduce storage requirements and latency in transmitting data across the fabric. At the same time, system context is preserved and deepened over time to greatly improve resiliency and predictive capabilities.


Capabilities like these are key to cost-effectively meeting increasingly intensive compute and analytics requirements. We have developed working prototypes and demonstrated their effectiveness in Intel Data Analytics and Machine Learning Pathfinding and are working hard to bring DCTA to life.

What do public health, medicine, science, agriculture, and engineering have in common? Answer: All of these fields, and countless others, face a new class of problems that can be solved only with unique combinations of high performance computing, data analytics, complex algorithms, and skilled computational and data scientists.


Bringing these combinations together is a goal of a just-announced long-term strategic partnership between Intel and the Alan Turing Institute in the United Kingdom. Via this partnership, researchers from Intel and the institute will work together with teams of research fellows and software engineers to drive fundamental advances in mathematical and computational sciences.


A key mission for the partnership, as well as for the Alan Turing Institute itself, is to develop the algorithms that allow people to unlock insights buried in mountains of diverse data—such as weather forecasting models that consider the interactions of ocean temperatures, atmospheric conditions, solar flares, and more.


Of course, algorithms aren’t just the stuff of scientific investigations. They are part of everyday life. When you use your cell phone to call home or search for hotels close to an airport, you’re realizing the benefits of algorithms developed by data scientists. For this reason, some of the advocates for the Alan Turing Institute characterize the times we are living in as “the age of algorithms.”


While driving advances in fundamental research and the algorithms that empower our lives, the Intel-Alan Turing Institute partnership will train a new generation of data scientists through institute’s doctoral program. This forward-looking training effort will help ensure that students are equipped with the latest data science techniques, tools, and methodologies.


The work done through the institute will also drive advances in the Intel Scalable System Framework for HPC. Intel will dedicate a hardware architecture team to the institute’s facilities so that new algorithms developed by the Alan Turing Institute will feed into the design of future generations of Intel microprocessors.


All of this work builds on the legacy of Alan Turing, who was one of the first people to create an electronic computer. Turing is considered by many to be the founder of modern computer science.


For a closer look at the Institute and its strategic focus, visit the online home of the Alan Turing Institute.

Last night at SC15, Diane Bryant spoke about the future of HPC—new applications, new audiences, new architectures. Intel has been involved in high-performance computing for over 25 years, and today, nearly 89 percent of the world’s 500 largest supercomputers run on Intel® architecture.


But HPC is at an inflection point, with growing demand from existing users as they grapple with more data and more complex models, and from new classes of applications that are turning to HPC to gain insights from Big Data streaming in from our connected world. Increased data, complexity, and audiences requires a complete re-examination of how systems are designed.


The challenges to continuing to exploit performance from HPC systems are well documented.  More than powerful processors are required to take HPC to the next level. We need leaps forward in memory, I/O, storage, reliability, and power efficiency, and we need innovations in these areas to work together in a scalable and balanced way.


Intel has been busy working on these next-gen challenges for decades. Our Intel® Xeon® processor E5 processors and Intel Xeon Phi coprocessors are designed for HPC. And our new Intel® Omni-Path Architecture is an HPC fabric that can scale to tens of thousands of nodes.


We’ve introduced innovative memory technologies like 3D XPoint technology, used to create fast, inexpensive, and nonvolatile storage memory. We’re also continuing to improve Lustre* software, the most widely used parallel file system software for HPC.


But next-gen HPC requires more than a collection of parts. The future will require a rethinking of the entire system architecture, a new system framework to ensure that all these parts work together seamlessly and efficiently.


That’s why we’ve developed the Intel® Scalable System Framework. It combines all the elements I just mentioned, and others, into a scalable system level solution that is more deeply integrated than ever before. It is a flexible blueprint for designing balanced and efficient HPC systems that can scale from small to large, address data and compute intensive workloads, and ease procurement, deployment, and maintenance, while being based on standard X86 programming models. Customizability is key. This framework will allow users to adapt their HPC system procurement to their application needs—to tune for high I/O or compute, for example.


Soon, Intel will publish a reference architecture and a series of reference designs for the Intel® Scalable System Framework that will simplify the development and deployment of HPC systems for a variety of industries.


We’ve got to make it easier to use these systems since HPC is moving beyond its traditional technical and scientific roots into business, education, even the world of dating. With the advent of Big Data analytics, everyone from retail chains to social media sites needs HPC calibre systems to make sense of the reams of data cascading in. And in regards to machine learning Andrew Ng, Associate Professor for Stanford University, Chief Scientist at Baidu and Chairman and Co-founder of Coursera, has a great quote, “Leading technology companies are making more and more investments in HPC and supercomputers. I think the HPC scientist could be the future heroes of the entire field of AI”.


Intel is working closely with a number of partners to bring the Intel® Scalable System Framework to market in 2016. Many of our partners have opened Centers of Excellence, where customers can collaborate with experts from Intel and our partners to optimize their codes for HPC systems. The ability to buy easy-to-use HPC systems will make HPC practical for solving new business and social problems that don’t have the technical staffing for the massive academic and research HPC systems of today.


Since Intel entered the HPC field more than 25 years ago, we’ve been out to democratize HPC. Now we’re transforming it to enable the next 25 years of innovation with the Intel® Scalable System Framework!



Learn more about Intel® Scalable System Framework

SAP TechED 2015 has now come and gone, but it was a great show for SAP—and Intel. Our joint innovation on such technologies as SAP HANA* was front and center during much of the show, and over the course of the event we revealed new co-engineered solutions in enterprise mobility, IoT and data analytics at the network edge. SAP TechED is truly the IT & developer conference for SAP professionals! Here are some of the highlights.


SAP TechED started off with Pat Buddenbaum, general manager of the enterprise segment of Intel’s Data Center Group, joining Steve Lucas, global president of SAP Platform Solutions, on the stage for the executive keynote address.


  • Watch this video clip to see Pat Buddenbaum and John Appleby, general manager of Bluefin Solutions, show Steve Lucas a live demonstration of Intel’s own SAP HANA infrastructure running at petabyte scale. The trio’s lighthearted banter belied this groundbreaking feat ofdatabase scale-up on 1 Petabyte of SAP HANA data tiered storage, as on-stage servers based on Intel® Xeon® processor E7 v3 technology running SAP HANA spun through a trillion rows of data in just a matter of seconds.


  • For more information about Intel’s one petabyte deployment of SAP HANA, watch this video interview I conducted with Pat Buddenbaum. We discussed how Intel will use data analytics from the deployment to help refine our manufacturing processes.


  • Pat Buddenbaum also appeared on the HANA Effect Podcast with host Jeff Word from SAP. They discussed the business imperatives driving legacy enterprises to transform their business strategies and data center capabilities with real-time computing to remain relevant in the face of competition from young, upstart challengers.



SAP TechED also gave Intel the opportunity to highlight some of the cutting edge solutions that we are developing with SAP and our ecosystem of technology partners.


  • The IoT Opportunity: Near-Real-Time Analytics from Edge to Cloud: Bridget Karlin, Intel’s director of IoT strategy and technology, and SAP Vice President Irfan Khan discussed Intel and SAP collaboration on end-to-end IoT solutions. Watch this video to learn how these jointly optimized systems securely and seamlessly ingest data from Intel IoT gateways, move it to SAP’s SQL Anywhere Database, and transmit it to SAP HANA cloud platforms, providing near real-time analytics that can be tailored to a wide variety of vertical business needs.



  • How Digital is Actually Transforming Business - Intel + SAP: What do a worn out shoe, a fender bender, and a heart monitor have in common? Find out by watching this short animation that illustrates how SAP S/4 HANA* and Intel Xeon processors can infuse data from sensors and devices on the edge directly into your business.



  • Modernize Your SAP Environment with Improved Performance and a Lower TCO: Audio-equipment manufacturer Peavey Electronics set out to improve the performance of its SAP Business Warehouse platform to provide easier access to mobile reporting tools and add flexibility to respond to fast-changing business needs. https://www.necam.com/docs/?id=ea22aed2-2cc5-418a-82b6-fb85e7687c5a to discover how Peavey addressed these challenges by moving from disk-based database software running on proprietary hardware and OSs to a modern, in-memory database platform based in Intel, SAP, and Red Hat technologies.


  • Enterprise 3-D Graphics Everywhere: 3-D visualization can provide a richer, more detailed view of complex data, technical drawings, and processes than traditional 2-D technologies, improving intuitive process learning, search capabilities, and spatial analysis. Read this solution brief to discover how a combined solution from Intel, SAP, and Citrix can deliver secure, real-time, and detailed 3-D information to virtually any device.


Intel also hosted a number of ongoing activities and events at its booth. One fun event was the HANA Challenge, an Oculus-based game in which participants used their knowledge of SAP HANA to compete for prizes. In this photo, you can see contestants using Oculus virtual-reality goggles to test their skills.


At trade shows and conferences, it’s usually me who’s interviewing industry experts, but at SAP TechED, I had the chance to sit on the other end of the couch. Diginomica’s Jon Reed interviewed me to discuss real-life scenarios for big data analytics. I also shared upcoming collaborations between Intel and SAP, including 3D XPoint™ non-volatile memory technologies, which have the potential to significantly drive down the cost of in-memory computing.


What were some of your key takeaways during the show this year?


Follow me at @TimIntel for my commentary on data analytics, and follow @IntelITcenter to join the dialogue with Intel’s IT experts.



Data centers today are undergoing arguably the largest change in IT history. Driven by application mobility and seamless scaling, cloud architectures are disrupting how data centers are designed and operated. Quickly disappearing are the traditional silo application architectures and overprovisioned networks.


HPC has always required the largest affordable systems, and was the first to adopt large scale clusters. HPC clusters have not only required large compute scaling, but also massive amounts of communication and storage bandwidth required to feed the compute.


HPC clusters use flat or near flat networks to deliver large amounts of bandwidth, along with fast protocols based on RDMA to minimize communication overhead and latency.  HPC clusters use distributed file systems like IBM GPFS* or Lustre* above highly available shared RAID storage to deliver the bandwidth and durability required.


Cloud architectures have many of the same requirements, largest affordable compute, flat or near flat networks, and scalable storage.  The requirements are so similar that Amazon*, Microsoft*, and Google* all support deployment of large-scale, virtualized HPC clusters over their respective IaaS offerings.


The storage platforms used in the deployments consists of a virtualized distributed file system such as NFS or Lustre. This file system is attached to/available on the virtual cluster interconnect to which the cluster's virtualized compute and head nodes are also attached. Delivering low latency and high bandwidth over the virtual interconnect is a real challenge.


Underlying the file system are virtualized block devices that provide the requisite strong consistency and high availability. However, unlike the use of highly available RAID storage in traditional HPC deployments, the durable storage in HPC cloud deployments use non-POSIX-compliant BLOB stores across multiple nodes in the cluster. Providing durability over the network and still meeting the aggressive latency targets required by HPC applications can be a daunting task.


The Storage Performance Development Kit (SPDK) is built on Data Plane Development Kit (DPDK) used to accelerate packet processing.  SPDK delivers high performance without dedicated hardware in Linux* user space. SPDK employs use level polled mode drivers to avoid kernel context and interrupt overhead. Virtual function support for virtual machines also minimizes overhead of hypervisor interaction.


SPDK has demonstrated large improvements in Intel® Ethernet DCB Service for iSCSI target and TCP/IP stack processing and significant latency and efficiency improvements with its NVMe driver, while reducing BOM costs in storage solutions.


Using storage nodes running SPDK, cloud systems can deliver the higher performance and lowest latency storage to HPC applications.  With cloud deployments scaling larger and larger and storage media getting faster, the demand on high throughput low latency storage processing will continue grow. SPDK is a major step forward in reducing storage latency and increasing storage bandwidth.

Thinking of finally getting that space-age, curved, flat screen TV? If you’re in China, 11/11 was your lucky day. It’s Singles' Day – the largest online shopping event in the world. Hundreds of millions of people flocked to online Chinese portals this week, snapping up everything from umbrellas to phones to 2-in-1s.


CNBC reported each Singles' Day shopper was expected to spend an average of $287 this year. Just when you’re trying to reconcile that figure against a reported slowdown in the Chinese economy, read CNN Money’s take on Alibaba’s recent Q3 revenue, which rose 32%, leaving many investors happy and analysts confused.

Alibaba hit $14.3B in sales
The Chinese equivalent of Amazon, Singles' Day is a huge deal for Alibaba.
According to the BBC, the company clocked $3B in sales in the first half hour, with an eye-watering $1.6B pouring in, during the first 12 minutes after midnight.


Last year, the company topped over $9B in sales in a span of 24 short hours on 11/11. That’s more than the entire 2014 GDP of Malta!


Intel is at the heart of Singles' Day  

Here are a few ways Intel partnered with Alibaba to ensure a seamless Singles' Day this year:


  • More than 530 Alibaba CDN clusters now run on Intel® Xeon® processors, Intel® SSDs, and high speed Ethernet adapters globally, which dramatically speed up access times.
  • AliCloud, Alibaba’s cloud-computing subsidiary, processed a total of 140,000 transactions per second at peak and it runs on run on Intel Xeon processors.


                                                                              blog photo2.jpg

Today, Alibaba has 530 CDN data clusters, which all run on Intel Xeon processor-based servers. These servers drive the huge online traffic

during today's Singles' Day shopping frenzy. Pictured above are Alibaba Thousand Island data centers, powering Singles' Day, within the Alibaba organization.


While Alibaba looks to us as one Intel team, behind the scenes, a small army of Intel employees in China and across the world are working with Alibaba. They hail from sales and marketing, data center, memory and storage, networking, software and security, IoT, and the Labs.


“Intel is a huge supporter of the 11/11 shopping experience,” says Rupal Shah, SMG VP and GM of Intel China, in a short video featured on online shopping site TMall.com. “We’ve been partnering very closely with Alibaba not only on the front end, but also behind the scenes, where the Alibaba data center is powered by Intel Xeon microprocessors, SSDs and networking gear.”


chinese-singles-day-2015-blog (2).png



Raejeanne Skillern is General Manager, Cloud Service Providers for Intel.

Multi-Node Caffe* Training on Intel Xeon Processor E5 Series

In the second installment of the Intel® Math Kernel Library technical preview package, we present an optimized multi-node implementation using Caffe* that builds on our previous release of an optimized single node implementation. This implementation scales upto 64 nodes of Intel® Xeon® processor E5 (Intel® microarchitecture code name Haswell) on the AlexNet neural network topology, and can train it to 80 percent Top-5 accuracy in roughly  5 hours, using synchronous minibatch stochastic gradient descent.  Below is a view into the technical details of how we achieved such amazing strong scaling for this very difficult problem.


Multi-node Synchronous SGD


In this work we perform strong scaling of the synchronous mini-batch stochastic gradient descent algorithm. We scale the computation of each iteration across multiple nodes, such that the multi-threaded, and multi-node parallel implementation is equivalent to a single-node single-threaded serial implementation. We utilize data- and model-parallelism, and a hybrid parallelism approach to scale computation. We present a detailed theoretical analysis of computation and communication balance equations, and determine strategies for work partitioning between nodes.


Balance Equations in Data Parallelism


Consider a convolutional layer with ofm output feature maps each of size: output_w × output_h (width and height), ifm input feature maps, stride stride, and kernel of size kernel_w × kernel_h. Clearly, the amount of computation in the number of floating-point operations (FLOPS) in this layer for a forward pass is:


Computation_FPROP = 2 × ifm × ofm × kernel_w × kernel_h × output_w × output_h


Recall that the computation for forward propagation, backward propagation and weight gradient calculation is the same. Now if we consider a multinode implementation where the number of data-points assigned per node is MB_node, then the total computation per node, per iteration is: Computation = 2 × ifm × ofm × kernel_w × kernel_h × output_w × output_h × 3 × MB_node The total communication per iteration can similarly be estimated for a data-parallel approach. In each iteration, the partial weight gradients must be communicated out of the node, and the update weights should be received by each node. Hence the total communication volume is:


Communication = data_size × ifm × ofm × kernel_w × kernel_h × (1 + (1 - overlap))


Here overlap is the amount of overlap afforded by the software/algorithm between the sends and receives. Assuming floating point data representation, and complete overlap (overlap = 1) of sends and receives, we can estimate the communication volume (in bytes) to be:


Communication = 4 × ifm × ofm × kernel_w × kernel_h


The communication-to-computation ratio for data parallel implementation of a single layer is therefore computed as:


Algo-comp-to-comm-ratio = 1.5 × output_w × output_h × MB_node


It is notable that the algorithmic computation-to-communication ratio does not depend on the kernel size or number of input and output feature maps or stride, but instead solely depends on the size of the output feature-map and the number of data-points assigned per node.


For the neural network training computation to scale, the time taken for computation should dominate the time for communication. Hence the algorithmic computation-to-communication ratio computed above must be greater than the system computation-to-communication ratio.


Let us consider the implications of this observation for three cases and three hardware options, one for an Intel Xeon processor with an FDR InfiniBand* link, another for an Intel Xeon processor with 10GigE Ethernet, and another for a dense compute solution like Intel® Xeon PhiTM processor with Intel® Omni-Path Fabric. First let us consider the three layers we want to study:


  1. A convolutional layer with 55×55 output feature map (like C1 layer of AlexNet, or similar to C2 layer of VGG networks) with algorithmic-compute-to-communication ratio of: 4537×MB_nod

  2. A convolutional layer with 12×12 output feature maps like C5 in OverFeat-FAST (and which constitutes the bulk of OverFeat-FAST computation), where the algorithmic computation-to-communication ratio is: 216×MB_node

  3. A fully connected layer which can be considered as a convolutional layer with feature map size = 1, where the algorithmic compute-to-communication ratio is 1.5×MB_node


It is notable that the aforesaid algorithmic compute-to-communication ratios are optimistic and best-case scenarios. The worst case scenario happens when overlap=0, and then these values are halved. For example, the ratio for fully connected layers becomes 0.75×MB_node. It is notable that these are theoretical analysis, and both the computation as well as communication times may vary in an actual implementation.


Now let us consider the system computation-to-communication ratios for the three hypothetical platforms described earlier:


  1. A server class CPU C1 (with 2.7TF peak SP performance), with FDR InfiniBand = 2700GFLOPs/7GB/s = 386.
  2. Same server class CPU C1, with Ethernet = 2700/1.2GB/s = 2250
  3. A manycore processor M1 (with 6.0TF peak SP performance) with Omni-Path Fabric/PCI Express* Gen 3 = 6000GFLOPs/12.5GB/s = 480


Given the system computation-to-communication ratio for the three systems mentioned here, and the algorithmic computation-to-communication ratio for the layers presented earlier, we can estimate the minimum number of data points which can be assigned to each node. This in conjunction with the size of the minibatch, sets limits on the scaling possible for data-parallel approach to neural network training.



Intel® Xeon® processor + InfiniBand FDR

Intel® Xeon® processor + 10Gb Ethernet

Intel® Xeon PhiTM + Omni-Path Fabric

C1 (55x55)




C5 (12x12)




F1 (1x1)





Figure 1. The minimum number of data points which must be assigned to a given node.


Clearly there are several cases where an inordinately large number of data points must be assigned to a given node in order to make data-parallelism beneficial. Often this is greater than the size of the mini-batch needed to converge at a reasonable rate. Hence, the alternative method of model-parallelism is needed to parallelize neural network training.


Model Parallel Approach


Model parallelism refers to partitioning the model or weights into nodes, such that parts of weights are owned by a given node and each node processes all the data points in a mini-batch. This requires communication of the activations and gradients of activations, unlike communication of weights and weight gradients as is in the case of data parallelism.


For analyzing model parallelism, we should note that the forward and back-propagation need to be treated differently. This is because during the forward propagation we cannot overlap communication of the previous layer activations with the forward propagation operation of the current layer, while during backpropagation we can overlap activation gradient communication with weight gradient computation step.


Analyzing the Model Parallel Approach:


We first consider a simple model parallel approach where each node operates on a part of the model of size: ifm_b×ofm_b input- and output-feature maps. In this case, the computation for the forward pass, or backward-pass, or weight-gradient update is given as:


Computation = 2 × ifm_b × ofm_b × kernel_w × kernel_h × output_w × output_h × minibatch


For the forward pass the amount of data received by this layer is:


Recv_comms = 4 × ifm_b × input_w × input_h × minibatch × (ifm/ifm_b - 1)


The amount of data sent out by the previous layer is:


Send_comms = 4 × ifm_b× input_w × input_h × minibatch


Hence the time taken for a forward pass with no compute and communication overlap for a given layer is:


Computation/System-flops + (Recv_comms + Send_comms)/Communication-bandwidth


Similar to the analysis of data-parallel multinode implementations, we can compare the communication and computation in the model parallelism. The algorithmic compute-to-communication ratio is:


2 × ifm_b × ofm_b × kernel_w × kernel_h × output_w × output_h × minibatch/ 4 × ifm × input_w × input_h × minibatch


This can be simplified as: 0.5 × ifm_b × ofm_b× kernel_w × kernel_h × feature-size-ratio/ifm (here feature size ratio is the ratio of size of output feature to input feature). This ratio is independent of the mini-batch size. The algorithmic ratio can be further simplified to: 0.5 × ofm× kernel_w × kernel_h × feature-size-ratio/NUM_NODES (NUM_NODES = (ifmofm)/(ofm_bifm_b)). We then consider mirrored operations for backpropagation and no communication during weight gradient computation, which leads to up to 3X increase in compute and up to 2X increase in communication. The operation is compute bound if:


0.75 × ofm× kernel_w × kernel_h × feature-size-ratio/NUM_NODES > system-compute-to-comm-ratio


Exploring this limit for C5 layer described earlier, and Intel microarchitecture code name Haswell processors with FDR-IB we obtain the following:


0.75 × 1024 × 9 × 0.73 /NUM_NODES > 386, so NUM_NODES < 14.


Similarly for a fully connected layer with 4096 output feature maps we have the following conclusions: 3072/NUM_NODES > 386, so NUM_NODES < 8


Clearly model parallelism alone does not scale well to multiple nodes even for convolutional layers. However, the choice of parallelization strategy is also dictated by which of model and data parallelism works better for a given layer. In particular, if we compare data and model parallelism for a 4096-4096 fully connected layer, we can easily draw a conclusion that model parallelism scales several times better than data parallelism.  In particular, for a mini-batch size of 256, a fully connected layer cannot even scale beyond one node using data-parallelism. However, we must highlight the challenges in software design needed to overlap computation and communication in model-parallelism.


There is therefore a clear need to have both data parallelism and model parallelism for different types of layers. Of particular interest therefore is the question: “When to use model parallelism and when to use data parallelism?” This is answered by simply comparing the volume of data communicated in both schemes. The ratio of communication volume in model and data parallelism is:


(1.5 × output_w × output_h × MINIBATCH/NUM_NODES)/(0.5 × ofm× kernel_w × kernel_h × feature-size-ratio/NUM_NODES)


We can simplify this ratio to be dependent on the MINIBATCH size and surprisingly independent of the number of nodes the problem is mapped to. One should pick model parallelism over data parallelism if:


(3× input_w × input_h × MINIBATCH)/(ofm× kernel_w × kernel_h) < 1 Or: (3× input_w × input_h × MINIBATCH) < (ofm× kernel_w × kernel_h)


Consider now the fully connected layer F1, where ofm=3072 and input_w/h kernel_w/h are all 1. The equation above indicates that model parallelism is favored as long as MINIBATCH is less than 1024. In visual understanding neural networks, MINIBATCH is less than or equal to 256, hence for fully connected layers we use model parallelism, while for convolutional layers we use data parallelism. In ASR networks MINIBATCH is often larger than 1024, so data parallelism is the preferred route for this case.


In the tech preview package we focus on convolutional neural networks, and perform data parallelism for convolutional layers and model parallelism for fully connected layers. This is aligned with the method proposed by Alex Krizhevsky in his paper


A special thank you to Dipankar Das, Karthikeyan Vaidyanathan, Sasikanth Avancha and Dheevatsa Mudigere from Intel Lab’s Parallel Computing Lab and Vadim Pirogov from Intel’s Software and Services Group.  They continue to be the driving force behind the research and performance optimizations work illustrated in this blog.


Exascale at SC15

Posted by SHEKHAR BORKAR Nov 12, 2015

Well, it’s that time again, getting ready for Supercomputing 2015, a premier conference on HPC. The focus, for a while now, has been on achieving exascale performance by 2022. This won’t be easy; there are numerous challenges, such as energy efficiency, programmability, productivity, and reliability—just to name a few. All of these will be discussed at the conference by prominent researchers in the field. I am honored to participate in three panels and birds-of-feather sessions, to further discuss these challenges and potential opportunities.


The first panel is on Tuesday, Nov 17th, from 1:30 to 3:00 pm, discussing new approaches to computing. This panel will discuss what new computing paradigms look like in a challenging environment, with two proposed paradigms being quantum and neuromorphic computing. Does anyone know how and why neuromorphic computing works? Today’s theory of computing is age old, from Boolean logic (centuries), to Turing machine (decades), and it took that long to put into practice. Same thing with quantum computing; is it really computing?


The second panel will discuss the future of memory technology and will meet Tuesday, November 17th, from 3:30 to 5pm. We know to achieve exascale it will require innovation in memory to deal with extremely large data sets.  Many different technologies have been proposed to replace DRAM, with different tradeoffs of performance, energy use, cost and endurance. We’ll never have a perfect solution for everyone and need to find the right mix of DRAM and new memory technology solutions to drive to exascale and beyond.


The birds-of-feather session scheduled on Wednesday, from 1:30 to 3:00 pm, will once again discuss computing approaches for the future. This session will have a slightly different twist, to include future devices such as tunnel FETs, and such. But essentially the same theme as the panel on Tuesday, and will discuss the same very important issues! People are so eager to find a solution that they forget the time it takes.


I am looking forward to the conference, and the intellectually stimulating discussions and debates which we all enjoy.


To learn more about the upcoming conference visit: http://www.sc15.supercomputing.org/

A while back, I talked to a data center IT professional who made an interesting observation on one of the problems with current approaches to storage—specifically hard disk drive (HDD) failures.


“We never get out of rebuild,” he said. He went on to explain that his IT team was constantly dealing with the effects of failing hard drives. On a regular basis, drives would go down and an operation would kick in to automatically rebuild them. And while systems rebuilt the failed drives, infrastructure performance would suffer.


These reliability issues point to one of the reasons why data centers need to move away from hard disk drives, with an average mean time to failure rate of 30 years, and into the era of widespread use of reliable solid state storage technology, with a failure rate that is double or longer than HDDs.  But that’s just one reason. Another popular one is performance.  HDDs offer on average 200-250 IOPS per second per drive with an average response time of 2-7 milliseconds and today’s SSDs are more on the magnitude of at least 6000 IOPs with an average response time of 100-500 MICROSECONDs depending on the manufacturer.  That is a 30x performance (IOPS) improvement and 1,000x improvement in response time.


These dramatic performance gains make possible to better utilize today’s multi-core processors and high-speed networking in the rest of the infrastructure.


What can this mean for data center applications?  Better performance for business processing applications, big data processing for data analytics, and faster processing of scientific and life science applications.  For virtual environments this means more efficient server and desktop consolidation, and better performing applications.  Overall better efficiency and performance in the data center means increased productivity and the capacity to handle more business and revenue.


And there is good news here. Next-generation solid state storage technologies are racing ahead as the new face of primary (hot/warm tiers) storage. We took a quantum leap forward with the arrival of NAND and flash memory. And now we are poised to take quantum leap No. 2 with the rise of persistent DRAM and non-volatile memory technologies, most notably 3D XPoint technology. These will become the building blocks for ultra-high performance storage and memory.

The new NVM technologies will wipe out the I/O bottlenecks caused by legacy primary (hot/warm tiers) storage architectures. A case in point: 3D XPoint technology, developed by Intel and Micron, is 1,000 times faster than NAND. NAND latency is measured in 100s of microseconds; 3D XPoint latency is measured in tens of nanoseconds. This is yet another magnitude faster.


And, better still, 3D XPoint technology has 1000 times the endurance of NAND and 10 times the density of conventional memory. Put it together and you have a unique balance of performance, persistency, and capacity—the characteristics we are going to need for the storage landscape of the coming years.


All of this means that storage can now keep pace with the speeds of modern multicore processors and next-generation networks, along with an ever-larger deluge of data. And, in another important benefit, with the move to data centers dominated by solid state storage with no moving parts, primary storage will become more reliable—and less of a headache for data center IT professionals.


This doesn’t mean you will have to throw out your traditional disk arrays. They will still have a place in the data center, although they will play a different role. They will be repurposed for non-primary storage.


For a closer look at next-generation NVM technologies, including the new 3D XPoint technology, visit www.intel.com/nvm.

In just a matter of days (November 14 and 15 to be precise) Intel will host its annual Intel® HPC Developer Conference for the builders and creators that hold high performance computing near and dear to their heart. This year the conference will be held in Austin and will be just before SC15, celebrating all things super computer.


This year, the Intel developer teams are bringing top notch content and speakers to the event. Here are the top things that developers will experience at the Intel HPC Developer Conference 2015.


1.Representatives from Super Computing Centres across the globe will dive into their projects, provide overviews of their progress and share learnings on topics including: machine learning on cryo-EM data; porting LHC detector simulations on Intel® Xeon® and Xeon Phi™ architectures; the benefits of leveraging software defined visualization; and P-k-d Trees – Massive, low-overhead particle ray tracing.


2.The Intel developer ninjas will be on-site. Ok, it’s really the amazing people that are part of Intel’s elite black belt program, which recognizes people who go above and beyond in contributing to the Intel developer community and to helping our community with their projects. The Black Belts are at the show to help other developers get involved, to help make connections and increase collaboration, and to help answer questions drawing from their own years of expertise and experience. Access to this kind of talent and knowledge cannot be overvalued. Interested in the Black Belt program? More information is available here. https://software.intel.com/en-us/blackbelt.


3.Intel is going to award a student with an amazing prize. The Intel® Modern Code Developer Challenge invited students to improve the runtime performance on code that will be uses in a simulation of the interactive brain cells during the formation of the brain. Winners could receive a nine-week internship at CERNopenlab, a tour of CERNopenlab, or a trip to SC16! Intel, in partnership with CERNopenlab will announce winners on Saturday, November 14 at Intel HPC DevCon. To learn more about Intel’s partnership with CERNopenlab, visit: http://openlab.web.cern.ch/.


4.Hands-on time. Intel is hosting open lab spaces throughout Intel HPC Developer Conference and attendees will have a unique opportunity to access and use a variety of software tools. Conference-goers will also be able to connect their system to a training server with an Intel Xeon Phi coprocessor. There are special open lab sessions, led by Intel platform experts that will focus on performance issues, tools for optimization and building parallelization into code.


We’re looking forward to hearing about attendees’ experiences at the show. Tweet at us @IntelHPC and let us know. For more information about Intel HPC Developer Conference, check out the website: https://hpcdevcon.intel.com.

Filter Blog

By date:
By tag:
Get Ahead of Innovation
Continue to stay connected to the technologies, trends, and ideas that are shaping the future of the workplace with the Intel IT Center