1 2 3 87 Previous Next

The Data Stack

1,294 Posts

Back when the concept of Big Data was young, it described the reams of data kicked up by social media and the Internet in general. But Big Data keeps evolving, and at IBM Impact 2014, we will focus on a new horizon for data management and analytics: the Internet of Things, where Big Data, the cloud, and new forms of data management and analytics meet up to deliver business intelligence.

 

Today’s big data is bigger than ever: the Internet of Things encompasses sensor data, machine data, and device-to-device communication to create a tidal wave of data that pushes the boundaries of current technologies to store, analyze, and make sense of it. For example, in an average airplane there are more than 50,000 sensors. These sensors create between 5 to 6 petabytes of data per flight. Consider the number of flights per day around the world and you begin to get an idea of the volume of data generated each day by the Internet of Things (#IoT).

 

We’re just now putting in place the tools to unlock the information and intelligence in these massive data flows. Intel and IBM are working together to leverage the explosion in data volume and mine it for knowledge that can support business decisions, build customer loyalty and reveal patterns and relationships to revolutionize our understanding in such areas as healthcare, national security, and climate change.  Specifically, Intel brings to the table leadership performance on IBM WebSphere Application Server* with Intel® Xeon® processors, Intel® SSDs & NICs. Also, our Intel® Gateway IoT appliance team has partnered with the IBM Cloud team. One specific demo will highlight #IoT on a custom collectible car!

 

Intel is proud to sponsor IBM IMPACT 2014 and I encourage you to attend the following breakout sessions led by Intel and IBM industry technologists, along with our Internet of Things demonstrations in the EXPO. Each of our sessions will include a drawing to win an Intel® Solid-State Drive!

 

 

 

 

 

 

 

Please seek me out at the conference as I tweet man-on-the-street impressions so we can exchange tweets, MTs & RTs. Follow me and the Big Data community for Intel at @TimIntel and @Intel IT Center.

 

I look forward to seeing you in Las Vegas!

I recently made my third trip to World Hosting Day (WHD) in Germany, a gathering I always look forward to. The content and audience are consistently exceptional, and even with the long journey the event is always a valuable experience.

 

WHD provides an opportunity to meet with leading and emerging hosting and cloud service providers to gain a better understanding of the trends and requirements shaping their future build-out plans.  That helps me shape Intel’s cloud technology strategies to help them more efficiently reach those goals.

 

In my keynote, I talked about how Intel sees the future of the cloud data center. With today’s growing numbers of users, services and data types, it is no longer an option to build a highly efficient and seamlessly scalable infrastructure - this infrastructure is now a critical competitive requirement. 

 

During WHD, I talked with many CSPs about re-architecting the data center to create an IT environment that is more flexible, easier to manage and designed to accelerate service deployments. Software-defined infrastructure is a key to addressing this challenge, and Intel is committed to delivering the rack scale architecture (RSA) and other technologies to make SDI a reality.

 

In my conversations with CSPs, it was also great to learn more about how they are expanding their portfolios to address new workloads by deploying Intel® Atom™ C2000-based services. I shared the keynote stage with one of these forward-looking service providers, 1&1 one of Europe’s largest CSP’s, who is using Atom C2000-based services.

 

During WHD, I also had the pleasure of announcing nine additional CSPs who are joining the Intel Cloud Technology program: Joyent (US), GoGrid (US), ScaleMatrix (US), Cirrity (US), Cloud Provider (Netherlands), Nine Internet (Switzerland), Mobily (Saudi Arabia), Wortmann AG (Germany) and Sakura Network (Japan). At last count, that makes 25 companies who are now branding Intel products in their services—and you can expect to learn about more such companies in the coming months. Want to see how services from all of these providers (and many more) compare against each other.  Go to www.intelcloudfinder.com – your one stop shop for comparing the world’s leading cloud services and finding the best service to match a user’s specific needs.

 

There’s a reason for all this momentum for the Intel Cloud Technology program. Companies understand the benefit to their customers of providing HW transparency for their services to select the right performance for their needs. Choosing a cost optimized services for lighter use or the latest and most powerful for more demanding workloads. Customers can also benefit from choosing a service with specific Intel capabilities made available to optimize workloads. Find out more about all the participating service providers and Intel Cloud Technologies on www.intelcloudfinder.com/IntelCloudTechnology .  

*Please note this is a guest post from COMSOL AB:



Numerical simulation is the third pillar of cognition finding. Here's how COMSOL Multiphysics® simulation software takes advantage of hybrid parallelism on Intel® multicore processors and HPC clusters.


State-of-the-Art Technology and Beyond


Alongside theory and experiment, numerical simulation has established itself as the third pillar of cognition finding. It provides an essential tool to model the physical behavior of complex processes in industry and science. And in order to fully simulate real-world applications often many different and dependent physical phenomena need to be considered.


Multiphysics simulations help engineers and scientists to develop safer cars, design more energy-efficient aircraft, search for new energy sources, optimize chemical production processes, advance communication technologies, and create new medical equipment. While providing a cost-efficient and flexible tool for simulating physical behavior of real-world processes, multiphysics simulation has a high demand for compute power and memory resources.


As the hardware world in the multicore decade has turned parallel even for desktop computers, parallelism is of paramount importance. A vital feature of compute-intensive software is the ability to scale up to hundreds and thousands of cores. COMSOL Multiphysics® is ready to exploit shared memory and distributed memory parallelism at the same time - hybrid parallel computing.


Let's dive deeper into the specifics of this type of computing.


Living in a Hybrid Parallel World


While the integration density of silicon chips keeps on growing, the clock frequencies have stagnated. Additional transistors are now used to pack more and more complex cores onto a single die. The latest multicore incarnations of the classic in-socket CPU types have more than 10 cores, such as the recently introduced 15-core Intel® Xeon® processor E7-8890 v2 (Ivy Bridge).


For the programmer, the tide has turned in such a way that she now needs to address additional levels of parallelism. Modern software needs to account for core level parallelism, parallelism between sockets on a shared memory node, and parallelism between nodes in a cluster. This can be boiled down to shared memory and distributed memory parallelism. For illustration, consider a small cluster with six distributed memory processes (MPI processes) assigned to three nodes, below. Each process uses shared memory across four cores.



comsolpic1.png

Configuration of a hybrid cluster with three nodes connected by a network, two sockets per node, one quad core processor per socket, and one MPI process with four threads per socket. Image credit: COMSOL, Inc.


 

When it comes to the algorithms, we also need to think about data parallelism and task parallelism. The overall goal of parallel execution is to perform more work per time unit and thereby increase user productivity. The user can then either solve the same problem in a shorter amount of time (i.e. she can run more simulations per day) or she can use additional resources for solving even larger problems in order to obtain more accurate results with better resolution.   


Data and Task Parallelism in Numerical Simulation


Numerical simulation, in large part, relies on uniform loop-based operations on huge matrices and vectors. Consider, for instance, an iterative solver for solving a linear system of equations (LSE) for several million degrees of freedom (DOFs). The solver can be put together by vector additions, matrix-vector multiplications, and scalar products. In a parallel iterative solver, all these routines run in parallel.


When parallelizing the kernels, you find out that blocks of data can be local to one thread or to a group of threads. So, not only can work be divided, but data arrays can also be broken into distinguished blocks that can be kept in different memory locations. The distribution of matrix blocks and the division of loop iterations is known as data parallelism.


In contrast, you can also imagine a case where the LSE to be solved depends on a parameter. Instead of a single LSE, you might then have to solve hundreds of LSEs. Of course, these tasks (solving the LSEs) can be processed in parallel as well, and this kind of parallelism is called task parallelism.


However, there are also algorithms that do not contain any kind of parallelism due to dependencies of the intermediate results. These sequential parts are known to limit the expected speedups both in theory and in practice.  


Shared Memory Parallelism: A Global View of a Local Part


Shared memory parallelism is based on a global view of the data. A shared memory program typically consists of sequential and parallel parts. In the parallel parts, a fork-join mechanism can be used to generate a team of threads that are taking over the parallel work by sharing data items and computing thread-private work. Communication between threads is accomplished by means of shared data and synchronization mechanisms.


For the user, it is important to know that every desktop computer nowadays is a shared memory parallel computer due to the multicore processor(s) under its hood. However, she needs to know that her resources are limited.


Typically, the problem size will be limited by memory capacity and the performance will be limited by the available memory bandwidth. For additional resources, you would need to add more computers or shared memory nodes. To this end, shared memory nodes are interconnected by fast networks and make up a cluster. For cluster type systems, distributed memory parallelism is needed and hybrid parallelism needs to be taken into account for better performance.


Distributed Memory and Hybrid Parallelism: A Discrete View of the Whole Ensemble


For distributed memory computing, the data has to be divided and assigned to distributed memory locations. This requires considerable changes in the algorithms and programs.


Remote data items cannot be accessed directly, since they belong to different memory spaces managed by different processes. If data in other blocks is needed, it must be communicated explicitly between the distributed memory processes via message passing. Common patterns are two-sided communication with one sending and one receiving process, or global communications (e.g. all-to-all communication). The additional communication requires further time and resources, and should be kept at a minimum. This also requires new and improved algorithms that focus on data locality and reduced communication. Inside every process, shared memory parallelism can be used in order to fully exploit the available resources.


Due to the hybrid configuration of modern hardware, a single programming and execution model is not sufficient. There are structural differences in communication between two threads on the same socket and two processes on different nodes. The hybrid model reflects the actual hardware in more detail and provides a much more versatile and adaptable tool to express all necessary mechanisms for good performance. It combines the advantages of a global and discrete view to the memory. Most importantly, the hybrid model helps to reduce overhead and demand for resources.


Next, we will show you a benchmarking example to illustrate the benefits of hybrid simulations.


A Hybrid Scalability Example


The scalability of hybrid numerical simulation with COMSOL Multiphysics® is exemplified for a frequency distributed electromagnetic waves model representing a balanced patch antenna where the electric field is simulated. We use a small Gigabit ethernet connected cluster containing three nodes, each with a two-socket quad-core Intel® Xeon® processor E5-2609 with 64 GB RAM per node and a total of 24 cores.



comsolpic2.png.jpg

The electrical field of a balanced patch antenna. The distributed frequency model has 1.1 million DOFs and was solved with the iterative BiCGStab solver. Image credit: COMSOL, Inc.

 

 

Our study compares the number of simulations that can be run per day for a number of processes ranging from one to twenty-four and a number of threads per process varying from one to eight. You can see our results in the graph below. Each bar represents an (nn x np)-configuration, where nn is the number of distributed memory processes, np is the number of threads per process, and nn*np is the number of active cores.

 

The graph shows a general performance increase with the number of active cores. For the full system load with twenty-four active cores, the best performance is obtained for one distributed memory process per socket (i.e. six processes in total). The performance and productivity gain on this small system with a hybrid process-thread configuration (case 6×4) is more than a factor of four over a single shared memory node (case 1×8). The hybrid 6x4 configuration is also almost 15% better than the purely distributed case with twenty-four processes (case 24×1).


comsolpic3.png

Benchmarking the electromagnetic wave model using different process x thread configurations in a hybrid model. The y-axis indicates performance in terms of the total number of simulations that can be run per day. The bars indicate different configurations of nn x np, where nn is the number of distributed memory processes and np is the number of threads per process. Image credit: COMSOL, Inc.


 

For additional reading and further benchmark examples for shared and distributed memory, hybrid computing, batch sweeps, and details on how to set up hybrid parallel runs in COMSOL Multiphysics®, check out the hybrid modeling series on the COMSOL® Blog.



About the Authors


Jan-Philipp Weiss received his diploma degree in mathematics from the University of Freiburg, Germany, in 2000 and a PhD in applied mathematics from the Technical University of Karlsruhe, Germany, in 2006. From 2008 until 2012, Jan-Philipp headed a shared research group with Hewlett-Packard on numerical simulation for multicore technologies at the Karlsruhe Institute of Technology.


Pär Persson Mattsson received his Masters degree, with major in applied mathematics and minor in informatics, from the Georg-August-University Göttingen, Germany, in 2013.


Intel and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. COMSOL and COMSOL Multiphysics are trademarks of COMSOL AB.

For the first time, Red Hat Summit, the premier Open Source technology showcase, will take place on the West Coast, at San Francisco’s Moscone Center from April 14-17. It’s exciting to gather with the Open Source community of system administrators and enterprise architects to hear about the latest achievements in the Open Source tech universe.

 

Intel and Red Hat are key pioneers in bringing Open Source Linux to the enterprise, providing an established foundation for mission-critical workloads and cloud deployments based on the Red Hat Enterprise Linux* operating system and Intel® Architecture.

 

Doug Fisher, VP and general manager for Intel’s Software and Services Group in his keynote address on April 15, 9:30-10am will lay out Intel’s vision for a Software Defined Infrastructure, which encompasses compute, network, and storage domains and leads the way toward more agile and cost-effective data center architectures. Doug’s keynote will also focus on how Red Hat and Intel are addressing the architecture challenges posed by the explosive growth in data and connected devices.

 

Intel is also sponsoring a number of sessions and will participate in several panel discussions. Here are a few highlights – you won’t want to miss them:

 

  • Tuesday, April 15, 1:20-2:20pm: Empowering Enterprise IT. What’s next in data center efficiency and agility? Join Jonathan Donaldson, general manager of Intel’s Software Defined Infrastructure Group, for a discussion of how Intel and Red Hat are uniquely poised to take advantage of the changing face of IT infrastructure.

 

 

 

 

It will be a great show, to be sure. Please share your thoughts about Red Hat Summit and Intel!

 

For the latest on data center optimization, follow me at @RobShiveley.

By Shannon Poulin: Intel VP Data Center Group, GM Datacenter and Enterprise IT Marketing



Around the world, IT is facing significant pressure to be more responsive to the business to deliver new services quickly, become much more efficient and cost effective, while at the same time helping make sense of mountains of data to speed decision making and enable new insights and discoveries.  These challenges are particularly acute in rapidly growing China, where technology leaders and hardware and software developers are gathering this week for the Intel Developer Forum (IDF14) in Shenzhen.

 

Among its many attractions, IDF provides a place where developers come together to collaborate, network and accelerate innovation while gaining insights into new advancements Intel is making to address key challenges IT is facing.  From cloud platforms to solutions for high-performance computing (HPC) and big data analytics, Intel is collaborating with a broad range of partners and customers to re-architect the data center to enable more efficient, automated, and agile datacenters.

 

In her keynote address at IDF Shenzhen, Diane Bryant, senior vice president and general manager for the Intel Data Center Group, highlighted examples of technology collaborations from the world’s #1 supercomputer to significantly improving transportation and healthcare in China.   Here are a few notable examples of Intel’s collaboration in China.

 

Enabling breakthrough new discoveries


China’s National Supercomputing Center in Guangzhou is host to the #1 supercomputer globally on the current TOP500 list.  It relies on industry leading Intel technologies to power the world’s fastest supercomputer, the “Milky Way-2” system. The massive system includes 48,000 Intel® Xeon Phi™ coprocessors and 32,000 Intel Xeon processors and operates at a peak performance of 54.9 petaflops—or 54.9 quadrillion floating point operations per second.  This kind of performance will enable new discoveries and scientific breakthroughs, such as improved weather prediction and advancements in life sciences, among others.

 

Improving transportation & healthcare via big data analytics


There are about 10 million trucks on the road in China, where the overall cost of logistics is fairly high due to a lack of efficiency in the operation and routing of trucks. The City of Zhengzhou is now addressing its truck-related challenges with a cloud-based transportation solution that takes advantage of Intel technologies.

 

The system links telematics data from trucks to a central data center to improve efficiency through better routing of trucks. The system uses real-time traffic data and monitors the safe operation of vehicles. The goal is to connect 50,000 trucks to the system by the end of 2014.

 

In another big data implementation, the Shanghai City Government is using an Intel-based Hadoop* cluster to process 16 million new records created every day.  The goal is to improve the health of citizens through enhanced quality of healthcare services, improved utilization of hospital resources and chronic disease diagnosis and treatments, while supporting social security and medical insurance reform.

 

Re-architecting the data center


You can’t achieve gains like those highlighted above with yesterday’s approaches to the data center. We are in a new era for computing that requires us to rethink our approaches to the data center. Intel is taking a leadership role in this re-architecting of the data center collaborating with a broad range of customers and partners like those I discussed above to meet the challenges of a global economy increasingly driven by digital services.

 

That’s the idea behind software defined infrastructure (SDI), an Intel industry initiative that is a focus of the discussions about datacenter innovation this week at IDF. SDI changes the game, moving data center infrastructure from static to dynamic, and from manual to automated. In simple terms, SDI creates a software layer that automates the allocation of infrastructure resources, be it servers, storage, networking, memory and that takes advantage of the telemetry data, such as performance, power, and security of the underlying hardware to optimally provision these resources.  Ultimately this will make data centers more agile, efficient, and responsive to enable fast, cost effective delivery of new digital services.  This includes rack level innovation, re-defining how racks are designed and deployed to enable software defined infrastructure. 

 

I am excited about the opportunities we have working closely with the ecosystem in China to continue to advance innovation to solve significant challenges through the use of Intel technologies.

 

You can see a replay of Diane’s presentation at IDF Shenzhen here.  To learn more about the Intel’s efforts to further the evolution of the data center, visit our cloud, HPC, and big data analytics sites.

 

Follow @IntelITS on Twitter for more.

Jay Kyathsandra is the product marketing manager for Intel Rack Scale Architecture.



For years, people have been talking about the potential of rack scale architecture (RSA), a logical architecture with pooled and disaggregated computing, networking and storage resources with software that enables composing usage specific system.  Key motivation of RSA is to increase data center flexibility, enable greater utilization of assets, and drive down the total cost of ownership for IT infrastructure. Today, the talk is turning to action.


We now have a functioning Intel RSA prototype solution and a well-defined path forward toward the day of RSA adoption. At Intel, we are working on a complete reference architecture that will allow a range of implementation choices for OEM providers and organizations that want to build their own RSA solutions.

 

I am in Shenzhen China this week talking to our customers and partners about rack level innovation at IDF Shenzhen.  Intel is working closely as a technical advisor to Project Scorpio which is a collaborative effort between Tencent, Alibaba, Baidu and China Telecom to develop the next generation rack scale architecture.  Intel is also collaborating with several key end users and system builders in China to build an RSA solution that meets local needs.

 

The case for RSA


So why is RSA enjoying so much momentum? Through its ability to disaggregate the components of the server rack, RSA provides the foundation for software-defined infrastructure (SDI) and the next-generation data center.

 

In this new era for the data center, the server, storage, and networking components in a rack are turned into a set of pooled and disaggregated resources. Hardware attributes are exposed upward to a software layer, where provisioning takes place. The software composes a system based on the requirements of a specific application or service definition. SDI is very much a world where the application defines the system.

 

Among other benefits, disaggregation will help data center operators increase the flexibility of IT infrastructure, improve asset utilization rates, and reduce TCO.  Much of the market is motivated by the success of the mega data center operators, who are driven by the need for hyper efficiency, performance and flexibility which enables agile service delivery at the lowest possible total cost of ownership.

 

Enterprises want to leverage public cloud operator efficiencies but in a completely different application and requirements environment.  The disaggregated data center will help them achieve capital efficiencies and, more importantly, enable the real-time matching of the infrastructure to meet the requirements of dynamic workloads and applications.

 

The elements of RSA


At the hardware layer, the Intel Rack Scale Architecture has four main pieces: Flexible compute nodes, POD (multiple racks) wide storage, low latency high bandwidth Ethernet fabric and modular POD management.

 

Intel Rack Scale Architecture will include a suite of innovative technologies based on Intel® Xeon® processors and the Intel® Atom™ system-on-chip (SoC) processors. These components will power servers, storage, and networking in the disaggregated rack. Intel Ethernet switch silicon will enable distributed input/output.

 

In addition, Intel Rack Scale Architecture will include the new Intel photonic architecture, based on high-bandwidth, Intel® Silicon Photonics Technology. Compared to today’s copper-based interconnects, this technology enables fewer cables, increased bandwidth, farther reach and extreme power efficiency.

 

At the management and provisioning layer, Intel RSA incorporates firmware, APIs, and software to access the pooled resources across multi-vendor systems, manage the logical assets, enable IT orchestration and SDI solutions.

 

Moving forward


We’re at the point where people are excited about the concept of RSA.  Now the focus is shifting to what it will take to leverage this new approach to the data center. We have demonstrated a prototype solution. The next step is to put a complete reference architecture in place, so a broad range OEMs and end user data centers can get down to the business of building and implementing RSA solutions.

In March, Chip Chat continued archiving HP Discover episodes, with episodes on the telco industry and SSDs, as well as episodes from partners including McAfee, VMware and Dell. We also celebrated our 300th episode talking with the City of Barcelona about the IT infrastructure. And the Digital Nibbles Podcast checks in with some fascinating episodes covering cloud metrics, the IaaS/PaaS debate, OpenStack and software-defined storage. As always, leave a comment on this post if you’d like to suggest a podcast topic for either Chip Chat or Digital Nibbles.

Intel® Chip Chat:

 

  • Managing a Smart City with Intel® Xeon® Processor E7 – Intel® Chip Chat episode 300: Eduard Martín Lineros and Enrique Félez Zaera from the City of Barcelona stop by to talk about using the Cisco Unified Computing System based on the Intel® Xeon® processor E7 to support their Smart+Connected City Solutions, and well as to transition the city to a cloud-based IT delivery model. They were looking for a system that offered scalable performance with advanced reliability and data analytics for the most business-critical applications. For more information, visit www. www.youtube.com/watch?v=TCbvxb5t5_8.

 

 

  • Live from HP Discover: The Storage Industry and SSDs – Intel® Chip Chat episode 302: In this archive of a livecast, Andreas Schneider and Shaun Rasmussen from Intel stop by to talk about trends in enterprise storage (the explosive growth of data storage, software defined storage, and when to use hot/cold/warm storage), the benefits of Intel® Xeon® and AtomTM processors, and the use of SSDs in the storage tier. For more information, visit www.intel.com/storage.

 

  • Security and Mobile Devices – Intel® Chip Chat episode 303: In this archive of a livecast from a recent trip to Mobile World Congress, Lianne Caetano, the Director of Mobile Product Marketing for McAfee, stops by to talk about the evolution of McAfee Mobile Security (now available on Google Play and Apple iTunes for free), and what consumers can do to protect themselves from malware and identity theft on various devices. For more information, visit www.mcafeemobilesecurity.com or www.intelsecuritygroup.com.

 

  • Virtual SAN*: A New Storage Tier with VMware – Intel® Chip Chat episode 304: Alberto Farronato, the Director of Product Marketing for Cloud Infrastructure Storage and Availability at VMware, stops by to chat about the recent launch of the VMware* Virtual SAN* (VSAN), which provides a software-defined storage tier that pools compute and DAS resources through the server hypervisor. By clustering server direct-attached HDDs and SSDs, VSAN creates a distributed, shared data store at the hypervisor layer that is designed and optimized for virtual machines. For more information, visit www.vmware.com/now.

 

  • Live from HP Discover: The EMEA HP Intel Solution Center – Intel® Chip Chat episode 305: In this archive of a livecast, Fabienne Chevalier, the manager of the EMEA Industry Solution Center at HP, stops by to talk about the joint solutions center from HP and Intel. They hold over 100 workshops a year to facilitate the adoption of innovative solutions based on Intel and HP technologies. Customers can bring in their needs and get a deep dive into solutions and features during the small-group format training. This year’s workshops will highlight big data analytics and NFV. For more information, visit www.hpintelco.com.

 

  • The Dell* PowerEdge* R920 and Intel® Xeon® E7 v2 Processors – Intel® Chip Chat episode 306: Lisa Onstot, a Server Platform Marketing Director at Dell, stops by to talk about the Dell PowerEdge R920 server featuring the recently-launched Intel® Xeon® processor E7 v2. The R920 server has been built specifically for enterprises to facilitate quick data access – architected with the massive memory capacity needed to accelerate large, mission-critical applications, as well as high-performance databases. For more information, visit www.dell.com/poweredge.

 

Digital Nibbles Podcast:

 

  • Cloud Metrics and Software-Defined Storage – Digital Nibbles Podcast episode 53: A couple of excellent conversations this week on cloud performance metrics and software-defined storage. First up, returning guest Paul Miller (@paulmiller), an analyst with Cloud of Data, chats about transparency and conflict of interest in the consulting market, as well as cloud benchmarks and what they mean for enterprise workloads. Then Bernard Harguindeguy (@atlantisilio), the CEO of Atlantis Computing, discusses optimizing virtual machines (VMs) and how they interact with storage, resulting in less storage traffic on a network and a performance boost for apps.

 

  • OpenStack for Private Clouds and the IaaS/PaaS Debate – Digital Nibbles Podcast episode 54: This week’s first Guest is Chris Kemp (@kemp), the founder of Nebula and NASA’s first CTO. He’s here to talk about OpenStack, which he co-developed, and his company’s piece of hardware that can turn servers into their own private cloud, helping enterprises efficiently run their infrastructure. Then JP Morgenthal (@jpmorgenthal), the director of the Cloud Computing Practice at Perficient, chats about the blurring of IaaS and PaaS and where Perficient fits in.

At this year’s Interop show, Intel and many of our closest friends including end customers, system companies, and optical companies announced support for the 100G CLR4 Alliance. Behind the techy name are very important issues that this Alliance is trying to address.

 

The first issue is growth. For any industry to expand there needs to be industry-wide specifications or standards that ultimately make things work better together and more importantly, drive down costs. At Intel, we have worked with the industry many times and have helped to enable countless standards in the PC and data center market that has allowed for explosive growth for these industries.

 

Second, photonic communication is a great way to move data. Optical has the known benefits of moving data further than electrical links, transmitting data faster and not being affected by EMI. As we move from 10Gbps signaling to 25Gbps signaling, optical communication becomes even more important.


Need for longer reach in the data center


Data centers are becoming massive in scale, requiring longer and longer reaches for connectivity. This leaves an enormous opportunity to bring high-speed, low-power, optical links that can span up to 2 kilometers in modern data centers operating at data rates up to 100Gbps. That’s more than 20 football fields!

 

Yes, there are telecom centric optical transceivers today operating at 100Gbps, but their power, size and costs are non-starters for the new data center. Thus, there is a huge gap that needs to be filled for reaches that span from say 100m to 2km. And that’s the problem we are trying to address here.

Intel, along with Arista and many others, decided to form an open industry group to rally around a specification that addresses the “up to 2km” data center reach. This approach leverages much of the work done in the standards body but focuses on moving fast. The 100G CLR4 mission is to create an open, multi-vendor specification for a cost-effective, lower power, small form factor optical transceiver using duplex single mode fiber which reduces fiber count by 75 percent. (See key attributes of proposed solution in Figure 1 below.)

 

 

In the span of a few weeks, we have already received overwhelming support for the specifications above.  Here is a list of preliminary supporters for this new alliance.

 

 

One of the key voices behind the 100G CLR4 Alliance is Andy Bechtolsheim, founder, chairman, and chief development officer of Arista Networks.

“Arista is excited about the 100G CLR4 Alliance," said Andy. “We need to accelerate the time-to-market for cost effective low-power, 100G CRL4 QSFP form factor optics that address the 2km reach requirements of large data center customers. We believe an open multivendor effort is the best way to bring this to market.”

 

Stay tuned for more updates

 

In the meantime, you can find more information on the news, Intel Silicon Photonics and MXC technology here.

 

From humble beginnings—raised by parents who emigrated to the U.S. from Italy in 1957—Intel Fellow Mario Paniccia was named in 2008 “one of the world’s most recognized corporate researchers,” according to R&D magazine.

I’ve always been told that change in the telecom industry takes a long time. Look backing, it took three to ten years for some major transitions, like circuit-switched calls to VoIP, proprietary form factors to ATCA, and home-grown OSes to carrier-grade Linux*.


On the contrary, network functions virtualization (NFV) is on a speedy trajectory all its own. In less than a year and a half, TEMs starting from the ground up already have NFV-based systems in field trials around the world.


Why so fast?

 

Market forces are playing a critical role in the rapid rollout of NFV-based solutions. The explosion of network traffic coming from video and other types of data has severely disrupted the revenue model for service providers. The status quo of networks architected with proprietary hardware appliances has proven to be too costly and inflexible. To reverse these trends, the industry is embracing NFV using standard IT server and virtualization technologies that will accelerate service innovation and lower CapEx and OpEx. Fueling this effort, seven of the world’s leading telecoms network operators initiated an ETSI Industry Specification Group for NFV that was quickly joined by over 150 other network operators, telecoms equipment vendors, IT vendors, and technology providers.1,2

 

It’s all about virtualized network functions


The exhibition hall at Mobile World Congress (MWC) 2014 was filled with NFV-based proof of concepts (PoCs), most of which showcased virtualized network functions from multiple vendors – all running on the same system. Please read further in the newsletter for specific examples.


Disruption creates opportunities


With every disruption, there’s a chance for companies to reset themselves, and NFV is no exception. NFV enables Tier 2 and 3 companies to gain market share because software-based functions are perhaps 10 times faster to deploy than their incumbent hardware counterparts. The time to market advantage is real, especially since there’s no specialty silicon (e.g., ASICs) or purpose-built boxes to design, test, and install in the network.

 

Monumental pace of adopted innovation


It’s mindboggling that network architecture has been turned upside down in such a short span of time. There’s no doubt that enterprises and data centers set the stage for this network transformation by optimizing standard IT technologies to the point where the cost and flexibility advantages are abundantly evident. What’s less obvious is the magnitude of innovation that NFV will unleash. Buckle your seat belts!

 

 

Jim St. Ledger is Software Product Line Manager in the Intel Communications & Storage Infrastructure Group.

 

 

 

1"Network Functions Virtualisation: An Introduction, Benefits, Enablers, Challenges & Call for Action," published October 22-24, 2012 at the "SDN and OpenFlow World Congress", Darmstadt-Germany. This white paper is available at the following link: http://portal.etsi.org/NFV/NFV_White_Paper.pdf

2Source: http://www.etsi.org/technologies-clusters/technologies/nfv

*Other names and brands may be claimed as the property of others.

The digital universe continues to grow at an explosive pace. In the next two years alone it is expected that more than 3 billion individual users will be connected to the Internet. During this period, the amount of new data created will double while the amount of mobile data consumed will grow by 11x.

 

The infrastructure serving this explosive growth is facing some daunting challenges. Data trends show that for every 400 new smart phones, one additional server is required in the backend data center. That trend continues through tablets where each 100 tablets account for an additional server. With these new servers and services there is an explosive growth in the amount and types of storage required for all this new data. Throw into this mix the number of switch ports, and new security issues exposed with this growth and the full impact of this trend is realized.

 

How do you add all this backend capacity? How do you manage it? How do you make it work across multiple physical data-centers? How do you manage the lifecycle of these services and applications? Traditional process and management methods simply do not work in a time of unrelenting service growth.

 

To respond effectively to the service explosion, you need a new management framework that empowers users with self-service and the sense of boundless resources. A framework that gives your infrastructure managers the automation and speed to deliver the required breakthrough in operational efficiency. There is good news, and bad news. The good: This framework is here today. It goes by the name “cloud.” The bad: To remain competitive, it is no longer optional.

 

The launch of Citrix CloudPlatform 4.3 is a positive step in the right direction for organizations that want to capitalize on cloud to meet the challenges discussed above. The open source based CloudPlatform is already one of the world’s most widely used cloud management platforms. This new version of the platform makes additional progress in enabling speedy user on-ramps to the cloud and diversity in infrastructure support.

 

While some cloud environments may support very narrow application sets, most cloud environments—private, hybrid, or hosted—need to support a wide diversity of applications. The new applications required in this explosion involve significant data processing, the delivery of an immersive user experience,, intensive media processing, and enterprise application requirements.

 

Demands like these require a high-powered computing foundation, and that’s where Intel enters the picture. Intel-powered cloud computing delivers increased performance and lower total cost of ownership (TCO) through dynamic, efficient and workload-optimized infrastructure. Intel products enable the gamut of workloads. Based on the specific needs, a workload can be executed with the highest performance, predictability, and lowest TCO on an infrastructure based on the Intel® Xeon Phi™, Intel® Xeon® or Intel® Atom™ processor families.

 

The consistent programming model across Intel platforms, coupled with top-notch orchestration and management from solutions like the Citrix CloudPlatform, that enable the future of cloud computing.

 

 

Intel, Intel Xeon and the Intel logo are trademarks of Intel Corporation in the United States and other countries. * Other names and brands may be claimed as the property of others.

Writing this blog post on St Patrick’s Day I am reminded through the St Patrick’s Day: Google Doodle that amongst other things St Patrick’s Day is a day that brings Myths and Legends to the fore and is therefore an ideal day to look at Oracle redo on SSDs – surely one technology topic that provides a great deal of conflicting opinion.  For example the Oracle support document Troubleshooting: "log file sync" Waits (Doc ID 1376916.1) describes the following issue:

 

If the proportion of the 'log file sync' time spent on 'log file parallel write' times is high, then most of the wait time is due to IO … As a rule of thumb, an average time for 'log file parallel write' over 20 milliseconds suggests a problem with IO subsystem.

 

and includes the following as one of the recommendations:

 

  • Do not put redo logs on Solid State Disk (SSD)
  • Although generally, Solid State Disks write performance is good on average, they may endure write peaks which will highly increase waits on 'log file sync'

 

This seems unequivocal, firstly if your log file parallel write time is on average (and not peak) greater than 20 milliseconds you should tune your IO, this I can agree with, however “Do not put redo logs on Solid State Disk (SSD)”  is worth investigating to see if it is really based on established fact.  After all I have used SSDs exclusively for Oracle workloads for the last 5 years, starting with the Intel® SSD X25-E Series in 2009 and currently with the Intel® SSD 910 Series and DC S3700 Series. I have also worked with other storage for Oracle but always an SSD based solution.  Not only do SSDs have excellent latency characteristics, and the modern endurance of Intel SSDs measured in DWpD essentially eliminates concerns over endurance, also modern SSD’s are very, very good at data path protection exactly what you need for high intensity redo.  Nevertheless Oracle Support are specific on the issue they have identified that SSD write performance may be good on average, however it is “write peaks” that impact log file sync times and using the rule of thumb of an average time for 'log file parallel write' over 20 milliseconds we would expect these “write peaks” to exceed 20 milliseconds to cause these concerns.  This is something we can test for to see if there is truth behind the myths and legends that always surround disruptive technologies.

 

Log File Sync and Log File Parallel Write

 

Stepping back a bit it is worth clarifying the relationship between ’log file sync’ and ‘log file parallel write.  ’Log file sync’ is the time elapsed by a foreground session after waiting for the redo it has put in the memory resident redo log buffer to be flushed to disk when it issues a COMMIT (or ROLLBACK) to make the transaction permanent and ‘log file parallel write’ which is the time it takes for the redo to be written to disk by the background process.  Prior to 12c this background process doing the writing was the log writer (LGWR) however at 12c the’ log file parallel writes’ are performed by a number of log writer workers (LGnn) with LGWR time elapsed under the ‘target log write size’ event.   Historically the log writer and foreground processes have communicated with a post/wait mechanism using semaphores (semctl) however a more recent parameter _use_adaptive_log_file_sync  being set to true by default since 11g indicates that polling may be used by the foreground processes instead with the method used dynamically selected. What this means is that whereas ‘log file parallel write’ is the actual write to disk, ’log file sync’ also includes the scheduling of and communication between the foreground and background processes and therefore on a busy system most of the time spent in ’log file sync’ by the foreground process may not be waiting for the ‘log file parallel write’ by the background process. However if there are significant “write peaks” then there may be a number of foreground processes waiting in ’log file sync’ for that write to complete exacerbating the elapsed time.  Consequently what we want to do is capture the time spent on ‘log file parallel write’ by the background processes on SSD storage to observe whether it  will highly increase waits on 'log file sync'.

 

Capturing Log File Parallel Write

To do this I used a 4 socket E7 v2 system to drive a significant level of throughput through Oracle 12c running on Oracle Linux 6.5.  It is worth noting that the E7 v2 includes the Intel® Integrated I/O  and Intel® Data Direct I/O Technology features and therefore the CPU latency aspect of I/O is further minimised.  For the SSD storage I used 2 x Intel® SSD 910 Series configured with Oracle ASM as per the post referenced here. Naturally I used HammerDB for the workload and configured the redo logs of a size to ensure log file switch checkpoint activity during the test.

 

I could list the Log Writer and Log Writer worker processes as follows:

 

oracle   127307      1  0 10:23 ?        00:00:00 ora_lgwr_IVYEXDB1

oracle   127312      1  0 10:23 ?        00:00:01 ora_lg00_IVYEXDB1

oracle   127316      1  0 10:23 ?        00:00:00 ora_lg01_IVYEXDB1

oracle   127320      1  0 10:23 ?        00:00:00 ora_lg02_IVYEXDB1

oracle   127322      1  0 10:23 ?        00:00:00 ora_lg03_IVYEXDB1

 

and begin a trace of all of the Log Writer Workers as follows with the example for process ora_lg03_IVYEXDB1 above.

 

[oracle@ivyex1 ~]$ sqlplus sys/oracle as sysdba

SQL> oradebug setospid  127322;

Oracle pid: 21, Unix process pid: 127322, image: oracle@ivyex1.example.com (LG03)

SQL> oradebug event 10046 trace name context forever, level 8;

Statement processed.

 

I ran a test for 10 minutes with a number of virtual users to run the system at a high CPU utilisation generating a significant number of transactions and then when complete stopped the trace as follows before collecting the trace files from the trace directory:

 

SQL> oradebug event 10046 trace name context off;

Statement processed.

 

Looking in the trace file it shows the timing of the event 'log file parallel write' that we are interested in as shown below:

 

WAIT #0: nam='log file parallel write' ela= 501 files=1 blocks=58

WAIT #0: nam='LGWR worker group idle' ela= 39

WAIT #0: nam='log file parallel write' ela= 466 files=1 blocks=52

WAIT #0: nam='LGWR worker group idle' ela= 33

WAIT #0: nam='log file parallel write' ela= 368 files=1 blocks=54

 

Of course on of the advantages of HammerDB is that with a scripted interface you are not restricted to simply running the pre-built workloads, you can run any workload you choose. Therefore in the Script Editor window of HammerDB I entered the following to process the trace files and ran it to extract the elapsed time into a CSV format.

 

#tclsh

set filename "lg00.trc"

set filename2 "output00.csv"

set fid [open $filename r]

set fid2 [open $filename2 w]

set elapsed 0

set maxelapsed 0

set count 0

set overmilli 0

set over10milli 0

set over20milli 0

while {[gets $fid line] != -1} {

if {([string match {*log\ file\ parallel\ write*} $line])} {

incr count

regexp {ela=\ ([0-9]+)\ } $line all elapsed

puts $fid2 "$count,$elapsed"

if { $elapsed > 1000 } {

incr overmilli

if {  [ expr $elapsed > $maxelapsed ] } { set maxelapsed $elapsed }

if { $elapsed > 10000 } {

incr over10milli

if { $elapsed > 20000 } {

incr over20milli

                 }

             }

         }

     }

}

puts "max elapsed was [ format "%.2f" [ expr {double($maxelapsed)} /1000 ]] millisecond"

puts "[ format "%.2f" [ expr {double($overmilli) / $count} * 100 ]]% over 1 millisecond"

puts "[ format "%.2f" [ expr {double($over10milli) / $count} * 100 ]]% over 10 millisecond"

puts "[ format "%.2f" [ expr {double($over20milli) / $count} * 100 ]]% over 20 millisecond"

close $fid

close $fid2

 

The output from the script was summarised as follows (Note that as shown below nearly all of the workload went through workers LG00 and LG01 and LG02 and LG03 were mostly idle) :

 

Log Writer Worker

Over  1ms

Over 10ms

Over 20ms

Max Elapsed

LG00

0.56%

0.02%

0.00%

13.66ms

LG01

0.39%

0.01%

0.00%

13.44ms

LG02

7.96%

0.07%

0.00%

13.05ms

LG03

7.07%

0.08%

0.00%

13.00ms

 

And  I verified that the output from the trace files corresponded with the output from v$event_histogram that over 99% of the redo writes were completed in less than 1 millisecond and the maximum elapsed write time was consistent around 13 milliseconds but also was an extremely small proportion of all the writes.

 

SQL> select wait_time_milli, wait_count from v$event_histogram where event = 'log file parallel write';

 

WAIT_TIME_MILLI WAIT_COUNT

--------------- ----------

            1    2460371

            2       6605

            4       1702

            8       1774

           16        726

 

Of course though how busy was the system?  The AWR report shows a load average around 75%

Host CPU

CPUs

Cores

Sockets

Load Average Begin

Load Average End

%User

%System

%WIO

%Idle

120

60

4

75.13

114.68

73.1

5.1

0.5

21.2

 

And a fairly busy redo rate of almost 340MB/sec.  Given the redo per transaction and the redo per second it is clear that this system is processing tens of thousands of transactions a second and millions of transactions of a minute.


Load Profile

Per Second

Per Transaction

DB Time(s):

160.0

0.0

DB CPU(s):

93.2

0.0

Redo size (bytes):

339,512,040.8

5,330.1

 

And the total time spent waiting for ‘log file sync’


Foreground Wait Events

Event

Waits

%Time -outs

Total Wait Time (s)

Avg wait (ms)

Waits /txn

% DB time

log file sync

26,850,803

38,253

1

0.70

39.68

 

was considerably greater than the ‘log file parallel write’ component and therefore consistent with high performance IO with most ‘log file sync’ time spent in communication between foreground and background processes.


Background Wait Events

Event

Waits

%Time -outs

Total Wait Time (s)

Avg wait (ms)

Waits /txn

% bg time

log file parallel write

1,958,168

0

763

0

0.05

125.03

target log write size

977,630

0

574

1

0.03

94.05

 


I then loaded the CSV file into a spreadsheet, highlighted the data and selected scatter plot. Although they don’t overlay perfectly on a time basis the data is close enough to combine for LG00 and LG01 and LG02 and LG03 respectively, noting that LG00 and LG01 were considerably busier workers than LG02 and LG03. The data for LG00 and LG01 is here:


lg0001.PNG


and the number of blocks written for LG00 only to indicate write latency against time as follows:


lg00blk.PNG


And to complete the picture the latency data for LG02 and LG03 is here:


lg0203.PNG

 

Clearly the latency is as we would expect strongly related to the amount of redo written and typically and on average the response time for redo writes is sub-millisecond even though we are running at a throughput of millions of transactions a minute and generating over 20GB of redo per minute.  Where there are longer latencies these are typically in the order of 12 to 13ms for 0.01 to 0.02% of writes so  certainly not a factor to highly increase waits.  Furthermore, cross-referencing with iostat data even at this rate of redo the average disk utilization was still only around 10%.

 

Running a 3 Hour Workload

 

Of course maybe a 10 minute test is not sufficient, so I then ran the same configuration for 3 hours generating as shown 342MB/sec of redo with 4.7 Terabytes in total.

 

Function Name

Writes: Data

Reqs per sec

Data per sec

Waits: Count

Avg Tm(ms)

LGWR

4.7T

11574.37

342.071M

49M

0.01

DBWR

1.5T

7746.17

110.681M

0

 

And the waits on ‘log file parallel write’?   proportionally the same as before.

 

 

SQL> select wait_time_milli, wait_count from v$event_histogram where event = 'log file parallel write'

 

WAIT_TIME_MILLI WAIT_COUNT

--------------- ----------

            1   51521618

            2     165210

            4      36313

            8      36242

           16      18478

 

 

COMMIT_WRITE and COMMIT_LOGGING

 

There is one further test we can do.  By setting the following parameter as follows:

 

commit_logging

BATCH

commit_wait

NOWAIT

 

redo will continue to be generated however with commit_logging set to ‘batch’ the foreground process will not notify the log writer to write its redo and with commit_wait set to ‘nowait’ the foreground process will also not wait for the log writer to notify it that the redo has been written. In other words these settings remove the scheduling aspect of ‘log file sync’ and therefore if the disk is not a bottleneck we should see throughput increase while noting with batches that the writes will be larger.  Sure enough this gives a significant increase in throughput with Oracle now writing over 400MB/sec of redo and CPU utilisation at 97%.

 

Per Second

Per Transaction

Redo size (bytes):

431,002,118.4

5,336.1

 

In other words removing the waits for scheduling meant that the SSDs could easily cope with higher levels of throughput.

 

V$LGWRIO_OUTLIER

 

In on my previous posts on Redo on SSD I mentioned a new performance view in 12c called V$LGWRIO_OUTLIER. This view reports log writer writes that take over 500ms. I touched on this in passing but didn’t investigate further so decided to take another look and sure enough the new  view had entries.  This is not the only place that long writes to the log file are reported. A trace file is also generated when a write to the log file takes more than 500ms for example:

 

Warning: log write elapsed time 780ms, size 4KB

 

So it would be reasonable to expect these to correlate and for a trace file to be generated for each entry in V$LGWRIO_OUTLIER.  It does not help that the documentation is not explicit about the precise time value that this view mentions however from seeing related output it seems reasonable  that the IO_LATENCY value is reported in milliseconds so we should expect entries above the value of 500. Additionally the view does not report a timestamp for each entry however the underlying x$ view does.

 

SQL> select view_definition from v$fixed_view_Definition where view_name ='GV$LGWRIO_OUTLIER';

 

So querying these ordered by timestamp shows the following result:

 

SQL> select IO_SIZE, IO_LATENCY, TIMESTAMP from X$KSFDSTLL order by timestamp;

 

   IO_SIZE IO_LATENCY  TIMESTAMP

---------- ---------- ----------

      32 2694340698 1394629089

      72 2695317938 1394629186

      28 2696135038 1394629268

      40 2696636948 1394629318

      64 2697984098 1394629454

      52 2698638788 1394629519

      68 2699724048 1394629628

      56 2699982768 1394629653

      24 2700330618 1394629688

      32 2752639988 1394634918

      72 2752946678 1394634949

      36 2753861848 1394635041

     108 2754161328 1394635071

      64 2754341738 1394635089

 

and if the value is in milliseconds then 2754341738 milliseconds is approximately equivalent to 1 month! and the latency is increasing every time. If we cross reference this against the trace data, the AWR report, event histogram and log write trace files and the only correlation appears to be with the timestamp.  Just to be certain after running the 3 hour test  the outlier data showed  396 entries (note that V$LGWRIO_OUTLIER only shows the last 180) all in ascending order.

 

SQL> select IO_SIZE, IO_LATENCY, TIMESTAMP from X$KSFDSTLL order by timestamp;

   IO_SIZE IO_LATENCY  TIMESTAMP

---------- ---------- ----------

      40 3471738158 1394706828

      44 3472415808 1394706896

      52 3472562658 1394706911

      16 3475418188 1394707196

      16 3475860828 1394707240

      40 3477258658 1394707381

      …

     128 3626810268 1394722336

     256 3627883738 1394722444

      20 3627919588 1394722447

     128 3628119698 1394722467

     320 3628620898 1394722517

     112 3629137298 1394722569

 

396 rows selected.

 

This is sufficient evidence for me to suggest that for the time being V$LGWRIO_OUTLIER on Linux should not be relied upon for measuring IO Latency or at the very least there is insufficient documentation to accurately interpret what it is meant to show.

 

Conclusion

 

So in summary should you put Oracle redo on SSD?  If you want sub-millisecond response times  coupled with high levels of data path protection  then surely the answer is yes. In fact more to the point a modern SSD can handle redo with such high throughput and low latency that the ‘log file parallel write’ component of ‘log file sync’ and therefore the SSD write performance even with systems generating redo at very high rates is not the determining factor.  If it is then I would recommend reviewing how to correctly configure Oracle redo on SSD before taking your own measurements of Oracle redo performance on SSD.

If you work with mission critical databases you won’t have missed the launch of the Intel® Xeon® Processor E7 v2 Family with up to double performance over the previous generation.   Benchmarks can be useful in comparing and contrasting systems and these show that E7 v2 has Up to 80% higher performance than IBM POWER7+ at up to 80% lower cost . Nevertheless working directly with customers evaluating database performance I often find that published benchmarks provide some but not all of the answers that database professionals are looking for including single and multi-threaded performance, power efficiency, platform cost, software cost, reliability and operating system and virtualization software choice.

 

As a consequence especially with a noticeable long term decline in the frequency of published benchmarks more and more customers have an engineering team to get ‘hands-on’ with evaluating systems for themselves.  It is therefore great to see a white paper from Principled Technologies that shows the approach to do just that illustrating how a system built on the Intel Xeon processor E7-4890 v2 is a much better value proposition than POWER7+ for running the Oracle database. The white paper shows that the Xeon E7 v2 system has 69% lower hardware purchase costs, up to 42% lower power consumption under load and 40% under idle and 16% higher performance at equivalent load with twice the headroom to grow.  All of this contributes to 5.7x performance/watt advantage for Xeon E7 v2.

 

More importantly for anyone taking the approach of ‘hands on’ evaluation the white paper includes all the details required for any database engineer to run their own equivalent in-house evaluation and not forgetting SPARC you can run equivalent tests against any system that runs  Oracle or other databases.  No-one should have to accept benchmark data prepared by competitive analysts and published without full details of system and database configuration.

 

I have to admit that as the author of HammerDB (Intel enabled engineers to develop open source projects long before it was fashionable)  the load testing tool used in the white paper I especially like the open methodology as it aligns with the intentions for developing such a tool.  Firstly being open source all of the code right down to the proprietary database drivers is published empowering the user. If you don’t like the workload you are free to change it – you can even write a whole new workload if you wish (HammerDB’s conversion of Oracle trace files makes this easy) and then contribute that workload back to the database testing community.

 

Single-Threaded Performance

 

With this approach you may see from previous blog posts that typically the first test run I run is a PL/SQL CPU routine (or T-SQL on SQL Server) to test single-threaded performance and verify the system BIOS and operating system settings.  Running this on an E7-4890 v2 gives me the following result:

 

Res = 873729.72

PL/SQL procedure successfully completed.

Elapsed: 00:00:07.88

 

And comparing this with previous results shows good incremental gains and if you look at the other results for Intel Core then there are indications of further gains from future Xeon products from Intel’s Tick-Tock model.  It would be great if you could fully test a database system in under 10 seconds however such a single-threaded test is only a first indicator, a small piece of the puzzle in evaluating database performance. Of course such as test tells us nothing of the platform scalability running multiple threads however the historical results show that the single-threaded performance is improving in step with the Tick-Tock model and you would expect better single-threaded performance from a processor that completes this test in a shorter time.

 

Multi-Threaded Performance

 

This is where HammerDB comes in. You can either run your own workload or one based on the specifications of industry standard benchmarks. The advantage of these specifications is that they are relatively straightforward to implement but more importantly they are designed to scale and have proven over time that they do scale so as long as the database software and system scales, then you have a workload that scales as well. Importantly it is not the absolute performance but the relative performance that is important and your aim should be to generate a Performance Profile.  What this means is that you should test your system at ever increasing increments of load until it is fully saturated.  HammerDB includes a CPU monitor and at the top of your performance curve with a scalable workload and sufficient I/O your CPU utilisation should look as follows:

 

orp1.png

 

With this reference data you then have the background information to measure and test a median load that you would expect from a typical production database and this is the approach that the Principled Technologies white paper takes.

 

Power Consumption

 

As the white paper shows another important metric is the power consumed by the server for a relative level of performance.  At data centre scale power consumed is one of the most important metrics for server density. Starting with the processor the power related metric is called TDP (Thermal Design Power) and indicates the power dissipated at the operating temperature called Tcase, further information is available here.  So although TDP will not give you all the information on peak CPU power requirements it remains the most useful processor metric for comparison.  If you want to know the max TDP for the E7 v2 family they are all published here showing a Max TDP ranging from 105W to 155W. Unfortunately for doing an evaluation the same data is not published for the IBM Power family, for SPARC it is published on the datasheet for the SPARC T-4 at 240W however is omitted from the equivalent datasheet for the SPARC T-5.  Consequently the Principled Technologies approach is ideal measuring the entire server power both idle and under load – a 5.7x performance/watt advantage translates into a significant advantage for server density for a data centre running E7 v2 compared to Power7+.

 

Hardware and Software Cost and Choice

 

After having measured both performance and power you can then evaluate cost. This takes the form of evaluating both the hardware acquisition cost as well as the software cost both comprising the TCO (Total Cost of Ownership).   As the white paper shows there is a 3.2X advantage for the E7 v2 system compared to the IBM system in hardware acquisition cost however a significant component of the TCO is the software license cost.  Given that the Oracle Database license is calculated per core and the IBM Power system has 32 cores compared to the 60 on the Intel  E7 v2 system it is important to reference the Oracle Processor Core Factor Table, this shows that the Power 7+ attracts a core factor of 1.0 compared to the 0.5 factor for the E7 v2 – this means that as well as offering higher performance the E7 v2 system also has a lower Oracle software license cost.  Additionally for the core based cost sensitive with the E7 v2 there are further options up to the E7-8893V2, with 6 cores running at higher frequencies under the same TDP allows choice in managing core factor based software acquisition cost.   Furthermore, choosing E7 v2 means that the same industry standard platform is available from multiple vendors supporting multiple operating systems such as Linux, Windows and VMware,  relational databases such as SQL Server, PostgreSQL and MySQL and Big Data technologies such as Hadoop, giving unparalleled technology choice all helping to drive down acquisition cost and improve manageability by standardising systems and system administrator skills.

 

Reliability

 

Of course some attributes by their nature are more difficult to measure than others. RAS (Reliability, Availability and Serviceability) by definition is harder to quantify as it is harder to measure something that is expected not to happen (like a service outage)  rather than something that does. Therefore evaluating RAS for E7 v2 requires looking at the features of Intel® Run Sure Technology and the longer term uptime data proving out reliability in the animation.

 

Conclusions

 

In this post we’ve looked at how the white paper from Principled Technologies illustrates a methodology for bringing platform evaluation in-house for comparing database performance. We’ve seen how database benchmarks published along with full configuration information can provide useful indicators however the declining frequency of these publications and the availability of free and open source tools is giving rise to the increased popularity of this in-house testing to determine the ‘best-fit’ database platform freeing information and enabling better data centre technology choices than ever before.

Over the past few days, there have been a number of announcements in the optical world from Corning, US Conec, TE Connectivity , Molex and others in support of the new MXC™ connector technology that are quite exciting. Before I get into what they announced, I should answer why I’d be excited about a simple connector.

 

First off, it’s because it’s not just a simple connector. It is a core building block for optical (or photonic) communications and will help define the way data centers are built in the future. Why should you care? Because it will help serve up all of what you love about the web, apps, and more much, much faster.

 

For example, one MXC™ cable can transmit data at 1.6Tera-bits per second (64 fibers at 25Gbs). That’s 1,600,000,000,000 bits per second. If you could transmit data at that speed, you could download a two-hour HD movie from iTunes (4GB) in less than two seconds. With 2.5 quintillion bytes of data (1 followed by 18 zeroes) created every day, I doubt anyone questions that we are going to need the higher bandwidth that MXC™ cables provides in our 21st century data centers.

 

To make this a reality, the industry needs to work together. This is where the exciting announcements come in. US Conec announced they are making parts for the connector which they will sell to cable companies. Corning, Molex and TE Connectivity announced they would make cable assemblies using the MXC™ connectors. US Conec and Intel will also host an MXC™ adopter’s forum for current and future companies that want to use MXC™ cables at OFC on March11th, 2014. All of this together means that the ecosystem is lining up and momentum for this technology continues to mount.

 

Microsoft had the following to say about MXC™ technology:


“Microsoft is pleased to join the MXC Adopters Forum and looks forward to evaluating MXC based products“ said Kushagra Vaid, General Manager of Cloud Server Engineering at Microsoft. “We believe that MXC along with Intel® Silicon Photonics will be instrumental in shaping next generation high performance data center architectures. We look forward to working with Intel and open standards bodies like OCP to accelerate information sharing and industry adoption”.

 

For those new to this topic, MXC™ cables have several advantages over traditional optical connectors besides the 1.6Tbps bandwidth. MXC™ connectors have fewer parts, are more robust, smaller, and are able to support 64 fibers and with a unique telescoping lens design that is 10 times more resistant to dust.

 

MXC™ connectors, coupled with Intel® Silicon Photonics, will enable many new data center innovations. For example, Fujitsu recently demonstrated an expansion box that increases the storage capacity and adds CPU accelerators to its 1U server. In September, Intel demonstrated a new rack architecture called RSA that when used with MXC™ cables and Intel® silicon photonics enables a totally new server architecture that increases performance and decreases cost. In the coming months, we expect to see more demonstrations and announcements about MXC™ and Intel Silicon Photonics.

 

Finally I wanted to say that we expect to see a few dozen companies attend our 1st MXC™ adopters meeting in March 11th at the Optical Fiber Conference.

 

2014 will be an exciting year so stay tuned!

 

You can get more information on Intel® Silicon Photonics and MXC technology at:
https://www-ssl.intel.com/content/www/us/en/research/intel-labs-silicon-photonics-research.html.

MXC is a trademark of US Conec

 

About Mario Paniccia

Mario Paniccia with a 1.6Tbps MXC cable (green) next to a 128Gbps PCI Express copper cable


Dr. Mario J. Paniccia is an Intel Fellow and general manager of the Silicon Photonics Operations Organization. Paniccia joined Intel in 1995 as a lead researcher developing a novel optical testing technology for probing transistor timing in microprocessors. Paniccia started Intel's research in the area of Silicon Photonics and currently leads the business unit driving silicon photonics product commercialization, which includes engineering, business, and strategy as well as defining future generation products.

 

Scientific American named Paniccia one of 2004's top 50 researchers for his team's leading work in the area of silicon photonics. In October 2008 Paniccia was named by R&D Magazine as "Scientist of the Year" for his teams pioneering research in the area of Silicon Photonics. In 2011 Paniccia was named "Innovator of the year" by EE times for his team’s demonstration of the world's first integrated 50Gbps Silicon photonics link.

 

Paniccia earned a bachelor's degree in physics in 1988 from the State University of New York at Binghamton and a Ph.D. in solid state physics from Purdue University in 1994. In May 2009 Paniccia was awarded an honorary doctorate degree from Binghamton University.

 

From humble beginnings—raised by parents who emigrated to the U.S. from Italy in 1957—Intel Fellow Mario Paniccia was named in 2008 “one of the world’s most recognized corporate researchers,” according to R&D magazine.

Though Intel and IBM have celebrated many achievements together during our rich history of co-engineering, our current collaboration—involving the Intel® Xeon® E7 v2 processors and IBM’s latest DB2 database technologies—is delivering unbelievable breakthrough results, especially with performance gains topping 148x beyond the previous generation of software and processors.

 

How did Intel and IBM generate such dramatic performance improvement?

 

To find out and gain a better understanding of how collaboration flows between IBM and Intel, I went behind the scenes and talked with Jantz Tran, an Intel performance application engineer who works closely with IBM DB2 development.


In fact, Jantz works so closely with IBM that he has an office at IBM’s Silicon Valley Labs—He basically embodies the collaboration between the two companies, working directly with IBM developers to ensure that Intel technologies map to DB2 database development, and vice versa.


“My team assists the IBM dev groups by answering any technical questions they may have about Intel processors,” says Jantz. “For instance, I help ensure DB2 software can take advantage of the parallelism and vectorization support built into the most recent Intel Xeon processors. I also work with the DB2 performance team to set up and tune software and hardware for analysis and benchmark testing on joint IBM and Intel platforms.”

 

Some of Jantz’s most exciting recent projects involved aligning the new columnar database format in IBM DB2 with BLU Acceleration* with new instruction sets and vectorization support in Intel Xeon E7 v2 processors.

 

Columnar data processing is a much faster technology for scanning massive data sets and performing analytical querying, particularly when supported by Intel® Advanced Vector Extensions (Intel® AVX) and SSE (Streaming SIMD Extensions) instructions in Xeon E7 v2 processors. This enables DB2 to pack more data elements into the register of a single processor and divide query processing into multiple threads that work simultaneously.

 

DB2 with BLU Acceleration is a re-architecting of IBM’s database platform, adding columnar store to existing row-based data store,” says Jantz. “This allows the technology to take advantage of Intel AVX and SSE instructions to tap into the massively increased performance potential of highly parallelized, multicore processing.

 

“This is where you get those really big, 148x performance improvements. In the benchmark, just upgrading from previous generation IBM DB2 10.1* to IBM DB2 with BLU Acceleration increased workload speed 77x. Upgrading to Intel Xeon E7 v2 processors from previous generation chips doubled the performance.”

 

Jantz says running DB2 with BLU Acceleration on Intel Xeon E7 v2 processors lets you take advantage of actionable data-compression features and make much more effective use of system memory.

 

“Packing columnar data into SSE registers allows you to use memory pools much more efficiently than row-based stores,” he says, “because you can run queries and evaluate data while it is still compressed. In fact, data compression with columnar store is so much more efficient it requires a lot less memory to run the same data set. So you can house a much larger columnar database on a much smaller memory footprint.”

 

For example, in benchmark tests that Jantz helped engineer, running 10 TB of raw data through the previous generation IBM DB2 10.1 resulted in a row-base database size of 9.69 TB. (That means to run 10 TB of data in-memory required about 10 TB of memory.) However, running the same 10 TB of raw data through DB2 with BLU Acceleration with columnar store and data compression required only 2.13 TB.

 

In other words, the same data was 4.55x smaller with DB2 with BLU Acceleration using actionable compression than with DB2 10.1 using static compression!

 

“So if you have 10 TB of raw data and 2 TB of memory, you can run it as an in-memory database using DB2 with BLU Acceleration and Intel Xeon E7 v2 processors,” says Jantz. “The bottom line: These technologies allow you to run large primary databases directly in-memory at orders-of-magnitude improved performance.”

 

These dramatic performance achievements are groundbreaking, even considering over 15 years of collaboration and joint engineering between Intel and IBM, with generation-after-generation of improvements in performance, up-time, and reliability.

 

Want to learn more about how IBM DB2 with BLU Acceleration and Intel technologies work together to deliver on the promise of Big Data? Watch this video:




And to discover how to unlock the value of your own data with Intel and IBM innovations read the white paper. Both offer insight into what are truly amazing accomplishments in this IBM-Intel collaboration.

 

Follow Tim and the Big Data community for Intel at @TimIntel.

Terracotta BigMemory Max, a recognized leader of in-memory data management solutions, is changing the way that enterprises analyze big data. How? By storing data in a server’s machine memory. This change makes big data analysis more reliable and much faster. On the traditional hard-disk-based data management model, analysis of big data could have taken hours or days to process, but BigMemory Max whittles that down to minutes or even seconds! Crucial business decisions can be made faster and more accurately when big data is processed in-memory.

 

Watch a select video presentation on Terracotta BigMemory Max and Intel from the Strata Conference



BigMemory Max also safeguards data by copying it across multiple servers. Data loss won’t occur if one server fails because a complete copy of the data sits on another server. BigMax Memory can maintain an uptime of 99.999 percent with this data mirroring method.

 

But wait—it gets even better. The Intel® Xeon™ processor E7 v2 family was engineered for systems that require an uptime of 99.999 percent. Combining BigMemory Max with servers built on the Intel® Xeon™ processor E7 v2 allows enterprises to follow the scale-up model, which helps reduce power use, cooling costs, and complexity.

 

Servers running BigMemory Max and the Intel® Xeon™ processor E7 v2 family help decrease overall total cost of ownership (TCO) without sacrificing the benefits of a large cluster running on lower-powered servers. This allows enterprises to expand their big data horizons, changing impossible or improbable analyses into very manageable scenarios.

 

Read more about Terracotta BigMemory Max and Intel at Terracotta Breaks Down Barriers to Big Data Management .

Follow me on Twitter, @TimIntel.

Filter Blog

By author:
By date:
By tag: