Sponsors

    

Welcome to the Embedded Forum

Inside you'll find resources from Intel's Embedded and Communications Alliance (ECA) members as well as from folks at Intel® Embedded and Communications Group and other embedded developers and experts. Join the conversation and let us know what's on your mind.

Start a Discussion Now. Collaborate with Intel® ECA Members building Intel® Architecture solutions for your embedded design questions.

Recent Blog Posts

2

The software performance gain you can expect migrating from single- to multi-core depends on a several factors, not the least of which is the architecture and very nature of the application. A logical starting point would be to scale the single-core performance linearly by the number of cores. But there may be some system overhead that experience shows can consume 10-20% capacity - so a reasonable expectation for many applications is slightly sub-linear, for example 3.2x to 3.8x on a four-core[i]. The really good news is that some embedded applications, such as network packet processing, can actually scale supra-linearly- if the right programming concepts are applied to fully leverage the multi-core platform's features. Intel has demonstrated this by porting the popular open source Snort network intrusion detection software to a four-core processor, achieving more than 6.2x performance over single-core. This article is a brief summary of that Snort exercise.

Exceptional Snort performance was largely achieved through techniques that maximized the benefits of cache memory. Cache efficiency is the performance linchpin to most modern processor systems and now takes on even greater significance in the world of multi-core. The cache-hit rate often correlates to the program's locality of reference, meaning the degree to which a program's memory accesses are limited to a relatively small number of addresses. Conversely, a program that accesses a large amount of data from scattered addresses is less likely to use cache efficiently. Multi-core architectures can significantly improve program flow so that cache associated with each individual core is used more effectively. With multiple caches available, developers can optimize data locality, driving higher cache-hit rates and improved overall application performance.

When migrating applications to multi-core there are often numerous approaches to distributing the code among the cores, and those different configurations can yield widely ranging performance. Finding the optimal one may require some experimentation. One obvious and probably simplest trial was to run parallel copies of the full Snort application in each of the four cores, each one handling a quarter of the total packets. This option produced sub-optimal results. After thorough analysis of the code's architecture and dataflow, developers finally converged on a high-performance configuration by combining the concepts of functional pipelining and flow-pinning.

http://communities.intel.com/openport/servlet/JiveServlet/downloadImage/1918/snort.bmp

Functional pipelining is a technique that sub-divides the software into multiple sequential stages and assigns these stages to dedicated cores. Each core runs its application stage and then hands off the intermediate results to the next stage, and so on. Pipelining can increase locality of reference since each core runs a subset of the entire application, potentially increasing the cache hit rate associated with executing instructions. Pipelining also provides an opportunity for load sharing since you can assign multiple cores to the stages that are more CPU-intensive. Snort lent itself well to pipelining since the existing code was already designed as a sequence of well-bounded functional stages (those named in the diagram).

Flow-pinning is a technique that overlays the pipelined configuration. Performing functions such as TCP reassembly on a large number of TCP flows is likely to access a large amount of data over a large range of memory locations, resulting in reduced cache efficiency. Restricting, or "pinning" individual TCP flows to a single core improves data locality because each core operates on a smaller number of flows. This translates into less data access over a smaller range of memory locations for better cache efficiency. The diagram box "Packet Classify Hash" implements this pinning function by directing packets from the same TCP flow to the same core.

By applying pipelining and flow-pinning, developers were able to nearly double cache efficiency, leading directly to the high application performance. And further cache optimization could potentially yield even greater results.

Achieving more than 6.2x four-core performance over single-core is a great real-world example of the potential of Intel multi-core processors. Since Snort is rather typical of a packet processing application, it is likely that the supra-linear performance gains described here can be generalized to other applications with intensive packet processing requirements.

There is one caveat I should mention. Our demonstration was done using Snort version 2.2.0, which has since been superceded by newer versions with increased functionality and a modified software architecture. While the basic transformation process and optimization techniques could be applied to the current release, the optimal multi-core software configuration and performance would likely differ from the results of our exercise.

To view the complete white paper, visit: http://download.intel.com/technology/advanced_comm/31156601.pdf

Lori
-----

[i] For an explanation of the 3.2x - 3.8x performance scaling estimate, see page 12 of the white paper http://download.intel.com/technology/advanced_comm/315697.pdf

2 Comments Permalink
2

Around this time of year there tends to be a flurry of activity in most companies as
product families are created or updated for the coming year. Concept designs
and platforms can be used to test customer interest at tradeshows and on
roadshows so that specifications can be finalized to maximize market appeal. When
resources are tight it is a good idea for OEMs to find a partner to help them
get the show on the road.

Custom system design services can quickly get you a prototype of your product
idea. By turning ideas into reality, you can gather customer input and mitigate your risk when
entering a new product category or utilizing a new product technology, such as
multicore, content security or advanced management technologies. By selecting a
vendor with a broad product line of motherboards, SBCs and systems as your design
partner you can make sure that you achieve the optimum configuration. Once you have
zeroed in on a processor, your design partner should use the chipset designed for
that processor, so that you can access to the full capabilities and performance of the
processor.

What does that mean to you? If your vision is a new fanless small form factor system,
you can take advantage of the Intel GMA 500 controller that supports full hardware acceleration
of H.264, MPEG2, VC1 and WMV9, as well as up to 4 streams of HD audio by using
the Kontron KTUS15/minITX motherboard with the Intel® Atom™ and the Intel®
System Controller Hub US15W combination.

Get ready for 09.


Kontron – Nancy Pantone

2 Comments Permalink
0

As processor technology becomes more mobile appropriate, it opens up new doors for all kinds of applications. More processing power makes it easier for these devices to be more intelligent and to communicate wirelessly amongst themselves. The growth in intelligent mobile devices of all types is going to be phenomenal in the coming years. But, many issues exist that hinder the development and growth of mobile devices.

To be mobile, they must be small and lightweight. They have unique thermal constraints because of the size. You can't simply throw large heat sinks and heat pipes on them to disperse the heat from the processors and chipsets. Fans are not good. They create a long list of design challenges that are more easily addressed by taking them out of the equation. Who wants a fan making noise and constraining how the device is used so as not to block the airflow?

And most visible to users is battery life. Short battery life puts a leash on the mobility factor. Large batteries make the device heavier and less mobile.

Improvements in processor and chipset technology are making it easier to overcome these issues.

I recently moderated an E-cast event on the topic of "Rethink Cool- Intel® Atom^TM^ Meets Tough Design Requirements" that was sponsored by the Intel® Embedded and Communications Alliance, with presentations from Intel ECG, RadiSys, and Nexcom.

Intel ECG started it off with a quick introduction to the Atom processor. RadiSys discussed how COM Express uses the Atom processor to help us re-think cool. The presentation covered the advantages of using a board level module to overcome some of the challenges of designing small systems. The Nexcom presentation touched on techniques on how to design for long battery life in mobile industrial applications.

You can view the E-cast in its entirety at http://w.on24.com/r.htm?e=117593&s=1&k=D93329A40831838D6947B2A6DCCD85DD

Several questions were asked during the E-cast. For a list of questions and responses go to Rethink Cool- Intel® Atom™ Meets Tough Design Requirements, Sept 24, 2008, Live Chat

0 Comments Permalink
0


I mentioned in my last blog that the people behind ATCA have been looking towards the push for next generation technologies. In the case of I/O, this is being driven by new technologies, but the one I will focus on for this entry is power. ATCA is looking to expand into new market areas, while also meeting the increasing capacity demands of its current market space. To do this requires more performance on a simplistic level. And as we know, increased performance tends to require more power, which in turn generates more heat.


Which brings us to “*shall*” and “shall not.”


One of the most important instances where we reach this crossroads is when considering a change in ATCA specification in terms of power per blade. Initially, ATCA blades were limited to 200W of power, which of course implied that the chassis surrounding it can
cool a 200W blade.

However, the latest releases of ATCA-based blades no longer have this restriction. The specification that used to state that an ATCA blade shall be limited to 200W per slot has now changed to a shall not exceed 400W, although at different places within the specification. Obviously, this is an important difference. It allows the ATCA designer to use more powerful CPU solutions to meet that ever increasing curve of capacity requests and enables the ability to support more cores, more memory and more storage. In short, a single “shall” allows ATCA to expand in a much needed direction.

0 Comments Permalink
0

You know what it's like when you have that "A Ha" moment. Well we have been getting a lot of these lately from various customers. What's driving these reactions are the implications of having two low voltage Harpertown processors running on Trenton's MCX/MCG system host boards and what this means to applications that require maximum performance with minimum heat generation. The processors I'm speaking about are the embedded Quad-Core Intel Xeon Processors L5410 and L5408. The performance and thermal design power (TDP) ratings of 50 Watts and 40 Watts respectively have significant implications in the high-end performance segment of the embedded system market that Trenton serves.

Now I know that 40W or 50W TDPs sounds horrible compared to the sub-20W TDP ratings common in the low-end, commodity driven portion of the embedded system market. However, the reality we deal with everyday in the embedded system market segments that we serve is a demand for a level of processor performance that in the past has precluded most low power processor solutions. The Intel L5410 and L5408 are rapidly changing this performance vs. heat paradigm.

For example, in a surveillance aircraft application, the system needed requires four system host boards with each board having two, Quad-Core Intel Xeon Processors L5408. This system is managing incoming data from a variety of sources, processing all this data and driving the display and communication systems needed to act on these critical inputs. Data processing time and accuracy are critical as well as the need to have a system that generates as little heat as possible. It would have been next to impossible to design a system to meet all of the customer's performance and thermal requirements without the L5410 or L5408 processors.


Next time I'll share with you some of the system design details regarding the chassis airflow design and the new four-segment PICMG 1.3 Ethernet fabric backplane we produced for this system application.

0 Comments Permalink

Welcome Video

Lightbulbs
 

Innovation Case Study

Windmill
Remotely managed Advantech ARK-3382 controller, developed for a wind power solution at the Beijing 2008 Olympic Games.