Home > Intel Communities > Open Port IT Community > The Server Room > Server Solutions Insider > Discussions
7 Replies Last post: Oct 21, 2009 10:40 AM by pmcfee  
Michael Brinkman - IBM 4 posts since
Mar 25, 2009
 
Currently Being Moderated

May 16, 2009 12:16 PM

IBM x3550 M2, x3650 M2, HS22, and dx360

Wanted to know if anyone here has tried out one IBM's new servers.  I was on the team that developed the new UEFI code stack for these servers and would like to get first impressions from the community.  This was a large effort and will provide many avenues for innovation as we go forward.  I would be also interested in any new functions that you would like to have in the pre-boot or systems management areas of your system.

 

Michael Brinkman

Mgr. UEFI Team.

Average User Rating
(0 ratings)




Jeremey Wise   1 posts since
Jun 2, 2009
Currently Being Moderated
1. Jun 2, 2009 11:07 AM in response to: Michael Brinkman - IBM
Re: IBM x3550 M2, x3650 M2, HS22, and dx360

Yes.. .dozens of times:)


****

Background:

I am a Senior Solutions Architect who helps our Engineers at our integration facility to find resources to help when new product is having issues, or where complex designs require a more solutions based set of technical resources. I only supliment the already great and talented resources who debug and work every day with the IBM Modular product line (Intel). Any time a new product is released, I ask to be informed so I can come get some 'hands on' the new systems so I can keep my knowledge of the changing technologies sharp.

***

 

The change to the Nehalem processor marks a major change for anyone who has to deal with integration. I had been getting feedback from out facility of more then "the usual issues" of a new product so I made time to go over and take an example and document it. This may not reflect all issues and as any new product is released, is sure to be worked out. I have to say up-front that the issues are NOT in any way sestemic of a core technology issue or even critical in nature. IBM and Intel have for years done very well at avoiding and midigating those level issues. What is reflected below are things that, IMAO, need to be reviewed to see if there are ways to make improvement. These oppinions reflect the integration and setup aspect of the system and also do not capture long term runtime experiance (always looking for demo gear to prove that out for our labs

 

 

****************

System: HS22 Model 7870

Product ID: 7870C3U

SN: (not listed to protect the inocent)

 

System Specs:

1Socket (quad core)

RAM

PC3-10600 PN:43x5061 (qty 6)

 

Areas of Improvement:

1) System Integration times: These are how long it takes us from time of unboxing to where we can test, update and ship a box. This "T Time" has more than doubled. This impacts how quickly we can meet customer demands for shipping. Most of the issues stem from boot times. From the time the hardware finished being installed and power is activated to the point where a technical person can interact with the system for loading (or updating). This has more than doubled and has the greatest impact on how long it takes to debug and root cause a hardware issue.

               Blade placed into chassis and powered: 0.00

               Chassis Acceptance of Blade: 2:12

               First screen: 34sec

               BIOS Options: 52sec

               Legacy boot Option (select media for loading): 2:24

               (Total Time to get to be able to select boot target to load OS: 3:16)

 

2) System Boot Error from factory and (I believe the cause of long boot times after initial flash) "PXE-99: Unexpected Network Error".  I did not have time to further root cause this with any kind of sniffer on the Blade Switch.

 

3) RAM Debug Issues: With the new processors binding to RAM, their is a "difficult" process to debug bad RAM and to replace it. Due to the long boot cycles, if the system is NOT being stable, RAM is the first (and most usually) the culperate. Errors on RAM with the new form factor are definatly higher. The comments I have relate to a few things about how this impacts the Nehalem systems. BIOS does not always indicate slot and so 16DIMMs x 3:16min per boot cycle make root cause of DIMM very laborious. When a DIMM is noted as being bad, it can be easily replaced BUT the system locks out the slot. This requires that the battery is pulled from the system board for a 10sec or so to clear the error. This also has impacts on EFI / BIOS settings and configurations that will cause issues for customers. This also impacts systems with the "minimum" DIMM configuration sets where they will see an impact in system core speed and boot capabilities due to the RAM failure. (I did not personally see this but errors are indicated in the BIOS as to the slot and I saw no indicators on the planar board for indication of bad DIMM, such as light path, I need to confirm this, and document how a customer can check if their system "changes" are due to bad DIMM)

 

 

2) IPMI Issues: Many customers us IPMI to manage and monitor system components. The Blade Managment Module does have the ability to affect the IPMI definition of the HS22 systems, nor can you make the changes (view only at this time) in the BIOS.This is even more of an issue when the systems are laid down with VMWare Hypervisor which does not have an OS level agent to report system events back and so relies purly upon IPMI for reporting and statistics. This is actually such an issue that it is in debate if customers will accept shipment of product as it is not "properly configured". {Ex: BSMP IP Address Range 192.168.70.200  host IP was set but can not change to reflect our customer's IP range set of 10.x.x.x)


3) Missing Serial Numbers: We still on occation receive units which have no Serial number. When you insert them into the chasis, the managment module shows blank data for the model and serial number. The only fix at this time due to the IPMI issue listed above is to declare the unit 'bad" and replace the entire system. This imediatly throughs an entire day of productivity out, at best... assuming we have spares in stock. Though I did not see this process, they had this same issue on the x3650M2 Nehalem systems but there is a fix of using a flash utility from IBM. The issue with the blade units is being worked on by IBM but no ETA.

 

 

4) Logic and understanding of EFI and its relation to "Legacy boot" option. It is likly a learning issue, but in the last 10min I had that I could work on this topic I tried to learn how to make some of what I would classify as basic changes to the system to impact boot times and boot targets (such as SAN boot, iSCSI targets, setting BOOT for ONLY Fibre HBA and then impact the boot option for the ONE time that the initial OS would be loaded). I was not able to lay this out and need to play with it. The Engineer did explain that there seems to be some level of reliance on the "Legacy boot" option that has to be used and worked with that has not been fully worked out.

 

 

Good things: I like to make sure to also take note of things the Engineers say "that is great!"

1) RAID Capabilities in BIOS saves enormous amounts of time and can allow use to quickly setup what is almost always basic RAID 1 for hardware integration and testing

 

2) System performance is VERY impressive. I have not had any HPC builds to do any "kicking of the tires" on my own but the system performance for the little time I have had my hands on them (post BOOT) have very fast load and application times for VMs (which is much of what we design.

 

3) Legacy boot option options are still very helpful.

 

 

 

I hope this helps. Look me up if you have any additional questions or points of clarifications you are needing.

masbe   1 posts since
Aug 17, 2009
Currently Being Moderated
2. Aug 17, 2009 10:47 AM in response to: Michael Brinkman - IBM
Re: IBM x3550 M2, x3650 M2, HS22, and dx360

I have limited experience with the x3650 M2. Unfortunately I haven't had the opportunity to explore all of its features, so the only comment I'll make is with regard to the time it takes to a) enable the power button so the server can be powered on and b) the time taken to load the EFI: each of these takes longer than I would like and has made troubleshooting DIMM problems very time consuming. Sorry to be negative: if I have anything positive to comment in the future I'll post again!

digg1980   1 posts since
Sep 15, 2009
Currently Being Moderated
4. Sep 15, 2009 3:10 PM in response to: Michael Brinkman - IBM
Re: IBM x3550 M2, x3650 M2, HS22, and dx360

Hi Michael,

 

I have installed many of these servers being a Systems X & Storage Technical specialist at one of the IBM Larger partners. I had mainly faced two problems with the new EFI:

 

1- Boot time is too slow. It take about 20 minutes to boot a VMware ESX when EFI is used, compared to less than 2 minutes with the normal bios.

2- Boot from SAN specially with Qlogic cards does not seems to make life any easier. I had even got to write a post on how to get this setup trying to help our customers through the process. The post can be found at:

The file ql2300.sys is corrupted. press any key to continue.

 

Though I can tell that the performance of the new systems after they boot up is about 1.5 times what the earlier models used to give. I am comparing the performance as a VMware setup where I can install about 1.5 times the time of VMs at an HS22 to what I used to get on HS21XM.

 

I hope that help,

Eiad

virtualpete   1 posts since
Oct 19, 2009
Currently Being Moderated
6. Oct 19, 2009 10:39 PM in response to: Michael Brinkman - IBM
Re: IBM x3550 M2, x3650 M2, HS22, and dx360

Hi,

 

working on a couple of x3550 M2's in a remote co-location, dont have kvm hardware, or ibm director in this case - so have been relying on the remote admin features of imm.    would have loved to use the remote console but the requirements needed to get a java webstart app going are impossible for me, ie that the host im running the browser on can't get to the internet, and due to site rules there's no chance of fixing that - so the remote console feature is basically useless for now, could someone consider this scenario maybe come up with something that doesnt need to pull .jar files down from t'internet?   ... anyway, we're using redhat, so we have the option of a serial console at least and I did figure out how com2 is accessible via the cli imm - 'console 1' command, so i prevailed.

 

Our host was shipped to us 3 weeks ago and had such old firmware on it that it really wasnt workable, perhaps you could push an alert up the supply chain to try and ensure units are not shipped out with known bad firmware - i spent the best part of a day with hung imm, that died simply because i rebooted the host OS, arranging remote staff to powercycle boxes, applying firmware and waiting 5-15 minutes each restart until i got them to a stable configuration - it's not the sort of experience that motivates customers to buy another 100 or so units.

 

anyway - now the servers are at these levels....

IMMYUOO32F-2009/08/2608/26/2009
UEFID6E128A-2009/08/2008/20/2009
DSADSYT19A-2009/08/2008/20/2009

 

able to boot/build/manage ok - but looking forward to your speedups in Q4 release.. - hopefully you can do something with the warm-reboot times - there seems to be about 5 minutes of dead time during a simple reboot that is not desirable. maybe if i was looking at the console i'd see what the holdup is...

 

one other thing i've noticed is that the 'onetime PXE Network Boot' option on the http imm interface does not work for me (and BTW cant find it on the CLI IMM) - documentation refers to certain conditions needing to be met, but does not explain what these are - maybe its a bug, maybe its something im doing wrong, so hoping your documentation people are planning on adding some more detail on this feature. i had to go down the path of getting someone at remote site to use F1 on console to set BootOption.BootOption to "PXE Network=Hard Disk 0", build off PXE, then reboot with the DHCP and tftp server disabled, wait about 10 minutes for PXE to try 12 or so times on each network interface before proceeding to the hard disk boot, before being able to fix BootOrder with asu - this is because I can't for the life of me find where you can modify the uefi settings via the IMM.

 

and also - while i think of it - is there a way to cut back the access rights of asu and MegaCli to just readonly?   im worried that a mischeivous individual on a compromised system can do things like 'reset to factory' on the raid controller or uefi - it seems the only option open to me currently is to disable the USB interconnect to IMM completely, but that doesnt help with the RAID.

 

otherwise - love your work.   keep it up...

pmcfee   1 posts since
Oct 21, 2009
Currently Being Moderated
7. Oct 21, 2009 11:02 AM in response to: Michael Brinkman - IBM
Re: IBM x3550 M2, x3650 M2, HS22, and dx360

Just got done getting the HS22 to Boot From SAN (SVC). We recently received a HS22 blade (7870-AC1) and tried to configure it to Boot From SAN. After several attempts of trying to configure the Qlogic card (QMI2572) a support call was placed to IBM since the Qlogic card was unable to save any settings. Another Qlogic card was also ordered to rule out a defective card and it had the same issue. I wouldn't even recommend trying to Boot From SAN without these firmware levels.

Firmware levels should be at:
BIOS 1.04, Build P9E130AUS
Diagnostics 1.13, Build P9YT40A
Blade Sys Mgmt Processor 1.05
Qlogic QMI2572, BIOS Revision 2.08
IMM (Integrated Management Module) 1.05, Build YU0032F

Overall we were able to configure the HS22 to Boot From SAN but were only able to do so by loading an OS (internal drives) and applying the updates since Update Express didn't have the latest firmware for the Qlogic card. The time to deploy a blade has increased. The HS21 took less than 30 minutes to have up and runnning when booting from the SAN compared to the 2 1/2 hrs to deploy the HS22. Please share this email as needed.IBM x3550 M2, x3650 M2, HS22, and dx360

 

Message was edited by: William Lea, adjusted to smaller font and removed Bold.

More Like This

  • Retrieving data ...