Server Products
Data Center Products including boards, integrated systems, Intel® Xeon® Processors, RAID Storage, and Intel® Xeon® Processors
4750 Discussions

SS4000-E : Questions on pushing the limits

idata
Employee
4,952 Views

Hello,

BACKGROUND: according to the support document at http://www.intel.com/support/motherboards/server/ss4000-e/sb/CS-022215.htm http://www.intel.com/support/motherboards/server/ss4000-e/sb/CS-022215.htm , the SS4000-E w/ firmware 1.4 v710 is stated to have a maximum capacity of 3TB.

MY QUESTION: Is that a limitation of the software design of the firmware, or was that number based on hard drive sizes available at the time?

WHY I ASK:

I write this with the hope that one of the support members has tried this, or one of the community members may have tried this.

I bought several of these SS4000-E boxes, new from a reseller blowing them out at a very low price.

Disclaimer: I understand that it's past end-of-sale and is end-of-support. And, I understand the list of limited "supported" hard drives. I have read the documentation available. (So, I acknowledge that this is a "message-in-a-bottle" support request that may or may not be answered.)

That being said, it looks like the SS4000-E with 1.4 v710 firmware will work with 2TB drives. (Notice that I did not say "supported" as they are not on the list of supported drives)

...at least, it LOOKS like it does when I plug 4 of them in them into the SS4000-E box, and set them up in RAID 5, and create 3 partitions (2 @ 2048 [1.99TB each] , one w/ remainder space [1.45TB])

So far, so good.

Then, since I do have these three partitions, I began restoring data to these drives. (Using a free product "SyncBack", a third party program, and this is running in a Win7 home environment)

I filled up my 2TB public partition and started to add data to my 2TB public-2 partition... when that was done, I found that I received a "DISK CHANGE NOTIFICATION" message upon logging in. I could not reach one of the partitions.

Well, I did not change the disk, and I could not tell if there was a failure, and I could not get past that screen. I did verify that the disk sizes and serial numbers recognized by the Intel ss4000-e firmware were EXACTLY the same. And, the disks on that page stated no errors

So, I reinitialized and restarted. (Yes, I always have a back-up for my backup. I am a technologist, I have no faith in any single-point solution).

This time, I stated restoring with "public-3" (w/ 391GB of data) and then "public-2" (w/ 1.1TB of data) ... everything seemed to be going well. I stopped and started my restore several times along the way without problems.

However, when I got to the end of restoring the second partition, AGAIN, I found that I had received a "DISK CHANGE NOTIFICATION " error.

And, of course, again, no disk failure or change, with all lights on the box a happy shade of green.

I DID notice, this time that the solution said: "Current state: RAID 5 (NORMAL, Resync : 73 %, Finish : 1873 min, Speed : 4540K/sec)"

Here is a screenshot, resync numbers have changed slightly:

And, I can reach my mapped "public-2" and "public-3" partitions and see and use that data on those partitions, but I cannot reach my mapped system-created "public" partition. (And, if I access the resource directly, I can access "admin", public-2", "public-3", but not "public".)

Of course, as I have the "DISK CHANGE NOTIFICATION" message screen, I cannot get past that to see if that partition still exists on the system or not.

So, THIS time, rather than reinitializing and starting over, I am choosing to let the RAID restore process run and see what may be the result in a day or two. (if it will actually work or not)

But ultimately, I want to know: Is the 3TB limit mentioned in the documentation something that may be causing the system (with 4 2TB Drives) to believe it has had a failure?

... and will it ever work with 4 @ 2TB drives?

I hope for a response. I will post my RAID restore results here in a few days time.

Thanks,

v.

0 Kudos
21 Replies
idata
Employee
1,617 Views

The large HDDs "should" work as the 1.4 version of firmware (operating system) supports storage capacity greater than 2 TeraBytes (TB). As you see, the storage is divided into 2TB partitions, including one for a shared public folder, one for user home folders and one for backups. However, Intel has never tested larger than 1 TB HDDs and don't know how they'll work.

That said, the way you're configured should be fine. The public1, public2 and public3 shared folders will actually be separate partitions like /dev/vbdi4. vbdi5 and vbdi6 mounted on /nas/NASDisk-00004, NASDisk-00005 and NASDisk-00006. These partitions will each be 2TB.

 

 

The different partitions are still part of the single RAID array. Unless there was damage to the specific sectors in this partition. Maybe damaged sectors on a specific disk and may explain the "one or more drives had either failed or been changed" message.

 

 

You can troubleshoot the system by creating and a diagnostic file (XRay) for the SS4000 and analyzing the results. See:

 

ftp://download.intel.com/support/motherboards/server/ss4000-e/sb/ss4000etroubleshootingguide.pdf ftp://download.intel.com/support/motherboards/server/ss4000-e/sb/ss4000etroubleshootingguide.pdf

John

idata
Employee
1,617 Views

John,

Thank you for the input.

As a result of my wait, I was rewarded with the RAID reporting that it is "Normal".

HOWEVER... I am still stuck at the "Disk Change Notification" screen.

Screenshot:

Choosing [Scan] or [Continue] appears to do nothing but return me (again) to the Disk Change Notification screen.

I am able to [ShutDown], however on restarting the SS4000-E, the system returns to this same screen.

While this apprears to be progress, sadly, the results are worse than before: At this time, I can access the 200MB Admin directory, but now I can no longer access Public, Public-2 or Public-3.

Those three partitions make up the majority of the 4 @ 2TB drives in RAID 5, and this is a different result than before the reboot, where I could access Public-2 and Public-3.

Q: Is there any command that can "force" the system to get past this Disk Change Notification screen?

What I was able to do was use the suggested command that you provided, turn on debugging and generate the XRAY file.

I have uploaded this file to http://dl.dropbox.com/u/32146825/xray.tgz http://dl.dropbox.com/u/32146825/xray.tgz . I would appreciate if you could take a look at this, or provide guidance for what key items that I should consider when looking at the output.

Finally, turning off debugging did not make any changes to the accessibility of the partitions (not that I thought that it would, but hey, what the heck, right?)

As before, all lights on the SS4000-E are green, not indicating any errors. There appears to be some drive activity, based on sounds/vibration of the unit. At this point I am planning on letting this run for a day to see if there will be any changes to the partition availability. (not optimistic there either)

Again, thank you for your support (and the support of any others that may provide input, of course).

vincent

0 Kudos
idata
Employee
1,617 Views

Update: After running the SS4000-E overnight, there were no changes: Public shared space was still unavailable.

So, to remove the drive set from the equation, I replaced:

with:

Observations:

  1. The Seagate drive is a lower RPM drive,

     

  2. On placing the Hitachi drives into another manufacturer's NAS solution (that provides more access to SMART data), one of the drives DID show a SMART event. However, I would think that one SMART event should not lock-up the entire SS4000-E solution, as the drive passed the systems drive test.

     

  3. On reinitializing and reconfiguring the Seagate Drives into the SS4000-E, I did notice that while the partitions create quickly, on checking advanced / drives I do see that the RAID does not actually get configured (resync'ed), and that it seems to actually be formatting/testing the partitions.

     

Screenshot showing the Seagate drives, and the beginning of a "resync" process that appears that will take nearly 5 days (!!)

Of course, my hope(s) are three-fold:

  1. That the preparation / resync of the solution (with no data on the drives) should not take an actual 5 days,

     

  2. That waiting for this resync to complete BEFORE adding data will allow the systems to better accept the larger capacity. (Possibly, I had overwhelmed the solution when adding data while the initial resync was taking place?) And...

     

  3. That the Seagate drives will behave differently than the Hitachi Drives.

     

The good news, is that we once again are able to reach the SS4000-E Home screen, and see a RAID 5 configured storage space of 5587.5GB

Screenshot:

I would still be very interested in discovering what the xRay output states for the previous crash (as I would be very interested if it was system limitation related), however I am moving forward with this attempt.

I will continue to document the results for others interested in the outcome of the SS4000-E with 2 TB drives.

Regards,

Vincent

0 Kudos
idata
Employee
1,617 Views

Update:

0 Kudos
idata
Employee
1,617 Views

The Seagate drives have finished their initialize/resync. (actual time elapsed was closer to 7 days)

RAID status is Normal, Hotplug Indicator shows Yellow for all disks,.

I will begin to restore data to drives today, I will provide updates at various milestones.

0 Kudos
idata
Employee
1,617 Views

MAJOR SETBACK.

The plan was to change the NAS name from NAS-4 to NAS-1, as well as change the fixed IP address.

On changing the NAS name, two hard drives (# 1 and # 4) went offline (drive indicator lights on SS4000-E box off).

Screenshot:

Scan did not get the drives to recognize.

Shutdown / reboot required the NAS to be found using the Storage System Console tool.

As you would think, losing two drives broke the RAID. The system reports the same drives, but that they are "new".

Screenshot:

Looks like it's time to reinitialize the disks (again).

0 Kudos
idata
Employee
1,617 Views

zebraitis,

Changing the storage system name or IP address does not have an effect on the RAID. Why would it? It may cause the system to be inaccessible in a networked environment until the name and/or IP address propagates through a DNS'd network, but won't be the cause of a RAID failure.

I just tried both a name and IP address change on my lab system with no adverse effects. Changed from Storage101 with a RAID 5 configuration to Storage999 and IP address 192.168.101 to 192.168.1.150, rebooted and the storage system was now Storage999, IP address 192.168.1.150 with RAID 5.

During the boot I monitored the boot process and the disk were indicated as active sync and in the operating system storage console the Advanced: Disks Hotplug indicator all yellow. All functions are operating normally.

Storage system software versions 1.0 through 1.4 are a "mildly" proprietary build based on standard Linux Kernel version 2.6. There's no "kill RAID" commands built in for system name or IP address changes.

John

0 Kudos
idata
Employee
1,617 Views

John,

Thanks for monitoring this journey.

Changing the storage system name or IP address does not have an effect on the RAID. Why would it? It may cause the system to be inaccessible in a networked environment until the name and/or IP address propagates through a DNS'd network, but won't be the cause of a RAID failure.

You know, logically, I absolutely agree with you, and I expected no surprises.

But yet, there was a negative impact. Knowing that it was a very unexpected result, that is why I included the screenshot. Surprised, I even went and looked at the NAS box, and the lights on Drive # 1 and # 4 were off, just as the screen shot /disk change notification showed.

Other than changing the name, there was no other action taken.

Could it be a fluke? Sure. Absolutely could be.

And, possibly, I may have been able to remove/reinstall the two drives and seen if they would again power up. However, I did skip that step and moved forward with reinitialization, as the second screen shot above did not show a surviving RAID.

Here is the system log, which shows the syncronization completing, and then (seven hours later) a nearly concurrent disk error with my initiating the name change. (rem: read from bottom up)

Sadly, as they say "it is what it is".

I continue on.

0 Kudos
idata
Employee
1,617 Views

Update:

For those who may read this thread and are curious... The initialization / resync of the 4 @ 2TB drives in a RAID 5 configuration appears to process at a rate of 20% per day.

While this resync process occurs, the drives show RED, which indicates that the RAID would be broken if a drive failed or was removed.

After two days of processing, here is a screenshot:

My intent is to wait until the resync is complete, restart the system, verify that the RAID is valid following that reboot, and then begin the restoration of data.

0 Kudos
idata
Employee
1,617 Views

SUMMARY:

  • Things are not going well, the SS4000-E failed again.

     

  • Possible defective unit (?)

     

  • With different drives, the NAS failed in much the same manner.

     

  • I have two more of these units, I have placed the sets of drives into them to see the outcome

     

Here are the details:

The resync of the drives had completed, and things looked good.

The System log showed that the resync completed:

And, the home page showed the expected status:

And, looking at the drives, the NAS appeared fine, and the 4 drives showed Hotplug YELLOW as expected:

And then... For whatever reason, drive # 4 went dark, and I had a Disk Change Notification:

This time, I chose to pull Drive # 4, and reinsert. Disk Change Notification told me it was rebuilding...

However, after rebuilding for some time, Drive # 4 went dark again. Here is the System log:

At this point, since the RAID was still valid on the three remaining drives, and I could access the various partitions, I started testing the Seagate drive (using "SeaTools") to check for any errors or issues. Finding none, I planned on reinserting the drive.

And that is when I found that Drive # 1 was also dark. At that point, that meant that only two drives of the RAID remained, and that was the end of that.

At this point, I had found that the SAME box that had two different sets of drives from two different manufacturers had two drives go dark in the same slots: # 1 & # 4.

Now, if this takes the drives out of the calculation, that may leave the SS4000-E box itself as a possible problem.

SO...

Since I have two additional SS4000-E boxes, I decided to pop them out of the cardboard box, and insert the two sets of drives into those two boxes to see the outcome.

They are both currently in the resync process, at around the 40% mark.

Both boxes came with firmware 1.4 v.709... So, I chose to let them run with that, and not upgrade the firmware to v710.

Sidebar:

One thing that caught my attention was that one of the boxes was MaxData branded.

Same menus and function, but different color scheme. For the curious, here's a few screenshots:

I'll provide an update once the two boxes finis their resync. I hope that I will be able to begin data restoration.

v.

0 Kudos
idata
Employee
1,617 Views

If I could play devil's advocate here for a moment. If after all this struggling you do manage to get this working, would you actually trust it with your data? Personally I'd be cutting my losses and moving over to something like an HP MicroServer and NAS4Free (formerly FreeNAS).

The time you've spent on trying to make this work has long overshadowed the initial cost savings IMO.

0 Kudos
idata
Employee
1,617 Views

emilec wrote:

If I could play devil's advocate here for a moment. If after all this struggling you do manage to get this working, would you actually trust it with your data? Personally I'd be cutting my losses and moving over to something like an HP MicroServer and NAS4Free (formerly FreeNAS).

The time you've spent on trying to make this work has long overshadowed the initial cost savings IMO.

Thanks for the comment. That's a fair question, and one that has crossed my mind. I have built my own server based NAS before, however, I like the idea of a stand-alone device.

At this point, I would say that I am committed to finding a solution that works. If these two SS4000-E's should have any challenge once they complete resync, then it's off to another solution. However, I will see this through and document my results to find if the 2TB drives work in these boxes or not.

0 Kudos
idata
Employee
1,617 Views

OPTIMISTIC UPDATE:

The two "newer" S4000-E's, with the two sets of 2TB drives (Hitachi & Seagate), appear to have completed their resync, and Hotplug Indicators show yellow.

NAS-1:

NAS-4:

After rebooting each unit several times with no drive loss(which was the problem with the first tested S4000-E), I am again beginning data restoration to NAS-1.

Updates will be provided after each partition is completed.

0 Kudos
idata
Employee
1,617 Views

Minor Info Update:

  • NAS-1 continues to restore.

     

  • NAS-4 has been cycled several times with no negative effects, and today I have changed the name of that box from NAS-4 to NAS-5 with no negative effect.

     

Screenshot of that name change:

.

.

.

( Continued on next page...)

0 Kudos
BDear
Beginner
1,617 Views

Awaiting the results of your odyssey as I am very curious. I just started to investigate the capabilities & limits of the SS4000-E in anticipation of making a purchase of the device.

0 Kudos
idata
Employee
1,617 Views

So... Weird thing that happened today...

Yesterday, I attempted to stress the solution by restoring two partitions worth of info concurrently.

My intent was to write to NAS-1 "public" as well as NAS-1 "public-3" at the same time.

Due to my own error, I actually wrote both restores to "public" (... into separate subdirectories... so should have been no big deal).

I stopped the restore of the data that should have gone to public-3 when I saw my mistake (a few hours later) and I proceeded to copy that data from "public" to "public-3"

Data copy was S L O W ... much slower than the initial network write.

So, after about 30 minutes, I decided to delete that misplaced data... since I knew I had a backup anyway.

Now, TODAY is when the REALLY strange behavior begins...

On trying to access public-3 today, Win7 said that the network resource was not available, or that I had insufficient rights.

Looking at the "home page", I saw that the NAS was ... confused.

Apparently one partition (the smallest, public-3) was likely gone, and that another partition seemed to think that it was for backups.... And that the overall status was "NOT READY"

This was troubling, as I had seen this type of behavior on the first box that I had attempted to test.

At that point, I tried a reboot of the box, but on restart, the status was the same.

Checking the partitions, I saw that "public-3" had decided that it was "0" MB in size:

However, the disks in the RAID still were good... and hotplug indicator was YELLOW (which is good)

Thinking that the NAS could be doing something in the background, I checked the System Status page... and the CPU was idle.

And... the system log showed no errors.

So: I'll give the S4000-E the benefit of the doubt on this one...

I deleted partitions "public-2" and "public-3", recreated them and assigned rights, and continue on with the restoration of the "public" partition.

Now the home page shows what we expect to be normal:

Bottom line: Working, and still restoring, but just more than a little disconcerting with that disappearing partition.

0 Kudos
idata
Employee
1,617 Views

Informational Update:

Completed restoring NAS-1/public with 1.88TB of data on a 1.99TB partition.

All appears well. Now begining restore of NAS-1/public-2

0 Kudos
idata
Employee
1,617 Views

OK... Now it looks like we are getting to a potential problem that appears to be replicated.

Reminder: My S4000-E has 4 2TB drives installed in a RAID 5 config, resulting in 5.5 TB (5587 GB) available space. This requires the creation of three partitions: public (created by default and my choice to size to 2048 GB) , public-2 (created by choice at 2048 GB), and public-3 (created by choice with the remaining space)

So far, NAS-1/public has restored well.

I paused restoring NAS-1/public-2 and decided to let NAS-1/public-3 restore for a while.

That is when the problem became evident.

When I began to restore NAS-1/public-3, I saw that the speed of transfer was extremely slow.

Here is what that looked like:

The picture ablove shows a very slow transfer, when compare that with a screenshot of NAS-1/public-2 showing the expected transfer speed below...

Also disturbing was finding that the system log no longer had a complete record, but appeared to start over:

Knowing that the running restore of NAS-1/public-3 may take weeks at the speed displayed, I attempted to abort the restore, finally pulling the ethernet cable out to cause a loss of network resource.

Having stopped the restore, I reconnected the ethernet cable to the NAS. All partitions were still there, as well as all physical drives still indicating YELLOW in the RAID configuration.

On reboot of the drive, The home screen showed a screen that we have seen before:

A partition that thinks that it is shared, a partition that thinks that it is a backup, and a partition that is GONE.

(... and, yes, "public-3" again was at 0 bytes.)

In this case, I was able to again delete NAS-1/public-3, and then public and public-2 "came back" and the system again was ready.

The good news is that the physical drives remain in teh raid, and teh raid remain valid.

Again, I continued with the previously interrupted restore of NAS-1/public-2, with apparently no problem.

Here is the "new" home window, showing the space occupied by public and public-2:

SO.... HERE IS MY QUESTION(S) TO THE INTEL SUPPORT TEAM:

  • Why do I appear to have such a problem with the creation of "public-3", the partition that contains the remainder of the available drive space (approx. 1.5TB) ?

     

  • What affects that partition so that write speeds are so low?

     

  • Are there any suggestions of the partition size for that third partition? Does that even matter?

     

  • Are there any limitations on the S4000-E firmware that affects the creation of that partition? Is there a problem in exceeding 4096 (2048x2) total partitiond space?

     

I await your input.

While waiting, I continue on with the restoration of Nas-1/public-2

v.

0 Kudos
idata
Employee
1,617 Views

Sadly, the S4000-E has again suffered a critical error.

Unless The support folk at Intel can provide a reason why, this may be the end of the pursuit of having the S4000-E work w/ 2 TB drives.

System XRAYoutput file available at http://dl.dropbox.com/u/23866842/Apr_22_xray.tgz http://dl.dropbox.com/u/23866842/Apr_22_xray.tgz

SUMMARY:

NAS-1/public-2 was restoring, and continued to restore.

However there was a Windows message informing me that NAS-1/public was no longer available as a resource.

On checking windows network resources, I could access the data on NAS-1/public-2, but I could not access NAS-1/public.

On checking the S4000-E, Drive # 1 was dark.

On logging in, there was a Disk Change Notification message, stating that Drive # 1 was no longer active, and the raid was degraded (3 of 4 drives functional).

Removing and reinserting the drive begins the rebuild process, however NAS-1/public remains unavailable.

DETAILS:

While I did not initiate any change, here is the Disk Change Notification:

After removing the drive and reinserting, I received a rebuilding update:

If the time message is correct, then rebuld process of that drive will take over 4 days.

However, the rebuild process may not be of much value... as after reinserting the drive there is still no access to NAS-1/public:

("admin" and "public-2" can be accessed.)

After reinserting the drive, selecting [ Continue ] on the Disk Change Notification screen would NOT allow me to proceed to the Home screen, so I could not tell if the "public" partition was still there.

I was able to run the Intel XRAY diagnostic program built in on the S4000-E. The output is available for anyone to view at: http://dl.dropbox.com/u/23866842/Apr_22_xray.tgz http://dl.dropbox.com/u/23866842/Apr_22_xray.tgz

While unfamiliar with all the info that could be reviewed in this data, checking the MESSAGES file, I found the record of the disk being shut down by the S4000-E. This looks like the system failing and the system choosing to shut down the drive, rather than a physical drive failure:

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.err kernel: drivers/scsi/gd31244/drv/gd31244_lld.c# 2201:gd31244_device_reset: Dev Reset 0:0:0:0, dev# 0: Success

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.err kernel: drivers/scsi/gd31244/drv/gd31244_lld.c# 2218:gd31244_device_reset: reconfigure device # 0 failed

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.err kernel: drivers/scsi/gd31244/drv/gd31244_lld.c# 2112:gd31244_bus_reset: Bus Reset called for Ho:Ch:Tgt:Lun (0:0:0:0)

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.err kernel: drivers/scsi/gd31244/drv/gd31244_lld.c# 2201:gd31244_device_reset: Dev Reset 0:0:0:0, dev# 0: Success

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.err kernel: drivers/scsi/gd31244/drv/gd31244_lld.c# 2218:gd31244_device_reset: reconfigure device # 0 failed

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.info kernel: scsi: Device offlined - not ready after error recovery: host 0 channel 0 id 0 lun 0

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.warn kernel: SCSI error : <0 0 0 0> return code = 0x10000

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.warn kernel: end_request: I/O error, dev sda, sector 3907029008

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.err kernel: scsi0 (0:0): rejecting I/O to offline device

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.alert kernel: raid1: Disk failure on sda1, disabling device.

 

Apr 22 06:44:22 ZEBRAITIS-NAS-1 user.warn kernel: ^IOperation continuing on 3 devices

Strictly out of curiosity, I will be letting the rebuild continue, just to see what will happen and to see the status of the "public" partition.

.

Admittedly, I have spent nearly a month working on this, and considering these issues continue on more than one box... Well, I'm nearly at the end.

I would very much like the Intel support team to look at the XRAY output and let me know what's going on.

0 Kudos
idata
Employee
1,448 Views

RAID FAILURE

Turning off the S4000-E and restarting resulted in the Failure of the RAID.

Even though three drives remained, and the RAID and data should have been secure, it completely failed requiring an initialization of the drives to continue in any manner.

At this point, there is no sense in making any other posts unless the Intel support team provides guidance.

0 Kudos
Reply