Intel SS4200-E - 4 drives failed after power outage - help me get my data back

idata · ‎12-12-2011

I've got an Intel easynas 4200-E with 4 1.5TB disks in it in a RAID5.

Today we had a power outage, but the NAS stayed on during the brief outage, as it was connected to a UPS.

When I came and checked my computer, the NAS had an Orange button in the middle of the blue drive buttons and I was unable to access the console. It was locked up and no lights were flashing.

I finally pushed the button on the front to get it to shut down. after a bit it finally shut itself down, and when it rebooted, all four drive lights flashed briefly and then all lights turned orange.

in the console it tells me all 4 drives have data on them that needs to be overwritten, and wants me to click a link to authorize overwriting the data.

I looked on the web and found several people have had this issue. I have tried unplugging each drive 1 at a time and powering up, but with each drive it just says my data is unavailable. each of the drives show up when they are plugged in, so I don't think any of them are bad, but there isn't one obvious bad drive.

I have also tried the factory reset as listed in my guide, but that didn't do anything either.

HELP! I have a bunch of digital pictures on there that I can't lose. I had set up a Second Intel Nas and was in the middle of mirroring them when this happened. The other Nas is running fine but I only have about 1 TB out of the 3.2 TB total.

Any ideas will help, I'm desperate.

Thanks.

Matt

idata · ‎12-12-2011

Matt,

You say, "unable to access the console" after the power outage. Using the SS4200 Entry Storage System manager did it simply not connect, or did it give an error message of some kind?

As you probably know, the amber Power/Status Pushbutton LED indicates a critical or non-recoverable condition has occurred. The amber disk LEDs mean a disk drive fault has occurred.

The "data" the console reports may very well be your data, but has become corrupted in a way that can't be repaired by the system.

On every boot, the Storage System software does a file system check. This will replay the data in the ext3 logs, correcting any file system issues that may have been caused by an improper shutdown. If this fails a full file system check is performed. The full check will make the file system mountable if possible. If the file system check doesn't recover the partition, it's likely pretty serious.

The first option you should use is the EMC storage console "Recover Disks" option. Recover Disks will attempt to recreate the array on the existing drives preserving user data. It requires all the original drive to be present and functional.

You can find the "Recover Disks" option with the Storage System manager. Log in if necessary, then select the "Settings" tab and replace settings.html?t….. in the browser address bar with support.html.

If that doesn't work, there are ways to access the system from a command line, but doing so you're accessing as the root user and a mistake can be dangerous.

I do suggest contact the Intel support center and they can help you through checks and possible solutions that I can't through this media. You can find contact information at the http://www.intel.com/p/en_US/support/contact/phone GLOBAL PHONE SUPPORT link.

Regards,

John

idata · ‎12-12-2011

John,

thanks for the quick reply, I am at wits end and really nervous about the data on the disks.

When I say I initially could not connect to the console, I mean when i tried to hit the IP address in the browser it eventually timed out. When I used the intel entry storage system, it just said the NAS device was unavailable. And had the red X on it.

the thing I can't figure out is, the NAS never shut down. It is on a UPS and I heard it beeping, and came in the office and the lights were still on and everythign. All four drives lights were blue, and the center button had turned orange. the power was only out for about 6 minutes and then was back on so the UPS never shut down the NAS.

So after the console didn't connect, I waited a while to see if it was rebuilding the RAID or doing maintenance. it wasn't flashing or didn't appear to be doing anything. This has happened in the past, the performance on these degrades over time and you sometimes have to restart them. I pushed the power button on the front and waited for it to shut down.

It did shut itself down cleanly, and then I pushed the button again, all four disk lights flashed blue, center light flashed blue, and then as it was spinning up they flashed sort of blue/orange, and then after ab it, all lights went orange solid. The center and all four drive lights.

Just prior to this the drive lights had all been blue. I highly doubt all four drives all at once went out.

After the machine booted, I was able to get into the console just fine, and it shows all four disks, but it says that all four disks contain data that needs to be erased, and has a link asking me to authorize overwriting this data. I haven't clicked it, nor will i.

I will try your suggestion, if you want to tell me what to try through the console at root level, I'm comfortable doing that. i've enabled SSH and have looked around at it with putty. I have read some forum posts saying you can pull the drives and put them in a linux box and mount them as an LVM volume group, I am just not sure what order. I'd assume they are mounted in this volume group as sda sdb sdc sdd but who knows. I couldn't see in the terminal where they were mounted and didn't poke around too much.

Thanks for any advice you can give me. this is the second time I've had issues with this. I have run it for a year and it's been perfect. The same thing happened to me last year and I got all my files backed up to other places. Well I had a failure of the other device so I got another one of these and was in the middle of copying it over and this one died. I am lost here. :-(

Thanks for your help. I will wait for your reply and then call intel support in a day or two.

matt

idata · ‎12-12-2011

actually John, I just went to the intelnas/support.html and there are only two options:

- support access

- dump files

support access only has one option: enable ssh support

dump files says it can't dump because of the state of the disks.

It's possible recover disks isn't there because I have one of the disks unplugged, you said they all have to be there. I'll plug it in and see if that works. There are tons of posts saying that this machine can't really deal well with a failed disk and sometimes will report one failed as all failed, and to pull one disk at at time and power it on and see if it'll mount up. I'll plug it back in and see what happens.

thanks again for the reply. If we get the data back I'm sending you to dinner somewhere. I'm in a full. blown. panic here. there was literally 1 day's time I couldn't afford a failure. and it happend on the one day.

idata · ‎12-17-2011

John,

I got the RAID sucessfully mounted with MDADM. There are three drives and one of them is faulted, I believe to be totally failed. I pulled that drive and tried cloning it with dd to a good drive and got read errors.

so i mounted the array with mdamd --assemble --assume-clean --raid-level=5 --devices=4 /dev/mdo /dev/sdd1 /dev/sdb1 missing /dev/sda1

(I had had a previously failed drive in this machine that it had rebuilt itself, so the drives got reordered in the raid.

mdadm --examine on all four disks showed them to be

1 /dev/sd1

2 /dev/sdb1

3 FAULTED

4 SPARE / /dev/sda1

the spare, /dev/sda1 showed clean and superblocks appeared to be good, so I mounted the array using assume clean. It mounted successfully and it shows as clean degraded status.

however, I still can't access the file system. One of the posts in the other theads said there is a logical volume on TOP of the array. Is that true?

the reason it's confusing is I mounted the array /dev/md0 in /mnt/soho_storage

but in some of the documentation, it references running a filesystem check at location /dev/evms/md0vol1

it also wants me to mount the file ssytem from /dev/evms/md0vol1 into /mnt/soho_storage -t ext3

I can't find the logical volume commands on the box, is the order Raid first and then the Logical Volume on top of the raid, then the ext3 file system?

Thanks for you help..

matt

idata · ‎12-19-2011

Matt,

The Intel® Entry Storage System SS4200-E operating system volume is mounted as /dev/hda1 and is a Linux ext2 partition. hda1 is a 256MB DOM (Disk on Memory) module installed in the motherboard IDE port.

The data is in an ext3 journaled file system on an lvm2 logical volume on an md raid (RAID 1 for two disks and RAID 5 for four disks) array mounted as /dev/evms/md0vol1. The array consists of a partition from each of the disks. There is only one partition on each disk that spans the entire disk.

The operating system and data partitions use a software RAID 32K Byte chunk size.

The file system block size on the operating system partition (hda1) is 1024 Bytes.

The file system block size on the data partition (md0vol1) is 4096 Bytes.

Did you check to see if the UUIDs were the same for all disks by using the mdadm –examine /dev/sdX1 (where X is the specific disk you want to see: a, b, c or d)?

When you do the mdadm -examine or –detail, what's the "State : " of the array?

Did you try the echo "recover 1" > /tmp/dm.fifo.in where "1" in the command line after "recover" corresponds to the drive in the array (0-3) that is foreign or faulted? I wouldn't do this if the array is in a rebuilding or recovering state.

John

idata · ‎12-29-2011

John,

Thanks for getting back to me, sorry for the delays, with travel, the holidays, work etc I've been swamped and haven't had a chance to login and update.

I did run mdadm --detail /dev/md0

it told me that there was no mdadm device mounted/available. so I ran mdadm --examine on all four disks, /dev/sd[a-d]1

it showed me four devices, one of them was faulted. three of the devices agreed that drive 3 [/dev/sdc] was faulted and not in sync. interestingly /dev/sdc initially displayed as OK. The UUID and EVENT ID's were identical on all four disks. interestingly, all four devices also showed as clean, but I believe the /dev/sdc1 was confused. When I ran mdadm --examine /dev/sdc1 the output was:

# mdadm: cannot open /dev/sdc1: no such device or address

leading me to believe /dev/sdc1 was/is completely TU.

Running # mdadm --examine /dev/sdd1 returned:

UUID ee2cbd45:c8d6ccf0:6b4c268d:2d266a5c [Same UUID shows an all remaining disks]

Raid Level : raid5

raid devices : 4

Total Devices : 4

State: clean

Active Devices : 2

Working Devices : 3

Failed Devices : 1

Spare Devices : 1

Checksum : ec81c762 - correct [All three functional devices showed correct on checksum]

This Number Major Minor RaidDevice State

2 8 49 2 active sync /dev/sdd1

0 0 0 0 0 removed

1 1 8 17 1 active sync /dev/sdb1

2 2 8 49 2 active sync /dev/sdd1

3 3 0 0 3 faulty removed

4 4 8 1 4 spare /dev/sda1

When I used the sheet that I got that showed the steps to try and recover the array, I worked through them and kept getting errors when trying to fsck the array or even mount it.

I got another disk, removed all partitions and replaced the third disk in the array, in hopes that it would rebuild the array on it's own. It didn't, it just came back and said all four disks had data that needed overwritten.

I then took the new disk, stuck it into a linux box, took the original /dev/sdc1 device and also stuck it in the linux box. it appeared to launch and I could see the disk and fdisk saw it's data. I used dd to mirror the original /dev/sdc1 disk to the new hard drive. it kept erroring out so I finally killed the process and wiped the new disk again.

The last thing I did was use the suggestion at the end of that sheet you sent me about assembling the array using MISSING. so I ran this command:

# mdadm --create /dev/md0 --assume-clean --level=5 --chunk=32 --raid-devices=4 /dev/sdb1 /dev/sdd1 missing /dev/sda1

the return was:

mdadm: /dev/sdd1 appears to be part of a raid array:

mdadm: /dev/sdb1 appears to be part of a raid array:

mdadm: /dev/sda1 appears to be part of a raid array:

continue creating array y

mdadm: array /dev/md0 started

I then ran:

# mount /dev/evms/md0vol1 /mnt/soho_storate -t ext3

which returned:

mount: mounting /dev/evms/md0vol1 /mnt/soho_storage failed: No such file or directory

I then ran:

# mount /dev/md0 /mnt/soho_storage -t ext3

which returned:

mount: mounting /dev/md0 on mnt/soho_storage failed: invalid argument

I then ran:

# mdadm --examine /dev/md0

which returned:

mdadm: No md superblock detected on /dev/md0

I then ran:

# mdadm --detail /dev/md0

which returns:

/dev/md0

version: 00.90.03

Creation Time : Fri Dec 16 14:56:00 2011

Raid Level : raid5

Array Size : 4395407808 (4191.79 GiB 4500.90 GB)

Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)

Raid Devices : 4

Total Devices : 3

Preferred Minor : 0

Persistence : Superblock is persistent

Update Time : Fri Dec 16 14:56:00 2011

State : clean, degraded

Active Devices : 3

Working Devices : 3

Failed Devices : 0

Spare Devices : 0

Layout : left-symmetric

Chunk Size : 32K

UUID : df094012:dc75835b:426b5a71:6591e0c2

Events : 0.1

Number Major Minor RaidDevice State

0 8 17 0 active sync /dev/sdb1

1 8 49 1 active sync /dev/sdd1

2 0 0 2 removed

3 8 1 3 active sync /dev/sda1

And then I got frustrated and decided to power off and think about it before I did anything else. That's where we sit.

I powered it up today, and launched the Intel Easy Storage Console to power it down and replace the new disk with the original. I notice that the "disks" section of the console says:

"one of the disks has been replaced"

if I go into the "manage disks" section,

and the go to "data Protection"

there is a link for RAID 1 and RAID 5.

if I click add disk to storage, there is a button and it says

"click next to recreate data protection using newly inserted disks"

I am not sure if that will rebuild the array WITH my data, or give me a brand new array WITHOUT my data. it feels like it's going to rebuild a new array with no data on it, so I exited out and will wait to see if you have any better ideas.

if I run mdadm --examine /dev/md0

this is what I get:

# mdadm --examine /dev/sdc1

mdadm: cannot open /dev/sdc1: No such device or address

# mdadm --detail /dev/md0

/dev/md0:

Version : 00.90.03

Creation Time : Fri Dec 16 14:56:00 2011

Raid Level : raid5

Array Size : 4395407808 (4191.79 GiB 4500.90 GB)

Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)

Raid Devices : 4

Total Devices : 3

Preferred Minor : 0

Persistence : Superblock is persistent

Update Time : Fri Dec 16 14:56:00 2011

State : clean, degraded

Active Devices : 3

Working Devices : 3

Failed Devices : 0

Spar...

idata · ‎01-02-2012

Matt,

Reset Data Protection will clean the drives and rebuild the array, destroying any existing data on the drives.

The Recover Disks option will attempt to recreate the array on the existing drives preserving user data. It requires all the original drive to be present and functional.

John

MBarb9 · ‎03-07-2015

# mdadm --examine /dev/sda1

/dev/sda1:

Magic : a92b4efc

Version : 00.90.00

UUID : 7fb24cc8:674589b6:5609b484:a2371a97

Creation Time : Sun Jun 9 19:20:05 2013

Raid Level : raid5

Used Dev Size : 1465135936 (1397.26 GiB 1500.30 GB)

Array Size : 4395407808 (4191.79 GiB 4500.90 GB)

Raid Devices : 4

Total Devices : 4

Preferred Minor : 0

Update Time : Sat Mar 7 06:34:58 2015

State : clean

Active Devices : 2

Working Devices : 3

Failed Devices : 1

Spare Devices : 1

Checksum : 86c69707 - correct

Events : 0.3068556

Layout : left-asymmetric

Chunk Size : 32K

Number Major Minor RaidDevice State

0 0 0 0 removed

1 254 1 1 active sync /dev/evms/.nodes/sdb1

2 0 0 2 removed

3 254 2 3 active sync /dev/evms/.nodes/sdd1

4 254 3 - spare /dev/evms/.nodes/sdc1

5 254 0 - faulty spare /dev/evms/.nodes/sda1

# mdadm --assemble --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

mdadm: forcing event count in /dev/sdd1(0) from 3068551 upto 3068556

mdadm: failed to RUN_ARRAY /dev/md0: Input/output error

# mdadm --assemble --run --force /dev/md0 /dev/sda1 /dev/sdb1 /dev/sdc1 /dev/sdd1

mdadm: device /dev/md0 already active - cannot assemble it

...so how do I get this temporarily working again so I can pull my data off it? The recover disks option worked twice out of a great many attempts, but lasted about 30 minutes before the problem repeated. Do I have to replace drive A first for this to work? Really don't want to spend money on a 1.5 TB drive when I was planning on replacing this array in the next few months.

I know it's a REALLY old thread but really hoping someone has an answer.

# mdadm --stop /dev/md0

mdadm: fail to stop array /dev/md0: Device or resource busy

was told to stop then rebuild it but it won't let me. Ideas?

Allan_J_Intel1 · ‎03-09-2015

I have moved your post to the server team at:

Allan.

AChan85 · ‎03-09-2015

Matt,

Sry to hear about your loss. Hope you can recover some data. A lot of hard work.

This exposes some of the risks we take. Linux is still not very common (compared with Windows). And it is hard to get help. NAS should be simpler, but in this case it is not.

I run WS2012, so not much help here....

David_A_Intel · ‎03-10-2015

In the last entry it seems that 3 out of the 4 drives are still synched. There might be a chance to attempt data recovery. As always, if the data in the unit is sensitive for your business; then, Intel recommends taking the unit to a data recovery center (http://www.intel.com/support/motherboards/server/sb/CS-025645.htm planning for the worst case).

Our http://www.intel.com/support/motherboards/server/ss4200-e/sb/CS-033948.htm Intel® Entry Storage System SS4200-E Array Recovery Procedures list examples of the possible options to attempt to recover the array.

For this process, I would recommend contacting our http://www.intel.com/p/en_US/support/contactsupport Intel Customer Support team for proper follow up of this transaction.

idata · ‎03-10-2015

This isn't going to help you any unfortunately but just some advice, IMHO never run RAID 5 (or any RAID set that has a single point o failure) with any data you care anything about. I've found that even recovering from a single drive failure more often than not leads to a second drive failure during the resilvering. Anyway These types of issues are one of the reasons I've moved to software RAID 10 sets. I wish you luck with your data recovery I know your pain.

Ronny_G_Intel · ‎03-10-2015

Hi All,

I am moving this conversation over to the Servers Community.

Thanks,

AChan85 · ‎03-10-2015

This is not really true.

I've had the same RAID 5 since 2008. Never lost any data due to RAID 5. The worst time was when Seagate 7200.11 drives had so many problems. All my HDs are 7200.11. I did have two HDs died, but at different times. But all the HDs have been running fine for 5 years.

No matter what, always have backup. Anything can happen. There can be fire, theft, flood, or power surge like this...