It was bound to happen. After 4+ years of running multiple NAS units 24x7, I finally ended up in a situation that brought my data availability to a complete halt. Even though I perform RAID rebuild as part of every NAS evaluation, I have never had the necessity to do one in the course of regular usage. On two occasions (once with a Seagate Barracuda 1 TB drive in a Netgear NV+ v2 and another time with a Samsung Spinpoint 1 TB drive in a QNAP TS-659 Pro II), the NAS UI complained about increasing reallocated sector counts on the drive and I promptly backed up the data and reinitialized the units with new drives.

Failure Symptoms

I woke up last Saturday morning to incessant beeping from the recently commissioned Synology DS414j. All four HDD lights were blinking furiously and the status light was glowing orange. The unit's web UI was inaccessible. Left with no other option, I powered down the unit with a long press of the front panel power button and restarted it. This time around, the web UI was accessible, but I was presented with the dreaded message that there were no hard drives in the unit.

The NAS, as per its intended usage scenario, had been only very lightly loaded in terms of network and disk traffic. However, in the interest of full disclosure, I have to note that the unit had been used againt Synology's directions with reference to hot-swapping during the review process. The unit doesn't support hot-swap, but we tested it out and found that it worked. However, the drives that were used for long term testing were never hot-swapped.

Data Availability at Stake

In my original DS414j review, I had indicated its suitability as a backup NAS. After prolonged usage, it was re-purposed slightly. The Cloud Station and related packages were uninstalled as they simply refused to let the disks go to sleep. However, I created a shared folder for storing data and mapped it on a Windows 8.1 VM in the QNAP TS-451 NAS (that is currently under evaluation). By configuring that shared folder as the local path for QSync (QNAP's Dropbox-like package), I intended to get any data uploaded to the DS414j's shared folder backed up in real time to the QNAP AT-TS-451's QSync folder (and vice-versa). The net result was that I was expecting data to be backed up irrespective of whether I uploaded it to the TS-451 or the DS414j. Almost all the data I was storing on the NAS units at that time was being generated by benchmark runs for various reviews in progress.

My first task after seeing the 'hard disk not present' message on the DS414j web page was to ensure that my data backup was up to date on the QNAP TS-451. I had copied over some results to the DS414j on Friday afternoon, but, to my consternation, I found that QSync had failed me. The updates that had occurred in the mapped Samba share hadn't reflected properly on to the QSync folder in the TS-451 (the last version seemed to be from Thursday night, which leads me to suspect that QSync wasn't doing real-time monitoring / updates, or, it was not recognizing updates made to a monitored folder from another machine). In any case, I had apparently lost a day's work (machine time, mainly) worth of data.

Data Recovery - Evaluating Software & Hardware Options
Comments Locked

55 Comments

View All Comments

  • zodiacfml - Saturday, August 23, 2014 - link

    Still that hard to restore files from a NAS? Vendors should develop a better way.
  • colecrowder - Saturday, August 23, 2014 - link

    At work we recently had 2 drives error on a Synology RAID 5. It wasn't quite total failure of two drives, but it crashed the volume. It's a 30+ TB system for our film digitization business, 13-disk RAID (1812+ and 513 expansion) and we've tried just about everything to recover, to no avail. UFS didn't help, an expert in the Ubuntu method we hired couldn't fix it either. Lesson learned: back everything up every night! Did find this useful guide for situations like ours, though:

    http://community.spiceworks.com/how_to/show/24731-...
  • ganeshts - Saturday, August 23, 2014 - link

    Shame about the lost data, but the link is definitely interesting.

    In case of drive failures within accepted limits (1 for RAID-5, 2 for RAID-6), the NAS itself should be able to rebuild the array. If more drives are lost, RAID rebuild softwares can't help since data is actually missing and there is no parity to recover the lost data.

    That said, if there is a drive failure as well as a NAS failure, I would personally make sure to image the remaining live drives on to some other storage before attempting recovery using software (rather than trying to recover from the live disks themselves)
  • Navvie - Monday, September 1, 2014 - link

    30TB RAID5? Please tell me you replaced that array with something more suitable.
  • DNABlob - Sunday, August 24, 2014 - link

    Good article & research.

    These days for my personal stuff, I use cloud backups (CrashPlan) and a single disk or striped pair (space and/or performance). If quick recovery is imperative, I'll employ something like AeroFS to sync data between two hosts on the same LAN. Pretty decent setup if you don't need to maintain meta-data like owner & ACLs.

    I'll spare you a long diatribe about software RAID5 and how a partial stripe write can silently corrupt data on a crash. As far as I can tell, this isn't fixed in Linux's RAID implementation. At the time Sun was very proud of their ZFS / RAID-Z implementation which fixed the partial write problem. For light write workloads, partial stripe writes are unlikely, but still a very real risk.

    https://blogs.oracle.com/bonwick/en_US/entry/raid_...
  • KAlmquist - Monday, August 25, 2014 - link

    In reply to DNABlob: As far as I know, no released version of the Linux RAID has had a problem with silent data corruption, so there is no need for a fix.

    The author of the article you've linked acknowledges that RAID 5 can be implemented correctly in software when he writes, "There are software-only workarounds for this, but they're so slow that software RAID has died in the marketplace." It is true that an incorrect implementation of RAID 5 could result in silent data corruption, but the same thing can be said of any software, including ZFS. ZFS includes checksums on all data, but those checksums don't do any good if a careless programmer has neglected to call the code that verifies the checksums.
  • elFarto - Sunday, August 24, 2014 - link

    The reason your mdadm commands weren't working is because you were attempting to use the disks themselves, not their partitions.
  • mannyvel - Monday, August 25, 2014 - link

    What the article shows is that if your device uses a Linux raid implementation you can get the data off of your drives if your device goes belly up by using free or commercial tools. While useful, you could have done the same thing by buying a new device and dropping your drives in - correct?

    This isn't really data recovery where your raid craps out because of a two-drive failure or some other condition that whacks your data. This is recovery due to an enclosure failure. Show me a recovery where your RAID dies, not where your enclosure dies.
  • Lerianis - Friday, September 5, 2014 - link

    Not always. Some machines are so badly designed that they INITIALIZE (wipe the drives) when old drives with data are put into them.
  • crashplan - Thursday, September 18, 2014 - link

    True. Anyways great article.

Log in

Don't have an account? Sign up now