[SOLVED] How to recover data from raid 5 with two failed disks - volume crashed after DSM update

Questions and mods regarding system management may go here
Forum rules
1) This is a user forum for Synology users to share experience/help out each other: if you need direct assistance from the Synology technical support team, please use the following form:

https://account.synology.com/support/su ... p?lang=enu



2) To avoid putting users' DiskStation at risk, please don't paste links to any patches provided by our Support team as we will systematically remove them. Our Support team will provide the correct patch for your DiskStation model.
Galaxy
I'm New!
I'm New!
Posts: 5
Joined: Thu Nov 23, 2017 10:47 pm

[SOLVED] How to recover data from raid 5 with two failed disks - volume crashed after DSM update

Unread post by Galaxy » Thu Nov 23, 2017 11:04 pm

EDIT: I fixed it and got all my data back! I made a whole bunch of tests and will explain everything for you. For better understanding, I have added comments to my questions in this post in blue. On the next post I will explain, how I recovered all my data from a raid 5 with two failed disks. I hope it helps someone. THE MOST IMPORTANT THING IS THAT YOU KEEP CALM AND DON'T RUSH INTO ANYTHING! THINK ABOUT IT AND AFTER THAT, THINK ABOUT IT AGAIN! You read that everywhere and it's true, believe me!

Hey guys,

I need your help!

My Setup:
4 Disks: Western Digital Red Series (WD40EFRX) from February 2016 running in an raid 5 (Slot 1,2,3,4); Slot 5 is empty; DS515+
  1. Let me explain what happend and what I did in steps:
  2. On 17.11.2017 (early in the morning) my DS did an automatic update (I have had set the box to do automatic updates).
  3. On 18.11.2017 at around 00:30 I wanted to access my data and could not reach the web interface, nor the mountpoint on my windows and also couldn't ping it. So I looked at the DS and saw that the blue led light was flashing. If I remeber correctly every disk led was dead. It looked similiar to this one here: https://forum.synology.com/enu/viewtopi ... 7&t=128497 - So I did (sadly, should have waited...) a restart via the power button: Hold it for a few seconds until the beep. The DiskStation did a normal shutdown with a flashing blue led. After that I started it again via Power Button.
  4. After the start had completed, the status led was orange and all disk leds were on (no beeping, just the normal one, which says that the station has booted). I checked to access web, mountpoint, but could not reach my data. So I checked with Synology Assistant and found it with issue "Configuration lost" and an dhcp ip adress.
  5. I searched the web and it said that it should not be a problem to do a reinstall. My data would not be touched. Only the system partitions will be touched. So I did a connect and after that a reinstall (web showed reinstall, if I remember correctly) with manual downloaded .pat file (DSM_DS1515+_15217.pat). If the data would be wiped I should get a red text, which wasn't shown. So the reinstall started and failed between 40% and 50% with "failed formatting system partition [35]" (I don't remember the correct wording, but it was a "[35]" there. After that the nas showed me to install, not to reinstall.
  6. So I did a manual shutdown and start via power key, like in step 2 explained again. This time it came up with an orange led and a constant beeping (beep, beep, beep...). Ironically the station came up now with my old configuration (with the manual ip-adress i had set for example). But all my installed packages were gone. It showed me that volume1 crashed. volume1 is now shown with disk3 and disk4 but missing disk1 and disk2. Also it is showing no bytes. On the disk side every disk is on "normal", but has the error "system partition failed" (In German: "Systempartitionierung fehlgeschlagen). It looks like that:
    Image Image Image
  7. So now I can't do a reapair of the raid 5, because of 2 lost disks, my raid 5 is gone (well at least i thought that :wink:). Now I did shut it down and gone on research. I will explain that down below. My goal is to get the data back. It doesn't matter if I have to recreate settings and the volume later. I just want to get my data back if it's possible...
Because of automatically monthly running smart tests which ran fine (all disks are "normal"), I don't think my disks failed. I think the update or my manual shutdown (step2) or the reinstall did mess things up, so the volume got lost... In some logs is shown that a few bad sectors were found: "Bad sector was found on disk[3]". And also "I/O-Error on disk 3 on NAS01". This errors started in the middle of October and there are like 15 of it. But I assume that this is disk3 in the nas not disk1 or disk2 which failed? I am assuming correct?
Yes ist's correct: "I/O-Error on disk 3 on NAS01" means the disk in the phisycally third slot. "I/O-Error on disk 5 on NAS01" would be the phyisically fifth slot and "I/O-Error on disk 1 on NAS01" would be the first phisycally slot. Same goes for "disk[3]", "disk[5]" or "disk[1]" and so on. Don't mess this up with the "raid slot" mentioned down below, because mdadm counts from 0. So in my example, I have 4 disks, which are physically mentioned with 1,2,3,4 and in mdadm (the tool to manage software raids in Linux), they are mentioned as 0,1,2,3 - just keep that in mind! (Btw: NAS01 is just the name of the nas-system - so on yours it will most likely be another name of course).

I checked outputs from various files in /etc/space as well, so I did not change the disk order by accident (I paniced a little at that day...). I also checked some logs. It seems to be, that on a boot the diskstation tries to stop the /dev/md2 and wants to do a an force assemble, but fails because it can't stop the /dev/md2 (Why? I don't know...)
As I said, panic is the worst thing... Every step you do, write 'em down and make photos and videos, so you can remember correctly! For the peace of your heart, if you changed the disk order by accident, it's not that big issue at all. I tried this on a test case as well and I could get everything back. Mdadm is really robust on such things. Just keep reading. :)

I did run a few mdadm commands to check what is going on:
root@NAS01:/dev# cat /proc/mdstat
This shows the raid information. As you can see there are three raids: md0, md1 and md2. md0 and md1 are just system partitions in Synology Systems (md0 is normally the system partition and md1 is the swap partition). The following numbers (md2, md3 and so on) contain your created raids and therefore your actual data. Note that md0, md1, md2, md3 and so on are just names for the raids. It could be that md0 is your data and md2 your system partition for example. It's not likely like that in Synology systems, but you'll never know if you don't check that. Again: Test as much as you can and get as much information before you do anything! You can get to know which raid contains your data in some ways, I will explain on next post. However, you can see U and _ in the output. A U means this device is still in your raid and up. A _ means this device is missing in your raid (This has not to mean that your disk is dead!). Above of this blocks you can see which devices are in the raid. For example on md2 you see that sdc3 (/dev/sdc3) and sdd3 (/dev/sdd3) are still present on slot 2 and 3. But slot 0 and slot 1 are missing, you discover that because of the leading two missing _ and because the raid will start to count on 0, when it is created. Note here, that the slot number of the raid doesn't have to be your acutally physical disk order in your device! So the slot 0 in the raid doesn't have to be the first(1) physical disk in your device. It could also be the second (2) or fifth or whatever. Normally Synology devices use phyiscal disk slot 1 for raid slot 0, physical disk slot 2 for raid slot 1 and so on and also names the devices like that (first disk is /dev/sda, second disk is dev/sdb etc.), but just keep in mind, that it isn't always like this! Another thing you can see, is that md0 and md1 have five disks (UU_ _ _) - but I only have 4 disks in my case :?: This is shown, because Synology creates the system and swap partition on all disks, no matter if they exist or not. It's done like this to expand this raids on all new disks you put in your enclosure, to provide always a working DSM, regardless which disk fails. Note also that md2 is raid5 and md1 and md0 are raid1. As I did had an raid5 with 4 disks, I know at this point, that md2 is the raid which contains my data and has missing two devicdes (0 and 1) - I assume that this are /dev/sda3 and /dev/sdb3, but at this point I still can't be shure! So let's check it out :wink:

Code: Select all

Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4]
md2 : active raid5 sdc3[2] sdd3[3]
      11706589632 blocks super 1.2 level 5, 64k chunk, algorithm 2 [4/2] [__UU]

md1 : active raid1 sdc2[0] sdd2[1]
      2097088 blocks [5/2] [UU___]

md0 : active raid1 sda1[0] sdb1[1]
      2490176 blocks [5/2] [UU___]

unused devices: <none>

root@NAS01:/dev# mdadm --detail /dev/md0
With this commands you can get more details about the existing raids. You can see here, that as I mentioned before, indeed this raid contains 5 disks, although I only have 4 disks. The funny thing is that the synology system raids are always degraded (see State at md0 and md1), if you don't use every physically slot :D

Code: Select all

/dev/md0:
        Version : 0.90
  Creation Time : Sat Nov 18 00:20:00 2017
     Raid Level : raid1
     Array Size : 2490176 (2.37 GiB 2.55 GB)
  Used Dev Size : 2490176 (2.37 GiB 2.55 GB)
   Raid Devices : 5
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Thu Nov 23 18:30:08 2017
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : 5c6bb801:f95aa80d:3017a5a8:c86610be
         Events : 0.2264

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       0        0        2      removed
       3       0        0        3      removed
       4       0        0        4      removed

root@NAS01:/dev# mdadm --detail /dev/md1

Code: Select all

/dev/md1:
        Version : 0.90
  Creation Time : Sat Nov 18 01:11:06 2017
     Raid Level : raid1
     Array Size : 2097088 (2048.28 MiB 2147.42 MB)
  Used Dev Size : 2097088 (2048.28 MiB 2147.42 MB)
   Raid Devices : 5
  Total Devices : 2
Preferred Minor : 1
    Persistence : Superblock is persistent

    Update Time : Thu Nov 23 18:27:25 2017
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           UUID : d9a4606d:3e753320:8467a421:84d4f733 (local to host NAS01)
         Events : 0.30

    Number   Major   Minor   RaidDevice State
       0       8       34        0      active sync   /dev/sdc2
       1       8       50        1      active sync   /dev/sdd2
       2       0        0        2      removed
       3       0        0        3      removed
       4       0        0        4      removed

root@NAS01:/dev# mdadm --detail /dev/md2
So let me explain a bit more with the output of md2, as for md0 and md1 we don't really care. What you can see is the devices the raid normally contains (Raid Devices : 4) and the total devices which are in the raid at the moment (Total Devices : 2). You can compare this with the output of the previous command (cat /proc/mdstat). Another important thing is that on Persistence is shown: Superblock is persistent. However if not, there are also possibilities to solve from this problem, so don't lose hope if so and keep reading. Basically Linux raid reserves a bit of space (called a superblock) on each component device. This space holds metadata about the RAID device and allows correct assembly of the array. As it is an raid 5 with two missing disks, the state of the raid is in fact FAILED (see State). Also you see when the raid was created,the version of the Suberblock format etc. If you need more information take a look here: https://github.com/tinganho/linux-kerne ... ion/md.txt

Code: Select all

/dev/md2:
        Version : 1.2
  Creation Time : Sat May 21 17:13:58 2016
     Raid Level : raid5
     Array Size : 11706589632 (11164.27 GiB 11987.55 GB)
  Used Dev Size : 3902196544 (3721.42 GiB 3995.85 GB)
   Raid Devices : 4
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Thu Nov 23 18:27:43 2017
          State : clean, FAILED
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 64K

           Name : NAS01:2  (local to host NAS01)
           UUID : eadf7c1e:9c3e4ac3:3f35ab65:118a3497
         Events : 156

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       0        0        1      removed
       2       8       35        2      active sync   /dev/sdc3
       3       8       51        3      active sync   /dev/sdd3
 

root@NAS01:/dev# fdisk -l
So now it get's interesting, because here you can see what actually happend to my raid. This command lists all partitions of all disks. Let's take /dev/sdd for example. On /dev/sdd you can se three existing partitions: /dev/sdd1, /dev/sdd2 and /dev/sdd3 with their size. So let's be honest, it's really obvious, that /dev/sdd3 is part of my raid 5, because of the size (3.6 TB * 4 disks = 14,8 - 3.6 (minus one disk for raid 5) = 10.8, which is my volume size). But don't find your raid because it seems to be obvious. Just compare the output with the output of the commands cat /proc/mdstat and mdadm --detail /dev/mdx. From this commands we see, which partitions are in the raids:
  • For md0 it is /dev/sda1 and /dev/sdb1
  • For md1 it is /dev/sdc2 and /dev/sdd2
  • For md2 it is /dev/sdc3 and /dev/sdd3


If we compare this with the output here, we can see that on /dev/sda there is missing the partition /dev/sda2 and on /dev/sdb there is missing the partition /dev/sdb2 and therefore they both are missing in md1. Another thing you should see, is what I mentioned above: That Synology normally works like this /dev/sdx1 is in md0, /dev/sdx2 is in md1 and /dev/sdx3 is in md2 - or if you have more raids, /dev/sdx4 would be in md3 and so on. As I said, you can never be shure but it looks clearly like that. So let's take this as presume fact to go on. With this, it explains, why md1 has two missing disks, which should be there: /dev/sda2 and /dev/sdb2. They just don't exist anymore. So the error "system partition failed", which occured on reinstall was at least correct - in fact two parts of the swap raid are gone :?: I still don't know why this happens, but I can only imagine that the CPU has something to do with it... Recently Intel admited, that the Intel C2000 Series has Failures. While other resellers took back products, which have guarantee, Synology just extended the support of the affected devices - see Announcement:https://www.synology.com/en-global/comp ... %20Updatet. I had enabled the automatic updates on my case and could see in the logs, that a day before my case wasn't accessible, an update was made. Either this failed and deleted the two partitions, or my reinstall... I still don't know, but will contact Synology for a replacement. My trust in this case is gone, as this should not happen and I also tested my first disk (which is in fact /dev/sda) with an extended S.M.A.R.T. test, which completed with no errors (I also analyzed the raw values).

Ok back to topic. If all is OK, you should see, that your partitions of your raid which crashed (failed), are still there. As we presume, that everything with a 3 at the end belongs to /dev/md2 and we find them all in the ouptut (/dev/sda3, /dev/sdb3, /dev/sdc3 and /dev/sda3), we know that the partitions of the raid are still there - so my reinstall didn't wipe my data partitions (As I said before, make photos. I paniced and wasn't really shure which errors were shown and was afraid, the reinstall wiped all :D; This step was a rlly good point to take a deep breath ^^).

Code: Select all

Disk /dev/sdd: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 2F83CF67-C568-4D90-8884-75ACFBA7954D

Device       Start        End    Sectors  Size Type
/dev/sdd1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdd2  4982528    9176831    4194304    2G Linux RAID
/dev/sdd3  9437184 7813832351 7804395168  3.6T Linux RAID


Disk /dev/sda: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 6E557084-F62E-4FB3-AC85-1A88A1D150DE

Device       Start        End    Sectors  Size Type
/dev/sda1     2048    4982527    4980480  2.4G Linux RAID
/dev/sda3  9437184 7813832351 7804395168  3.6T Linux RAID


Disk /dev/sdb: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 2BAD3013-6B28-4528-8EE9-36FF2DEC9B68

Device       Start        End    Sectors  Size Type
/dev/sdb1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdb3  9437184 7813832351 7804395168  3.6T Linux RAID


Disk /dev/sdc: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 3A38788A-4BE1-4CAC-BBE1-A135FF1AFF6C

Device       Start        End    Sectors  Size Type
/dev/sdc1     2048    4982527    4980480  2.4G Linux RAID
/dev/sdc2  4982528    9176831    4194304    2G Linux RAID
/dev/sdc3  9437184 7813832351 7804395168  3.6T Linux RAID


Disk /dev/md0: 2.4 GiB, 2549940224 bytes, 4980352 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/md1: 2 GiB, 2147418112 bytes, 4194176 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/zram0: 2.4 GiB, 2522873856 bytes, 615936 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/zram1: 2.4 GiB, 2522873856 bytes, 615936 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/zram2: 2.4 GiB, 2522873856 bytes, 615936 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/zram3: 2.4 GiB, 2522873856 bytes, 615936 sectors
Units: sectors of 1 * 4096 = 4096 bytes
Sector size (logical/physical): 4096 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes


Disk /dev/synoboot: 120 MiB, 125829120 bytes, 245760 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xfd900657

Device         Boot Start    End Sectors  Size Id Type
/dev/synoboot1 *       63  32129   32067 15.7M 83 Linux
/dev/synoboot2      32130 224909  192780 94.1M 83 Linux

root@NAS01:/dev# mdadm --examine /dev/sd[abcd]1

Code: Select all

/dev/sda1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 5c6bb801:f95aa80d:3017a5a8:c86610be
  Creation Time : Sat Nov 18 00:20:00 2017
     Raid Level : raid1
  Used Dev Size : 2490176 (2.37 GiB 2.55 GB)
     Array Size : 2490176 (2.37 GiB 2.55 GB)
   Raid Devices : 5
  Total Devices : 2
Preferred Minor : 0

    Update Time : Thu Nov 23 18:29:58 2017
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 3
  Spare Devices : 0
       Checksum : abbbeb63 - correct
         Events : 2258


      Number   Major   Minor   RaidDevice State
this     0       8        1        0      active sync   /dev/sda1

   0     0       8        1        0      active sync   /dev/sda1
   1     1       8       17        1      active sync   /dev/sdb1
   2     2       0        0        2      faulty removed
   3     3       0        0        3      faulty removed
   4     4       0        0        4      faulty removed


/dev/sdb1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 5c6bb801:f95aa80d:3017a5a8:c86610be
  Creation Time : Sat Nov 18 00:20:00 2017
     Raid Level : raid1
  Used Dev Size : 2490176 (2.37 GiB 2.55 GB)
     Array Size : 2490176 (2.37 GiB 2.55 GB)
   Raid Devices : 5
  Total Devices : 2
Preferred Minor : 0

    Update Time : Thu Nov 23 18:29:58 2017
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 3
  Spare Devices : 0
       Checksum : abbbeb75 - correct
         Events : 2258


      Number   Major   Minor   RaidDevice State
this     1       8       17        1      active sync   /dev/sdb1

   0     0       8        1        0      active sync   /dev/sda1
   1     1       8       17        1      active sync   /dev/sdb1
   2     2       0        0        2      faulty removed
   3     3       0        0        3      faulty removed
   4     4       0        0        4      faulty removed


/dev/sdc1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : c07f7981:9727785b:3017a5a8:c86610be
  Creation Time : Sat Nov 18 00:28:02 2017
     Raid Level : raid1
  Used Dev Size : 2490176 (2.37 GiB 2.55 GB)
     Array Size : 2490176 (2.37 GiB 2.55 GB)
   Raid Devices : 5
  Total Devices : 2
Preferred Minor : 0

    Update Time : Sat Nov 18 00:43:38 2017
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 3
  Spare Devices : 0
       Checksum : ad94f06a - correct
         Events : 4936


      Number   Major   Minor   RaidDevice State
this     0       8       33        0      active sync   /dev/sdc1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       8       49        1      active sync   /dev/sdd1
   2     2       0        0        2      faulty removed
   3     3       0        0        3      faulty removed
   4     4       0        0        4      faulty removed
   

/dev/sdd1:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : c07f7981:9727785b:3017a5a8:c86610be
  Creation Time : Sat Nov 18 00:28:02 2017
     Raid Level : raid1
  Used Dev Size : 2490176 (2.37 GiB 2.55 GB)
     Array Size : 2490176 (2.37 GiB 2.55 GB)
   Raid Devices : 5
  Total Devices : 2
Preferred Minor : 0

    Update Time : Sat Nov 18 00:43:40 2017
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 3
  Spare Devices : 0
       Checksum : ad94f07f - correct
         Events : 4937


      Number   Major   Minor   RaidDevice State
this     1       8       49        1      active sync   /dev/sdd1

   0     0       8       33        0      active sync   /dev/sdc1
   1     1       8       49        1      active sync   /dev/sdd1
   2     2       0        0        2      faulty removed
   3     3       0        0        3      faulty removed
   4     4       0        0        4      faulty removed



root@NAS01:/dev# mdadm --examine /dev/sd[abcd]2

Code: Select all

/dev/sdc2:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : d9a4606d:3e753320:8467a421:84d4f733 (local to host NAS01)
  Creation Time : Sat Nov 18 01:11:06 2017
     Raid Level : raid1
  Used Dev Size : 2097088 (2048.28 MiB 2147.42 MB)
     Array Size : 2097088 (2048.28 MiB 2147.42 MB)
   Raid Devices : 5
  Total Devices : 2
Preferred Minor : 1

    Update Time : Thu Nov 23 18:27:25 2017
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 3
  Spare Devices : 0
       Checksum : 7ec7fead - correct
         Events : 30


      Number   Major   Minor   RaidDevice State
this     0       8       34        0      active sync   /dev/sdc2

   0     0       8       34        0      active sync   /dev/sdc2
   1     1       8       50        1      active sync   /dev/sdd2
   2     2       0        0        2      faulty removed
   3     3       0        0        3      faulty removed
   4     4       0        0        4      faulty removed
   
 
/dev/sdd2:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : d9a4606d:3e753320:8467a421:84d4f733 (local to host NAS01)
  Creation Time : Sat Nov 18 01:11:06 2017
     Raid Level : raid1
  Used Dev Size : 2097088 (2048.28 MiB 2147.42 MB)
     Array Size : 2097088 (2048.28 MiB 2147.42 MB)
   Raid Devices : 5
  Total Devices : 2
Preferred Minor : 1

    Update Time : Thu Nov 23 18:27:25 2017
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 3
  Spare Devices : 0
       Checksum : 7ec7febf - correct
         Events : 30


      Number   Major   Minor   RaidDevice State
this     1       8       50        1      active sync   /dev/sdd2

   0     0       8       34        0      active sync   /dev/sdc2
   1     1       8       50        1      active sync   /dev/sdd2
   2     2       0        0        2      faulty removed
   3     3       0        0        3      faulty removed
   4     4       0        0        4      faulty removed

root@NAS01:/dev# mdadm --examine /dev/sd[abcd]3
Ok just a few words to this comamnds. The ouput shows information of the partitions /dev/sda3, /dev/sdb3, /dev/sdc3 and /dev/sdd3. The commands before equivalent to the partitons with a 1 and 2 at the end. The key information here is the Device UUID and the Events (Event Count). The UUID (Universally Unique Identifier) is a 128-bit number which is randomly generated to identify the device here. The Event Count, basically counts every major event what happens to the whole raid. If you boot up your nas, mdadm "reassembles" your raid and at this process the events of all containing disks will be counted up (normally it is current events +1) - More information: https://raid.wiki.kernel.org/index.php/Event. If one disk differs by the others, mdadm will not automatically inlcude this disks in the raid. And this is exactly what is happening here on every boot. As you can see, the events of /dev/sda3 and /dev/sdb3 are lower than the others. So mdadm doesn't reassemble them on boot, which leads to the crashed volume. Why this works like that? Imagine one disk of four disks had failed and was thrown out of the raid 5. As it is a raid 5 with one lost disk then, you could still access and change the data on the raid, because it is just degraded. If you now change a lot of data and would try to assemble the lost disk later on, it will possible lead to data loss, because the data on the thrown out disk is outdated... This is why in this case you do a rebuild and not an assemble.
  • /dev/sda3 -> 144
  • /dev/sdb3 -> 144
  • /dev/sdc3 -> 156
  • /dev/sdd3 -> 156


Note also, that the raid level of the partition is shown. As we know, that md0 and md1 are raid 1 and md2 is raid 5, this is another indication, that /dev/sda3, /dev/sdb3, /dev/sdc3 and /dev/sdd3 is in fact my crashed raid 5.

Code: Select all

/dev/sda3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : eadf7c1e:9c3e4ac3:3f35ab65:118a3497
           Name : NAS01:2  (local to host NAS01)
  Creation Time : Sat May 21 17:13:58 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7804393120 (3721.42 GiB 3995.85 GB)
     Array Size : 23413179264 (11164.27 GiB 11987.55 GB)
  Used Dev Size : 7804393088 (3721.42 GiB 3995.85 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 5a742e05:cd4dc858:cc89ff46:afe66cbc

    Update Time : Fri Nov 17 05:48:35 2017
       Checksum : 6caaf75c - correct
         Events : 144

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 0
   Array State : AAAA ('A' == active, '.' == missing)
/dev/sdb3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : eadf7c1e:9c3e4ac3:3f35ab65:118a3497
           Name : NAS01:2  (local to host NAS01)
  Creation Time : Sat May 21 17:13:58 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7804393120 (3721.42 GiB 3995.85 GB)
     Array Size : 23413179264 (11164.27 GiB 11987.55 GB)
  Used Dev Size : 7804393088 (3721.42 GiB 3995.85 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 808a7ce1:02b4f1bd:e1e366ff:0a4cd1a8

    Update Time : Fri Nov 17 05:48:35 2017
       Checksum : 52ee333a - correct
         Events : 144

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 1
   Array State : AAAA ('A' == active, '.' == missing)
   
 
/dev/sdc3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : eadf7c1e:9c3e4ac3:3f35ab65:118a3497
           Name : NAS01:2  (local to host NAS01)
  Creation Time : Sat May 21 17:13:58 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7804393120 (3721.42 GiB 3995.85 GB)
     Array Size : 23413179264 (11164.27 GiB 11987.55 GB)
  Used Dev Size : 7804393088 (3721.42 GiB 3995.85 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 91949959:f4bafdcb:5b756116:9ae3f29e

    Update Time : Thu Nov 23 18:27:43 2017
       Checksum : e63a082d - correct
         Events : 156

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 2
   Array State : ..AA ('A' == active, '.' == missing)
 

/dev/sdd3:
          Magic : a92b4efc
        Version : 1.2
    Feature Map : 0x0
     Array UUID : eadf7c1e:9c3e4ac3:3f35ab65:118a3497
           Name : NAS01:2  (local to host NAS01)
  Creation Time : Sat May 21 17:13:58 2016
     Raid Level : raid5
   Raid Devices : 4

 Avail Dev Size : 7804393120 (3721.42 GiB 3995.85 GB)
     Array Size : 23413179264 (11164.27 GiB 11987.55 GB)
  Used Dev Size : 7804393088 (3721.42 GiB 3995.85 GB)
    Data Offset : 2048 sectors
   Super Offset : 8 sectors
          State : clean
    Device UUID : 1626944f:cc7b07bc:a62141ac:3024989c

    Update Time : Thu Nov 23 18:27:43 2017
       Checksum : 5fc3476d - correct
         Events : 156

         Layout : left-symmetric
     Chunk Size : 64K

   Device Role : Active device 3
   Array State : ..AA ('A' == active, '.' == missing)

I read about how to get the disks back in to the arrray and thought about running this commands:
mdadm --stop /dev/md2
mdadm --assemble --run /dev/md2 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 -v

My problem is, that I don't know if the order of the disks depends in the last command? Because I don't know if /dev/sda3 or /dev/sdb3 should be on first or second position. Furthermore I don't know if the array is in case build of /dev/sda3 ; /dev/sdb3? I think it can only be this two? I am also scared now if the reinstall process just wiped my data partitions on disk1 and disk2. Could that be possible? Or is this not possible because there exists the /dev/sda3 and /dev/sdb3 on this disks? So If I would run with force I would kill my last hope which would be a data rescue comapany... (If I am right)?
Yes this is really a problem. You should know the correct order! Well mdadm is very robust as I said and when you reassemble the raid it will check which disk was on which logically slot in the raid, no matter where the disk is now and was before physically (CAUTION! THAT ONLY APPLIES TO AN --ASSEMBLE). But I would not recommend to just count on the robustness and forget the bain. I tested it twice on another case and it went well, but a bird in the hand is worth two in the bush. So check the correct disk order before - I will explain how, in the next post.

More problematic is, that the Event Count differs:
/dev/sda3 144
/dev/sdb3 144
/dev/sdc3 156
/dev/sdd3 156

I heard if I run it without --force , then it will fail because of "possible out-of-date" issue. But I think the Event Count differs, because of the boots I did (did like 4, after step 6 I think). Also I read, that running this command with --force can be dangerous..
Command with force:
mdadm --assemble --force --run /dev/md2 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 -v
Yes this is true, it would fail with "possible out-of-date" error on /dev/sda3 and /dev/sdb3 and would start the raid as crashed with /dev/sdc3 & dev/sdd3. It would not hurt, but also nothing would be achieved - this is exactly what happened on every boot. By the way, I assuemed correctly. The event count, counted +2 on /dev/sdc3 and /dev/sdd3 on every boot - this is why the event count differs on my partitions.

So what can I do at best? I don't think my disks failed, because a double fail at the same time is very unlikely. A little bit later yes, but not on the same time on reboot... But if they have a failure, I could break even more if I'm doing stuff... So I'm asking me if I should clone one of the failed disks, or every disk...
I will explain you in the next post, what I did and what you should do.

I appreciate any help.
Thank you in advance!

So there we are, I hope you could follow me, if not just ask :). I think it is really important, that you know what you are doing and not unthinkingly run commands, which are recommended on the internet. This is why I write in such detail. So research the web and take the time you need. I know it is hard, if there is valuable data on the raid and not knowing of getting it back. But as I said in the beginning: KEEP CALM! Every step you do could be the step which leads to full data loss. If you do nothing, you can't destroy anything. So think and research before running the commands. Make shure you understood what I or others did and trying to explain before going to act. And if you have another case to test with, even better.

Ok so let's move on and let me explain in steps what I did to get all back. :)
Last edited by Galaxy on Thu Dec 28, 2017 5:05 pm, edited 9 times in total.

Galaxy
I'm New!
I'm New!
Posts: 5
Joined: Thu Nov 23, 2017 10:47 pm

Re: [SOLVED] How to Recover from raid 5 with two failed disks - volume crashed after update (raid 5)

Unread post by Galaxy » Wed Dec 27, 2017 5:03 pm

  1. First of, whenever something like this happens, you can never be shure if it is a hardware defect on the case or the disks, or just a software issue like mine was. So check this first:
    1. Foremost I took out my disks and put a number on all disks (beginning by 1 up to 4), so I can't mix them up by accident. Better way is, to do that in advance, when you put your nas in operation the first time. So if you havn't done it, do it now.
    2. After that I checked every disk for noises. I started one disk by one in a quickport like this: http://en.sharkoon.com/category/storage ... tions.aspx, but you could also connect them in your computer. However the quickport has the advance, that there are no other noises, which could mislead. Just listen for scratching or something like this - noises you normally wouldn't expect. If you aren't shure, watch videos on YouTube or try other disks for comparison. They all sounded alright for me.

      There are several recommendations in the internet, to test the disks with an extended S.M.A.R.T Test. But I would not recommend that, when the disk sounds alright. If a disk has in fact errors and you test every disk, this will only lead to probably more errors. Why you should test a disk if it is maybe defect? If it is, the test will not change anything about that. Furthermore it will increase the probability of causing more errors on the disk. The test for a 4TB disk for example takes around 10 hours. The commands you run to get logs or information and even the assemble will only take minutes. So if your disk isn't making any strange noises and you don't have runned SMART values, which show that the disk is nearly dead, don't rush for a test. However if you have old runned SMART values which indicate, that a disk will die, your best option is to clone the disk, or if the data is worth the money, go to a data rescue company and let them clone the disk for you. Then you can try to rescue your data with the cloned disk.
  2. After that, test if your case has an issue. Steps for a hardware test are described here: https://forum.synology.com/enu/viewtopi ... 15#p472440 All OK? Then go on, otherwise raise a ticket at synology support.
  3. Once we have checked that the case and the disks seem to be alright, we can begin to gather some information about what happened. The best thing is to write down all commands you want to run (e.g. in a notepad on your computer). Then you can start the case, connect via ssh, copy paste the commands in one go and stop the case immediately after that. Afterwards you can analyze the output and think about your next steps. You should do it like that, because if after all there is an issue with the disks, the best is not to start them very often or keep them running for a long time.

    At least you should do the following, to get some information about what is going on:
    • Run the following commands via ssh:
      Note that before you can run any command, activate ssh on your case, if you havn't already: https://www.synology.com/en-global/know ... m_terminal. How you can login via ssh and get root permissions is described here: https://www.synology.com/en-global/know ... SSH_Telnet

      This commands are specific for my amount of disks and partitions. If you have more disks or partitions, run them also for them (like for /dev/sde or /dev/sdb4 etc.)

      Code: Select all

      mdadm --examine /dev/sd[abcd]1
      mdadm --examine /dev/sd[abcd]2
      mdadm --examine /dev/sd[abcd]3
      
      smartctl --xall /dev/sda
      smartctl --xall /dev/sdb
      smartctl --xall /dev/sdc
      smartctl --xall /dev/sdd
      
      mdadm --detail /dev/md0
      mdadm --detail /dev/md1
      mdadm --detail /dev/md2
      
      fdisk -l /dev/sd[abcd]
      
      cat /proc/mdstat
      
      Also enter the following folder /etc/space. There should be files with the name space_history_<date>_<time>.xml, like this one: space_history_20171203_141741.xml. To get the content just write cat in front of them.
      This files are essential to get the disk order you had at the beginning. They contain information (amongst other things, the serial number of the physical disks, the logical slot of the partitions in the raid and the device names (e.g. /dev/sda3)) about how the raids are assembled at the time the name of the file contains. Get the contet of some older and a current file. It shows how the raid was assembled before and how it is assembled now.

      Code: Select all

      cd /etc/space
      
      cat space_history_20161212_054349.xml
      cat space_history_20170101_103420.xml
      cat space_history_20170616_054728.xml
      
      Example Outputs from mine:
      You can see that the raid 5 which contains the data is indeed as assumed build out of the partitions /dev/sda3, /dev/sdb3, /dev/sdc3 and /dev/sdd3. Furthermore you can see the serial of each disk and which device name was given to the partitions (remeber, that /dev/sda hasn't to be the same physical disk, if you would change the physical disk order in the case). And of course, you can see the logical slots of the devices in the raid. So if we now compare an older output with an more younger output we can take the information, that the disks with the serial 123C and 123D have the same device name and are on the same slot as before. So this disks are fine. Now we have to check if this is also for the two lost disks.

      We can get this information if we analyze our runned commands above. In the output of smartctl --xall /dev/sdx there is a line "Serial Number:". So we can get for every device name the serial number and compare this with an older xml. For example: Take the "Serial Number:" from the command smartctl --xall /dev/sda and check if this number is equivalent to the number with the device name /dev/sda in an older output of the xml. If it's the same, you know that the disk is correctly named and most likely in the correct physical slot and more important, which logical slot it was and should be in the raid config (for /dev/sda it is slot 0 in my case).

      You can be most likely shure that the disk is also in the correct physical slot, because normally the first physical disk is slot 0 in the raid and named /dev/sda, the second is slot 1 and named /dev/sdb and so on (I mentioned that in my previous post). However if you see for example, that the "Serial Number:" from the command smartctl --xall /dev/sda has the serial number of /dev/sdb in an older xml file and vice versa, then it seems to be that you most likely reversed the disk /dev/sda and /dev/sdb. You can change the disk order then, so that it will match (check with the same commands after that, if it now fits). Anyways, how I said mdadm is very robust and should recognize every disk correctly and therefore assemble them in the correct order, because of using the uuid instead the device name. So the physical order isn't that necessary when you want to assemble the raid. I have to put a big :!: here, because I would not stake my life on that! I tested it with my test case twice with different mixed orders and it worked, but I can't promise that it would work with another version of mdadm or DSM. So it's better to get back the correct physical disk order if you mixed it, or even better never mix the order and number your disks (Step one :wink:)

      If everything fits we can go on.

      Sidenote: I changed the original serial numbers to 123A, 123B and so on.
      • Older xml with active raid

        Code: Select all

        <?xml version="1.0" encoding="UTF-8"?>
        <spaces>
            <space path="/dev/md2" reference="/volume1" uuid="eadf7c1e:9c3e4ac3:3f35ab65:118a3497" device_type="2" drive_type="0" container_type="2" limited_raidgroup_num="12" >
                <device>
                    <raid path="/dev/md2" uuid="eadf7c1e:9c3e4ac3:3f35ab65:118a3497" level="raid5" version="1.2">
                        <disks>
                            <disk status="normal" dev_path="/dev/sda3" model="WD40EFRX-68WT0N0        " serial="123A" partition_version="8" partition_start="9437184" partition_size="7804395168" slot="0">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdb3" model="WD40EFRX-68WT0N0        " serial="123B" partition_version="8" partition_start="9437184" partition_size="7804395168" slot="1">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdc3" model="WD40EFRX-68WT0N0        " serial="123C" partition_version="8" partition_start="9437184" partition_size="7804395168" slot="2">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdd3" model="WD40EFRX-68WT0N0        " serial="123D" partition_version="8" partition_start="9437184" partition_size="7804395168" slot="3">
                            </disk>
                        </disks>
                    </raid>
                </device>
                <reference>
                    <volume path="/volume1" dev_path="/dev/md2" uuid="eadf7c1e:9c3e4ac3:3f35ab65:118a3497" type="btrfs">
                    </volume>
                </reference>
            </space>
        </spaces>
        
      • xml shortly after the raid failed (volume crashed)

        Code: Select all

        <?xml version="1.0" encoding="UTF-8"?>
        <spaces>
            <space path="/dev/md2" reference="/volume1" uuid="eadf7c1e:9c3e4ac3:3f35ab65:118a3497" device_type="2" drive_type="0" container_type="2" limited_raidgroup_num="12" >
                <device>
                    <raid path="/dev/md2" uuid="eadf7c1e:9c3e4ac3:3f35ab65:118a3497" level="raid5" version="1.2">
                        <disks>
                            <disk status="normal" dev_path="/dev/sdc3" model="WD40EFRX-68WT0N0        " serial="123C" partition_version="8" partition_start="9437184" partition_size="7804395168" slot="2">
                            </disk>
                            <disk status="normal" dev_path="/dev/sdd3" model="WD40EFRX-68WT0N0        " serial="123D" partition_version="8" partition_start="9437184" partition_size="7804395168" slot="3">
                            </disk>
                        </disks>
                    </raid>
                </device>
                <reference>
                    <volume path="/volume1" dev_path="/dev/md2" uuid="eadf7c1e:9c3e4ac3:3f35ab65:118a3497">
                    </volume>
                </reference>
            </space>
        </spaces>
        
      Don't forget to shutdown your case after running the commands. You can analyze the output in peace when the case is shut down. :wink:
  4. Now after you have analyzed what happend based on the output, you can rethink what you want to do next. If you have a case like mine, where two disks of your raid 5 got lost at the same time and the event count doesn't differ to much from the event counts of the disks, which are still in the raid, your chances are very good to get your raid back online with just assembling the disks.

    To assemble the raid you have to follow this steps:
    1. Start the case and login via ssh.
    2. Run some of the commands you runned before to check, if the disk order still fits.
    3. Stop the raid (Run the command mdadm --stop and behind this the name of the raid (in my case /dev/md2)):

      Code: Select all

      mdadm --stop /dev/md2
    4. Assemble the raid. Run the command mdadm --assemble --run and behind this the name of the raid (in my case /dev/md2) and every device name (partition) your raid contains. In my case it is /dev/sda3, /dev/sdb3, /dev/sdc3 and /dev/sdd3. The order doesn't matter when you run the assemble command (I tried this on my test case), but I would recommend to run it in the correct order compared to the xml files (device name of slot 0 first, then device name of slot 1 and so on). The -v at the end means verbose and will just put out more information when the command runs.
      In my case the command is this one:

      Code: Select all

      mdadm --assemble --run /dev/md2 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 -v
      Because of the different event count of my disks, the command will fail - well not fail, but my raid will only start with two disks (/dev/sdc3 & /dev/sdd3). The first two (/dev/sda3 & /dev/sdb3) have an older event count (144) as the two (/dev/sdc3 & /dev/sdd3) which were still in the raid after the crash (156). The good point is, that you now can see how mdadm will try to assemble your raid and compare this with the output of your previous commands. For example: In my case, the serial number (123A) of /dev/sda3 in the xml file should match the "Serial Number:" of the output of smartctl --xall /dev/sda and the slot number (0) in the xml should match with the assemble command slot number (/dev/sda3 is identified as a member of /dev/md2, slot 0). Same goes for the other disks. If something doesn't fit, it could happen that mdadm will assemble your raid in the wrong order if you force the assemble - no matter what, don't do that!
      Note that it could be, that the (possible out of date) isn't shown.

      Code: Select all

      mdadm: looking for devices for /dev/md2
      mdadm: /dev/sda3 is identified as a member of /dev/md2, slot 0.
      mdadm: /dev/sdb3 is identified as a member of /dev/md2, slot 1.
      mdadm: /dev/sdc3 is identified as a member of /dev/md2, slot 2.
      mdadm: /dev/sdd3 is identified as a member of /dev/md2, slot 3.
      mdadm: added /dev/sda3 to /dev/md2 as 0 (possibly out of date)
      mdadm: added /dev/sdb3 to /dev/md2 as 1 (possibly out of date)
      mdadm: added /dev/sdc3 to /dev/md2 as 2
      mdadm: added /dev/sdd3 to /dev/md2 as 3 
      mdadm: /dev/md2 has been started with 2 drives (out of 4).
    5. When you are shure that the command will assemble your raid in the correct order, you can run it with force now:

      Code: Select all

      mdadm --assemble --force --run /dev/md2 /dev/sda3 /dev/sdb3 /dev/sdc3 /dev/sdd3 -v
      If everything goes well, it will have success, assemble your raid and start it (you can see, how the event count is forced to be counted up):
      Note that the sequence "mdadm: added x to y" in the output of the command can vary, but this doesn't matter.

      Code: Select all

      mdadm: looking for devices for /dev/md2
      mdadm: /dev/sda3 is identified as a member of /dev/md2, slot 0.
      mdadm: /dev/sdb3 is identified as a member of /dev/md2, slot 1.
      mdadm: /dev/sdc3 is identified as a member of /dev/md2, slot 2.
      mdadm: /dev/sdd3 is identified as a member of /dev/md2, slot 3.
      mdadm: forcing event count in /dev/sda3(0) from 144 upto 160
      mdadm: forcing event count in /dev/sdb3(1) from 144 upto 160
      mdadm: added /dev/sdb3 to /dev/md2 as 1
      mdadm: added /dev/sdc3 to /dev/md2 as 2
      mdadm: added /dev/sdd3 to /dev/md2 as 3
      mdadm: added /dev/sda3 to /dev/md2 as 0
      mdadm: /dev/md2 has been started with 4 drives.
    6. Hell yeah if this goes well you are one, ok two steps away from your loved data. Create a directory (e.g. name it recovery):

      Code: Select all

      mkdir /recovery
    7. And mount your raid (In my case I mounted /dev/md2 to the created directory /recovery). I also would recommend to mount as read only, as I did (-o ro).

      Code: Select all

      mount -o ro /dev/md2 /recovery
    8. If you now take a look into the directory your data should be there:

      Code: Select all

      ls -la /recovery
  5. Now just take a program like WinSCP connect to your nas and copy everything to another storage. After that you can decide what to do next. I would recommend to check all disks with extended SMART-Tests and to reinstall the NAS - but that's your choice 8)


    Some more information, about my tests:

    Before I did anything, I tested this all on a test case with other disks. I also could reproduce the error I had. On my first two disks the swap partition was missing (/dev/md1 -> /dev/sda2 & /dev/sdb2), so I decided to delete them manual on a test case. I installed the test case from scratch with 4 disks, and created a raid 5. After the consistency check I shut it down, removed the first disk (/dev/sda) and deleted the partition (/dev/sda2) with gparted. After that I reinserted it and well guess what happend. Yes the raid 5 (/dev/md2) was degraded. I then did the same for the second disk. And yes indeed, I got the configuration lost issue in the Synology Assistant and couldn't access the nas. So I tried to reinstall. However, this time the reinstallation had run without any error, but on boot there was nothing reinstalled - it came up with configuration lost again (I tried the reinstall 3 times, with the same result). So I checked disk1 and disk2 in gparted - well both partitions were missing... So the reinstall could just not recreate the partitions - same issue like on my case... The only difference was, that my case booted up after the reinstall failure, instead of showing configuration lost again. So I pulled disk1 and disk2 out, started the case and after a few seconds put disk1 and disk2 back in. Now it booted up and as expected, the raid crashed. So to do the final test, I installed the case from scratch again and deleted the last two disks (disk3 and disk4). Now the case started normally, but yes as expected, the raid was crashed... This means, that the cases from synology always try to start with the dsm from the first disk and if there is any issue, it won't boot. And also it means, that if a system partition on a disk dies (I assume that the same will happen, if you delete parts of /dev/md0), the partitions from your volumes are also affected... Very strange :?: :shock: I can only explain the behaviour to me, that the disk will be thrown out of all raids when something happens to one array (e.g. /dev/md1), because it is assumend something abnormal happened.

    I also tested a recreation of the raid after deleting the system partition from disk3 and disk4. Well the assemble worked as well before, but I wanted to test the recreation. On this link is explained how to recreate the raid successfully, if an assemble won't work (for example if the supberblock is missing <- what I mentioned in my previous post): https://unix.stackexchange.com/a/146944/208241 But keep in mind, that recreating is dangerous. You should know what you do! And on the recreation the disk order is significant, especially in the mdadm --create command! The wrong order will cause data loss!

    Some uselfull links:

Galaxy
I'm New!
I'm New!
Posts: 5
Joined: Thu Nov 23, 2017 10:47 pm

Re: [SOLVED] How to recover data from raid 5 with two failed disks - volume crashed after DSM update

Unread post by Galaxy » Thu Dec 28, 2017 6:23 pm

So there we are, I hope this will help someone. If you have questions, suggestions or found mistakes, please post it so I can adjust it.

Last but not least there is to say again: Raid is no Backup! So make it more often. I nearly learned it the hard way and would have lost some valuable data for me - not all, but some of the latest :?
And even better is to make a backup from backup and after that a backup from this backup offsite :lol: (fire or water can be dangerous... :twisted:)
Well seriously get a better backup concept which runs automatically, if you not having it already. :wink:

Another thing is not to use raid 5 anymore. I wanted to change my system soon, but never did (you know the life get's you :D). So I'm glad it crashed now and I was lucky, so that I can do it now. The main reason for not using raid 5 anymore is the time a rebuild process will take. If you have disks from the same vendor and bought them the same time, the probability that more than one disk will fail at the same time is huge. In the past that wasn't a big problem, because you did run disks with around 500 GB. But nowadays with 4, 8 or even 10 TB disks, the rebuild time takes days. Then it is very likely that another disk dies within this process or while you are copying your data from a degraded raid 5 and your raid is gone... So you can imagine what follows: Even Raid 6 with such large disks and raids with very much disks will not be recommended anymore in the future. Well for your home business it isn't that big issue when you have a good backup concept, because normally you don't have that much and large disks and the reliability needs. But in critical business it's an issue.

I will build my nas from scratch with raid 6 now and go for an encrypted cloud backup: https://www.synology.com/en-global/dsm/ ... cloud_sync :mrgreen:

Post Reply

Return to “System Managment Mods”