Shutdown / Recovery script - where's the misstake?

This forum is for devlopers to discuss questions about apps/plug-ins/extension modules/API specifically developed for using with Synology DiskStations.
dognose
Trainee
Trainee
Posts: 10
Joined: Sun Feb 28, 2016 2:03 am

Shutdown / Recovery script - where's the misstake?

Postby dognose » Tue Jul 11, 2017 9:57 pm

Hey together.

I'm running a 2-Node-Hyper-V Cluster along with 3 Synology RS815+. 2 NAS are clustered, containing the VMs, one is independent for "cold data".
Connection is ISCSI, Everything (hyperv1, hyperv2, nas1, nas2, nas3, switches) are backed by an UPS.

So, I'm currently in the progress of developing a "fully" automated shutdown and recovery "setup". Everything works to plan, except the "recovery" process.
The Synologies are showing some "strange" behavior, which I don't understand...

Layout

- A HyperV-Node called "hyperv1" (backed by an UPS)
- A HyperV-Node called "hyperv2" (backed by an UPS)
- A RS815+ (nas1) (backed by an UPS)
- A RS815+ (nas2) (backed by an UPS)
- A RS815+ (nas3) (backed by an UPS)
- nas1 + nas2 together are knwon as "syncluster". (backed by an UPS)

- A raspberry PI, NOT backed by the UPS, responsible for recovery - more on that later.
- A router, NOT backed by the UPS.

Plan

If theres a powerloss, the following sequence should apply:

- shutdown hyperv1
- shutdown hyperv2
- shutdown nas1,nas2,nas3

Since the hyperV hosts are using ISCSI for VM Storage, the NAS need to shutdown later, and power on earlier.

Scripts

So, I wrote some scripts for this. I replaced (it's all IP-Based) every IP with the term "hyperv1", "hyperv2", "nas1", "nas2", "nas3", "syncluster", "router" and "raspberry".

First, "hyperv1" should shutdown, and the following powershell-script works as expected.

Code: Select all

$LogFileName = "C:\controlled_shutdown.log"

if (!(Test-Connection "raspberry" -Count 3 -Delay 2))
{
  $d = Get-Date -format "yyyy-MM-dd HH:mm:ss"
  $d = $d + ": "
  $d + "(Rasperry) is down..." >> $LogFileName
  if (!(Test-Connection "router" -Count 3 -Delay 2))
  {
    $d = Get-Date -format "yyyy-MM-dd HH:mm:ss"
    $d = $d + ": "
    $d + "(router) is down as well... shutdown" >> $LogFileName
    shutdown -s -t 0 -f
  }
}


So, the first hyperV-Node is shutting down if neither "router" nor "raspberry" are available, passing all VMs to the second node.
The second node in turn runs the following script:

Code: Select all

$LogFileName = "C:\controlled_shutdown.log"
$d = Get-Date -format "yyyy-MM-dd HH:mm:ss"
$d = $d + ": "

if (!(Test-Connection "raspberry" -Count 2 -Delay 2))
{
  $d + "(Rasperry) is down..." >> $LogFileName
 
  if (!(Test-Connection "hyperv1" -Count 2 -Delay 3))
  {
    $d + "(hyperv1) is down... shutdown" >> $LogFileName
    shutdown -s -t 0
  }
}


So, if the raspberry AND the first hyperV-Node are down, the second hyperV-Node shuts down as well.

Subsequently, the NAS are checking the availability of the cluster-nodes as well - and shuting down, if neither "raspberry" nor "hyperv1" nor "hyperv2" are available:

Code: Select all

#!/bin/sh

shutdown_flag=true
for i in "raspberry" "hyperv1" "hyperv2"
  do
    if ping -c 10 -W 5 "$i" > /dev/null; then
        shutdown_flag=false;
        break;
    else
      /bin/echo "$i is down" >> /var/log/ha.log
    fi
  done
 
 if [ "$shutdown_flag" = "true" ] ; then
   /bin/echo "shutdown!" >> /var/log/ha.log
   /sbin/poweroff
 fi


-----------------------------------------------------
So, after about 5-10 minutes all hyperv-nodes and all NAS have shutdown.
-----------------------------------------------------

Recovery

My idea was that the "raspberry" which is not backed by the UPS will come online again, once power is restored.
Thus, it is running following script in order to "wake up" (WOL configured on every device) every device in the required order:

Code: Select all

#!/bin/sh

hyperv1ip="10.10.20.2"
hyperv1mac="FC:AA:14:70:91:0A"

hyperv2ip="10.10.20.3"
hyperv2mac="40:8D:5C:1A:35:9B"

nas1ip="10.10.20.10"
nas1mac="00:11:32:56:3A:5D"

nas2ip="10.10.20.11"
nas2mac="00:11:32:52:71:61"

nas3ip="10.10.20.12"
nas3mac="00:11:32:4F:87:D1"

synclusterip="10.10.20.13"

ts=$(date '+%Y-%m-%d %H:%M:%S');

ping -c 2 $nas1ip >/dev/null 2>&1
if [ $? -ne 0 ] ; then
  echo "$ts : Wakeup NAS 1" >> /home/pi/Documents/wol.log
  etherwake "$nas1mac"
else
  ping -c 2 $nas2ip >/dev/null 2>&1
  if [ $? -ne 0 ] ; then
    echo "$ts : Wakeup NAS 2" >> /home/pi/Documents/wol.log
    etherwake "$nas2mac"
  else
    ping -c 2 $nas3ip >/dev/null 2>&1
    if [ $? -ne 0 ] ; then
      echo "$ts : Wakeup NAS 3" >> /home/pi/Documents/wol.log
      etherwake "$nas3mac"
    else
      ping -c 2 $hyperv1ip >/dev/null 2>&1
      if [ $? -ne 0 ] ; then   
        echo "$ts : Wakeup HyperV 1" >> /home/pi/Documents/wol.log
        etherwake "$hyperv1mac"
      else
         ping -c 2 $hyperv2ip >/dev/null 2>&1
         if [ $? -ne 0 ] ; then
          echo "$ts : Wakeup HyperV 2" >> /home/pi/Documents/wol.log
          etherwake "$hyperv2mac"
        else
          echo "All good"
        fi
      fi
    fi
  fi
fi


The Problem

It somewhat works. After Power is restored, NAS1, NAS2, NAS3 are starting due to WOL-Packets as expected.

The Problem I'm facing is: Even if the device sending the WOL-Pakets (raspberry) is clearly available on the network,
which should make every NAS stay awake due to it's shutdown script (they are checking availablility of "raspberry"),
NAS1, NAS2 and NAS3 are keeping "shuting down" again, loging, that neither "raspberry" nor "hyperv1" nor "hyperv2" is online.

10.10.20.30 is down
10.10.20.2 is down
10.10.20.3 is down
shutdown!
10.10.20.30 is down
10.10.20.2 is down
10.10.20.3 is down
shutdown!

(that's raspberry/hyperv1/hyperv2)

As you can see, I already increased the ping-count (for nas-shutdown) to 10, using a timetout of 5 seconds - still, the NAS
are starting and shutting down during "recovery"...

The raspberry actually never tries to wakeup hyperv1 or hyperv2 - it stays busy with "waking up" nas1, nas2 and nas3...

----------------

I believe, that the script - responsible for shutdown - on the synology starts to kick in, before network-connectivity has been restored,
thus leading to another immediate shutdown.

However, If i power on "hyperv1" and/or "hyperv2" manually - none of the NAS is attempting another shutdown...

Any ideas, what might go wrong during the "recovery-sequence"?

Generally the "raspberry" is responding to "pings", else my logs would be filled with "Raspberry is down" entries on all the NAS and HyperV Nodes...

cheers,
dognose
dognose
Trainee
Trainee
Posts: 10
Joined: Sun Feb 28, 2016 2:03 am

Re: Shutdown / Recovery script - where's the misstake?

Postby dognose » Sat Jul 15, 2017 7:15 pm

After trying some different options on the nas' shutdown scripts, I finally figured out the problem:

For whatever reason, the raspberry starts with it's "wake-up-sequence" about 1 minute after power is restored,
but until it FINALLY starts to answer on pings (or ssh etc.), it takes about 7.5 minutes.

Therefore, during this timespan, the nas have been keeping shutting down again.

I now modified the script with 4 things:

1.) I added a check if eth0 is "up" on the nas running the script. (wasn't the cause, but can't hurt anyway)

Code: Select all

#!/bin/sh

conn_info=$(ip link show | grep "eth0")

if [[ "$conn_info" = *"UP"* ]] ; then
   shutdown_flag=true
   for i in "10.10.20.1" "10.10.20.30" "10.10.20.2" "10.10.20.3"
     do
      if ping -c 2 -W 2 "$i" > /dev/null; then
         shutdown_flag=false;
         break;
      else
        /bin/echo "$i is down" >> /var/log/ha_shutdown.log
      fi
     done
   
    if [ "$shutdown_flag" = "true" ] ; then
      /bin/echo "shutdown!" >> /var/log/ha_shutdown.log
      /sbin/poweroff
    fi
else
  /bin/echo "eth0 not yet ready..." >> /var/log/ha_shutdown.log
fi


2.) See above: I noted, that synology is using the file "ha.log" itself, so i changed the file to "ha_shutdown.log" (Just to have dedicated logs)
3.) See above: I also added the "router" (10.10.20.1) as a ping target for the shutdown-condition (so: router, raspberry, hyperv1, hyperv2) and turned the ping-count / timeout back to "normal" values.
(so now: 4 targets, 2 pings, 2 seconds timeout - it will take 16 seconds (+ max. 60 seconds, script running every minute) to invoke the shutdown)

3.) I combined the wakeup for the cluster, and just checking the cluster ip "responding" rather than nas1 and 2 individually.

Code: Select all

#!/bin/sh

hyperv1ip="10.10.20.2"
hyperv1mac="FC:AA:14:70:91:0A"
hyperv2ip="10.10.20.3"
hyperv2mac="40:8D:5C:1A:35:9B"
nas1ip="10.10.20.10"
nas1mac="00:11:32:56:3A:5D"
nas2ip="10.10.20.11"
nas2mac="00:11:32:52:71:61"
nas3ip="10.10.20.12"
nas3mac="00:11:32:4F:87:D1"
synclusterip="10.10.20.13"

ts=$(date '+%Y-%m-%d %H:%M:%S');

ping -c 2 $synclusterip >/dev/null 2>&1
if [ $? -ne 0 ] ; then
  echo "$ts : Wakeup NAS 1" >> /home/pi/Documents/wol.log
  etherwake "$nas1mac"
  echo "$ts : Wakeup NAS 2" >> /home/pi/Documents/wol.log
  etherwake "$nas2mac"
else
  ping -c 2 $nas3ip >/dev/null 2>&1
  if [ $? -ne 0 ] ; then
    echo "$ts : Wakeup NAS 3" >> /home/pi/Documents/wol.log
    etherwake "$nas3mac"
  else
    ping -c 2 $hyperv1ip >/dev/null 2>&1
    if [ $? -ne 0 ] ; then   
      echo "$ts : Wakeup HyperV 1" >> /home/pi/Documents/wol.log
      etherwake "$hyperv1mac"
    else
       ping -c 2 $hyperv2ip >/dev/null 2>&1
       if [ $? -ne 0 ] ; then
        echo "$ts : Wakeup HyperV 2" >> /home/pi/Documents/wol.log
        etherwake "$hyperv2mac"
      else
        echo "All good"
      fi
    fi
  fi
fi


So, now I ran another test and tracked every device with a "ping" tool, which was quite helpfull on this (NirSoft PingInfoView) It was also helpfull before on figuring out, what was actually happening.

Finally everything works "as expected", at least in 3 recent tests.

The raspberry stayed "invisible", until the cluster, nas3 and the first hyperv host have been considered alive. Then it starts to respond on "pings" as well:
(10.10.20.30) which was roughly 445 pings (one a second), so about 7.5 Minutes.

This might be caused by the "power-loss-shutdown" that applies for the raspberry. If a regular reboot is performed, it starts to answer pings right away.
(Maybe some sort of: "Uh, repairing, blabla, whatever")

Image

Image

Return to “Developer Discussion Room”

Who is online

Users browsing this forum: No registered users and 1 guest