I'm running a 2-Node-Hyper-V Cluster along with 3 Synology RS815+. 2 NAS are clustered, containing the VMs, one is independent for "cold data".
Connection is ISCSI, Everything (hyperv1, hyperv2, nas1, nas2, nas3, switches) are backed by an UPS.
So, I'm currently in the progress of developing a "fully" automated shutdown and recovery "setup". Everything works to plan, except the "recovery" process.
The Synologies are showing some "strange" behavior, which I don't understand...
Layout
- A HyperV-Node called "hyperv1" (backed by an UPS)
- A HyperV-Node called "hyperv2" (backed by an UPS)
- A RS815+ (nas1) (backed by an UPS)
- A RS815+ (nas2) (backed by an UPS)
- A RS815+ (nas3) (backed by an UPS)
- nas1 + nas2 together are knwon as "syncluster". (backed by an UPS)
- A raspberry PI, NOT backed by the UPS, responsible for recovery - more on that later.
- A router, NOT backed by the UPS.
Plan
If theres a powerloss, the following sequence should apply:
- shutdown hyperv1
- shutdown hyperv2
- shutdown nas1,nas2,nas3
Since the hyperV hosts are using ISCSI for VM Storage, the NAS need to shutdown later, and power on earlier.
Scripts
So, I wrote some scripts for this. I replaced (it's all IP-Based) every IP with the term "hyperv1", "hyperv2", "nas1", "nas2", "nas3", "syncluster", "router" and "raspberry".
First, "hyperv1" should shutdown, and the following powershell-script works as expected.
Code: Select all
$LogFileName = "C:\controlled_shutdown.log"
if (!(Test-Connection "raspberry" -Count 3 -Delay 2))
{
$d = Get-Date -format "yyyy-MM-dd HH:mm:ss"
$d = $d + ": "
$d + "(Rasperry) is down..." >> $LogFileName
if (!(Test-Connection "router" -Count 3 -Delay 2))
{
$d = Get-Date -format "yyyy-MM-dd HH:mm:ss"
$d = $d + ": "
$d + "(router) is down as well... shutdown" >> $LogFileName
shutdown -s -t 0 -f
}
}
So, the first hyperV-Node is shutting down if neither "router" nor "raspberry" are available, passing all VMs to the second node.
The second node in turn runs the following script:
Code: Select all
$LogFileName = "C:\controlled_shutdown.log"
$d = Get-Date -format "yyyy-MM-dd HH:mm:ss"
$d = $d + ": "
if (!(Test-Connection "raspberry" -Count 2 -Delay 2))
{
$d + "(Rasperry) is down..." >> $LogFileName
if (!(Test-Connection "hyperv1" -Count 2 -Delay 3))
{
$d + "(hyperv1) is down... shutdown" >> $LogFileName
shutdown -s -t 0
}
}
So, if the raspberry AND the first hyperV-Node are down, the second hyperV-Node shuts down as well.
Subsequently, the NAS are checking the availability of the cluster-nodes as well - and shuting down, if neither "raspberry" nor "hyperv1" nor "hyperv2" are available:
Code: Select all
#!/bin/sh
shutdown_flag=true
for i in "raspberry" "hyperv1" "hyperv2"
do
if ping -c 10 -W 5 "$i" > /dev/null; then
shutdown_flag=false;
break;
else
/bin/echo "$i is down" >> /var/log/ha.log
fi
done
if [ "$shutdown_flag" = "true" ] ; then
/bin/echo "shutdown!" >> /var/log/ha.log
/sbin/poweroff
fi
-----------------------------------------------------
So, after about 5-10 minutes all hyperv-nodes and all NAS have shutdown.
-----------------------------------------------------
Recovery
My idea was that the "raspberry" which is not backed by the UPS will come online again, once power is restored.
Thus, it is running following script in order to "wake up" (WOL configured on every device) every device in the required order:
Code: Select all
#!/bin/sh
hyperv1ip="10.10.20.2"
hyperv1mac="FC:AA:14:70:91:0A"
hyperv2ip="10.10.20.3"
hyperv2mac="40:8D:5C:1A:35:9B"
nas1ip="10.10.20.10"
nas1mac="00:11:32:56:3A:5D"
nas2ip="10.10.20.11"
nas2mac="00:11:32:52:71:61"
nas3ip="10.10.20.12"
nas3mac="00:11:32:4F:87:D1"
synclusterip="10.10.20.13"
ts=$(date '+%Y-%m-%d %H:%M:%S');
ping -c 2 $nas1ip >/dev/null 2>&1
if [ $? -ne 0 ] ; then
echo "$ts : Wakeup NAS 1" >> /home/pi/Documents/wol.log
etherwake "$nas1mac"
else
ping -c 2 $nas2ip >/dev/null 2>&1
if [ $? -ne 0 ] ; then
echo "$ts : Wakeup NAS 2" >> /home/pi/Documents/wol.log
etherwake "$nas2mac"
else
ping -c 2 $nas3ip >/dev/null 2>&1
if [ $? -ne 0 ] ; then
echo "$ts : Wakeup NAS 3" >> /home/pi/Documents/wol.log
etherwake "$nas3mac"
else
ping -c 2 $hyperv1ip >/dev/null 2>&1
if [ $? -ne 0 ] ; then
echo "$ts : Wakeup HyperV 1" >> /home/pi/Documents/wol.log
etherwake "$hyperv1mac"
else
ping -c 2 $hyperv2ip >/dev/null 2>&1
if [ $? -ne 0 ] ; then
echo "$ts : Wakeup HyperV 2" >> /home/pi/Documents/wol.log
etherwake "$hyperv2mac"
else
echo "All good"
fi
fi
fi
fi
fi
The Problem
It somewhat works. After Power is restored, NAS1, NAS2, NAS3 are starting due to WOL-Packets as expected.
The Problem I'm facing is: Even if the device sending the WOL-Pakets (raspberry) is clearly available on the network,
which should make every NAS stay awake due to it's shutdown script (they are checking availablility of "raspberry"),
NAS1, NAS2 and NAS3 are keeping "shuting down" again, loging, that neither "raspberry" nor "hyperv1" nor "hyperv2" is online.
10.10.20.30 is down
10.10.20.2 is down
10.10.20.3 is down
shutdown!
10.10.20.30 is down
10.10.20.2 is down
10.10.20.3 is down
shutdown!
(that's raspberry/hyperv1/hyperv2)
As you can see, I already increased the ping-count (for nas-shutdown) to 10, using a timetout of 5 seconds - still, the NAS
are starting and shutting down during "recovery"...
The raspberry actually never tries to wakeup hyperv1 or hyperv2 - it stays busy with "waking up" nas1, nas2 and nas3...
----------------
I believe, that the script - responsible for shutdown - on the synology starts to kick in, before network-connectivity has been restored,
thus leading to another immediate shutdown.
However, If i power on "hyperv1" and/or "hyperv2" manually - none of the NAS is attempting another shutdown...
Any ideas, what might go wrong during the "recovery-sequence"?
Generally the "raspberry" is responding to "pings", else my logs would be filled with "Raspberry is down" entries on all the NAS and HyperV Nodes...
cheers,
dognose