- Below approach is in design stage. In theory everything looks cool and I will keep you posted what's going to happen in practise, soon
- Your comments, concerns, feedback, suggestions are highly welcome
Hi All,
Since the release of NPM 12.0.1 we have an exciting new feature - High Availability - in Orion platform. The only minor downside is that it does require HA cluster to be sitting in the same subnet. Oops... For some minor, for other major. What do you do when you have two sites and you cannot stretch subnet easily due to architectural constraints?
Well, I have been digging this for the past several weeks. FoE is not being sold anymore. It may well be supported, but you certainly cannot buy it if you are a new customer. I have started FoE thread here, which has great insights, although now it is not relevant anymore. At the moment it seems like there is a gap for smooth inter-site fail-over solution and I hope this will be plugged soon. As of now SolarWinds offers Active-Active approach (you can find PDF about it in FoE thread as well or just get it here directly).
At first I though that maintaining two instances manually will be a huge pain and massive overhead for engineers and this was stopping me from accepting this idea. However, thinking further and after consulting with SolarWinds support I have realised that we do not need to. Here is what I came up with:
(1)
First - you do need to purchase another additional set of licenses that you have got already (this is the most painful step). To soften this a bit - contact your SolarWinds re-seller partner (or SolarWinds directly) and ask for 50% discount. This is known as "Disaster Recovery License" and is being offered upon request
(2)
Deploy live environment as normal, add all nodes, configure settings, alerts, etc - as usual practise, nothing fancy here
(3)
Deploy exact copy of your live environment at the DR site, use Disaster Recovery License and point it to "empty" database. By "empty" I mean that you do not need to populate it with any assets, just install fresh deployment and leave it as it. You don't even need to configure any settings, views, etc - just vanilla setup (cold standby so to speak)
Now, you will have an Active-Active setup (although not quite Active-Active as initially suggested by SolarWinds in the above PDF. I would rather call it Active-Reserved, because at the DR site you do not add any devices and you do not configure it)
DR considerations:
SolarWinds kit at the DR site will just sit there and do nothing until DR is invoked. We have Database AG in place (mirror formally). So, in the event of main site failure SQL will fail over to mirror copy at DR site. Because we already have SolarWinds deployed at DR - all what will be left to do is to run configuration wizard and re-point DR deployment from empty db to mirror copy of live db at the DR site. Recovery process should last as long as it takes for the configuration wizard to complete its magic (haven't measured it yet, but it will depend on number of module, hardware kit, etc. I anticipate under 2 hour). Well, not instant fail-over as with HA - but good enough, providing that monitoring by itself is not business critical tool and in the event of site fail-over monitoring definitely is not on the list of priorities for the business to worry about, unless you are managed monitoring solutions provider (which I hope you are not, as otherwise 2 hours recovery may be a "killer").
If you do not have SQL AG, then simply ensure you backup your SQL at live site and transfer your backup over to DR site on a regular bases. No need to recover, but fresh backup should be available in the event of DR to restore DB at the DR site
To "complicate" things further (or I would better say to safeguard and increase availability) - we also plan to create HA cluster for application layer (new feature) at both sites, therefore protecting from local app server failures. Although HA at DR might be a bit excessive - we would like to keep things as closely mirrored as possible between DR and LIVE sites
And, yet another thing - NTA server. You can have it on same APP box (which is not recommended, although fully supported), or you can have a separate box. In the event of DR you simply recover NTA DB from backup at DR site and then you should be able to switch to DR box as well
Grey areas:
- After running config wizard and re-pointing DR instance at the live db it is not clear how do we proceed with NTA recovery? Particularly, what steps involved in recovery of NTA box. So, this needs further testing, but I guess standard recover approach will be the case here
- Running two databases for different SolarWinds deployments within one SQL instance (see image below). After reading a lot of manuals I could not find any reasons for not being able to do so. Your comments are highly appreciated here.
- Restoring [App DR] <--> [HA DR] relationship after failover (after running config wizad at DR site and re-pointing to live db). Not sure what is going to happen with HA cluster at this point - again, needs testing. We have asked SolarWinds support to confirm - still waiting for them to come back
Final say:
Once again, after digging through many different options - this one seems the most appealing, with virtually no overhead for Engineers on a day-to-day running - which is key to not overload them. Running Active-Active and managing all changes manually at both ends is way too much - sorry, no, sorry
Diagrams:
Normal operation
Site failover: