Once upon a time, round about the time computer dinosaurs roamed the planet (early 1990s), your scribe was an IT project manager. His project at the time was to deploy a networked Windows 3.1 infrastucture of about 1300 PCs across a large manufacturing site - large as in 1.5 miles long and 0.5 miles wide with about 50 buildings containing humans who needed PCs. The site was predominantly VAX/VMS based using dumb VT terminals, a number of standalone PCs and a handful of networked PCs running on thick-wire Ethernet using mostly DecNet and a few with the Wollongong TCP/IP stack installed. Suffice to say it was an environment that cave men would be familiar with. I’m guessing a few forum members might also remember some of this technology
As well as being the project manager I was also our ‘intelligent buyer’ for much of the new infrastructure so I made sure I was up to speed with what a third-party installer was going to install for us. We cared about Disaster Recovery (DR), yes even back then, and whilst I didn’t have a limitless budget I tried to have the environment built in such a way as to start to support some aspects of DR. I couldn’t afford a spare set of servers, but given that the servers we were buying could have hardware RAID cards fitted that I could afford I had those fitted. To put these servers in context, these were Compaq servers, probably Intel Pentium 100MHz which could have 14 * 1GB SCSI disks installed giving us 12GB of usable RAID 5 storage. We installed 5 of these to cover use by all 1300 PCs!
The servers were built off-site and we went to the third-party premises to inspect them before they were delivered to site. Whilst we’d had ‘big’ RAID on the VAXes this was our first encounter with RAID on smaller servers and the third-party was expounding the virtues of it. They said…you can pull a disk whilst they’re running and put a new disk in which will re-build back to a fully working environment. Never one to miss challenging ‘brave talk’ I said “Go on, pull that disk right now.” They seemed a little nervous, but did it. The server told us something was wrong, the disk was put back in and the server said it would take a little while to get things back to normal, but it happily did what was needed. The test was sucessful and everyone breathed a sigh of relief.
The servers, PCs, network and software was installed and users trained which went fairly well. Once we’d finished I put my mind to DR again. We could survive the failure of two disks in a server (14 disks - 1 hot spare and 13 formatted as RAID5 giving us 12GB usable), but I thought - what happens if we have a server hardware failure. We’ve got some high priority users and some lower priority. So if the server used by the high priority users failed, was there a way to shut the low priority user server down and use the hardware for the high priority users?
The servers were all identical so I wondered whether it would be possible to take the disks out of the high priority server and install them in the low priority server, effectively bringing the high priority server back online within an hour or so. I talked to the third-party server storage experts about taking RAID5 disks out of one server and putting them into another identical server. They said they saw no reason why that wouldn’t work.
So we set up a test. On a Friday evening we’d shut two servers down and swap the disks, bringing each server back up again and test that the transfer had worked. The Change Control paperwork was duly filled in and approved. We’d allowed 4 hours to do the test and then swap things back to the original configuration. Friday evening was quiet so it wouldn’t impact too many people.
Our Friday evening started. We made sure we had a good backup. We shut the servers down. We checked the disk labels so we knew which ones went where. We removed the disks from both servers and put them into the equivalent slots on the other server. Then we powered one server on. Lots of beeps and warnings emanated from it. We’d kind of expected that, but eventually the server didn’t boot, giving an error along the lines of “Bad RAID Array”. That was worrying, so we abandoned the test and worked to put things back as they were. So we put the ‘Bad RAID Array’ disks back into the original server and booted it up. Lots more beeps and warnings emanated from it before it also failed to boot giving a ‘Bad RAID array’ error!
Now we were getting even more worried. The test might not have worked but we had always assumed that if we put the disks back in the original server that we’d be back where we started from.
The evening turned to night. We tried everything we could think of to get the servers to recognise their original disks, but nothing we tried worked. At about midnight we gave up and got the backup tapes out. We kicked off data restore routines, that we’d only ever used to restore small numbers of files. Due to the way the backups were being done, full+incrementals, the software wanted to read all the tapes from the previous week to build a full catalogue of all the files, working out which ones were the final versions when we did the final incremental backup before starting the test. So we spent a few hours feeding tapes into the tape drives whilst the catalogue was built. At about 3-4am the system said the catalogue had been fully rebuilt and which files did we want to restore…we said “All of them!”. The restore started and I left a colleague (who volunteered to stay) to nurse the restore process through. I went home to get a bit of sleep, a shower and some breakfast. We’d estimated the restore to take 5-6 hours so I said I’d be back towards the end just to check things had gone OK.
I got back there at about 10am on Saturday morning to find that the restore had hardly started. It was running at about 10% of the speed we’d expected. So instead of 5-6 hours we were looking at 50-60 hours…or 4 to 5 days!!
We just had the live with that. We didn’t seem to be able to do anything to convince the restore to go faster and if we stopped after 12 hours and re-started then we were just going to get a longer outage. I guess we also wondered if once we got to the less populated ‘incremental’ backups it might speed up. It didn’t
So, some time on the Wednesday we finally gave the system back to the high priority users. They weren’t entirely happy at having lost access to their data for so long. We managed to convince them that it wasn’t our fault and that mysterious and unexpected software fault had caused the problem.
We did eventually find out why it didn’t work as planned. The RAID controller boards in each server had NVRAM which stored a ‘map’ of the array, i.e. which disks were in which slot and what part of the array each disk contained. There is a utility that you’re supposed to run to dump the NVRAM to a floppy disk. When you put the disks into a new server you need to load the floppy disk export data back into the NVRAM on that server’s RAID card so that it would recognise how to use the disks. Unfortunately the third-party disk storage experts omitted to tell us that little detail so we were doomed from the start. Having said that, whilst there was an outage we didn’t lose any data…what they got back on Wednesday was exactly what they had when they went home on Friday.
Only two of us ever knew the real reason why it went wrong and my colleague took the secret with him to the grave. I’ve now told you, but I think the majority of people affected by the outage will be retired now and not worrying about an incident that took place about 30 years ago.