NWS College Park datacenter recovery continues

ktrue · 17 April 2024 18:33

2024-04-17 06:05:50 EDT - Iva Talley Additional comments
College Park Datacenter Issue - Update #10

Key Points

No significant updates during overnight restoration efforts.
NCEP Center’s (OPC, CPC, and WPC) operations remain severely degraded due to downed NetApp systems in College Park.
NCO support will continue to prioritize bringing up rzdm servers and NCWCP intranet site.
No ETR.
Current Known Impacts include:
NCEP Centers’ websites hosted in CP that remain inaccessible include EMC and NCWCP intranet sites.
WPC, OPC, and CPC’s operational product suites’ status, range from being degraded to down.
FTPPRD is inaccessible in CP (Customers are able to use nomads.ncep.noaa.gov as a viable backup in the meantime)
NCO operations personnel are unable monitor NWS networks and circuits.
CONUS QPE data is not updating on MRMS (Index of /data)
Several layers are not updating on NWS Cloud Services (GIS and Map Viewer)
Multiple outside datasets are not available/delayed (UKMET data, ECMWF data, Canadian METARS, ACARS aircraft data)
See the NCO IT Status Dashboard for the latest updates and more information

SDM Mary Beth

ktrue · 17 April 2024 18:41

Also on the NCO status page https://www.nco.ncep.noaa.gov/status/messages/ (which is now back in operation):

SENIOR DUTY METEOROLOGIST NWS ADMINISTRATIVE MESSAGE
NWS NCEP CENTRAL OPERATIONS COLLEGE PARK MD
1019Z WED APR 17 2024

…UPDATES TO RECENT NWS OPERATIONAL OUTAGES…

WIDESPREAD WFO NETWORK OUTAGES…
WFOs internet and AWIPS connections have remained stable since
the circuit was restored in College Park Tuesday morning. NCO’s
network team will continue to work towards mitigating impacts
during the recurring circuit outages in College Park.

NWS BROADCAST REPEATED MESSAGES…
The problem with NWS products being broadcasted multiple times
was traced to Monday’s efforts to mitigate impacts from the
temperature spike in the College Park Data Center. NCO
implemented a fix to correct the issue at 2:00pm EDT Tuesday.

MRMS…
CONUS QPE data continues to not update on MRMS
(Index of /data). The problem has been linked
to the College Park Data Center Outage. Efforts to restore the
data will resume early Wednesday.

RECOVERY EFFORTS IN THE COLLEGE PARK DATA CENTER…
No significant updates during overnight restoration efforts. NCEP
Center’s (OPC, CPC, and WPC) operations remain severely degraded
due to downed NetApp systems in College Park. No ETR.

Current Known Impacts include:
-NCEP Centers’ websites hosted in CP that remain inaccessible
include EMC and NCWCP intranet sites.
-WPC, OPC, and CPC’s operational product suites’ status, range
from being degraded to down.
-FTPPRD is inaccessible in CP (Customers are able to use
nomads.ncep.noaa.gov as a viable backup in the meantime)
-NCO operations personnel are unable monitor NWS networks and
circuits.
-CONUS QPE data is not updating on MRMS
(Index of /data)
-Several layers are not updating on NWS Cloud Services (GIS and
Map Viewer)
-Multiple outside datasets are not available/delayed (UKMET data,
ECMWF data, Canadian METARS, ACARS aircraft data)

Gerhardt/SDM/NCO/NCEP

administrator · 18 April 2024 08:53

Looks like the worst is over…

SENIOR DUTY METEOROLOGIST NWS ADMINISTRATIVE MESSAGE
NWS NCEP CENTRAL OPERATIONS COLLEGE PARK MD
1836Z WED APR 17 2024

…AWIPS 12Z GFS (AND DOWN STREAM) DATA DELAYS - UPDATE…

NCO Support is reporting that a large queue has cleared out and
most data should be available…Please check and report whether
data is updating. Thank you.

Shruell/SDM/NCO/NCEP

ktrue · 18 April 2024 14:16

This morning’s synopsis from the problem report email:

2024-04-17 22:10:34 EDT - Iva Talley Additional comments

NCO received the latest update concering this event outage:

College Park Datacenter Issue - Update #11

Key Points

6 pm EDT update: Data is currently rebuilding from the NetApp multi-disk failure NCO encountered a few days ago. It’s moving in a positive direction as both previously failed aggregates are progressively rebuilding. Estimated time of rebuild is unknown, but if things keep progressing without failure, the rebuilds should complete by tomorrow.

After the NetApp aggregates rebuild, NCO will ensure that the file systems are not inconsistent. If they are inconsistent, then consistency checks will need to be performed. More details and impact will come accordingly.
Meanwhile, NCO is rebooting WPC compute farm systems to resolve hung mounts in support of restoring WPC operations.

ACARS aircraft data are again ingesting into the numerical forecast models.
FTPPRD is accessible again in CP, currently populated with GFS and GEFS data (Customers should continue using nomads.ncep.noaa.gov as a viable backup, in case of issues retrieving model data from ftpprd)
NCEP Center’s (OPC, CPC, and WPC) operations remain severely degraded due to downed NetApp systems in College Park.

NCO support will continue to prioritize bringing up RZDM servers and NCWCP intranet site.
ACARS aircraft data are ingesting in the numerical models
No ETR.
Current Known Impacts include:
NCEP Centers’ websites hosted in CP that remain inaccessible include EMC and NCWCP intranet sites.
WPC, OPC, and CPC’s operational product suites’ status, range from being degraded to down.
NCO operations personnel are still unable monitor NWS networks and circuits.
CONUS QPE data is not updating on MRMS (Index of /data)
Several layers are not updating on NWS Cloud Services (GIS and Map Viewer)
Multiple outside datasets are not available/delayed (UKMET data, ECMWF data, Canadian METARS)
See the NCO IT Status Dashboard for the latest updates and more information

(idt)

swright1957 · 18 April 2024 14:28

At least its getting the recovery sorted out.

ktrue · 19 April 2024 13:37

This morning’s update:

2024-04-19 06:00:48 EDT - Iva Talley Additional comments

College Park Datacenter Issue - Update #14

Key Points

No significant updates overnight for restoration efforts.
Teams will resume efforts today with prioritizing the disk rebuilds, RZDM, NCWCP intranet.
Still no ETR for the datacenter restoration.
Current known impacts include:
NCEP Centers’ websites hosted in CP that remain inaccessible include EMC and NCWCP intranet sites.
WPC, OPC, and CPC’s operational product suites’ status, range from being degraded to down.
NCO operations personnel are still unable monitor NWS networks and circuits.
Several layers are not updating on NWS Cloud Services (GIS and Map Viewer)
The NCEP Model Status website is not updating in real time
Multiple outside datasets are not available/delayed (UKMET data, ECMWF data, Canadian METARS)
See the NCO IT Status Dashboard for the latest updates and more information

Keith Liddick

Senior Duty Meteorologist
NOAA/NWS/NCEP Central Operations

administrator · 19 April 2024 14:11

I’m glad I’m not on their Ops team! I remember one major disk outage back in the early 90s where we ‘lost’ the storage for our biggest manufacturing plant. They had no access from Friday evening until the following Wednesday. I was there from Friday evening until I drove home at about 3am on Sunday morning, to get about 3 hours sleep before I had to go back in again to make sure the restore process ran successfully.

I learned a valuable lesson from that incident…you must always test backups. We had been doing just that…we restored a set of random files from a randomly selected tape or two each month just to make sure the tapes, tape drives and restore software were working. However, I learned that there’s a huge difference between restoring a few files and a whole disk storage unit. A few files take minutes…a whole storage unit took the best part of 4 days. That was back in the days when a storage array of 12GB was considered big!

As for how the storage got ‘lost’ that’s an entirely different story that maybe I’ll tell one day.

Steve Urkel Oops GIF

swright1957 · 19 April 2024 14:25

Is that story going to start
“Once upon a time”
There was a data storage backup

administrator · 19 April 2024 16:30

It might have anonymous characters appearing in it.

ktrue · 19 April 2024 16:45

The NWS has two major datacenters, one in College Park, MD and one in Boulder, CO. They regularly switch from one to the other so the non-primary one can have patches/upgrades.

What seems to have happened at College Park was a major chiller failure leading to many over-temperature alarms and chaotic shutdowns of various systems as they thermaled-out.

The first message on the thread of the problem report showed:

2024-04-15 22:59:07 EDT - Iva Talley
Work notes
College Park Datacenter Issue - Update #1

Key Points

The IDP College Park data center experienced a temperature spike around 15/1530Z due to a faulty switch on a temperature chiller. The temperature chiller issue was resolved by 15/1620Z.
NCO remains engaged in disk rebuilding/systems recovery efforts associated with fallout from this morning’s temp chiller issue at the CP datacenter.

Current Known Impacts include:
NCEP Centers’ websites hosted in CP including NCO, WPC intranet, OPC, and EMC, are currently inaccessible
CPC is unable to issue products due to Intranet service being inaccessible
FTPPRD is inaccessible in CP. Customers are able to use nomads.ncep.noaa.gov is a viable backup in the meantime.
No ETR at this time.
See the NCO IT Status Dashboard for the latest updates and more information

Reginald Ready

Senior Duty Meteorologist
NOAA/NWS/NCEP Central Operations

The next message showed 188 servers in deep trouble. No wonder it’s taking time to rebuild.

administrator · 19 April 2024 17:37

Once upon a time, round about the time computer dinosaurs roamed the planet (early 1990s), your scribe was an IT project manager. His project at the time was to deploy a networked Windows 3.1 infrastucture of about 1300 PCs across a large manufacturing site - large as in 1.5 miles long and 0.5 miles wide with about 50 buildings containing humans who needed PCs. The site was predominantly VAX/VMS based using dumb VT terminals, a number of standalone PCs and a handful of networked PCs running on thick-wire Ethernet using mostly DecNet and a few with the Wollongong TCP/IP stack installed. Suffice to say it was an environment that cave men would be familiar with. I’m guessing a few forum members might also remember some of this technology

As well as being the project manager I was also our ‘intelligent buyer’ for much of the new infrastructure so I made sure I was up to speed with what a third-party installer was going to install for us. We cared about Disaster Recovery (DR), yes even back then, and whilst I didn’t have a limitless budget I tried to have the environment built in such a way as to start to support some aspects of DR. I couldn’t afford a spare set of servers, but given that the servers we were buying could have hardware RAID cards fitted that I could afford I had those fitted. To put these servers in context, these were Compaq servers, probably Intel Pentium 100MHz which could have 14 * 1GB SCSI disks installed giving us 12GB of usable RAID 5 storage. We installed 5 of these to cover use by all 1300 PCs!

The servers were built off-site and we went to the third-party premises to inspect them before they were delivered to site. Whilst we’d had ‘big’ RAID on the VAXes this was our first encounter with RAID on smaller servers and the third-party was expounding the virtues of it. They said…you can pull a disk whilst they’re running and put a new disk in which will re-build back to a fully working environment. Never one to miss challenging ‘brave talk’ I said “Go on, pull that disk right now.” They seemed a little nervous, but did it. The server told us something was wrong, the disk was put back in and the server said it would take a little while to get things back to normal, but it happily did what was needed. The test was sucessful and everyone breathed a sigh of relief.

The servers, PCs, network and software was installed and users trained which went fairly well. Once we’d finished I put my mind to DR again. We could survive the failure of two disks in a server (14 disks - 1 hot spare and 13 formatted as RAID5 giving us 12GB usable), but I thought - what happens if we have a server hardware failure. We’ve got some high priority users and some lower priority. So if the server used by the high priority users failed, was there a way to shut the low priority user server down and use the hardware for the high priority users?

The servers were all identical so I wondered whether it would be possible to take the disks out of the high priority server and install them in the low priority server, effectively bringing the high priority server back online within an hour or so. I talked to the third-party server storage experts about taking RAID5 disks out of one server and putting them into another identical server. They said they saw no reason why that wouldn’t work.

So we set up a test. On a Friday evening we’d shut two servers down and swap the disks, bringing each server back up again and test that the transfer had worked. The Change Control paperwork was duly filled in and approved. We’d allowed 4 hours to do the test and then swap things back to the original configuration. Friday evening was quiet so it wouldn’t impact too many people.

Our Friday evening started. We made sure we had a good backup. We shut the servers down. We checked the disk labels so we knew which ones went where. We removed the disks from both servers and put them into the equivalent slots on the other server. Then we powered one server on. Lots of beeps and warnings emanated from it. We’d kind of expected that, but eventually the server didn’t boot, giving an error along the lines of “Bad RAID Array”. That was worrying, so we abandoned the test and worked to put things back as they were. So we put the ‘Bad RAID Array’ disks back into the original server and booted it up. Lots more beeps and warnings emanated from it before it also failed to boot giving a ‘Bad RAID array’ error!

Now we were getting even more worried. The test might not have worked but we had always assumed that if we put the disks back in the original server that we’d be back where we started from.

The evening turned to night. We tried everything we could think of to get the servers to recognise their original disks, but nothing we tried worked. At about midnight we gave up and got the backup tapes out. We kicked off data restore routines, that we’d only ever used to restore small numbers of files. Due to the way the backups were being done, full+incrementals, the software wanted to read all the tapes from the previous week to build a full catalogue of all the files, working out which ones were the final versions when we did the final incremental backup before starting the test. So we spent a few hours feeding tapes into the tape drives whilst the catalogue was built. At about 3-4am the system said the catalogue had been fully rebuilt and which files did we want to restore…we said “All of them!”. The restore started and I left a colleague (who volunteered to stay) to nurse the restore process through. I went home to get a bit of sleep, a shower and some breakfast. We’d estimated the restore to take 5-6 hours so I said I’d be back towards the end just to check things had gone OK.

I got back there at about 10am on Saturday morning to find that the restore had hardly started. It was running at about 10% of the speed we’d expected. So instead of 5-6 hours we were looking at 50-60 hours…or 4 to 5 days!!

We just had the live with that. We didn’t seem to be able to do anything to convince the restore to go faster and if we stopped after 12 hours and re-started then we were just going to get a longer outage. I guess we also wondered if once we got to the less populated ‘incremental’ backups it might speed up. It didn’t

So, some time on the Wednesday we finally gave the system back to the high priority users. They weren’t entirely happy at having lost access to their data for so long. We managed to convince them that it wasn’t our fault and that mysterious and unexpected software fault had caused the problem.

We did eventually find out why it didn’t work as planned. The RAID controller boards in each server had NVRAM which stored a ‘map’ of the array, i.e. which disks were in which slot and what part of the array each disk contained. There is a utility that you’re supposed to run to dump the NVRAM to a floppy disk. When you put the disks into a new server you need to load the floppy disk export data back into the NVRAM on that server’s RAID card so that it would recognise how to use the disks. Unfortunately the third-party disk storage experts omitted to tell us that little detail so we were doomed from the start. Having said that, whilst there was an outage we didn’t lose any data…what they got back on Wednesday was exactly what they had when they went home on Friday.

Only two of us ever knew the real reason why it went wrong and my colleague took the secret with him to the grave. I’ve now told you, but I think the majority of people affected by the outage will be retired now and not worrying about an incident that took place about 30 years ago.

ktrue · 19 April 2024 18:12

WOW! That’s quite a saga. Thanks for sharing the experience.