Possible GFS/ECMWF data problems

I’ve identified a database problem which might affect the availability of the GFS and ECMWF data for WxSim. I’ve tried to get the databases back in sync but it’s taking a long time. I’m really tired so it’s probably safer to go to bed than to try to apply fixes that might break things further!

I’ll come back to this problem in the morning.

I’m recovering all the systems. One of the servers (the one that runs the forum) ran out of memory overnight which killed lots of stuff. I’m back from 100% full on the root partition to 91% full which is enough to get the forum and some other systems back up and running.

Next job is to untangle the database which doesn’t like running out of space. Worst case I’ll have to restore a backup but there are a few other tricks to try first. A dead database means no WxSim data for now.

I think I found the problem. A job that’s supposed to clear out old WxSim data from the database looks to have failed which means there are many more records than there should be. I’m looking into getting it started again.

I ve already switched the gfs to bohler till you get it sorted

I may be some time. I’ve got corrupt log files which means the database won’t start. To fix them I need to copy all of the data just in case the fix goes wrong. There’s not enough space left on the server to take a copy of the files. Arrgghhh.

It’s not impossible to fix, but it’s just going to take a while longer.

1 Like

I had to copy the data off-server, but as that’s a network operation it was slow. At least that gave me an opportunity to eat lunch :slight_smile:

1 Like

I think I’m on the downhill part of the journey although there’s still some way to go.

  1. I found a flaw in my backup methods. The backups still run when there’s a database problem so it create zero sized backup files. However, as a backup backup I download all the files to my NAS so I have a secondary copy. Unfortunately I forgot to turn on snapshotting which would allow me to keep older versions of the files for a week or three so the files on my NAS were also zero length. This isn’t sounding good at this stage.
  2. I turned on the ‘nuclear - don’t enable this unless it’s a real emergency’ database recovery option which has enabled me to start the database in read-only mode - which bypasses the corrupt log files (these aren’t error logs, they’re database transaction logs which are much more important). This has allowed me to get into the database and I’m currently running a database dump to get a good copy of all the data. This is running well but as it’s writing to off-disk storage it’s going to take a while to complete.

Next steps will be to nuke the existing database installation across all three servers and re-install from scratch. That will let me rebuild the 3-server database cluster. I can then restore the database backup from remote storage. After that I need to delete the old WxSim data. The forecast file had grown to about 313GB which is a bit excessive! Finally, I need to review my backup mechanisms to avoid this kind of thing happening in future.

I have the database cluster started again and the data is reloading from network storage. I don’t know how long this will take because I’ve never done it before.

Chris, many thanks for your efforts. I can’t imagine how much time and knowledge this takes.
(y)

1 Like

at least it not Microsoft update that have messed it up.
Keep going you will get there in the end. Even if it takes a while
:+1:

1 Like

Unfortunately I found a problem so had to restart the restore process. It looks like it will take a couple of hours to complete so I’m going to find something else to do for a while and will check back every so often.

All the data has been restored so next step is to clear out the unwatned records and then start all the WxSim data download scripts running again.

Getting closer. I’ve dumped the 330GB of unwanted data so the whole database system is much leaner now! The 1st Sept 12z GFS update is running and when that’s complete I’ll run the latest ECMWF update.

We have normality. I repeat, we have normality. Anything you still can’t cope with is therefore your own problem.

The 1st Sept 12z ECMWF data is also now available and (as far as I can tell) all functions are back up and running again. I’ll check to make sure of that but nothing is ringing bells to say “I’m still broken”.

Sorry for the extended outage. Luckily it happened on Sunday when there wasn’t anything else important that needed doing.

1 Like

Thanks, Chris, for getting it back up so quickly and keeping us updated.
It is much appreciated !

1 Like

Thanks so much, Chris!

1 Like

No further reports of problems overnight is good to see. Hopefully WxSim downloads are working properly again for everyone.

Thank you so much, Chris !

1 Like