McMahon data missing

administrator · 21 July 2024 17:04

I am aware of the problem with missing GFS and ECMWF data on my server. I can see the error so I know it’s a database issue and I’m pretty sure I know how to fix it.

Unfortunately I’m away from home at our annual family weekend. I’m on top of a hill with the nearest town about 10 miles away and only have my phone with me. I could probably have fixed the problem on the small screen, but…

Good security practice dictates that you don’t allow root user to log into SSH using a password. So I use a certificate for root user logins. That’s nice and secure but does mean that if you don’t have the current certificate then you can’t log in. I’ve forgotten to upload the latest certificate onto my phone so I’ve no way to log in to a command shell

We’ll arrive home tomorrow (Monday) afternoon so I should be able to fix the problem then. I also now know another parameter that I need to monitor and report on, although in this case it wouldn’t have helped because I still wouldn’t have been able to log in when notified of the problem.

broadstairs · 21 July 2024 17:33

We can use Sam’s ( Bohler) data for now and my forecasts have been OK with that.

Stuart

administrator · 22 July 2024 12:29

I am home and I’m working on getting the database back online. As it’s a clustered database it take a little more to start it up than a single server database.

administrator · 22 July 2024 13:21

The first server is back online and the second server is starting up. The two servers are now communicating to synchronise the data from the second sever to the first. It looks like the system might have been down to one running server for a while because there seems to be a lot of data to synch.

Once the first and second servers are back inline with each other I can start server three, which will also have to synchronise back into the cluster. I can’t let any systems start using the database until everything is back in synch otherwise it complicates things and potentially slows everything down.

Edit: Two finished - third server synchronising

administrator · 22 July 2024 14:21

The three databases are all back in sync and the GFS/ECMWF jobs have started. They’re running in parallel though so they will take longer than normal to complete.

administrator · 22 July 2024 15:15

I think I’m getting to the bottom of the problem now! The table that contains t6.5he GFS/ECMWF has 6.5 million records in it and there should only normally be about 130,000 at most. The disk isn’t full, but it’s at 80% which won’t help. Hopefully once I tidy the database up things will return to more like sanity!

weatherbee · 22 July 2024 16:07

Chris,
I just downloaded your GFS/ECMWF data without any problems.
Thanks for all your hard work.
Tom

administrator · 22 July 2024 16:15

The data will only be there temporarily. I need to empty the forecast table to get it back down to size. It takes a long time for MariaDB to process a 190GB table!

administrator · 22 July 2024 18:53

I’m still working on this

Turns out that sorting out what has now become a 210GB database table that’s consuming a vast proportion of the server’s disk space is non-trivial! I was also tripped up by a database backup that kicked off in the background that was preventing my operations doing what I wanted until it had finished.

More news soon…I hope.

administrator · 22 July 2024 19:08

An alternative approach has paid dividends. I’ve dropped the huge table and re-created the structure of it from the last backup. So I now have an empty table, a lot more disk space and a MariaDB server that’s not stuggling to handle a huge database.

I need to re-sync the three database servers now, but that will hopefully be quicker because there’s less data to move around now. Then I can kick off the GFS and ECMWF jobs again.

I also understand why the tables grew so much. At the end of each processing run there’s some code that deletes old run data. The system only ever uses the most recently processed run, so there’s no need to keep old data. Unfortunately there’s something in that code that’s failing and therefore no old data has been deleted for a couple of weeks.The same code is used by the GFS and ECMWF scripts so that doubled the problem. I’ll look at the code once I’ve got things back up and running again.

administrator · 22 July 2024 19:24

I’m on the downhill stretch now! The three servers have synced…it was a lot faster without a 210GB table to deal with! The GFS and ECMWF data download scripts are running and new data is being added to what was a new and empty forecast table. I think it’s about 30-40% complete now so hopefully there will be new data for you to download in the next 30-40 minutes.

administrator · 22 July 2024 19:36

Today’s 06z ECMWF is now available. 12z won’t be far off but the data isn’t available to download just yet.

administrator · 22 July 2024 19:40

GFS 12z data for today is now available.

I think things are back to normal now. Time to find out why the old data isn’t being deleted.

administrator · 22 July 2024 19:52

I’ve also fixed the reason why old data wasn’t being deleted. It was actually only ECMWF data that wasn’t being cleared out, but the problem had been going on for between 3 and 4 weeks so that was a lot of extra data. The cause…

hosts.docker.internal is not the same as host.docker.internal.

The second of those is correct, but I’d mistakenly added an extra ‘s’ to the first one. That’s the internal hostname that the database sits on, so with the typo the cleanup script wasn’t able to connect to the database.

Before I go away again I’ll also make sure I’ve got the required SSH certificate on any devices I’m taking with me. Having said that, I wouldn’t have been able to carry out this fix on my phone and even if had I known sooner I wouldn’t have been able to spend the time required to resolve this issue until I got home from our family gathering. I only see quite a few of the people there once a year.