I’m still investigating a variety of, probably linked, issues.
The forum may bounce a time or two as I try to get things back up and runnign properly again.
I’m still investigating a variety of, probably linked, issues.
The forum may bounce a time or two as I try to get things back up and runnign properly again.
I think things are slowly returning to normal. I’ve found some weird things that I don’t understand, e.g.
I don’t think I’ve been hacked, because the servers didn’t seem to be in a state that would be useful to a hacker. They were all checked as being fully patched about an hour or so before these problems started.
I’m working my way through all systems to see what is/isn’t working. I may not finish that before my brain decides that sleep is necessary, but tomorrow is another day (says Zebedee for those who remember!)
Getting 403 forbidden for model data downloads.
Boing! Time for bed. . .
I’ve gone as far as I can tonight. If I try to continue I’ll make mistakes because I’m sleepy. GFS and ECMWF data are both broken in ways that I don’t currently understand.
I’ll be back to this ASAP tomorrow morning.
I’ve started my recovery work again.
Next job is to completely rebuild the distrubuted storage because it’s not working properly. Luckily the forum lives in it’s own disk space so that can stay up but if you access any other weather-watch.com service then you’ll get errors.
I’ll give updates as I make progress.
Thanks for the update Chris. I’ve switched to Bohler data and no ECMWF for now and will only switch back once you have completed your work.
Wanted to say how much we all appreciate all your efforts on this.
Stuart
It’s slow going but I’m making progress. I’ve taken this opportunity to upgrade the distributed data storage software from v15 to v18 which means it’s back to a supported version. I tested with v17 and the problem I was having with data not being replicated was fixed so it should still be OK once I get to v18.
The next steps after the upgrade completes is to start moving files back onto the data storage and then start up the version systems one at a time to check that they’re working OK.
Thanks’ for all your hard work Chris. Im just letting my forecast run with errors until you have fixed it.
Thanks once again Chris
Thanks for all the HARD work Chris!
MikeyM
I’m moving data back and starting containers up again. Early test cases have worked OK and I’m just running test GFS/ECMWF downloads which seem to be working OK.
Thanks Chris ! works well again !!
It doesn’t work for me yet so I’ll just wait.
Anyway, many thanks for your efforts, Chris!
Sorry this is taking so long. Some systems are coming back up without problems and others are still broken. I think it’s being caused by a network issue which is preventing containers from accessing the database. I don’t understand why because it was working fine until yesterday morning and I didn’t make any changes because I was asleep when it went wrong!
I’m still working at this…hopefully once I find the magic fix everything will spring back into action!
Time to eat and relax my brain for a while.
I’m now pretty sure the remaining issues relate to a network issue. I’ve managed to get a response from the web server and PHP inside a couple of the containers so the only thing left that they need to do is access the database. I think that the containers that don’t work don’t use the database and those that are broken do use it.
It’s not easy to diagnose because there are about 80 different network sub-nets involved in the environment (40IPv4 and 40IPv6). I think there’s a clash of IPv4 sub-nets that’s blocking access to the database server. It’s difficult to debug because I can’t see the environment from inside a container to see what network routes they can see!
I’m thoroughly confused now.
It doesn’t seem to be a network issue after all. I’ve started to debug some code (with difficulty in a container) and some test PHP code can connect to the database but cannot then make a query against it. A lot (probably all) of the systems that are no longer working will have similar code in them.
So it seems that something has changed in the way that MariaDB or PHP/Perl handle database calls. I’ve searched and can’t find any evidence to support this though.
Zebedee: Time for bed.
Florence: Already?
Zebedee: . . . you’ve had enough magic for one day. (Boing)
I need to be up early tomorrow for a trip to the airport, so it’s time for bed again but I’ve made significant progress.
I’m now sure that the issue is a database client to server version incompatibility. I’m not sure which it is yet, but I created a temporary server with MariaDB 10.5 installed on it and copied a database to it. The test PHP script I was using worked first time! It failed every time with MariaDB 11.6 server.
My plan when I get home is to:
(3) is a major job
a) dump the databases to SQL
b) uninstall the server software
c) install an older version of the server software
d) Configure the server software
e) reload all the databases from SQL
f) test all the systems
Still getting errors for Wxsim at the moment.
Has not run correctly for 2 runs now.
When it next runs I will post error message as it cant download form Chris servers.
Probably best posted in the WxSim category. It will get lost in other stuff here.