Weather-watch v3 configuration is underway!

It’s quite clear that the current (what I’m calling weather-watch v2) infrastructure has some problems, e.g. WxSim data and status files not always available. The environment has a distributed/replicating file system, i.e. data is stored in three places and is accessible from all three of the servers in real-time. This software is causing me problems. The current one being that some files on one server aren’t being kept in sync. I’m trying to resolve that, but…

In the last hour I’ve also started a significant project to build a new environment to move everything to (I’ll call this Weather-watch v3).

Whilst building a new environment is a fair chunk of work it will be much easier than my last migration for a few reasons:

  1. I already have scripts to do a lot of the build process. For example, it’s taken less than an hour to commission three new servers, install Linux and get my base build of software installed onto all three.

  2. As I’ve now containerised all of my applications they are much easier to migrate to the new environment. Technically I could connect the three new servers to the current three and move containers across. However, that would take some messy networking using vSwitches and getting sub-nets correct. That might break the current system so I’m probably just going to create new services on the new servers and copy the data across.

  3. The network connection from Cloudflare to the servers actually makes it very easy to move a single application at a time. In the past that required DNS changes, but now I can switch between current and new for a single app within abouy 30 seconds.

I’ve been researching replicating file systems in the background since I started to have problems with the current one and I’m happy that I’ve found what seems to be a better solution. For me it’s open source (free) because my environment is fairly small compared to the terabyte, petabyte and even exabyte solutions that their ‘big’ customers have. If it can work in exabyte sized environments then I’m sure it can handle my sub-terabyte world!

Onwards and upwards as someone once said!

1 Like

Thanks for all your hard work!!!

MikeyM

I’m not ignoring the growing number of WxSim data errors being reported. There is a problem with the file storage on the current environment. I’ve chosen not to try to fix it because I’ve had a quick try and failed. I decided that my time was better spent setting up a new environment to migrate to.

I’ve wasted a lot of time today trying to install my chosen distributed file system…and failing. Only latterly finding a small, but very important, detail on a document “on display in the bottom of a locked filing cabinet stuck in a disused lavatory with a sign on the door saying ‘Beware of the Leopard’.” The very important detail is that the version I need to use (the open source/free version) will only run on Kubernetes and I’m using Docker Swarm. So that solution got thrown away.

I then went back to look at my list of alternative solutions again and came back to ‘ceph’. I know of it from a previous version of the W-W environment. It was available but I didn’t use it. I’d discounted it because some documentation said that it needed a minimum of 4 servers to run, and I only have 3 (and at the time I last saw it I only had two servers). Well, as above, don’t believe everything you read in the documentation. I found a recipe which, with some tweaking (it was for a slightly older version of ceph), runs happily on 3 servers. I’m happy running ceph…it’s a well known name and is (or has been) used by some big players in the tech world - Nvidia, AT&T, Google and Yahoo to name a few. If it’s good enough for them, it’s good enough for me!

So, I’m now at the stage where I have:

  • 3 Linux servers configures, each with 4CPUs/8Threads, 64GB RAM and 2*500GB NVME SSD drives
  • Clustered Galera/MariaDB database installed and tested
  • ceph distributed file storage installed and tested
  • Docker Swarm installed
  • Cloudflare tunnels connected

So I’m at the point where I can start to test some applications from the current environment in the new world. I’ll be starting with the small (personal) stuff that you won’t notice, but I’m hoping I’ll prove very quickly that my newly built environment is fully compatible with the current environment. Once I get some confidence about that then I can speed up the migration.

Not tonight though. I’ve already had a long day. I’ll be better served by getting to bed so that I can start afresh in the morning.

Finally, I’m sorry about the recent problems and outages. I really hadn’t expected things to go downhill like they did. Hopefully third time around things will be different!

I’m making good progress. I’ve successfully migrated 3 services across to the new servers and have proved that the environment is working as I had hoped. The database and new distributed file service are working well.

I’m just starting to look at the WxSim related services and have the test version of ECMWF download/processing running as I type so I’ll know in a few minutes if that’s worked correctly. If it has then it shouldn’t take long to test GFS and then set the production ECMWF/GFS services up. I can set them up to run in parallel at first, so you’ll still be using the old/broken version at first. After a few successful cycles I’ll be able to switch everyone over to the production services very quickly and hopefully the problems of the last week will be behind us!

More good progress this afternoon. Firstly, I’ve successfully tested my development versions of the ECMWF and GFS download/processing scripts.

That allowed me to move onto gettingthe eight parts of production ECMWF/GFS system running on the new servers. The eight parts include the various web sites associated with data downloads, displaying charts and managing the system. These have all worked successfully when initiated manually. I’ve no reason to suspect that they will behave any differently when scheduled to run automatically because the mechanism is pretty much identical to the manual process.

I want to let both ECMWF and GFS run through a full cycle on their own (today’s 12z run) and if that works OK then I’ll switch over to the new system this evening. The new system is correctly displaying the status files which is the main problem with the current system, so hopefully WxSim data downloading will stabilise pretty soon.

I have to admit that migrating to a containerised environment hasn’t been trivial and if I had much hair to lose I’d probably have lost more over the last few months. Having said that, the speed at which I’ve been able to move the containers to a new server environment is amazing. In the past my server migrations have taken 1 to 2 months (and sometimes longer) so the fact that I’ve got a large part of the migration complete in just 2 days is great. I know I’ve got more to do, but it’s a relatively easy exercise…just repeating the same process multiple times!

The WxSim processing has been moved so that’s a big worry off my mind!

I’ll have to move the forum soon, but I’ve got a few pre-cursor tasks to complete before I’m ready to do that. As last time I’ll probably need an outage of a couple of hours to let me do all I need. I’ll confirm when that will be nearer the time.

Please note - there may still be short outages as I move other services to the new servers. Some may require minor adjustments to the server configuration which necessitate a reboot or one, or all servers. Rebooting all the servers requires a little extra work tomake sure that any clustered services, e.g. the database, are back up and running.

I’ll try to keep any outages as short as possible

I’m assuming that the lack of comments means that everything is running nicely again and downloads are being successful.

I’m still working away in the background to complete the transfer. I’ve completed 28 components so far with 8 more to do. A component could be a stack (multiple related containers) or a service (a single container). The forum (4 related containers, but not a stack :rofl:) is a quarter complete. It will probably be the last thing to move, but that might be as soon as tomorrow, or perhaps Wednesday.

I’ve also started to add metrics and alerting support. I’ll be able to send alerts, e.g. GFS or ECMWF data not updated to schedule, to Telegram so I’ll get much quicker notification of problems once I’ve completed setting the alerts up.

Node metrics…

Web site monitoring/metrics…

Distributed disk space metrics/management…

Container management…

1 Like

Very nice looking setup Chris :smiley:

Everything apart from the forum has been moved now. It’s getting a little late in the day to start doing it now, so I’m looking to do it tomorrow morning (UK time).

It’s going to take about an hour to do a backup, copy the back to the new server, restore the backup and test that everything looks OK. So If you can’t access the forum tomorrow that will almost certainly be the reason.