Things should be getting back to normal. The 12z data has downloaded and is being processed as I type, so probably ready by the time I post this message.

I’ve resolved more of the disk space issues, but still have some more work to do. I initially missed a significant thing whilst trying to debug the problem. The servers have some shared/replicated storage, so data written to one server is replicated to the other servers so the data that needs to be shared is available ‘everywhere’. I could see that the shared storage was getting full and I thought that was the problem, so I concentrated on trying to clear some space. I thought I’d done that, but was still getting some weird ‘read-only’ errors.

After a lot of digging I eventually noticed that the non-shared disk space on server 3 was full. This was preventing data being written to the shared disk on server 1. I did some tidying up on all servers which gave enough space to start allowing data to be written again. Partial victory was mine!

However, I then noticed that a config file that should contain data was empty. The system had tried to write the file when the system was blocking disk writes. Luckily I have implemented a hierarchical backup system…

  1. The important data is replicated across three servers (shared storage described above), so the data still survives if any two servers die. However, this doesn’t protect the data from bad writes (like the one here). This is copy 1.
  2. A copy of the important data is copied a few times a day to a separate disk attached to one of the servers. This is copy 2.
  3. Once a day ‘copy 2’ is synchronised to my home NAS, which implements snapshotting, effectively giving me another 14 copies of the data locally (at varying snapshot intervals). This is copy 3.

In this case, copies 1 and 2 were corrupt, but copy 3 had the uncorrupted data which was quickly copied back to the servers. Partial victory was again mine!

Then it was time to download the GFS 12z data…which failed with a weird database permissions error. A quick restart of the database and load balancer on all three servers seem to have fixed this. So for now, complete victory is mine :wink:

I’ve typed long enough that the GFS 12z data is now available.

Chris as of now (22:28 on 16th) I am unable to access either GFS or ECMWF report files, cloudfare gave me a time out on your host! Error code 524 on both.


I suspect I’ve forgotten to restart one of the containers. I’ll take a look before I get in bed.

Should be working again now…at least it’s working OK for me.

I’ve made quite a few changes in the last few hours and I think I must have forgotten to restart one of the affected containers that provides the WxSim web service. Things came back to life after I restarted it.

I can’t check everything until I get back on my laptop, but a quick web check this morning suggests that the overnight runs worked well. I’m hopeful that my fixes have stabilised things.

All looking good today.


Thanks for all the work !

Had some issues/errors this night at wxsim-lite but away from home I have to check at a small laptop.

Will see upcoming runs

wxsim-lite does not use McMahon data but Bohler only so different issue.


1 Like