I think I’ve worked out why the WxSim database keeps on becoming huge and then crashing.
Every 6 hours the GFS and ECMWF scripts create about 65,000 forecast records each. The previous run records have to exist at the same time because you need to read those whilst the new data is being created. So at any given time the maximum number of records should be about 260,000.
After each new run finishes the old data is deleted by another script. Whilst the script is pretty simple, i.e. delete 65,000 records that puts quite a load on the database. The deletion process has to find and lock all of the records to be deleted and then complete the transaction which actually deletes the records and removes all the locks.
Locking 65,000 records for deletion can be done - it’s been working for years. However, it can be slow so sometimes might run longer than normal. Depending on run timings there’s also the possibility that the GFS and ECMWF deletion scripts might run at the same time which makes the process even slower.
Occasionally, perhaps if a run is late, or both GFS and ECMWF runs are late, the cleanup scripts might not have finished (or might not even have started) before the next downloads start. So now, there could be 400,000 records in total with 270,000 to be deleted. This becomes an even bigger deletion job which takes longer still. It could reach a point where the deletion job doesn’t finish before the next one starts and then there are record locking issues which can kill the jobs or at best slow them down significantly.
I caught the database with over 1,000,000 records in this evening and even just to count the records was taking nearly 3 minutes and I could see the deletion jobs failing due to locked records. Luckily 1,000,000 records didn’t consume all the disk space so things are still working.
So what’s the fix?
Currently I have one job for GFS and one for ECMWF that runs every 6 hours to delete all of the old records for each forecast type. I’ve now changed the way this works so that the jobs don’t try to delete all the records at once. They now run every minute but only delete 1000 old records on each run. Finding and locking 1000 records is much quicker so it runs to completion within a minute. Deleting 1000 records every minute means that normally all the old records will be deleted in just over an hour, but it also means that if there’s a huge number of old records for some reason, like now, then jobs will just keep running every minute until the backlog has been cleared.
The changed scripts are running now and have deleted over 30,000 records so far. There’s still a lot more records to delete, but the number of records is actually reducing which it wasn’t with the previous method.
Hopefully this is the end of the database crashes!