My quest to generate better stats for the availability of ECMWF (and GFS) data continues. I’m now able to generate timings for the following key stages in the process:
Time between the run ‘start’ time, e.g. 2023-11-20 00Z, and the data being flagged as being available for download
The time to download the data
The time to process the data
I’ve got 3.5 days of data to look at so far. Not much, but it’s giving me some things to think about. What I can see so far (assuming my stats code isn’t playing tricks on me) is:
a) The ECMWF data for the shorter 06Z/18Z runs and longer 00Z/12Z runs seems to become available at exactly the same delay after the run ‘start time’ (to the nearest 5 minutes - which is how often my script runs).
b) Once the data is downloaded it takes my data processing script about 1.4 times longer to process the longer runs. That’s reasonable…there’s a lot more data to crunch. It should actually be more like 1.6 times longer based on the extra data volume but I can understand why the longer runs don’t follow a linear multiplier for run time, i.e. there are two main time components - data crunching and database updates. There’s a lot more data to crunch but the same number of database updates (although each update is a bit larger for the longer runs).
c) The thing that’s puzzling me is that the download times are weird. The shorter runs download 30 files in about 50 seconds, but the longer runs are taking about 15 minutes to download 48 files. If I look at the file timestamps on disk the files for the longer run are all written within a couple of minutes, so something seems to be causing the longer runs to re-start - probably 3 times before the full set of files are downloaded.
It’s getting late here so I’m not going to try to figure out where the extra 15 minutes is going tonight. At least I’ve got something to investigate now, although it could be slow going because I can only run debug code twice per day when the 00Z/12Z runs are due to arrive.
I said I wasn’t going to investigate tonight but I had an idea and I think I know what’s going wrong. I have a script that checks for the data being fully available and I think that’s got an error in it when checking for the longer runs. So it tells the download script that the data is fully available when it’s not all there. The download script tries to download all the files, but because some aren’t there yet the download fails. It re-runs every 5 minutes and after 15 minutes all the files are available and the processing continues. I’ve checked the source files on the ECMWF site and the timestamps seem to agree with the 00/12Z runs being ready about 15 minutes after the 06/18Z run data.
I’ll test this theory out tomorrow…I can really only test this once per day…the 00Z run arrives too early…whilst I’ll be awake I have other things I need to do before 9am so I can’t spend time on the laptop. That just leaves the 12Z run to look at, so my first chance to check my theory will be tomorrow evening.
If I’m right about this then it’s not going to change the ECMWF data availability times for WxSimate. The 00/12Z run data just becomes available later so if I fix the script it will just report correct times to the stats database. It will save me some download capacity, but the server package I’m on doesn’t count download data so that doesn’t really matter much - although it will reduce the load a little on the ECMWF data servers.
Having done some sleep analysis on this I’m now sure that my ‘latest data check’ script isn’t working correctly.
A short ECMWF run for WxSim is 90 hours and a long run 144 hours. ECMWF scripts publish each 3 hours forecast file individually onto their distribution server which takes about 50 seconds per file. What I’ve not taken into account is a short period of approx 15 minutes, when the long run data files between 90 and 44 hours are being published.
When all files are published my script works OK, but it returns an incorrect value when 90 hours worth of files are available for a 144 hour run. I’ll only have tested when all 144 hours of files were available, mostly because the 15 minutes when there are >90 hours but <144 hours of files would usually occur at times (about 7:15am and 7:15pm) when I’m usually doing other things and not coding at my laptop.
This isn’t actually having very much effect at the moment though. Until I made changes a few days ago the incorrect check causing a download failure was one of the conditions that would cause the processing to be delayed for 30 minutes. However, now this failure only pauses processing for 5 minutes, so it only delays the downloads by a few minutes.
I think I know how to fix the problem, but I won’t be able to fully test the modified version until after the 06Z data becomes available in a few hours time.
Grrr…the changed script didn’t work, but I understand why and have fixed it now. The script is written in Python and I don’t use Python very much. I was trying to get a substring out of a longer one so used…
variable = “0123456789”
substr = variable[3:4]
…and intended to get the 4th and 5th characters of the string (it’s numbered from zero). Unfortunately, in Python the second value is actually the character after the last one you want, so [3:4] only gets one character - the fourth. I’ve now changed it to be [3:5] and hopefully things will work properly for the 00Z run tomorrow morning.
This has been a PSA on behalf of the Python Programmers Association
I managed to grab a quick 2 minute diagnostic session on the laptop during the crucial 15 minute period when the script was working incorrectly. It looks to be working correctly now. The stats will confirm that later when I get longer to look at the data.
The fixed “is the ECMWF data ready to download?” script is now in place on PROD and was used for the 06Z (short) and 12Z (long) runs. The 06Z run finished by 13:32 (7h32m after 06Z) and the 12Z run finished by 19:53 (7h53m after 12Z).
I’m now getting good and understandable statistics, so over the next few weeks I’ll be able to get a much better idea of a realistic time when, in say 95% of cases, the data should be available for WxSimate to download.
Thanks, Chris. I’m trying to follow the topic and understand all the issues you covered and described.
I remember there was a post that suggested the optimum times when the WXSIM(ate) should download fresh data and make forecast. This is probably linked to your analysis, right? And if my memory is OK, the GFS and ECMWF do a little overlap (not available “at the same time”).
The stats I’m working on should hopefully give a better and more up-to-date view of the best times to download GFS and ECMWF data. They work in near real-time rather than me doing manual data dumps and processing the data in a spreadsheet. I’m hoping to create a web page with up-to-date run status and stats.
The GFS and ECMWF data don’t really overlap at all, at least not in a timely manner. Typically, GFS data is available before the next forecast cycle starts, e.g. 00Z is available by approx 04:30. ECMWF isn’t available until after the next forecast cycle starts, e.g. 00Z is available by approx 08:00 when the 06Z forecast is already being calculated. That means there really isn’t a good time to grab both sets of data for WxSim.
For example, if you run WxSim when new GFS data is just available, e.g. at 04:30, you’ll get 00Z GFS data mixed with 18Z ECMWF data. If you run WxSim when new ECMWF data is just available, e.g. 08:00, you’ll get 00Z ECMWF mixed with 00Z GFS, but both data sets will be 8 hours old.
I’ve been thinking about this and perhaps we need to get some advice from Tom about how mixing works.
Does the mix take the latest available data from ECMWF and replace the GFS data in which case running at a time when ECMWF is most current would be best if mixing.
How does the mixing work exactly and this knowledge might help in determining the best way to run forecasts?
I don’t know exactly how it works, but I do know that ECMWF isn’t a direct replacement for GFS. ECMWF has far fewer data fields available than GFS so you couldn’t run WxSim on ECMWF alone. I assume that if you set the mix to 100% ECMWF then it just uses the fields from ECMWF that match GFS fields and uses GFS data for all the other data that’s missing from ECMWF.
I suspect beyond that it’s fairly simple. For the example above of running at 04:30 you get 18Z ECMWF mixing with 00Z GFS. If you take the 2m Temp from hour 12 of ECMWF (06:00) you can mix that with the 2m Temp from hour 6 of GFS (also 06:00). The difference being that the ECMWF data is older so is possibly not as relevant than the more recent GFS data forecast.
It’s working at the moment, but it is still a work in progress so don’t be surprised if it changes/looks different at some point. I don’t think data will be removed but I may add info to the page.
The page does quite a lot of database access so please don’t sit on the page refreshing it every few seconds. There’s no real benefit to refreshing it more often that about every 5-10 minutes. If use of the page begins to impact the server/database then technical measures will be put in place to prevent frequent refreshes, or block persistent rapid refreshers to only be able to refresh every 30 minutes or more.
The averaging periods will change. They’re set relatively short at the moment because there isn’t a lot of data. I think they’ll eventually become 2 weeks, 2 months and 1 year. That’s configurable in the scripts and I’ll adjust as time goes on.
The averages aren’t particularly accurate at the moment. They do reflect the collected data, but there have been recent outages caused by code errors and also changes in the way I record the stats data, both of which affect the averages. These are making some of the averages a little longer than they would normally be. Over time, with more data, the averages will average out (if there can be such a thing!) as the outlier data moves beyond the averaging periods or becomes diluted by a larger volume of more accurate data.
Hopefully this will be useful to you if you use McMahon GFS or ECMWF data in WxSimate. Please note, there is no equivalent of this page for the Bohler GFS data. I don’t have access to equivalent stats data from that system.
I already have some ideas of how to improve the page and will work on those alongside the current page. I’ll announce any changes as I have them ready.