Forum downtime - 7th July

I’m just getting ready to migrate the forum to the new servers. The forum will be offline for about 45 minutes from when I start (next 5-10 minutes from time of posting).

We’re back up and running on the new servers. I still have some work to do with the email components but that won’t stop you using the forum.


Email has been re-enabled. Outbound is definitely working. I think inbound should be too, but that’s not as easy to test.

If this arrives then both outbound and inbound email are working…I know @bitsostring has already tested this, but I like to do my own tests so that I can see email logs and headers :wink:

I’m trying to lock down the firewall to make the servers more secure and managed to make them very secure…I locked everyone (apart from me) out. I’m still working on the problem so things may come and go as I figure out the right combination of rules to let you in but keep the bad guys out.

This took a bit longer to figure out than it should have done. I spent some time scratching my head before finding the real cause. I shut down the old servers and the connection to the new servers died, or that’s what it looked like. There’s no connection between the old servers and the new ones, and whilst both have private network connections, the connections are (supposed to be) unaware of each other.

So, I spent some time trying to think what connection there was between old and new that I’d missed. I restarted the old servers, which came back online as expected, but still the new servers are disconnected. That’s when I started to look eleswhere and found a missing firewall rule for the new servers.


So what was the latest problem was getting cloud fare host error gateway 502 for at least the past half hour

I was checking the forum error logs and noticed that the translator wasn’t working. I had to rebuild the Discourse app to get it functional and that takes things offline. It had probably been broken for a couple of months but I’d not noticed previously. That’s another problem that won’t be reappearing.

@bitsostring has identified some images that don’t display. I’ve corrected those, but also found some more. To use the correct terminology I believe that some posts are marked as ‘uncooked’ so I’m running the ‘bake’ process to try to tidy them up. There are a few different baking recipes (not correct terminology) so I might need to try a few to find the correct one.

Home baking hasn’t worked, so I’ve kicked off a huge commercial bakery process. Sorry about the puns.

This will probably run all night and maybe longer and the forum will probably be slow whilst it runs.

I know now why this has happened. I used a temporary domain to facilitate the migration and some images have been given a URL from that domain. The rebaking will correct those problems, but it’s just going to take some time

To put this big bake job in context. There are two parts to it. One part is queueing background jobs to do the baking. After 30 minutes, that’s about 20% complete.

There are currently 80,000 jobs waiting in the queue to be processed and the background processor is handling about 110,000 jobs per hour. Under normal circumstances the queue is empty, or only has jobs in it for a few seconds before they’re dealt with. The queue is used for a lot of jobs, one of which is sending notifications (emails and browser) so those are likely to be delayed because they have to sit in the same queue. Notifications should run at a higher priority than the baking jobs so they shouldn’t be delayed too much.

1.2 million jobs later and everything should be rebaked. Hopefully there’s nothing else looking weird now!

