Route Scheduling and Outages

Hi all!

As some of you may know, I have made some changes to the route scheduling system in order to improve reliability of automatic route scheduling. Over the past month the route scheduling system ran failed to run completely a couple handful of times. While most routes scheduled automatically, at times, up to 10% of routes were not scheduled. While this is not a major problem, as the route scheduling will eventually schedule them the next day (or if that fails, the day after) the 2 days of buffer may not be enough for comfort, especially for long haul flights. Investigating this issue proved to be a challenge; there was no pattern of when the route scheduling failed, no easily interpreted error messages, and the failure was not reproducible on our test systems. Workaround required tedious and manual user action. Last week, I have implemented a stop gap measure that doesn’t require destructive testing, disabling routes, or any major changes to any airline.

I am happy to announce that routes have been scheduling successfully for the past few days and it appears that the route scheduling issue is resolved. Please expect route scheduling to still take between 2 and 3 hours.

However, some may have noticed that in the past few months we had a couple outage incidents. Both of these incidents were caused by data corruption on a certain sever that handles and store financial transactions for airlines which caused that service to crash. With this system processing over 6 GB of data (about 12 million transactions) per hour, errors do occur. Nonetheless, data corruption is very rare, as the system that handles the financial transactions has built in error correction and a vendor provided automatic crash recovery. Data corruption to a point that the system will crash is even rarer. Luckily, the manual crash recovery procedures and backups meant we lost no data. After further investigation, we felt that two incidents in two months is no coincidence, and working with our cloud provider, we suspect that this may be related to hardware issues. Therefore, we will be redeploying some our servers and systems in the near future and this will require downtime. The issue tracker and the new forums will remain online but everything else will be down for maintenance. We will provide notice of when this will happen.

Thank you and we sincerely appreciate your patience.

sw889432
(the guy who fixes things)

and the Airline Enterprise team
(the hard working people who made and runs this game)

3 Likes