Disclosure statements about previous outages that have affected the FMI Works cloud product
This article relates to the FMI Works product, when delivered as a cloud solution
FMI is committed to providing a robust and secure platform for facilities management. As part of this commitment, we provide transparency of security and availability of the FMI Works solution. This page provides information on the outages that have affected the FMI Works cloud solution.
17 November 2023, 6:40am-11:20am AEST, SSO users
Original Report
FMI Works is experiencing an issue where SSO users cannot log in.
This issue was assigned a priority of Urgent and was investigated by our team in accordance with the process outlined in our Support Services Guide.
Analysis
On 17th November, customers using SSO were unable to log in to FMI for approximately 4 hours. The system did not recognize their e-mail addresses as being valid after a successful SSO authentication. The root cause of the issue was identified as insufficient visibility and testing stemming from inconsistencies in the SSO configurations across development, test and production environments.
Resolution
The system was set to recognize existing SSO customers e-mail addresses as valid, resolving the issue. Additionally, FMI has standardized the SSO configurations across all environments.
30 August 2023, 6:30pm-31 August, 7:20am AEST, all customers
Original Report
FMI Works experienced an issue where all users were unable to log in. The outage was first reported at 05:45 AEST August 31st. Review of logs indicated that the outage started at 18:30 AEST August 30th.
This issue was assigned a priority of Urgent and was investigated by our team in accordance with the process outlined in our Support Services Guide.
Analysis
On August 30 at 6:30 pm, FMI experienced an outage for all customers of up to 12 hours and 50 minutes. It was determined that our cloud infrastructure provider experienced a service outage due to a utility power surge in the Australia East region. After the power was restored, our authentication servers restarted and were not able to connect to database resources. Without database connectivity, the servers would not process any authentication requests. Additionally, availability monitoring was hosted in the same data center and failed to alert DevOps support teams in a timely fashion.
Resolution
Restarting the authentication servers re-established communication with database servers restoring system functionality. FMI is looking at alternative products and/or hosting regions to ensure uptime alerts are more responsive.28 July 2023, 9:00am-3:00pm AEST, some users
Original Report
FMI Works experienced an issue where some users were unable to perform certain actions in FMI Works due to the loss of permissions.
This issue was assigned a priority of Urgent and was investigated by our team in accordance with the process outlined in our Support Services Guide.
Analysis
On July 28th, approximately 5 percent of users of FMI experienced restricted access that lasted up to 6 hours. The cause was related to permission changes restricting access (no unauthorised access was ever granted). As a result, these users were unable to carry out certain actions that required these permissions during this time. The root cause was the use and lack of documentation of magic numbers in the source code. A magic number is a number like "7" that doesn't contain meaning, for example, it could be days in the week or wonders of the world.
Resolution
Permission data was restored from database backups. We have updated coding guidelines to improve understanding through the use of named constants and documentation for all values that have some "magic" meaning.
9 January 2023, 8:00am-10:00am AEST
Original Report
FMI Works experienced an issue where some customers were presented with the error "Your connection is not private" when they attempted to access the application.
This issue was assigned a priority of Urgent and was investigated by our team in accordance with the process outlined in our Support Services Guide.
Analysis
On the morning of the 9th of January, FMI experienced an outage for some customers of up to 2 hours. The outage was determined to be caused by expired SSL certificates, where the certificates failed to renew automatically. During this time, effected customers were presented with the error message "Your connection is not private".
Resolution
The certificates failed to renew due to a missing DNS entry causing the domain verification to fail. This was remediated, and certificates were renewed, resolving the issue.
14 November 2022, 8:45am-9:30am AEST
Original Report
FMI Works experienced an issue where some users were presented with a 500 error when they attempted to log in.
This issue was assigned a priority of Urgent and was investigated by our team in accordance with the process outlined in our Support Services Guide.
Analysis
On the morning of 14th November, FMI experienced an outage for some users of up to one hour. The primary failure was a failure in single application server instance in one of our web farms. Users that were assigned this instance would not be able to access the application and would receive a 500 error instead.
Resolution
After analysis, it was determined that all of the failed requests were related to a single application server instance. As no other instances were failing the team moved to recycle the instance. This was identified to have occurred in the same Web Farm as the outage on 9th November. No other common cause was identified. As a result, the entire web farm is scheduled for decommissioning and clients will be moved to replacement web farms during the next maintenance window.
9 November 2022, 10am-12:20pm AEST, some users
Original Report
FMI Works experienced an issue where some users were presented with the legacy login page after initially logging in, preventing access to FMI Works.
This issue was assigned a priority of Urgent and was investigated by our team in accordance with the process outlined in our Support Services Guide.
Analysis
On the morning of 9th November, FMI experienced an outage for some users of up to two hours. Most users were provided a workaround by support and were able to resume work immediately. The primary failure was a failure in single application server instance in one of our web farms. Users that were assigned this instance would not be able to access the home page and would receive the legacy login page instead.
Resolution
After analysis, it was determined that all of the failed requests were related to a single application server instance. As no other instances were failing the team moved to recycle the instance. This was identified to have occurred in the same Web Farm as the outage on 27th October. No other common cause was identified. As a result, the entire web farm is scheduled for decommissioning and clients will be moved to replacement web farms during the next maintenance window.
27 October 2022, 1pm-3pm AEST, some users
Original Report
FMI Works experienced an issue where some users we given a error page when they attempted to use the home screen.
This issue was assigned a priority of Urgent and was investigated by our team in accordance with the process outlined in our Support Services Guide.
Analysis
On the afternoon of 27th October, FMI experienced an outage for some users of up to two hours. Most users were provided a workaround by support and were able to resume work immediately. The primary failure was a failure in single application server instance in one of our clusters. Users that were assigned this instance would not be able to access the home page and would receive an error page instead. The outage coincided with a depletion of networking resources for the same instance and appeared to be a failure within the operating system or networking stack.
The system auto-detected the failing application server instance and attempted to automatically recover. This auto-recovery process did not succeed.
Resolution
After analysis, it was determined that all of the failed requests were related to a single application server instance. As no other instances were failing the team moved to recycle the instance. The standard mechanism (as advised by the vendor) did not succeed in recycling the instance - just as the auto-recovery mechanism did not succeed. The DevOps team proceeded to manually delete and recycle this instance.
To facilitate the prompt identification and recovery of underlying infrastructure issues, FMI has added the identification and recovery steps to our internal outage remediation procedures.
13 July 2022, 6am-10am AEST, some users
Original Report
FMI Works experienced an issue where users were being redirected to the logout page when they attempt to login.
This issue was assigned a priority of Urgent and was investigated by our team in accordance with the process outlined in our Support Services Guide.
Analysis
On the morning of 13th July, FMI experienced an outage of services for some customers of approximately four hours. This outage was determined to be caused by a scaling failure during unexpectedly high activity. The primary failure occurred when capacity was reached on a database server cluster. The resultant failure, caused a cascade failure onto several servers in a web server farm. The issue was resolved via manual scaling events by responding team members. During these manual scaling events some customers previously unaffected by the outage reported 'sluggish' application behavior. Once the scaling was completed, all customers had restored service.
FMI routinely provisions capacity of database servers with 25% free space to allow for growth. Capacity scaling for databases is done during scheduled outage windows. Application servers automatically scale in/out as needs demand. If necessary, scaling up/down of servers is done during scheduled outage windows. In June/July of 2022, FMI saw larger data growth than anticipated, exhausting our provisioned capacity.
Resolution
After analysis of the process, FMI determined that improvements to our capacity planning process were desirable. Starting with awareness, FMI has added additional monitoring alerts and have expanded the distribution of those alerts. The regular maintenance update periods have also been elaborated to include a check on capacity before new releases are completed. Finally, FMI is changing from reactive capacity planning to predictive planning. Instead of sizing to current needs, FMI will use historical growth rates and predicted sales conversions to estimate future capacity requirements.
8 January 2021, 8am-1pm AEST, all customers
Original Report
Analysis
On the morning of 8th January, FMI experienced an outage of services for all customers for approximately five hours. This outage was determined to be caused by an improperly recycled SSL certificate by one of our partner vendors. The certificate re-issue and deployment process was completely automated and retrieved a new, but invalid, certificate. Upon expiry of the old certificate, there was no certificate available to secure the application.
While concurrently trying to repair the broken certificate, FMI went through an urgent requisition process for a new SSL certificate. The new certificate arrived before the old one could be repaired and all certificates were replaced manually.
Resolution
After analysis of the process, FMI determined that automatic issuance and recycling of certificates is still our preferred option. However, FMI has changed our certificate issuer to a new vendor which provides a more robust automatic certificate issuance pipeline, reducing the number of failure points. Concurrently, FMI change from wildcard certificates to individual domain specific certificates. This latter change will additionally mitigate the scope of failures should one recur.
More information
For information on any current outages, see Current Outages
For more information on our processes for outages generally, see Software Update Process