We have completed our investigations and concluded on the core of this incident. The services handling traffic taking care of our most used features like assignment and page elements ran into a complex server-to-server communication error state. The process of solving these complex situations is to start with most plausible cause and work our way through the list. For every action/change we need to test, verify and assess the results. In this instance we had to run through a set of changes before we reached our goal and got the service 100% up and running which also explains the long lifetime of this incident.
We have taken actions to prevent similar events in the future and there are other more long-term activities to further strengthen the affected service.