Production Services Affected by Issues with Underlying Service Provider (AWS)
Incident Report for Reflektion, Inc
Postmortem

Executive Summary

AWS API Gateway and Lambda functions in us-east-1 region had increased error rates for 1 hour 20 minutes. During this time, Sitecore Discover’s API customers were affected. Their requests didn’t reach our backend since they got dropped at API Gateway level.

Throughout the outage, Sitecore Discover was serving the traffic that reached its service workers. In terms of impact:

  • API Integration: About 67% of requests were dropped because of AWS outage
  • JS Integration: About 35% of requests were dropped because of AWS outage
  • CEC: CEC was able to load only intermittently during that time

Issue

Per AWS status page, there was degraded response and high error rate from AWS Lambda and API Gateway services in us-east-1. Sitecore Discover has deployments in us-east-1 region in US.

When API Gateway errored out, requests stopped reaching the service workers and instead getting dropped from API Gateway.

For Javascript integrations, there are certain flows which still require API Gateway and queries using those flows were dropped.

Impacts

Customer impact for most customers in US during the impacted windows.

Planned changes / followups

  • Sitecore Discover will build automation to bring up a temporary DNS Route which bypasses API Gateway in the event of such outages. It will still require the customer to make some changes to use the new endpoint on their end.
  • Sitecore Discover will re-evaluate and prioritize the change to move to us-east-2.
Posted Jun 15, 2023 - 13:55 PDT

Resolved
This incident has been resolved.
Posted Jun 13, 2023 - 15:31 PDT
Update
We are continuing to monitor for any further issues.
Posted Jun 13, 2023 - 14:34 PDT
Monitoring
AWS is working on a fix and we see most services are recovering. We are actively monitoring the status.
Posted Jun 13, 2023 - 13:47 PDT
Identified
We are monitoring the issue and are in touch with the team in AWS. Latest updates can be found here: https://health.aws.amazon.com/health/status
Posted Jun 13, 2023 - 12:21 PDT
This incident affected: Production Recommendations, Production Search, Production API, Production Customer Engagement Console, and Production Feed Processing.