Ebod-917 ((full))

šŸš€ Deep Dive: EBOD‑917 – What Went Wrong, How We Fixed It, and What We Learned

Date: 14 April 2026
Author: [Your Name], Senior Engineer, E‑BOD Team
Tag(s): #BugReport #Postmortem #EBOD917 #Reliability #DevOps


Benefits

EBOD-917 — Quick Update & Next Steps

Summary: EBOD-917 is progressing; key milestones reached and a short action plan follows to keep momentum.

2ļøāƒ£ The Incident Timeline

| Time (UTC) | Event | |------------|-------| | 10:12 | Alert from SRE on Spike in GET /users/id 500 errors (Grafana threshold: >200 rpm). | | 10:15 | Incident commander assigned – J. Lee. | | 10:20 | Triage: error traced to UserDirectoryService v2.4.1 (deployed at 09:45). | | 10:27 | Reproduction steps verified in staging – pagination bug triggers when page=0. | | 10:40 | Hot‑fix branch created (hotfix/EBOD-917-paginate-fix). | | 10:55 | Fix merged, container image built, and canary deployed to 2 % of traffic. | | 11:08 | Metrics show error rate dropped from 4.3 % → 0.2 % (within canary). | | 11:12 | Full rollout to all regions completed. | | 11:20 | Incident declared Resolved. | | 12:00 | Post‑mortem meeting scheduled (see notes below). | EBOD-917


6ļøāƒ£ Lessons Learned

| Area | Action Item | |------|-------------| | Testing | Introduce boundary‑value testing for all pagination parameters. | | Feature Flags | Enforce staged rollout (canary → 5 % → 20 % → 100 %). | | Monitoring | Track business‑level symptoms (e.g., UI error rates) in addition to HTTP status codes. | | Documentation | Keep API version change logs in sync with release notes. | | Post‑mortem Process | Conduct a blameless review within 24 h and publish a public incident summary for transparency. |


Owners & contacts

5ļøāƒ£ The Fix (What We Did)

  1. Corrected the Conditional

    // Before
    if (page > totalPages) 
        return Collections.emptyList();
    // After
    if (page >= totalPages) 
        return Collections.emptyList();
    
  2. Added Contract Tests

    • New Pact contract covering GET /users?page=0 → non‑empty payload when totalPages > 0.
    • Runs in CI pipeline on every PR.
  3. Feature‑Flag Guardrails

    • Updated the rollout script to require manual approval for production enablement.
    • Added a visibility toggle in the Ops dashboard (shows which services are using the flag).
  4. Improved Observability

    • New Prometheus metric: user_directory.empty_page_responses_total.
    • Alert fires when empty‑page rate exceeds 0.1 % of total calls.
  5. Documentation

    • Updated the API spec to explicitly state that page indexing is zero‑based.
    • Added a ā€œGotchasā€ section for front‑end developers.


All times are GMT -8. The time now is 02:53 AM.