← Work

Impact story: From performance failures to trusted delivery

1,544 words Filed in: Drupal, Azure, performance optimization, database performance, cloud migration

Performance regressions after cloud migration prompted comprehensive overhaul: infrastructure fixes, caching strategy, editorial efficiency, and governance — coordinated through evidence-based delivery.

tl;dr

  • Azure migration followed best practices but performance degraded: timeouts, daily failures jumped 50%, editor experience regressed
  • Rigorous diagnosis revealed root causes: database bottleneck (Azure MySQL 2-3x slower for Drupal), caching gaps, editorial workflow friction
  • Coordinated recovery across multiple fronts: infrastructure migration, caching strategy, editorial efficiency, governance frameworks
  • Infrastructure fixes: migrated to MariaDB on VMs, cutting database latency 51%, reducing 503 errors 77%
  • Platform-wide optimization: standardized 15 sites, doubled Lighthouse SEO scores, saved editors 120 hours monthly
  • Outcomes: median response time 0.45s, 20% Azure cost reduction, 79.9% user satisfaction (+7pp year-over-year)

Context The United Nations Office for Disaster Risk Reduction (UNDRR) is the UN's focal point for disaster risk reduction, coordinating global policy and supporting Member States to reduce disaster risk and losses.

UNDRR's 15-site Drupal ecosystem migrated to Azure following best practices: managed services, platform-as-a-service, letting Azure handle infrastructure complexity. The "modern" architecture should have been faster, more reliable, more scalable.

Instead, performance degraded. Long-tail content timed out. Daily page failures jumped 50%. Editor workflows slowed. User trust eroded.

We needed comprehensive recovery — not just infrastructure fixes, but coordinated improvements across database performance, caching strategy, editorial workflows, and governance. This level of transformation required rigorous diagnosis to understand root causes, then patient execution across multiple fronts.

The database layer turned out to be a critical bottleneck. Load testing revealed Azure Database for MySQL was 2-3x slower than self-hosted alternatives on Drupal's query patterns. But fixing the database alone wouldn't restore trust — we needed platform-wide optimization.

This is the story of how evidence-based diagnosis drove coordinated recovery: infrastructure migration, performance engineering, editorial efficiency improvements, and governance frameworks — all working together to rebuild performance and user trust.

The architectural bet that didn't pay off#

We chose Azure Database for MySQL for good reasons:

  • Microsoft's recommended architecture: Managed PaaS services over self-managed VMs
  • Operational simplicity: Automatic backups, patching, high availability built in
  • Scalability promise: Elastic compute and storage scaling
  • Industry best practice: Modern cloud-native applications favor managed databases

But Drupal generates specific query patterns that didn't match what we discovered for Azure Database for MySQL's optimization profile: tuned for OLTP (Online Transaction Processing) workloads with predictable access patterns. Drupal generates complex joins, large result sets, and bursty query patterns — especially on cold paths when cache misses.

The problems manifested in unexpected ways:

  • Cold path failures: If no one visited a page recently, first request would time out
  • Admin path regression: Editor workflows involving heavy queries slowed substantially
  • 95th percentile latency spike: Tail latency suggested database as bottleneck
  • Cache dependency: Performance acceptable only when cache warm — fragile architecture

Diagnosis: multiple bottlenecks#

Performance degradation after migration wasn't one problem — it was several compounding issues that required systematic diagnosis.

Database layer: The most critical bottleneck. We needed rigorous evidence to justify reversing architectural decisions, so load testing alternated between Azure Database for MySQL and MariaDB on a VM while keeping code and configuration static. We measured both user-facing paths and CRUD (Create, Read, Update, Delete) micro-benchmarks, neutralizing cache effects to see the real performance delta. The results were consistent: MariaDB on a VM outperformed the managed service by 2-3x on database-bound operations. (See Drupal delayed by Azure MySQL for full benchmarking methodology.)

Caching gaps: Cache strategy wasn't comprehensive enough for Drupal's cold-path patterns. Long-tail content served policymakers during urgent decisions, but these pages weren't staying warm. We needed cache warming, better render cache tuning, and CDN optimization.

Editorial workflow friction: Editors faced compounding delays — slow menu loads (7.8s), timeout failures on save operations, manual publication workflows requiring 10 minutes per item. These weren't infrastructure problems, but they eroded trust in the platform just as much as public-facing timeouts.

With root causes identified, we could plan coordinated recovery across all three fronts.

Recovery and optimization#

We executed coordinated recovery addressing all diagnosed bottlenecks:

  • Database migration: Moved to MariaDB on a VM with tight backup and continuity plans. This immediately cut DB-bound latency by 51% and reduced daily 503 errors by 77%. Editor time-to-first-byte improved materially, especially on admin paths and heavy queries.
  • Platform standardization: Consolidated asset handling with deterministic responsive images in build. Standardized Drupal 10 theming with design-system patterns and component libraries across all 15 sites. Unified UX while maintaining site-specific requirements.
  • Performance engineering: Implemented comprehensive caching strategy covering page cache, render cache, and CDN layers. Tuned image pipelines for Core Web Vitals (CWV) compliance. Conducted render path audits and fixed performance regressions. Added cache warming for cold paths to prevent the timeout pattern we saw post-migration.
  • AI-readiness: Implemented RDF (Resource Description Framework) / SKOS (Simple Knowledge Organization System) metadata pipelines and JSON-LD (JavaScript Object Notation for Linked Data) exports. Built machine-actionable content models positioning UNDRR as a leader in structured data publishing. Full context-first approach documented in Smart AI integration.
  • Analytics and monitoring: Deployed GA4 (Google Analytics 4) with custom dimensions for segment-level reporting. Built Looker Studio dashboards tracking chatbot referral patterns and AI-driven traffic. Hardened monitoring with failure categorization by root cause and owner. Implemented on-site user feedback to validate that numbers matched experience.

Governance: managing complexity#

This level of transformation — platform standardization, infrastructure migration, performance recovery, and feature delivery — required disciplined governance to stay on track.

We introduced a living risk register with owners, triggers, and mitigations to steer prioritization. The database migration was flagged early as high-impact/high-risk, which allowed us to plan the transition carefully and maintain service continuity.

We committed to a predictable 3-week release cadence with scope control and demo checkpoints. This gave stakeholders clear expectations about what would ship when, and gave the team focus during a period of significant platform change.

We required measurable outcomes on every major ticket — not just "improve performance" but "reduce 95th percentile Time to First Byte (TTFB) to under 1s on admin paths." This aligned work artifacts to organizational goals and made reporting clearer.

The cadence held throughout 2024-2025. 90%+ of major tickets included defined outcomes and success metrics. The risk register was reviewed each cycle with owner sign-off. Stakeholder confidence increased as we demonstrated predictable delivery even while managing platform complexity.

Outcomes#

Performance (what it means for users):

  • Median response time cut in half (0.88s → 0.45s) across 15 sites — disaster risk data accessible in under a second instead of multiple seconds of waiting
  • Database latency: 51% reduction after MariaDB migration — editors spend less time waiting for the CMS and more time on content that saves lives
  • Doubled Lighthouse SEO scores on representative pages — critical risk reduction information reaches broader audiences through search

Reliability (rebuilding user trust):

  • Daily page failures: recovered from post-migration peak, trending toward pre-migration baseline — fewer dead ends when users need data urgently
  • 503 errors: 77% reduction after database migration — regional pages and historical reports accessible consistently
  • Stability visible to users: smoother experience on long-tail and regional content, particularly for users in bandwidth-constrained environments

Cost optimization (doing more with less):

  • 20% quarterly Azure spend reduction — budget freed up for content and feature development rather than infrastructure overhead
  • Better resource utilization through right-sized infrastructure — technical efficiency enabling mission focus

User satisfaction (evidence that technical improvements matter):

  • 79.9% of visitors report sites are helpful (+7 percentage points year over year) — real people finding real value
  • 35,113 user responses (+4.3% year over year volume) — growing engagement from the communities we serve
  • Perception matched reality: instrumentation validated that performance fixes improved actual user experience, not just backend metrics

Campaign impact:

  • GAR2025 landing: 40% engagement lift during campaign period (see GAR2025 impact story)
  • Platform stability enabled successful high-stakes launches

AI readiness (positioning UNDRR for the future):

  • Machine-actionable metadata positioned UNDRR as industry leader — disaster risk knowledge discoverable by AI systems serving global audiences
  • Structured publishing workflows supporting both human and AI audiences — content reaches policymakers whether they're reading directly or querying through AI tools
  • Analytics tracking AI-driven referral patterns and behavior shifts — understanding how audiences discover and use critical information in an evolving landscape

What made this work#

This wasn't heroics — it was the result of standardization, rigorous diagnosis, and patient execution. We focused on high-leverage improvements and applied them consistently across 15 sites. The compounding effect is what moved the needle.

Cross-functional coordination across platform, product, editorial, design, and operations teams in Geneva, Bangkok, Bonn, New York, and Manila. Clear ownership, disciplined cadence, and evidence-based decision making. When we hit the database bottleneck, we didn't guess — we benchmarked rigorously and made the infrastructure change with data in hand.

We also kept listening to users throughout. The 79.9% satisfaction score wasn't an accident — it reflected our commitment to ensuring that technical metrics translated to real improvements in user experience.

Deep dives:

Related impact stories: