The Challenge: Complex Systems, High Stakes, and Tribal Knowledge Bottlenecks
Wix has built one of the world’s most successful website-building platforms, serving over 250 million users globally. With over 4,000 microservices in production, 500+ server developers, and billions of requests, production reliability is mission-critical.
Wix had already put significant effort into training programs, documentation, and internal tools to help developers respond to incidents. But with thousands of alerts coming in daily across their complex microservices architecture, troubleshooting still consumed nearly 20% of developers’ time and relied heavily on institutional knowledge.
"We were already heavily invested in production reliability, but our ecosystem is very complex," explains Aviva Peisach, Head of Backend Engineering at Wix. "There's a lot of tribal knowledge that people need to know, capture, and activate when they troubleshoot production incidents and alerts. Some of it is generic across all Wix services, and some of it is very specific to the team and the specific architecture and infrastructure that they use."

Responsible for server engineering infrastructure, Aviva, together with the Wix Production team, oversees resilience across Wix: monitoring, troubleshooting, and continuously improving mean time to recovery (MTTR) and mean time to acknowledge (MTTA). For her team, the challenge was clear: find a way to make incident response faster, less stressful, and less dependent on in-house knowledge, freeing developers to focus on building features, not chasing root causes.
“Every alert eventually impacts a user, so the stress levels are high because we must resolve issues quickly to maintain user satisfaction.”
Bridging the Gap Between In-House Knowledge and Production-Ready AI
Wix’s journey began with a multi-year effort to develop an internal tool called Alert Enricher to automate common troubleshooting steps. Alert Enricher checked predefined conditions for root causes such as recent GAs, infrastructure changes, or widespread issues, giving developers a head start when investigating alerts.
These efforts had a meaningful impact, but couldn’t fully address the breadth and nuance of Wix’s production environment. "Alert Enricher was very helpful, but it was also limited because all the rules had to be hard-coded," Peisach notes.
Wix had also experimented with AI internally, successfully creating promising initial results. But turning those into reliable, production-ready solutions would have required significant ongoing investment and maintenance.
Finding the Missing Link: AI That Scales Expertise, Not Replaces It
After years of building internal tools and experimenting with AI, Wix had clear criteria shaped by their previous experiences. They needed a solution that could do three things exceptionally well:
- Automate tribal knowledge without dictating process.
- Integrate seamlessly into existing workflows.
- Provide full control for internal teams to tune and enhance
Wild Moose stood out for its unique approach to the problem. Unlike generic AI copilots that provide one-off summaries, Wild Moose continuously codifies tribal knowledge into dynamic, self-updating playbooks that evolve with every incident.
“Wild Moose had the most mature approach for automating tribal knowledge. The responsiveness in tuning and adapting the product to our unique needs made it clear this was the right partnership.”
This partnership struck the right balance: humans define the knowledge, and AI executes it faster and more reliably. Wix valued the control and customization that Wild Moose offered, as it fit into their processes rather than dictating them.
Wix also found that Wild Moose didn't position itself as a replacement for human expertise, but as a force multiplier. The platform captures and automates Wix's domain knowledge rather than trying to solve problems without understanding Wix’s production instance. As Peisach notes, this reinforced an important principle:
“AI is not magic, but it’s an incredibly powerful tool for automation. It's strengthening our ability to automate our unique domain knowledge, rather than try and solve things out of the box for us."
From Playbooks to >80% Accuracy in Three Weeks
Wild Moose's implementation followed a phased approach that built towards scale without compromising accuracy:
Step 1: Automating Company-Wide Playbooks Through Rapid Tuning Cycles
The process began with Wix sharing their existing playbooks, which Wild Moose quickly automated. This initial phase included a three-week tuning period that front-loaded configuration work but delivered immediate results. During this time, Wild Moose ingested data from hundreds of services and daily deployments, learning Wix’s unique production environment and investigation patterns.
Wix saw value when Wild Moose automated their company-wide playbooks, quickly producing correct root causes and allowing rapid tuning when adjustments were needed. Within just three weeks, Wild Moose consistently achieved over 80% accuracy.
"We were able to get to a stage of over 80% accuracy in just three weeks, which is mind-blowing for a system as complex as ours."
Step 2: Custom Team Playbooks for the Long Tail
The second phase focused on enabling teams to create new playbooks that reflected their unique workflows. After just a two-hour training session, dozens of teams were able to build their own playbooks, demonstrating both the platform's accessibility and demand for better automation tools.
These team-authored playbooks represented the long tail of scenarios, the edge cases and specialized systems not covered by the core general investigation flows that already addressed roughly 90% of incidents.
This wasn't just about volume, the playbooks captured domain-specific expertise that had previously existed only in developers' heads.
Transforming the Developer Experience: How Wild Moose Reduced On-Call Stress
The impact of automation was felt almost immediately. Developers could resolve issues faster and pinpoint root causes that previously required hours of log-diving, scrolling through fragmented Slack threads to find prior fixes, bouncing between dashboards, and back-and-forth coordination with multiple teams. In Wix’s complex architecture, where any request might traverse a long chain of calls between services, Wild Moose’s ability to pinpoint the exact root cause service proved invaluable.
"With Wild Moose, we’re able to spot the root cause, whether it’s a recent GA, traffic switch, new experiment, or something environmental. Wild Moose helps pinpoint the actual root cause service causing the problem in a very complex environment with billions of requests coming in through a very long chain of calls between services.
On-call rotations became less stressful, and engineers felt more confident knowing they had a system that could guide them quickly to the right answer. Maintenance and improvements of the enrichment results also became dramatically easier. Unlike the hard-coded rules of Alert Enricher, Wild Moose learns and adapts through natural-language feedback, automatically improving accuracy over time.
When alert enrichments landed, here’s what the team had to say:

The Results: Higher Accuracy, Faster MTTR, and Boosted Team Morale
Within weeks, Wix could clearly see how their work with Wild Moose delivered results:
- Achieved 90% root cause accuracy within three weeks of implementation and tuning
- Projected to reduce MTTR by 50%, giving developers valuable time back to focus on building
- Enriched 30,000+ alerts per month, freeing developers from repetitive, manual investigation
- Supported hundreds of teams across 200 Slack channels, enabling consistent, automated troubleshooting at scale.
Wix consistently achieves 90% accuracy in root cause detection, with more teams continuing to onboard and contribute new playbooks. MTTR is projected to improve by 50%, freeing significant developer time and reducing customer-impacting downtime. Crucially, these gains require minimal ongoing maintenance. Wild Moose learns from every incident, so accuracy continues to increase over time without constant manual intervention.
"We consistently achieve over 90% accuracy on root cause, with more teams onboarding and writing playbooks every week.”
The Wix team is impressed by Wild Moose’s ability to pinpoint root causes with such accuracy across an extremely complex service environment. By addressing a critical bottleneck in the developer lifecycle, Wild Moose enables Wix to maintain full control and flexibility over how knowledge is captured and utilized, without removing human insight from the process.
Since launch, Wild Moose has scaled to enrich over 30,000 alerts per month, with accuracy continuing to improve as volume grows:
.png)
A Partnership Built on Deep Domain Expertise
Above all, Wix needed a partner that truly understood their environment and challenges.
"Wild Moose found a critical pain point in the developer lifecycle. The team deeply understands our domain, the product is mature, flexible, and continuously evolving, which makes it an excellent partnership."
This close collaboration allowed Wix to customize and integrate the platform seamlessly, creating a solution that aligned with how engineers actually work instead of forcing them to rigid tools or unpredictable AI agents.
Wix’s experience highlights a crucial insight: the most effective AI solutions don’t replace human expertise: they scale it. Success comes from partnering with teams who combine technical sophistication with deep domain knowledge that match an organization’s unique needs and culture.
