Draft for Review

SkyCop Engine Migration Plan

2026-03-27  •  Technical Review Document  •  Phased Strangler Fig
274K
Lines of PHP
Symfony 3.4 / PHP 7.4 monolith
~11
Domain Services
Target decomposition
15 mo
Target Timeline
6 phases, buffer included
3
Developers
1 fullstack + 2 senior Python

Why We're Migrating

The current system is an 8-year-old Symfony 3.4 / PHP 7.4 monolith. Both are end-of-life with no security patches. The codebase has 274,000 lines of PHP centered on a single 5,647-line Claim entity imported by 370 files. There is no domain isolation — every feature touches everything. Adding new capabilities is slow, risky, and expensive.

We're not rewriting for the sake of rewriting. We need a system that:

  • Can be developed by a small team with AI-assisted tooling
  • Has observability from day one (not bolted on after incidents)
  • Allows independent deployment of different business domains
  • Runs on modern infrastructure with proper CI/CD
  • Makes the claim lifecycle manageable, not a 102-transition maze

What We're Building

The Development Platform (First)

Before writing any business logic, we build a development platform — a foundation that every new service inherits. This is the most important decision in the entire migration: get the platform right, and every service gets monitoring, deployment, auth, and testing for free. Get it wrong, and we build 10 services that each reinvent the wheel.

Component Technology Purpose
Service Template Copier Generate new services with one command. Merges best of both existing templates: auth, CI/CD, Docker; functional testing patterns, worker health checks, structured logging.
Container Orchestration DOKS (DigitalOcean Kubernetes) Auto-scaling, health checks, rolling deploys. Existing cluster from Aerobots platform.
Event Backbone RabbitMQ Async communication. Already in production — new services share the same broker as the monolith. No bridge needed.
Workflow Engine Temporal Cloud Durable long-running processes — claim lifecycle, payment flows, legal escalations. Retries, timeouts, saga patterns.
Observability Sentry + OpenTelemetry + Grafana Errors (Sentry), metrics + traces + logs (Prometheus + Loki + Tempo). Unified from minute one.
Secrets HashiCorp Vault No more .env files with production credentials.
Feature Flags Unleash Safe migration cutover. Instant rollback per domain.
CI/CD GitHub Actions → GHCR → DOKS Push to main → build → test → deploy. Every service, same pipeline.
Admin UI Shell React + Module Federation One admin app, independently deployed domain modules. Shared navigation, auth, design system.
Why platform-first matters: The existing Aerobots services prove that Python/FastAPI on DOKS works. But each was set up independently — different monitoring, different deploy scripts, different patterns. The platform makes it so every new service starts with 100% of the operational concerns solved, and improvements to the template propagate to all services via copier update.

The Services (After Platform)

We decompose the monolith into ~10 domain services. Each service owns its own PostgreSQL database, publishes events to RabbitMQ, and exposes a REST API.

Service Domain Complexity
Dictionary Reference data (countries, airports, airlines, jurisdictions) Low
Lead Claim intake wizard (8-step form, AI validation, jurisdiction check) Medium
Documents Upload, storage, classification, PDF generation Medium
Messaging Email (150+ templates, 29 languages), SMS, WhatsApp, Telegram — delivery infrastructure Medium
Communication Engine AI-powered conversation orchestration, webchat, Zendesk replacement, human escalation UI High
Legal 8 law firm integrations, court cases, distribution logic High
Partners B2B API, affiliate networks, brands, commissions Medium
Payments Stripe, Revolut, payouts, invoicing, Rivile/Navision High
Users & Auth Registration, authentication, profiles, Google OAuth Medium
Claims Core Claim lifecycle (Temporal workflows), state transitions, scoring, orchestration Very High
Analytics & Search Elasticsearch, reporting, dashboards Medium

Shared Message Broker

New services and the monolith share the same RabbitMQ instance. No bridge, no translation layer, no second messaging system. New services publish and consume on the same broker — the monolith already has 75 consumers running on it. This is the simplest possible approach and eliminates an entire category of migration complexity.

AI Services — Current State & Vision

Today, the Aerobots platform runs 5 AI services on DOKS that interact with the monolith:

Service What it does CRM Integration
Voice Bot Automated airline/customer calls (Twilio + Pipecat AI) Receives call data via API, returns results via webhook
WA Bot WhatsApp customer support (Claude agent loop) Robot API (Basic Auth) + Port API (OAuth2)
CRM-MCS Middleware — scrapes admin panel HTML, returns JSON HTML scraping via Squid proxy (fragile)
Doc Moderator Human-in-the-loop AI document classification review Reads from CRM-MCS
Doc Classifier Chrome extension for AI document classification Calls CRM-MCS classify endpoint
Core Problem CRM-MCS exists because the monolith has no proper API for AI services. It logs into the admin panel, parses HTML with BeautifulSoup, and returns structured data. Every CRM UI change can break it.

Phase 0 fix: Extend the Robot API. The monolith already has /robot/ endpoints that provide structured JSON (the WA bot uses them). We extend the existing Robot API with the missing operations that CRM-MCS currently gets via HTML scraping — claim search, documents, action history, scores. This eliminates the scraping layer and gives all AI services a proper API immediately.

Vision — two-layer communication architecture:

Service Owner Responsibility
Messaging Service CRM build team Delivery infrastructure — template rendering, email/SMS/WhatsApp/Telegram delivery, delivery tracking, bounce handling. Any service can send a message through it.
Communication Engine Aerobots AI team AI layer — conversation orchestration, webchat assistant on claim form, email response engine, channel routing, human escalation, Zendesk replacement (Admin Shell module).
Voice Assistant Aerobots AI team Separate service — Twilio SIP, Pipecat real-time audio pipeline. Integrates with Communication Engine for orchestration but runs independently.

Document classification moves from a Chrome extension hack into the Documents Service as a first-class API capability.

How We Migrate

Strategy: Phased Strangler Fig

We extract one domain at a time. The monolith keeps running throughout. Each extraction follows the same playbook:

  1. Design — reverse-engineer domain from monolith code + knowledge base
  2. Build — new service on the platform, behind feature flag
  3. Shadow — run in parallel, compare results with monolith
  4. Switch — flip feature flag, route traffic to new service
  5. Verify — monitor for issues, maintenance window for cutover
  6. Clean up — remove domain code from monolith, retire bridge routes

Phasing

Phase 0
Platform
2–3 weeks
Harden Copier template (Sentry, OTel, Vault, K8s, RabbitMQ). Deploy Temporal Cloud. Build Admin Shell skeleton. Extend Robot API.
Phase 1
LEAD Domain
3–5 weeks
Dictionary Service + Lead Service + first Admin modules. Validates the entire pipeline end-to-end.
Phase 2
Content
4–6 weeks
Messaging Service (150+ email templates) + Documents Service. Parallelized across 2 devs.
Phase 3
External
5–8 weeks
Legal Service (8 law firms) + Partners Service (B2B API). Parallelized.
Phase 4
Financial
6–8 weeks
Payments Service (Stripe, Revolut, Rivile). Security audit before go-live. Zero tolerance for errors.
Phase 5
Core
12–18 weeks
Claims Core (Temporal workflows) + Users & Auth. The hardest extraction — all other services must be stable first.
Critical Dependency Phase 0 is the prerequisite — it blocks everything. If the platform isn't solid, we don't start Phase 1. Each phase includes: backend service + admin UI module + data migration + event bridge routes + integration testing.

What's NOT in Scope (Yet)

  • Analytics & Search — follows after Claims Core is stable
  • Monolith shutdown — only after all domains are extracted and stable
  • Mobile apps — existing mobile experience continues via current APIs
  • WordPress (www.skycop.com) — separate system, not affected

Decision Matrix

Key architectural choices with alternatives considered. The starred column is the winning option; the green row is the verdict.

Language: Python vs Java

Factor Python / FastAPI Java / Spring Boot
AI-assisted development Better — AI tools generate more concise, higher-quality Python Good but more boilerplate to generate and review
Existing codebase 7 running services on Aerobots platform, 2 service templates Zero Java services at SkyCop
Hiring Larger pool for AI-capable generalists Larger pool for enterprise backend
Type safety Pydantic v2 + mypy strict — requires discipline Native strong typing — compiler catches mistakes
Startup speed Template to deployed service: hours Days to a week with full Spring patterns
Long-term maintainability Needs strict linting/typing conventions Framework conventions enforce consistency
Verdict Python — velocity + existing patterns + AI acceleration

Message Broker: RabbitMQ vs Kafka

Factor RabbitMQ Kafka
Current state Already running, 75 consumers Only in AI module (emerging)
Migration complexity New services share same broker — zero bridging needed Would need Kafka↔RabbitMQ bridge for 18+ months
Event replay Not built-in Native — can re-read events
Scale fit Handles our volume trivially Designed for millions/sec — massive overkill
Ops complexity Simple, well-understood Significant — ZooKeeper/KRaft, partitions, consumer groups
Verdict RabbitMQ — already running, no bridge needed, right-sized

Workflow Engine: Temporal vs Camunda vs Code-based

Factor Temporal Camunda Code-based state machine
Durability Built-in — survives restarts, resumes from exact step Built-in (BPMN engine) Manual — DB state tracking + cron polling
Python support Excellent SDK Java-native, Python via REST/gRPC Full control
Long-running processes Native — timers span weeks/months Native (BPMN timers) Cron jobs checking "is it time yet?"
Retries / timeouts Declarative retry policies BPMN error handling Hand-rolled try/catch + backoff
Saga pattern Built-in compensation logic BPMN compensation events Manual rollback code
Visibility Web UI showing every workflow, step, wait status Cockpit UI, BPMN visualization Custom admin pages
Ops overhead Temporal Cloud = zero ops ($200–400/mo) Self-hosted = heavy (Java, DB, Elasticsearch) Zero infra, but code complexity grows
Learning curve Moderate — new concepts (workflows, activities) Steep — BPMN notation, engine concepts Low — plain code
Verdict Temporal Cloud — durable workflows for claim lifecycle, zero ops

Frontend: Unified SPA vs Micro-frontends vs Keep Twig

Factor App Shell + Module Federation Monolithic React SPA Keep Twig during migration
Operator UX One app, one login, consistent One app, one login, consistent Existing UX, no improvement for 1–2 years
Independent deployment Yes — each domain module deploys separately No — one bug blocks all domains N/A — monolith frontend
Build speed Each module built independently Entire SPA rebuilds No frontend builds needed
Migration path Add modules as domains are extracted Big-bang frontend rewrite needed Incremental, but old Twig stays
Upfront effort Shell + design system (~2 weeks) Full SPA upfront Zero
Verdict App Shell + Module Federation — unified UX, independent deployment

Database: PostgreSQL only vs PostgreSQL + MariaDB access

Factor PostgreSQL only PostgreSQL + MariaDB reads
Separation Clean — new services never touch old DB Coupling — new services depend on old schema
Sync complexity Need ETL/events to move data Direct reads, simpler short-term
Migration end state Clean cutover when monolith shuts down Must detach MariaDB reads later — deferred pain
Verdict PostgreSQL only — clean separation from day one

Technology Decisions

Area Decision Rationale
Language Python 3.13+ AI-assisted development velocity. Existing patterns from Aerobots. Larger hiring pool for AI-capable devs.
Framework FastAPI Async, fast, auto-generated OpenAPI docs, Pydantic validation. Proven in 7 existing services.
Database PostgreSQL New services only. No direct MariaDB access. Clean separation via RabbitMQ events + ETL.
Messaging RabbitMQ Already in production. New services share the same broker as the monolith — no bridge, no second system.
Workflows Temporal Cloud Durable workflow execution for claim lifecycle. Retries, timeouts, saga patterns. No infrastructure to manage.
Frontend React + Module Federation Admin shell with independently deployed domain modules. Operators see one app. Teams deploy independently.
Type Safety mypy strict + Pydantic v2 + ruff Compensates for Python's dynamic typing. All domain models are Pydantic, all public APIs typed, 30+ lint categories enforced.
Template Copier Parameterized scaffolding. Auth, testing, CI/CD, Docker built in. copier update propagates improvements to all services.
Infrastructure DOKS Existing cluster from Aerobots. K8s provides auto-scaling, health checks, rolling deploys, namespace isolation.

Team

Role Who Scope
Fullstack developer Anatoliy Gusev (existing) Admin Shell, frontend modules, backend services. Already has context on SkyCop systems.
2 Senior Python developers To hire Platform hardening, backend services, Claims Core reverse-engineering.
Pavel Tarasov (advisor) Part-time, not on build team DOKS/deployment patterns, monitoring, Aerobots platform knowledge transfer. Continues AI initiatives separately.
System analysts Existing Review designs produced by the build team. Domain knowledge support.
PHP team (split) Existing Strongest devs retrain to Python over time. Others maintain monolith — bug fixes, feature flags, sync mechanisms.

Key Principles

1
Platform before product.
Every service inherits observability, deployment, auth, and testing. No exceptions.
2
One domain at a time.
Extract, validate, switch, clean up. No big-bang rewrite.
3
The monolith keeps running.
Customers never notice the migration. Instant rollback via feature flags at every step.
4
Design from code, not from memory.
The build team reverse-engineers domain logic from the actual monolith codebase, not from assumptions about how it should work.
5
All 10 jurisdictions, always.
No partial migration per jurisdiction. When a domain switches, it handles all jurisdictions from day one.
6
Maintenance windows are OK.
Zero-downtime is ideal but not required for every cutover. Match the approach to the domain risk.

Timeline

Target: 15 months with a team of 3 developers (1 fullstack + 2 Python) + AI-assisted development. Phases overlap where possible — 3 devs means parallel work.

Phase Estimate Notes
0: Platform 2–3 weeks Anatoliy starts Admin Shell while Python devs harden the template
1: LEAD Domain 3–5 weeks Analyst specs ready, mostly code generation from OpenAPI
2: Content 4–6 weeks Documents + Messaging parallelized across 2 devs
3: External 5–8 weeks Legal + Partners parallelized
4: Financial 6–8 weeks Payments needs extra care regardless of speed
5: Core 12–18 weeks Claims Core — the hard one. Temporal helps but domain complexity is real
Total 32–48 weeks (~8–12 months) Buffer to 15 months for unknowns

Data Strategy

  • New services own PostgreSQL. No new code touches MariaDB.
  • Sync via RabbitMQ events + ETL. Data flows from MariaDB to PostgreSQL for each extracted domain. New services share the same RabbitMQ as the monolith — no bridge needed.
  • Legacy IDs kept forever. Every new table has a legacy_id column as a permanent audit trail linking back to the original MariaDB records.
  • Partner API: break with notice. Partners get 3 months to migrate to the new API. Migration guide provided.

Immediate Next Steps

  1. 1
    Post job listing for senior Python developer #2 — blocking prerequisite
  2. 2
    Start Phase 0 — harden the Copier template (Sentry, OTel, Vault, K8s, worker pattern, RabbitMQ)
  3. 3
    Reverse-engineer the Claim state machine — validate the knowledge base's 102-transition matrix against actual workflow.yml in the monolith
  4. 4
    Audit 355 console commands — catalog active cron jobs vs dead code (assign to PHP team)
  5. 5
    Set up Temporal Cloud account — provision namespace, test Python SDK connectivity

What Exists Today

We're not starting from zero. These assets feed directly into the migration:

Asset What it provides
Two service templates
new-project-template + aerbots-mcs-template
Will be combined into one platform template. Auth, CI/CD, Docker from new-project-template; functional testing, worker patterns, structured logging from aerbots-mcs-template.
Aerobots platform
Existing DOKS cluster
Running K8s cluster with 7 Python services, CI/CD patterns, monitoring
Knowledge base
skycop-knowledge
220 docs covering 12 domains, 200+ entities, 30+ integrations, claim state machine
Analyst designs
Confluence SS3
Phase I specs: Dictionary Service OpenAPI, Lead Service database schema, AI validation logic, event contracts
Monolith itself The source of truth for all business logic, 551 migrations of schema evolution, 75 RabbitMQ consumer patterns
SkyCop Engine Migration Plan — Draft for Review — 2026-03-27 Generated by Claude Code