Why We're Migrating
The current system is an 8-year-old Symfony 3.4 / PHP 7.4 monolith. Both are end-of-life with no security patches. The codebase has 274,000 lines of PHP centered on a single 5,647-line Claim entity imported by 370 files. There is no domain isolation — every feature touches everything. Adding new capabilities is slow, risky, and expensive.
We're not rewriting for the sake of rewriting. We need a system that:
- Can be developed by a small team with AI-assisted tooling
- Has observability from day one (not bolted on after incidents)
- Allows independent deployment of different business domains
- Runs on modern infrastructure with proper CI/CD
- Makes the claim lifecycle manageable, not a 102-transition maze
What We're Building
The Development Platform (First)
Before writing any business logic, we build a development platform — a foundation that every new service inherits. This is the most important decision in the entire migration: get the platform right, and every service gets monitoring, deployment, auth, and testing for free. Get it wrong, and we build 10 services that each reinvent the wheel.
| Component | Technology | Purpose |
|---|---|---|
| Service Template | Copier |
Generate new services with one command. Merges best of both existing templates: auth, CI/CD, Docker; functional testing patterns, worker health checks, structured logging. |
| Container Orchestration | DOKS (DigitalOcean Kubernetes) | Auto-scaling, health checks, rolling deploys. Existing cluster from Aerobots platform. |
| Event Backbone | RabbitMQ | Async communication. Already in production — new services share the same broker as the monolith. No bridge needed. |
| Workflow Engine | Temporal Cloud | Durable long-running processes — claim lifecycle, payment flows, legal escalations. Retries, timeouts, saga patterns. |
| Observability | Sentry + OpenTelemetry + Grafana | Errors (Sentry), metrics + traces + logs (Prometheus + Loki + Tempo). Unified from minute one. |
| Secrets | HashiCorp Vault | No more .env files with production credentials. |
| Feature Flags | Unleash | Safe migration cutover. Instant rollback per domain. |
| CI/CD | GitHub Actions → GHCR → DOKS | Push to main → build → test → deploy. Every service, same pipeline. |
| Admin UI Shell | React + Module Federation | One admin app, independently deployed domain modules. Shared navigation, auth, design system. |
copier update.
The Services (After Platform)
We decompose the monolith into ~10 domain services. Each service owns its own PostgreSQL database, publishes events to RabbitMQ, and exposes a REST API.
| Service | Domain | Complexity |
|---|---|---|
| Dictionary | Reference data (countries, airports, airlines, jurisdictions) | Low |
| Lead | Claim intake wizard (8-step form, AI validation, jurisdiction check) | Medium |
| Documents | Upload, storage, classification, PDF generation | Medium |
| Messaging | Email (150+ templates, 29 languages), SMS, WhatsApp, Telegram — delivery infrastructure | Medium |
| Communication Engine | AI-powered conversation orchestration, webchat, Zendesk replacement, human escalation UI | High |
| Legal | 8 law firm integrations, court cases, distribution logic | High |
| Partners | B2B API, affiliate networks, brands, commissions | Medium |
| Payments | Stripe, Revolut, payouts, invoicing, Rivile/Navision | High |
| Users & Auth | Registration, authentication, profiles, Google OAuth | Medium |
| Claims Core | Claim lifecycle (Temporal workflows), state transitions, scoring, orchestration | Very High |
| Analytics & Search | Elasticsearch, reporting, dashboards | Medium |
Shared Message Broker
New services and the monolith share the same RabbitMQ instance. No bridge, no translation layer, no second messaging system. New services publish and consume on the same broker — the monolith already has 75 consumers running on it. This is the simplest possible approach and eliminates an entire category of migration complexity.
AI Services — Current State & Vision
Today, the Aerobots platform runs 5 AI services on DOKS that interact with the monolith:
| Service | What it does | CRM Integration |
|---|---|---|
| Voice Bot | Automated airline/customer calls (Twilio + Pipecat AI) | Receives call data via API, returns results via webhook |
| WA Bot | WhatsApp customer support (Claude agent loop) | Robot API (Basic Auth) + Port API (OAuth2) |
| CRM-MCS | Middleware — scrapes admin panel HTML, returns JSON | HTML scraping via Squid proxy (fragile) |
| Doc Moderator | Human-in-the-loop AI document classification review | Reads from CRM-MCS |
| Doc Classifier | Chrome extension for AI document classification | Calls CRM-MCS classify endpoint |
Phase 0 fix: Extend the Robot API. The monolith already has /robot/ endpoints that provide structured JSON (the WA bot uses them). We extend the existing Robot API with the missing operations that CRM-MCS currently gets via HTML scraping — claim search, documents, action history, scores. This eliminates the scraping layer and gives all AI services a proper API immediately.
Vision — two-layer communication architecture:
| Service | Owner | Responsibility |
|---|---|---|
| Messaging Service | CRM build team | Delivery infrastructure — template rendering, email/SMS/WhatsApp/Telegram delivery, delivery tracking, bounce handling. Any service can send a message through it. |
| Communication Engine | Aerobots AI team | AI layer — conversation orchestration, webchat assistant on claim form, email response engine, channel routing, human escalation, Zendesk replacement (Admin Shell module). |
| Voice Assistant | Aerobots AI team | Separate service — Twilio SIP, Pipecat real-time audio pipeline. Integrates with Communication Engine for orchestration but runs independently. |
Document classification moves from a Chrome extension hack into the Documents Service as a first-class API capability.
How We Migrate
Strategy: Phased Strangler Fig
We extract one domain at a time. The monolith keeps running throughout. Each extraction follows the same playbook:
- Design — reverse-engineer domain from monolith code + knowledge base
- Build — new service on the platform, behind feature flag
- Shadow — run in parallel, compare results with monolith
- Switch — flip feature flag, route traffic to new service
- Verify — monitor for issues, maintenance window for cutover
- Clean up — remove domain code from monolith, retire bridge routes
Phasing
What's NOT in Scope (Yet)
- Analytics & Search — follows after Claims Core is stable
- Monolith shutdown — only after all domains are extracted and stable
- Mobile apps — existing mobile experience continues via current APIs
- WordPress (www.skycop.com) — separate system, not affected
Decision Matrix
Key architectural choices with alternatives considered. The starred column is the winning option; the green row is the verdict.
Language: Python vs Java
| Factor | Python / FastAPI | Java / Spring Boot |
|---|---|---|
| AI-assisted development | Better — AI tools generate more concise, higher-quality Python | Good but more boilerplate to generate and review |
| Existing codebase | 7 running services on Aerobots platform, 2 service templates | Zero Java services at SkyCop |
| Hiring | Larger pool for AI-capable generalists | Larger pool for enterprise backend |
| Type safety | Pydantic v2 + mypy strict — requires discipline | Native strong typing — compiler catches mistakes |
| Startup speed | Template to deployed service: hours | Days to a week with full Spring patterns |
| Long-term maintainability | Needs strict linting/typing conventions | Framework conventions enforce consistency |
| Verdict | Python — velocity + existing patterns + AI acceleration | |
Message Broker: RabbitMQ vs Kafka
| Factor | RabbitMQ | Kafka |
|---|---|---|
| Current state | Already running, 75 consumers | Only in AI module (emerging) |
| Migration complexity | New services share same broker — zero bridging needed | Would need Kafka↔RabbitMQ bridge for 18+ months |
| Event replay | Not built-in | Native — can re-read events |
| Scale fit | Handles our volume trivially | Designed for millions/sec — massive overkill |
| Ops complexity | Simple, well-understood | Significant — ZooKeeper/KRaft, partitions, consumer groups |
| Verdict | RabbitMQ — already running, no bridge needed, right-sized | |
Workflow Engine: Temporal vs Camunda vs Code-based
| Factor | Temporal | Camunda | Code-based state machine |
|---|---|---|---|
| Durability | Built-in — survives restarts, resumes from exact step | Built-in (BPMN engine) | Manual — DB state tracking + cron polling |
| Python support | Excellent SDK | Java-native, Python via REST/gRPC | Full control |
| Long-running processes | Native — timers span weeks/months | Native (BPMN timers) | Cron jobs checking "is it time yet?" |
| Retries / timeouts | Declarative retry policies | BPMN error handling | Hand-rolled try/catch + backoff |
| Saga pattern | Built-in compensation logic | BPMN compensation events | Manual rollback code |
| Visibility | Web UI showing every workflow, step, wait status | Cockpit UI, BPMN visualization | Custom admin pages |
| Ops overhead | Temporal Cloud = zero ops ($200–400/mo) | Self-hosted = heavy (Java, DB, Elasticsearch) | Zero infra, but code complexity grows |
| Learning curve | Moderate — new concepts (workflows, activities) | Steep — BPMN notation, engine concepts | Low — plain code |
| Verdict | Temporal Cloud — durable workflows for claim lifecycle, zero ops | ||
Frontend: Unified SPA vs Micro-frontends vs Keep Twig
| Factor | App Shell + Module Federation | Monolithic React SPA | Keep Twig during migration |
|---|---|---|---|
| Operator UX | One app, one login, consistent | One app, one login, consistent | Existing UX, no improvement for 1–2 years |
| Independent deployment | Yes — each domain module deploys separately | No — one bug blocks all domains | N/A — monolith frontend |
| Build speed | Each module built independently | Entire SPA rebuilds | No frontend builds needed |
| Migration path | Add modules as domains are extracted | Big-bang frontend rewrite needed | Incremental, but old Twig stays |
| Upfront effort | Shell + design system (~2 weeks) | Full SPA upfront | Zero |
| Verdict | App Shell + Module Federation — unified UX, independent deployment | ||
Database: PostgreSQL only vs PostgreSQL + MariaDB access
| Factor | PostgreSQL only | PostgreSQL + MariaDB reads |
|---|---|---|
| Separation | Clean — new services never touch old DB | Coupling — new services depend on old schema |
| Sync complexity | Need ETL/events to move data | Direct reads, simpler short-term |
| Migration end state | Clean cutover when monolith shuts down | Must detach MariaDB reads later — deferred pain |
| Verdict | PostgreSQL only — clean separation from day one | |
Technology Decisions
| Area | Decision | Rationale |
|---|---|---|
| Language | Python 3.13+ | AI-assisted development velocity. Existing patterns from Aerobots. Larger hiring pool for AI-capable devs. |
| Framework | FastAPI | Async, fast, auto-generated OpenAPI docs, Pydantic validation. Proven in 7 existing services. |
| Database | PostgreSQL | New services only. No direct MariaDB access. Clean separation via RabbitMQ events + ETL. |
| Messaging | RabbitMQ | Already in production. New services share the same broker as the monolith — no bridge, no second system. |
| Workflows | Temporal Cloud | Durable workflow execution for claim lifecycle. Retries, timeouts, saga patterns. No infrastructure to manage. |
| Frontend | React + Module Federation | Admin shell with independently deployed domain modules. Operators see one app. Teams deploy independently. |
| Type Safety | mypy strict + Pydantic v2 + ruff | Compensates for Python's dynamic typing. All domain models are Pydantic, all public APIs typed, 30+ lint categories enforced. |
| Template | Copier | Parameterized scaffolding. Auth, testing, CI/CD, Docker built in. copier update propagates improvements to all services. |
| Infrastructure | DOKS | Existing cluster from Aerobots. K8s provides auto-scaling, health checks, rolling deploys, namespace isolation. |
Team
| Role | Who | Scope |
|---|---|---|
| Fullstack developer | Anatoliy Gusev (existing) | Admin Shell, frontend modules, backend services. Already has context on SkyCop systems. |
| 2 Senior Python developers | To hire | Platform hardening, backend services, Claims Core reverse-engineering. |
| Pavel Tarasov (advisor) | Part-time, not on build team | DOKS/deployment patterns, monitoring, Aerobots platform knowledge transfer. Continues AI initiatives separately. |
| System analysts | Existing | Review designs produced by the build team. Domain knowledge support. |
| PHP team (split) | Existing | Strongest devs retrain to Python over time. Others maintain monolith — bug fixes, feature flags, sync mechanisms. |
Key Principles
Timeline
Target: 15 months with a team of 3 developers (1 fullstack + 2 Python) + AI-assisted development. Phases overlap where possible — 3 devs means parallel work.
| Phase | Estimate | Notes |
|---|---|---|
| 0: Platform | 2–3 weeks | Anatoliy starts Admin Shell while Python devs harden the template |
| 1: LEAD Domain | 3–5 weeks | Analyst specs ready, mostly code generation from OpenAPI |
| 2: Content | 4–6 weeks | Documents + Messaging parallelized across 2 devs |
| 3: External | 5–8 weeks | Legal + Partners parallelized |
| 4: Financial | 6–8 weeks | Payments needs extra care regardless of speed |
| 5: Core | 12–18 weeks | Claims Core — the hard one. Temporal helps but domain complexity is real |
| Total | 32–48 weeks (~8–12 months) | Buffer to 15 months for unknowns |
Data Strategy
- New services own PostgreSQL. No new code touches MariaDB.
- Sync via RabbitMQ events + ETL. Data flows from MariaDB to PostgreSQL for each extracted domain. New services share the same RabbitMQ as the monolith — no bridge needed.
- Legacy IDs kept forever. Every new table has a
legacy_idcolumn as a permanent audit trail linking back to the original MariaDB records. - Partner API: break with notice. Partners get 3 months to migrate to the new API. Migration guide provided.
Immediate Next Steps
What Exists Today
We're not starting from zero. These assets feed directly into the migration:
| Asset | What it provides |
|---|---|
|
Two service templates new-project-template + aerbots-mcs-template |
Will be combined into one platform template. Auth, CI/CD, Docker from new-project-template; functional testing, worker patterns, structured logging from aerbots-mcs-template. |
|
Aerobots platform Existing DOKS cluster |
Running K8s cluster with 7 Python services, CI/CD patterns, monitoring |
|
Knowledge base skycop-knowledge |
220 docs covering 12 domains, 200+ entities, 30+ integrations, claim state machine |
|
Analyst designs Confluence SS3 |
Phase I specs: Dictionary Service OpenAPI, Lead Service database schema, AI validation logic, event contracts |
| Monolith itself | The source of truth for all business logic, 551 migrations of schema evolution, 75 RabbitMQ consumer patterns |