# Keystone Implementation Review — Gaps vs `docs/implementation-spec.md` The schema/migrations/models are about 98% correct. The orchestration, drivers, UI, and tests have substantial gaps. Below are concrete, file-anchored issues to fix. ## Critical orchestration bugs ### 1. Operation parent/child hierarchy is flat — replicas are siblings of their service_deploy, not children **Spec §3 example** nests `service_deploy → replica_deploy`. **`app/Jobs/Environments/DeployEnvironment.php:74-82`** creates both as siblings under the same `environment_deploy` parent. Replica_deploy siblings rely on `RunStep::dispatchNextSiblingOperation` (`app/Jobs/Services/RunStep.php:120-140`) ordering by `id` — fragile, and if `service_deploy` fails, replica operations are not cancelled. **Fix:** Nest replica operations under their service's `service_deploy` operation (parent_id = service_deploy.id), and cascade-cancel children when a parent fails. ### 2. Failed operations don't cancel siblings or children `RunStep::failed` (`app/Jobs/Services/RunStep.php:163-178`) only cancels the failed operation's remaining steps. Sibling/child operations under the same parent continue to dispatch via `dispatchNextSiblingOperation`. A failing service deploy will still trigger gateway cutover. **Fix:** On step failure, mark parent + all descendant operations as `CANCELLED`/`FAILED`. Re-check during sibling dispatch. ### 3. Gateway cutover uses a hardcoded container name that doesn't exist `DeployEnvironment.php:202` runs `docker exec keystone-caddy caddy reload ...`, but Caddy replicas are created with name `keystone-service-{service->id}-{N}` per `DeployEnvironment.php:155`. The `keystone-caddy` container is never created — cutover will always fail. **Fix:** Look up the Caddy service's replica container name; or set a stable container_name in Caddy compose. ### 4. Gateway cutover is monolithic; no add-upstream/reload/drain sub-sequence **Spec §15** requires: render new replica → health check → add new upstream → reload → drain old → stop old. `DeployEnvironment.php:201-206` does only `caddy reload && sleep 10 && stop draining`. There's no add-upstream step (Caddyfile is fully overwritten at `slice_configure` time), no real health check during cutover, and `sleep 10` is an arbitrary drain. **Fix:** Split into separate steps with explicit ordering, and tie drain to active connections / Caddy upstream health. ### 5. `dispatchChildOperations` dispatches only the first child's first step `DeployEnvironment.php:377-387` dispatches a single step. Continuation depends on `RunStep::dispatchNextSiblingOperation` chasing siblings by id. If a child operation has zero steps (e.g. `service_deploy` for a service whose driver returned no plan), the chain dies silently. **Fix:** Make a single orchestrator (e.g. `DispatchOperationChain`) that knows how to walk the tree. Don't rely on implicit id-ordering between independently-created sibling operations. ## Driver contract & implementations ### 6. `Driver` base contract is anemic **Spec §9** lists 13 required driver capabilities. `app/Drivers/Driver.php` only declares `__construct` and `getOperationPlan`. Image policy, ports, volumes, env schema, health checks, resource defaults, slice types, env exports, firewall, update behavior are scattered across drivers without contractual enforcement. **Fix:** Define an interface with explicit methods (`type()`, `versionTrack()`, `defaultPorts()`, `firewallRules()`, `updateBehavior()`, etc.) and assert per-driver via tests. ### 7. `Caddy2Driver::buildCaddyfile()` reads an undefined field — dead code `app/Drivers/Caddy/Caddy2Driver.php:46` references `$this->service->credentials['backend']`. Nothing ever sets this. The actual Caddyfile is rendered inline by `DeployEnvironment::configureCaddyRouteScript` (`app/Jobs/Environments/DeployEnvironment.php:321-335`). **Fix:** Delete `buildCaddyfile()`, or move Caddyfile generation into the driver and remove the duplicate in `DeployEnvironment`. ### 8. Postgres18Driver has no slice provisioning in its operation plan **Spec §12:** "Creating a Postgres database/user should run as a slice operation against an existing Postgres replica, not redeploy the Postgres container." `AttachManagedService::createSliceProvisionOperation` (`app/Actions/Environments/AttachManagedService.php:108-125`) hardcodes the SQL script outside the driver. Slice provisioning logic belongs in the driver so other Postgres versions can implement it. **Fix:** Add `provisionSliceScript(ServiceSlice $slice): string` to the slice contract; have Postgres driver own the SQL. ### 9. Postgres provision script assumes a `keystone` admin user that is never created `AttachManagedService.php:133` uses `($service->credentials ?? [])['user'] ?? 'keystone'`. The Postgres service is never seeded with admin credentials and the compose for Postgres never sets `POSTGRES_USER=keystone`. This will fail in production. **Fix:** Establish admin credentials on Postgres service creation (write to `service->credentials`); pass via `POSTGRES_USER`/`POSTGRES_PASSWORD` in compose env. ### 10. Stateful update steps are placeholder strings that won't actually run `app/Actions/Services/CreateStatefulServiceUpdateOperation.php:38-42`: - `'docker compose down'` — no `-f path` so it runs in the SSH user's home directory. - `'docker volume ls'` — listing isn't "preserving"; it's a no-op. - `'docker compose up -d'` — no `-f path`, no image digest, no env update. - `'docker compose ps'` — not a real health check. Spec §11 specifies a real sequence: stop → preserve named volume → start new with updated digest → health check. **Fix:** Build steps from the driver against the service's actual compose path; verify the named volume exists before/after; replace healthcheck stub with `docker inspect --format '{{.State.Health.Status}}'` polling. ### 11. Stateful update doesn't write the updated digest into compose before restart The operation sets `available_image_digest` on the service (line 55) but the compose file on disk is not re-rendered. `docker compose up -d` will pull from whatever digest is currently in the compose, not the new one. **Fix:** Insert a "Render compose with new digest" step before the start step. ### 12. Valkey driver doesn't emit role-based env vars `app/Drivers/Valkey/Valkey8Driver.php:42-48` only emits `REDIS_HOST` and `REDIS_PORT`. **Spec §13** explicitly recommends `CACHE_STORE=redis`, `SESSION_DRIVER=redis`, `QUEUE_CONNECTION=redis` based on attachment role. **Fix:** Read the `EnvironmentAttachment.role` and add the appropriate Laravel env defaults (with the "Do not silently change queue behavior without confirmation" guard from §12). ### 13. Valkey has no logical-DB isolation `AttachManagedService.php:64-71` creates a `logical_database` slice but never assigns a Redis database index (`REDIS_DB`). All environments attached to the same Valkey service share DB 0. **Fix:** Assign `REDIS_DB` per slice; include in `environmentExportsForSlice`. ### 14. Caddy and Valkey slices have no `SLICE_PROVISION` operation `AttachManagedService::createSliceProvisionOperation` (line 110) early-returns unless `service->type === POSTGRES`. Caddy routes and Valkey logical DBs are created in the DB but never reconciled to the running service. **Fix:** Emit slice operations for all service types whose driver supports slices. ### 15. Postgres driver doesn't export DB_* at the service level `Postgres18Driver::environmentExports()` returns `[]`. That's fine, but only slice exports work, so if a service has no slice but a Laravel app references DB_HOST, nothing wires it. **Fix:** Either guarantee a slice always exists for Postgres attachments, or emit DB_HOST at the service level via the attachment. ## Deployment flow ### 16. Migration timing is hardcoded to pre_switch `DeployEnvironment::serviceDeployScripts` (`app/Jobs/Environments/DeployEnvironment.php:236-238`) always emits the migration step before "Deploy replicas". **Spec §18** lists `migration_timing: pre_switch | post_switch` on service config — never read. **Fix:** Check `$service->config['migration_timing']` and either emit the migration step before replicas or after the gateway cutover. ### 17. Migration mode `manual` is ignored `migrationScript` (line 292-301) only short-circuits when `migration_mode=disabled`. `manual` still auto-runs `php artisan migrate --force` during environment deploy. Spec §18 says manual mode should not run automatically. **Fix:** Treat `manual` the same as `disabled` for environment deploys; only the dedicated `environment-migrations.store` controller should run it. ### 18. Two parallel migration code paths `DeployEnvironment::migrationScript` (line 292), `LaravelRuntimeDriver::getOperationPlan`, and `EnvironmentMigrationController` all emit migration scripts independently. They will drift. **Fix:** Centralize in one place (driver method or dedicated action) and call from all three. ### 19. "Update gateway routes" step is a no-op `DeployEnvironment.php:248-250`: ``` 'script' => 'test -f /home/keystone/gateway/Caddyfile', ``` This just checks file presence. The actual route update is in a separate `slice_configure` operation, so this step is dead code in the service-deploy chain. **Fix:** Either remove the step or have it actually trigger the route update for this service. ### 20. Pre-switch service steps from spec §17 step 6 are missing entirely No "pre-switch" hooks are emitted by `DeployEnvironment` or any driver. **Fix:** Add a `preSwitchSteps(): array` driver capability and call it before the migration/replica steps. ### 21. Multi-server replica placement is not implemented `DeployEnvironment::ensureServiceReplicas` (line 147-171) always assigns `server_id = $service->server_id`. There is no way to place replicas across multiple servers — yet the registry-required check at line 39-41 assumes multi-server deployments exist. **Fix:** Either ship single-server-only v1 and remove the multi-server gate, or wire `Service.process_roles`/placement policy to multiple servers in `ensureServiceReplicas`. ### 22. No explicit `docker pull @` on target servers for multi-server `replicaDeployScripts` (line 261-290) does `docker compose up -d` only; for multi-server, target servers need to pull from the registry by digest. There is no pull step. **Fix:** Add `docker pull` step using `registry_ref` + digest before the `up -d` step on each target server. ### 23. Build strategy `dedicated_builder` and `external_registry` are not enforced `BuildApplicationArtifact::execute` (`app/Actions/Environments/BuildApplicationArtifact.php:30`) picks any server via `buildServer()`. The push is only conditionally added when strategy === `EXTERNAL_REGISTRY` (line 89). There's no enforcement that `dedicated_builder` requires a builder service to exist, nor that `external_registry` skips local build entirely. **Fix:** Branch on strategy at the top of `execute()`; for `external_registry`, skip the build and resolve the digest from the registry (`docker manifest inspect`); for `dedicated_builder`, fail if no builder service is provisioned. ### 24. Scheduler placement is not enforced at runtime **Spec §8:** `single` mode runs `schedule:run` on exactly one replica; `every_replica` runs on all. `LaravelRuntimeDriver` sets `AUTORUN_LARAVEL_SCHEDULER=true` based on the service's `process_roles`, but nothing applies the env per-replica based on `scheduler_target_service_id`/`scheduler_mode`. All replicas of the target service end up with the env var. **Fix:** When generating replica config, only emit `AUTORUN_LARAVEL_SCHEDULER=true` on the elected replica when `scheduler_mode=single`. The existing `PlanEnvironmentDeployment::blockers` (`app/Actions/Environments/PlanEnvironmentDeployment.php:79-87`) is a pre-flight check, not an enforcement. ### 25. Compose file is generated via shell heredoc instead of a real upload `composeUploadScript` (line 303-319) inlines the compose body in a `cat <<'KEYSTONE_COMPOSE'` heredoc. Any quoting issue (binary, single-quote in env, large file) breaks. Spec §16 implies generated artifacts should be transferred via SSH/SCP. **Fix:** Use SCP (or `ssh ... 'cat > path'` with binary-safe encoding) instead of heredoc. Also drop separate generation of `.env` files — currently only the compose is uploaded; `.env` references in compose will 404 on disk. ### 26. `.env` files never written to disk **Spec §16** layout includes `/home/keystone/services//.env`. `composeUploadScript` writes only `compose.yml`. If the compose `env_file: .env` directive is rendered, the deploy will fail. **Fix:** Render and upload `.env` alongside `compose.yml`. ## Models, schema, and encryption ### 27. Service.credentials migration column is plaintext but model casts as `encrypted:array` Migration `database/migrations/2025_03_27_121050_create_services_table.php:35` declares `text('credentials')`, while `app/Models/Service.php:36` casts it as `encrypted:array`. The cast works (Laravel encrypts on write) — but a `text` column without explicit `nullable()`/encoding intent is confusing. Confirm cast actually encrypts (it does for `encrypted:*` casts) and document. **Fix:** Add `->nullable()` and a comment in the migration noting the field is encrypted at the model layer. ### 28. `EnvironmentVariable.value` cast missing array-vs-string handling Spec doesn't require it, but worth flagging: the cast is `encrypted` (scalar), which means complex values (JSON-encoded secrets) need explicit json_encode by callers. Currently `AttachManagedService.php:97-104` always passes scalar values, so this is OK. ## UI and onboarding ### 29. Onboarding (Spec §19) is not implemented No onboarding controller, no routes, no Inertia pages. The spec calls for a guided flow: organisation → provider → source → deploy key → registry → server → app/env → attachments. Currently users must navigate disjoint pages manually. **Fix:** Add an `OnboardingController` with a state machine on `Organisation` (or session-based progress) plus an Inertia wizard. ### 30. Service detail/edit pages are missing `resources/js/pages/services/` only contains `updates/Create.vue`. There's no Index/Show/Edit. Spec §20 Phase 6 calls for "services under an environment with sensible defaults". **Fix:** Add service Show/Edit pages, including replica health, slices, and one-click update. ### 31. No "managed attachment" guided flow `resources/js/pages/environment-attachments/Create.vue` exposes a raw service list. Spec §12 + §20 call for managed flows for Postgres / Valkey / Caddy with auto-defaulted slices. **Fix:** Build a guided picker per role (database / cache / queue / storage / gateway) that filters services to compatible types and previews the generated slice + env vars. ### 32. Deploy policies are visible / no defaults hiding Spec §20 Phase 6: "Hide deploy policies by default." Currently they're set/exposed via `Service` form requests (`StoreServiceRequest`). No UI hiding. **Fix:** Don't expose `deploy_policy` in service create/edit UI; rely on driver-provided defaults from spec §2. ### 33. `resources/js/pages/applications/Show.vue` has a stale stub comment Line 72: `` — references the removed `Instance` model. **Fix:** Remove the stale comment; add the actual "New environment" button. ### 34. `resources/js/pages/servers/Index.vue` contains a `@todo pagination` literal Ship-ready code shouldn't have TODOs visible to the user. **Fix:** Either implement pagination or remove the comment. ### 35. No UI surfaces variable source / overridable `EnvironmentVariableController::store` hardcodes `source=USER`. The UI in `environment-variables/Create.vue` provides no way to see managed vs. user vars, and the spec §13 requires the source/overridable badge. **Fix:** Read variables on the environment Show page grouped by source; for `managed_attachment` rows, show a "managed by Postgres slice X" badge and disable editing unless `overridable`. ### 36. No environment Show page; deploys/migrations etc. are triggered from `applications/Show.vue` per-environment row This works but spec §20 Phase 6 wants environments to be the primary surface. Currently there's no environment detail page where services, replicas, slices, attachments, env vars, and operations are visible together. **Fix:** Add `environments/Show.vue` and route `applications/Show.vue`'s environment row to it. ## Tests — coverage gaps ### 37. `EnvironmentDeploymentControllerTest` only asserts dispatch Short test, no state assertions after run. **Fix:** Replace `Bus::fake()` with running the job inline; assert that operations, steps, replica records, compose files, and env vars are all in expected state. ### 38. No test asserts deploy key cleanup after build `BuildApplicationArtifact.php:100-101` adds the cleanup trap. No test verifies the trap actually runs or that the key file is gone. **Fix:** In `BuildApplicationArtifactTest`, fake the remote runner and assert the build script contains `trap cleanup EXIT` AND that the operation_dir path resolves under `/home/keystone/operations/`. ### 39. No test asserts the named volume naming convention Spec §10 names volumes `keystone_service__postgres_data`. No test in `ComposeRendererTest` checks volume name format. **Fix:** Snapshot-test or regex-assert the volume name pattern in `ComposeRendererTest`. ### 40. No test for parent-child operation chain executing end-to-end `DeployEnvironmentJobTest` creates operations but never runs the resulting `RunStep` jobs through `Queue::handleAll()`-style flow. **Fix:** Add an integration test that fakes the SSH layer (return canned success) and lets the chain run, asserting each operation transitions to `COMPLETED` in the correct order. ### 41. No test for cancellation cascade on failure None of the test files exercise `RunStep::failed` with sibling/child cancellation expectations (because that behavior doesn't exist yet — see gap #2). **Fix:** Add a test that fails a step mid-chain and asserts all later operations are `CANCELLED`. ### 42. No test for stateful update flow against a Postgres service `StatefulServiceUpdateTest` likely asserts only operation/step rows. Need: rendered script asserts compose path is correct, named volume is preserved, new digest is written. **Fix:** Strengthen assertions to validate the full step contents. ### 43. No test for multi-server build/push/pull `BuildArtifactPlanningTest` checks `requiresRegistry=true` but no test asserts a `docker push` and per-target `docker pull` actually occur. **Fix:** Add a job-level test with two-server topology and assert each target's deploy script includes a `docker pull @` before `compose up`. ### 44. No test that `manual`/`disabled` migration modes are honored **Fix:** Parametrized test asserting `migrationScript` returns `'true'` for `disabled` and `manual` modes, and the real command for `auto`. ### 45. No test for scheduler enforcement per replica **Fix:** Test that for `scheduler_mode=single`, only one replica's rendered env has `AUTORUN_LARAVEL_SCHEDULER=true`; for `every_replica`, all do. ### 46. No test that managed attachment auto-creates slices for Valkey + Caddy `ManagedAttachmentTest` likely tests Postgres only. **Fix:** Extend dataset to cover Valkey logical_database and Caddy route slices, with their env exports. --- ## Suggested ordering 1. Fix the orchestration bugs (#1-#5) — without these the chain doesn't reliably reach completion. 2. Fix the Caddy cutover (#3, #4, #7, #19) — without these no environment can serve traffic. 3. Fix Postgres slice provision admin user (#9) and stateful update scripts (#10, #11). 4. Implement migration timing/mode (#16, #17, #18). 5. Implement scheduler enforcement (#24) and multi-server placement + pull (#21, #22, #23). 6. Then UI: onboarding (#29), environment Show page (#36), managed attachment UI (#31), variable source display (#35). 7. Strengthen tests last (#37-#46) once the orchestrator and drivers are stable. Most of the schema and the high-level structure are correct — the gap is between the data model and the runtime behavior that's supposed to enforce/realize it.