Files
keystone/docs/implementation-review.md

221 lines
20 KiB
Markdown

# Keystone Implementation Review — Gaps vs `docs/implementation-spec.md`
The schema/migrations/models are about 98% correct. The orchestration, drivers, UI, and tests have substantial gaps. Below are concrete, file-anchored issues to fix.
## Critical orchestration bugs
### 1. Operation parent/child hierarchy is flat — replicas are siblings of their service_deploy, not children
**Spec §3 example** nests `service_deploy → replica_deploy`. **`app/Jobs/Environments/DeployEnvironment.php:74-82`** creates both as siblings under the same `environment_deploy` parent. Replica_deploy siblings rely on `RunStep::dispatchNextSiblingOperation` (`app/Jobs/Services/RunStep.php:120-140`) ordering by `id` — fragile, and if `service_deploy` fails, replica operations are not cancelled.
**Fix:** Nest replica operations under their service's `service_deploy` operation (parent_id = service_deploy.id), and cascade-cancel children when a parent fails.
### 2. Failed operations don't cancel siblings or children
`RunStep::failed` (`app/Jobs/Services/RunStep.php:163-178`) only cancels the failed operation's remaining steps. Sibling/child operations under the same parent continue to dispatch via `dispatchNextSiblingOperation`. A failing service deploy will still trigger gateway cutover.
**Fix:** On step failure, mark parent + all descendant operations as `CANCELLED`/`FAILED`. Re-check during sibling dispatch.
### 3. Gateway cutover uses a hardcoded container name that doesn't exist
`DeployEnvironment.php:202` runs `docker exec keystone-caddy caddy reload ...`, but Caddy replicas are created with name `keystone-service-{service->id}-{N}` per `DeployEnvironment.php:155`. The `keystone-caddy` container is never created — cutover will always fail.
**Fix:** Look up the Caddy service's replica container name; or set a stable container_name in Caddy compose.
### 4. Gateway cutover is monolithic; no add-upstream/reload/drain sub-sequence
**Spec §15** requires: render new replica → health check → add new upstream → reload → drain old → stop old. `DeployEnvironment.php:201-206` does only `caddy reload && sleep 10 && stop draining`. There's no add-upstream step (Caddyfile is fully overwritten at `slice_configure` time), no real health check during cutover, and `sleep 10` is an arbitrary drain.
**Fix:** Split into separate steps with explicit ordering, and tie drain to active connections / Caddy upstream health.
### 5. `dispatchChildOperations` dispatches only the first child's first step
`DeployEnvironment.php:377-387` dispatches a single step. Continuation depends on `RunStep::dispatchNextSiblingOperation` chasing siblings by id. If a child operation has zero steps (e.g. `service_deploy` for a service whose driver returned no plan), the chain dies silently.
**Fix:** Make a single orchestrator (e.g. `DispatchOperationChain`) that knows how to walk the tree. Don't rely on implicit id-ordering between independently-created sibling operations.
## Driver contract & implementations
### 6. `Driver` base contract is anemic
**Spec §9** lists 13 required driver capabilities. `app/Drivers/Driver.php` only declares `__construct` and `getOperationPlan`. Image policy, ports, volumes, env schema, health checks, resource defaults, slice types, env exports, firewall, update behavior are scattered across drivers without contractual enforcement.
**Fix:** Define an interface with explicit methods (`type()`, `versionTrack()`, `defaultPorts()`, `firewallRules()`, `updateBehavior()`, etc.) and assert per-driver via tests.
### 7. `Caddy2Driver::buildCaddyfile()` reads an undefined field — dead code
`app/Drivers/Caddy/Caddy2Driver.php:46` references `$this->service->credentials['backend']`. Nothing ever sets this. The actual Caddyfile is rendered inline by `DeployEnvironment::configureCaddyRouteScript` (`app/Jobs/Environments/DeployEnvironment.php:321-335`).
**Fix:** Delete `buildCaddyfile()`, or move Caddyfile generation into the driver and remove the duplicate in `DeployEnvironment`.
### 8. Postgres18Driver has no slice provisioning in its operation plan
**Spec §12:** "Creating a Postgres database/user should run as a slice operation against an existing Postgres replica, not redeploy the Postgres container." `AttachManagedService::createSliceProvisionOperation` (`app/Actions/Environments/AttachManagedService.php:108-125`) hardcodes the SQL script outside the driver. Slice provisioning logic belongs in the driver so other Postgres versions can implement it.
**Fix:** Add `provisionSliceScript(ServiceSlice $slice): string` to the slice contract; have Postgres driver own the SQL.
### 9. Postgres provision script assumes a `keystone` admin user that is never created
`AttachManagedService.php:133` uses `($service->credentials ?? [])['user'] ?? 'keystone'`. The Postgres service is never seeded with admin credentials and the compose for Postgres never sets `POSTGRES_USER=keystone`. This will fail in production.
**Fix:** Establish admin credentials on Postgres service creation (write to `service->credentials`); pass via `POSTGRES_USER`/`POSTGRES_PASSWORD` in compose env.
### 10. Stateful update steps are placeholder strings that won't actually run
`app/Actions/Services/CreateStatefulServiceUpdateOperation.php:38-42`:
- `'docker compose down'` — no `-f path` so it runs in the SSH user's home directory.
- `'docker volume ls'` — listing isn't "preserving"; it's a no-op.
- `'docker compose up -d'` — no `-f path`, no image digest, no env update.
- `'docker compose ps'` — not a real health check.
Spec §11 specifies a real sequence: stop → preserve named volume → start new with updated digest → health check.
**Fix:** Build steps from the driver against the service's actual compose path; verify the named volume exists before/after; replace healthcheck stub with `docker inspect --format '{{.State.Health.Status}}'` polling.
### 11. Stateful update doesn't write the updated digest into compose before restart
The operation sets `available_image_digest` on the service (line 55) but the compose file on disk is not re-rendered. `docker compose up -d` will pull from whatever digest is currently in the compose, not the new one.
**Fix:** Insert a "Render compose with new digest" step before the start step.
### 12. Valkey driver doesn't emit role-based env vars
`app/Drivers/Valkey/Valkey8Driver.php:42-48` only emits `REDIS_HOST` and `REDIS_PORT`. **Spec §13** explicitly recommends `CACHE_STORE=redis`, `SESSION_DRIVER=redis`, `QUEUE_CONNECTION=redis` based on attachment role.
**Fix:** Read the `EnvironmentAttachment.role` and add the appropriate Laravel env defaults (with the "Do not silently change queue behavior without confirmation" guard from §12).
### 13. Valkey has no logical-DB isolation
`AttachManagedService.php:64-71` creates a `logical_database` slice but never assigns a Redis database index (`REDIS_DB`). All environments attached to the same Valkey service share DB 0.
**Fix:** Assign `REDIS_DB` per slice; include in `environmentExportsForSlice`.
### 14. Caddy and Valkey slices have no `SLICE_PROVISION` operation
`AttachManagedService::createSliceProvisionOperation` (line 110) early-returns unless `service->type === POSTGRES`. Caddy routes and Valkey logical DBs are created in the DB but never reconciled to the running service.
**Fix:** Emit slice operations for all service types whose driver supports slices.
### 15. Postgres driver doesn't export DB_* at the service level
`Postgres18Driver::environmentExports()` returns `[]`. That's fine, but only slice exports work, so if a service has no slice but a Laravel app references DB_HOST, nothing wires it.
**Fix:** Either guarantee a slice always exists for Postgres attachments, or emit DB_HOST at the service level via the attachment.
## Deployment flow
### 16. Migration timing is hardcoded to pre_switch
`DeployEnvironment::serviceDeployScripts` (`app/Jobs/Environments/DeployEnvironment.php:236-238`) always emits the migration step before "Deploy replicas". **Spec §18** lists `migration_timing: pre_switch | post_switch` on service config — never read.
**Fix:** Check `$service->config['migration_timing']` and either emit the migration step before replicas or after the gateway cutover.
### 17. Migration mode `manual` is ignored
`migrationScript` (line 292-301) only short-circuits when `migration_mode=disabled`. `manual` still auto-runs `php artisan migrate --force` during environment deploy. Spec §18 says manual mode should not run automatically.
**Fix:** Treat `manual` the same as `disabled` for environment deploys; only the dedicated `environment-migrations.store` controller should run it.
### 18. Two parallel migration code paths
`DeployEnvironment::migrationScript` (line 292), `LaravelRuntimeDriver::getOperationPlan`, and `EnvironmentMigrationController` all emit migration scripts independently. They will drift.
**Fix:** Centralize in one place (driver method or dedicated action) and call from all three.
### 19. "Update gateway routes" step is a no-op
`DeployEnvironment.php:248-250`:
```
'script' => 'test -f /home/keystone/gateway/Caddyfile',
```
This just checks file presence. The actual route update is in a separate `slice_configure` operation, so this step is dead code in the service-deploy chain.
**Fix:** Either remove the step or have it actually trigger the route update for this service.
### 20. Pre-switch service steps from spec §17 step 6 are missing entirely
No "pre-switch" hooks are emitted by `DeployEnvironment` or any driver.
**Fix:** Add a `preSwitchSteps(): array` driver capability and call it before the migration/replica steps.
### 21. Multi-server replica placement is not implemented
`DeployEnvironment::ensureServiceReplicas` (line 147-171) always assigns `server_id = $service->server_id`. There is no way to place replicas across multiple servers — yet the registry-required check at line 39-41 assumes multi-server deployments exist.
**Fix:** Either ship single-server-only v1 and remove the multi-server gate, or wire `Service.process_roles`/placement policy to multiple servers in `ensureServiceReplicas`.
### 22. No explicit `docker pull <ref>@<digest>` on target servers for multi-server
`replicaDeployScripts` (line 261-290) does `docker compose up -d` only; for multi-server, target servers need to pull from the registry by digest. There is no pull step.
**Fix:** Add `docker pull` step using `registry_ref` + digest before the `up -d` step on each target server.
### 23. Build strategy `dedicated_builder` and `external_registry` are not enforced
`BuildApplicationArtifact::execute` (`app/Actions/Environments/BuildApplicationArtifact.php:30`) picks any server via `buildServer()`. The push is only conditionally added when strategy === `EXTERNAL_REGISTRY` (line 89). There's no enforcement that `dedicated_builder` requires a builder service to exist, nor that `external_registry` skips local build entirely.
**Fix:** Branch on strategy at the top of `execute()`; for `external_registry`, skip the build and resolve the digest from the registry (`docker manifest inspect`); for `dedicated_builder`, fail if no builder service is provisioned.
### 24. Scheduler placement is not enforced at runtime
**Spec §8:** `single` mode runs `schedule:run` on exactly one replica; `every_replica` runs on all. `LaravelRuntimeDriver` sets `AUTORUN_LARAVEL_SCHEDULER=true` based on the service's `process_roles`, but nothing applies the env per-replica based on `scheduler_target_service_id`/`scheduler_mode`. All replicas of the target service end up with the env var.
**Fix:** When generating replica config, only emit `AUTORUN_LARAVEL_SCHEDULER=true` on the elected replica when `scheduler_mode=single`. The existing `PlanEnvironmentDeployment::blockers` (`app/Actions/Environments/PlanEnvironmentDeployment.php:79-87`) is a pre-flight check, not an enforcement.
### 25. Compose file is generated via shell heredoc instead of a real upload
`composeUploadScript` (line 303-319) inlines the compose body in a `cat <<'KEYSTONE_COMPOSE'` heredoc. Any quoting issue (binary, single-quote in env, large file) breaks. Spec §16 implies generated artifacts should be transferred via SSH/SCP.
**Fix:** Use SCP (or `ssh ... 'cat > path'` with binary-safe encoding) instead of heredoc. Also drop separate generation of `.env` files — currently only the compose is uploaded; `.env` references in compose will 404 on disk.
### 26. `.env` files never written to disk
**Spec §16** layout includes `/home/keystone/services/<service-id>/.env`. `composeUploadScript` writes only `compose.yml`. If the compose `env_file: .env` directive is rendered, the deploy will fail.
**Fix:** Render and upload `.env` alongside `compose.yml`.
## Models, schema, and encryption
### 27. Service.credentials migration column is plaintext but model casts as `encrypted:array`
Migration `database/migrations/2025_03_27_121050_create_services_table.php:35` declares `text('credentials')`, while `app/Models/Service.php:36` casts it as `encrypted:array`. The cast works (Laravel encrypts on write) — but a `text` column without explicit `nullable()`/encoding intent is confusing. Confirm cast actually encrypts (it does for `encrypted:*` casts) and document.
**Fix:** Add `->nullable()` and a comment in the migration noting the field is encrypted at the model layer.
### 28. `EnvironmentVariable.value` cast missing array-vs-string handling
Spec doesn't require it, but worth flagging: the cast is `encrypted` (scalar), which means complex values (JSON-encoded secrets) need explicit json_encode by callers. Currently `AttachManagedService.php:97-104` always passes scalar values, so this is OK.
## UI and onboarding
### 29. Onboarding (Spec §19) is not implemented
No onboarding controller, no routes, no Inertia pages. The spec calls for a guided flow: organisation → provider → source → deploy key → registry → server → app/env → attachments. Currently users must navigate disjoint pages manually.
**Fix:** Add an `OnboardingController` with a state machine on `Organisation` (or session-based progress) plus an Inertia wizard.
### 30. Service detail/edit pages are missing
`resources/js/pages/services/` only contains `updates/Create.vue`. There's no Index/Show/Edit. Spec §20 Phase 6 calls for "services under an environment with sensible defaults".
**Fix:** Add service Show/Edit pages, including replica health, slices, and one-click update.
### 31. No "managed attachment" guided flow
`resources/js/pages/environment-attachments/Create.vue` exposes a raw service list. Spec §12 + §20 call for managed flows for Postgres / Valkey / Caddy with auto-defaulted slices.
**Fix:** Build a guided picker per role (database / cache / queue / storage / gateway) that filters services to compatible types and previews the generated slice + env vars.
### 32. Deploy policies are visible / no defaults hiding
Spec §20 Phase 6: "Hide deploy policies by default." Currently they're set/exposed via `Service` form requests (`StoreServiceRequest`). No UI hiding.
**Fix:** Don't expose `deploy_policy` in service create/edit UI; rely on driver-provided defaults from spec §2.
### 33. `resources/js/pages/applications/Show.vue` has a stale stub comment
Line 72: `<!-- Add instance button would go here -->` — references the removed `Instance` model.
**Fix:** Remove the stale comment; add the actual "New environment" button.
### 34. `resources/js/pages/servers/Index.vue` contains a `@todo pagination` literal
Ship-ready code shouldn't have TODOs visible to the user.
**Fix:** Either implement pagination or remove the comment.
### 35. No UI surfaces variable source / overridable
`EnvironmentVariableController::store` hardcodes `source=USER`. The UI in `environment-variables/Create.vue` provides no way to see managed vs. user vars, and the spec §13 requires the source/overridable badge.
**Fix:** Read variables on the environment Show page grouped by source; for `managed_attachment` rows, show a "managed by Postgres slice X" badge and disable editing unless `overridable`.
### 36. No environment Show page; deploys/migrations etc. are triggered from `applications/Show.vue` per-environment row
This works but spec §20 Phase 6 wants environments to be the primary surface. Currently there's no environment detail page where services, replicas, slices, attachments, env vars, and operations are visible together.
**Fix:** Add `environments/Show.vue` and route `applications/Show.vue`'s environment row to it.
## Tests — coverage gaps
### 37. `EnvironmentDeploymentControllerTest` only asserts dispatch
Short test, no state assertions after run.
**Fix:** Replace `Bus::fake()` with running the job inline; assert that operations, steps, replica records, compose files, and env vars are all in expected state.
### 38. No test asserts deploy key cleanup after build
`BuildApplicationArtifact.php:100-101` adds the cleanup trap. No test verifies the trap actually runs or that the key file is gone.
**Fix:** In `BuildApplicationArtifactTest`, fake the remote runner and assert the build script contains `trap cleanup EXIT` AND that the operation_dir path resolves under `/home/keystone/operations/`.
### 39. No test asserts the named volume naming convention
Spec §10 names volumes `keystone_service_<id>_postgres_data`. No test in `ComposeRendererTest` checks volume name format.
**Fix:** Snapshot-test or regex-assert the volume name pattern in `ComposeRendererTest`.
### 40. No test for parent-child operation chain executing end-to-end
`DeployEnvironmentJobTest` creates operations but never runs the resulting `RunStep` jobs through `Queue::handleAll()`-style flow.
**Fix:** Add an integration test that fakes the SSH layer (return canned success) and lets the chain run, asserting each operation transitions to `COMPLETED` in the correct order.
### 41. No test for cancellation cascade on failure
None of the test files exercise `RunStep::failed` with sibling/child cancellation expectations (because that behavior doesn't exist yet — see gap #2).
**Fix:** Add a test that fails a step mid-chain and asserts all later operations are `CANCELLED`.
### 42. No test for stateful update flow against a Postgres service
`StatefulServiceUpdateTest` likely asserts only operation/step rows. Need: rendered script asserts compose path is correct, named volume is preserved, new digest is written.
**Fix:** Strengthen assertions to validate the full step contents.
### 43. No test for multi-server build/push/pull
`BuildArtifactPlanningTest` checks `requiresRegistry=true` but no test asserts a `docker push` and per-target `docker pull` actually occur.
**Fix:** Add a job-level test with two-server topology and assert each target's deploy script includes a `docker pull <ref>@<digest>` before `compose up`.
### 44. No test that `manual`/`disabled` migration modes are honored
**Fix:** Parametrized test asserting `migrationScript` returns `'true'` for `disabled` and `manual` modes, and the real command for `auto`.
### 45. No test for scheduler enforcement per replica
**Fix:** Test that for `scheduler_mode=single`, only one replica's rendered env has `AUTORUN_LARAVEL_SCHEDULER=true`; for `every_replica`, all do.
### 46. No test that managed attachment auto-creates slices for Valkey + Caddy
`ManagedAttachmentTest` likely tests Postgres only.
**Fix:** Extend dataset to cover Valkey logical_database and Caddy route slices, with their env exports.
---
## Suggested ordering
1. Fix the orchestration bugs (#1-#5) — without these the chain doesn't reliably reach completion.
2. Fix the Caddy cutover (#3, #4, #7, #19) — without these no environment can serve traffic.
3. Fix Postgres slice provision admin user (#9) and stateful update scripts (#10, #11).
4. Implement migration timing/mode (#16, #17, #18).
5. Implement scheduler enforcement (#24) and multi-server placement + pull (#21, #22, #23).
6. Then UI: onboarding (#29), environment Show page (#36), managed attachment UI (#31), variable source display (#35).
7. Strengthen tests last (#37-#46) once the orchestrator and drivers are stable.
Most of the schema and the high-level structure are correct — the gap is between the data model and the runtime behavior that's supposed to enforce/realize it.