20 KiB
Keystone Implementation Review — Gaps vs docs/implementation-spec.md
The schema/migrations/models are about 98% correct. The orchestration, drivers, UI, and tests have substantial gaps. Below are concrete, file-anchored issues to fix.
Critical orchestration bugs
1. Operation parent/child hierarchy is flat — replicas are siblings of their service_deploy, not children
Spec §3 example nests service_deploy → replica_deploy. app/Jobs/Environments/DeployEnvironment.php:74-82 creates both as siblings under the same environment_deploy parent. Replica_deploy siblings rely on RunStep::dispatchNextSiblingOperation (app/Jobs/Services/RunStep.php:120-140) ordering by id — fragile, and if service_deploy fails, replica operations are not cancelled.
Fix: Nest replica operations under their service's service_deploy operation (parent_id = service_deploy.id), and cascade-cancel children when a parent fails.
2. Failed operations don't cancel siblings or children
RunStep::failed (app/Jobs/Services/RunStep.php:163-178) only cancels the failed operation's remaining steps. Sibling/child operations under the same parent continue to dispatch via dispatchNextSiblingOperation. A failing service deploy will still trigger gateway cutover.
Fix: On step failure, mark parent + all descendant operations as CANCELLED/FAILED. Re-check during sibling dispatch.
3. Gateway cutover uses a hardcoded container name that doesn't exist
DeployEnvironment.php:202 runs docker exec keystone-caddy caddy reload ..., but Caddy replicas are created with name keystone-service-{service->id}-{N} per DeployEnvironment.php:155. The keystone-caddy container is never created — cutover will always fail.
Fix: Look up the Caddy service's replica container name; or set a stable container_name in Caddy compose.
4. Gateway cutover is monolithic; no add-upstream/reload/drain sub-sequence
Spec §15 requires: render new replica → health check → add new upstream → reload → drain old → stop old. DeployEnvironment.php:201-206 does only caddy reload && sleep 10 && stop draining. There's no add-upstream step (Caddyfile is fully overwritten at slice_configure time), no real health check during cutover, and sleep 10 is an arbitrary drain.
Fix: Split into separate steps with explicit ordering, and tie drain to active connections / Caddy upstream health.
5. dispatchChildOperations dispatches only the first child's first step
DeployEnvironment.php:377-387 dispatches a single step. Continuation depends on RunStep::dispatchNextSiblingOperation chasing siblings by id. If a child operation has zero steps (e.g. service_deploy for a service whose driver returned no plan), the chain dies silently.
Fix: Make a single orchestrator (e.g. DispatchOperationChain) that knows how to walk the tree. Don't rely on implicit id-ordering between independently-created sibling operations.
Driver contract & implementations
6. Driver base contract is anemic
Spec §9 lists 13 required driver capabilities. app/Drivers/Driver.php only declares __construct and getOperationPlan. Image policy, ports, volumes, env schema, health checks, resource defaults, slice types, env exports, firewall, update behavior are scattered across drivers without contractual enforcement.
Fix: Define an interface with explicit methods (type(), versionTrack(), defaultPorts(), firewallRules(), updateBehavior(), etc.) and assert per-driver via tests.
7. Caddy2Driver::buildCaddyfile() reads an undefined field — dead code
app/Drivers/Caddy/Caddy2Driver.php:46 references $this->service->credentials['backend']. Nothing ever sets this. The actual Caddyfile is rendered inline by DeployEnvironment::configureCaddyRouteScript (app/Jobs/Environments/DeployEnvironment.php:321-335).
Fix: Delete buildCaddyfile(), or move Caddyfile generation into the driver and remove the duplicate in DeployEnvironment.
8. Postgres18Driver has no slice provisioning in its operation plan
Spec §12: "Creating a Postgres database/user should run as a slice operation against an existing Postgres replica, not redeploy the Postgres container." AttachManagedService::createSliceProvisionOperation (app/Actions/Environments/AttachManagedService.php:108-125) hardcodes the SQL script outside the driver. Slice provisioning logic belongs in the driver so other Postgres versions can implement it.
Fix: Add provisionSliceScript(ServiceSlice $slice): string to the slice contract; have Postgres driver own the SQL.
9. Postgres provision script assumes a keystone admin user that is never created
AttachManagedService.php:133 uses ($service->credentials ?? [])['user'] ?? 'keystone'. The Postgres service is never seeded with admin credentials and the compose for Postgres never sets POSTGRES_USER=keystone. This will fail in production.
Fix: Establish admin credentials on Postgres service creation (write to service->credentials); pass via POSTGRES_USER/POSTGRES_PASSWORD in compose env.
10. Stateful update steps are placeholder strings that won't actually run
app/Actions/Services/CreateStatefulServiceUpdateOperation.php:38-42:
'docker compose down'— no-f pathso it runs in the SSH user's home directory.'docker volume ls'— listing isn't "preserving"; it's a no-op.'docker compose up -d'— no-f path, no image digest, no env update.'docker compose ps'— not a real health check.
Spec §11 specifies a real sequence: stop → preserve named volume → start new with updated digest → health check.
Fix: Build steps from the driver against the service's actual compose path; verify the named volume exists before/after; replace healthcheck stub with docker inspect --format '{{.State.Health.Status}}' polling.
11. Stateful update doesn't write the updated digest into compose before restart
The operation sets available_image_digest on the service (line 55) but the compose file on disk is not re-rendered. docker compose up -d will pull from whatever digest is currently in the compose, not the new one.
Fix: Insert a "Render compose with new digest" step before the start step.
12. Valkey driver doesn't emit role-based env vars
app/Drivers/Valkey/Valkey8Driver.php:42-48 only emits REDIS_HOST and REDIS_PORT. Spec §13 explicitly recommends CACHE_STORE=redis, SESSION_DRIVER=redis, QUEUE_CONNECTION=redis based on attachment role.
Fix: Read the EnvironmentAttachment.role and add the appropriate Laravel env defaults (with the "Do not silently change queue behavior without confirmation" guard from §12).
13. Valkey has no logical-DB isolation
AttachManagedService.php:64-71 creates a logical_database slice but never assigns a Redis database index (REDIS_DB). All environments attached to the same Valkey service share DB 0.
Fix: Assign REDIS_DB per slice; include in environmentExportsForSlice.
14. Caddy and Valkey slices have no SLICE_PROVISION operation
AttachManagedService::createSliceProvisionOperation (line 110) early-returns unless service->type === POSTGRES. Caddy routes and Valkey logical DBs are created in the DB but never reconciled to the running service.
Fix: Emit slice operations for all service types whose driver supports slices.
15. Postgres driver doesn't export DB_* at the service level
Postgres18Driver::environmentExports() returns []. That's fine, but only slice exports work, so if a service has no slice but a Laravel app references DB_HOST, nothing wires it.
Fix: Either guarantee a slice always exists for Postgres attachments, or emit DB_HOST at the service level via the attachment.
Deployment flow
16. Migration timing is hardcoded to pre_switch
DeployEnvironment::serviceDeployScripts (app/Jobs/Environments/DeployEnvironment.php:236-238) always emits the migration step before "Deploy replicas". Spec §18 lists migration_timing: pre_switch | post_switch on service config — never read.
Fix: Check $service->config['migration_timing'] and either emit the migration step before replicas or after the gateway cutover.
17. Migration mode manual is ignored
migrationScript (line 292-301) only short-circuits when migration_mode=disabled. manual still auto-runs php artisan migrate --force during environment deploy. Spec §18 says manual mode should not run automatically.
Fix: Treat manual the same as disabled for environment deploys; only the dedicated environment-migrations.store controller should run it.
18. Two parallel migration code paths
DeployEnvironment::migrationScript (line 292), LaravelRuntimeDriver::getOperationPlan, and EnvironmentMigrationController all emit migration scripts independently. They will drift.
Fix: Centralize in one place (driver method or dedicated action) and call from all three.
19. "Update gateway routes" step is a no-op
DeployEnvironment.php:248-250:
'script' => 'test -f /home/keystone/gateway/Caddyfile',
This just checks file presence. The actual route update is in a separate slice_configure operation, so this step is dead code in the service-deploy chain.
Fix: Either remove the step or have it actually trigger the route update for this service.
20. Pre-switch service steps from spec §17 step 6 are missing entirely
No "pre-switch" hooks are emitted by DeployEnvironment or any driver.
Fix: Add a preSwitchSteps(): array driver capability and call it before the migration/replica steps.
21. Multi-server replica placement is not implemented
DeployEnvironment::ensureServiceReplicas (line 147-171) always assigns server_id = $service->server_id. There is no way to place replicas across multiple servers — yet the registry-required check at line 39-41 assumes multi-server deployments exist.
Fix: Either ship single-server-only v1 and remove the multi-server gate, or wire Service.process_roles/placement policy to multiple servers in ensureServiceReplicas.
22. No explicit docker pull <ref>@<digest> on target servers for multi-server
replicaDeployScripts (line 261-290) does docker compose up -d only; for multi-server, target servers need to pull from the registry by digest. There is no pull step.
Fix: Add docker pull step using registry_ref + digest before the up -d step on each target server.
23. Build strategy dedicated_builder and external_registry are not enforced
BuildApplicationArtifact::execute (app/Actions/Environments/BuildApplicationArtifact.php:30) picks any server via buildServer(). The push is only conditionally added when strategy === EXTERNAL_REGISTRY (line 89). There's no enforcement that dedicated_builder requires a builder service to exist, nor that external_registry skips local build entirely.
Fix: Branch on strategy at the top of execute(); for external_registry, skip the build and resolve the digest from the registry (docker manifest inspect); for dedicated_builder, fail if no builder service is provisioned.
24. Scheduler placement is not enforced at runtime
Spec §8: single mode runs schedule:run on exactly one replica; every_replica runs on all. LaravelRuntimeDriver sets AUTORUN_LARAVEL_SCHEDULER=true based on the service's process_roles, but nothing applies the env per-replica based on scheduler_target_service_id/scheduler_mode. All replicas of the target service end up with the env var.
Fix: When generating replica config, only emit AUTORUN_LARAVEL_SCHEDULER=true on the elected replica when scheduler_mode=single. The existing PlanEnvironmentDeployment::blockers (app/Actions/Environments/PlanEnvironmentDeployment.php:79-87) is a pre-flight check, not an enforcement.
25. Compose file is generated via shell heredoc instead of a real upload
composeUploadScript (line 303-319) inlines the compose body in a cat <<'KEYSTONE_COMPOSE' heredoc. Any quoting issue (binary, single-quote in env, large file) breaks. Spec §16 implies generated artifacts should be transferred via SSH/SCP.
Fix: Use SCP (or ssh ... 'cat > path' with binary-safe encoding) instead of heredoc. Also drop separate generation of .env files — currently only the compose is uploaded; .env references in compose will 404 on disk.
26. .env files never written to disk
Spec §16 layout includes /home/keystone/services/<service-id>/.env. composeUploadScript writes only compose.yml. If the compose env_file: .env directive is rendered, the deploy will fail.
Fix: Render and upload .env alongside compose.yml.
Models, schema, and encryption
27. Service.credentials migration column is plaintext but model casts as encrypted:array
Migration database/migrations/2025_03_27_121050_create_services_table.php:35 declares text('credentials'), while app/Models/Service.php:36 casts it as encrypted:array. The cast works (Laravel encrypts on write) — but a text column without explicit nullable()/encoding intent is confusing. Confirm cast actually encrypts (it does for encrypted:* casts) and document.
Fix: Add ->nullable() and a comment in the migration noting the field is encrypted at the model layer.
28. EnvironmentVariable.value cast missing array-vs-string handling
Spec doesn't require it, but worth flagging: the cast is encrypted (scalar), which means complex values (JSON-encoded secrets) need explicit json_encode by callers. Currently AttachManagedService.php:97-104 always passes scalar values, so this is OK.
UI and onboarding
29. Onboarding (Spec §19) is not implemented
No onboarding controller, no routes, no Inertia pages. The spec calls for a guided flow: organisation → provider → source → deploy key → registry → server → app/env → attachments. Currently users must navigate disjoint pages manually.
Fix: Add an OnboardingController with a state machine on Organisation (or session-based progress) plus an Inertia wizard.
30. Service detail/edit pages are missing
resources/js/pages/services/ only contains updates/Create.vue. There's no Index/Show/Edit. Spec §20 Phase 6 calls for "services under an environment with sensible defaults".
Fix: Add service Show/Edit pages, including replica health, slices, and one-click update.
31. No "managed attachment" guided flow
resources/js/pages/environment-attachments/Create.vue exposes a raw service list. Spec §12 + §20 call for managed flows for Postgres / Valkey / Caddy with auto-defaulted slices.
Fix: Build a guided picker per role (database / cache / queue / storage / gateway) that filters services to compatible types and previews the generated slice + env vars.
32. Deploy policies are visible / no defaults hiding
Spec §20 Phase 6: "Hide deploy policies by default." Currently they're set/exposed via Service form requests (StoreServiceRequest). No UI hiding.
Fix: Don't expose deploy_policy in service create/edit UI; rely on driver-provided defaults from spec §2.
33. resources/js/pages/applications/Show.vue has a stale stub comment
Line 72: <!-- Add instance button would go here --> — references the removed Instance model.
Fix: Remove the stale comment; add the actual "New environment" button.
34. resources/js/pages/servers/Index.vue contains a @todo pagination literal
Ship-ready code shouldn't have TODOs visible to the user. Fix: Either implement pagination or remove the comment.
35. No UI surfaces variable source / overridable
EnvironmentVariableController::store hardcodes source=USER. The UI in environment-variables/Create.vue provides no way to see managed vs. user vars, and the spec §13 requires the source/overridable badge.
Fix: Read variables on the environment Show page grouped by source; for managed_attachment rows, show a "managed by Postgres slice X" badge and disable editing unless overridable.
36. No environment Show page; deploys/migrations etc. are triggered from applications/Show.vue per-environment row
This works but spec §20 Phase 6 wants environments to be the primary surface. Currently there's no environment detail page where services, replicas, slices, attachments, env vars, and operations are visible together.
Fix: Add environments/Show.vue and route applications/Show.vue's environment row to it.
Tests — coverage gaps
37. EnvironmentDeploymentControllerTest only asserts dispatch
Short test, no state assertions after run.
Fix: Replace Bus::fake() with running the job inline; assert that operations, steps, replica records, compose files, and env vars are all in expected state.
38. No test asserts deploy key cleanup after build
BuildApplicationArtifact.php:100-101 adds the cleanup trap. No test verifies the trap actually runs or that the key file is gone.
Fix: In BuildApplicationArtifactTest, fake the remote runner and assert the build script contains trap cleanup EXIT AND that the operation_dir path resolves under /home/keystone/operations/.
39. No test asserts the named volume naming convention
Spec §10 names volumes keystone_service_<id>_postgres_data. No test in ComposeRendererTest checks volume name format.
Fix: Snapshot-test or regex-assert the volume name pattern in ComposeRendererTest.
40. No test for parent-child operation chain executing end-to-end
DeployEnvironmentJobTest creates operations but never runs the resulting RunStep jobs through Queue::handleAll()-style flow.
Fix: Add an integration test that fakes the SSH layer (return canned success) and lets the chain run, asserting each operation transitions to COMPLETED in the correct order.
41. No test for cancellation cascade on failure
None of the test files exercise RunStep::failed with sibling/child cancellation expectations (because that behavior doesn't exist yet — see gap #2).
Fix: Add a test that fails a step mid-chain and asserts all later operations are CANCELLED.
42. No test for stateful update flow against a Postgres service
StatefulServiceUpdateTest likely asserts only operation/step rows. Need: rendered script asserts compose path is correct, named volume is preserved, new digest is written.
Fix: Strengthen assertions to validate the full step contents.
43. No test for multi-server build/push/pull
BuildArtifactPlanningTest checks requiresRegistry=true but no test asserts a docker push and per-target docker pull actually occur.
Fix: Add a job-level test with two-server topology and assert each target's deploy script includes a docker pull <ref>@<digest> before compose up.
44. No test that manual/disabled migration modes are honored
Fix: Parametrized test asserting migrationScript returns 'true' for disabled and manual modes, and the real command for auto.
45. No test for scheduler enforcement per replica
Fix: Test that for scheduler_mode=single, only one replica's rendered env has AUTORUN_LARAVEL_SCHEDULER=true; for every_replica, all do.
46. No test that managed attachment auto-creates slices for Valkey + Caddy
ManagedAttachmentTest likely tests Postgres only.
Fix: Extend dataset to cover Valkey logical_database and Caddy route slices, with their env exports.
Suggested ordering
- Fix the orchestration bugs (#1-#5) — without these the chain doesn't reliably reach completion.
- Fix the Caddy cutover (#3, #4, #7, #19) — without these no environment can serve traffic.
- Fix Postgres slice provision admin user (#9) and stateful update scripts (#10, #11).
- Implement migration timing/mode (#16, #17, #18).
- Implement scheduler enforcement (#24) and multi-server placement + pull (#21, #22, #23).
- Then UI: onboarding (#29), environment Show page (#36), managed attachment UI (#31), variable source display (#35).
- Strengthen tests last (#37-#46) once the orchestrator and drivers are stable.
Most of the schema and the high-level structure are correct — the gap is between the data model and the runtime behavior that's supposed to enforce/realize it.