Implement Keystone environment deployments

2026-05-13 16:11:23 +01:00
parent 65d3142d03
commit aa680b25fd
175 changed files with 10258 additions and 740 deletions
--- a/docs/implementation-review.md
+++ b/docs/implementation-review.md
@@ -0,0 +1,220 @@
+# Keystone Implementation Review — Gaps vs `docs/implementation-spec.md`
+
+The schema/migrations/models are about 98% correct. The orchestration, drivers, UI, and tests have substantial gaps. Below are concrete, file-anchored issues to fix.
+
+## Critical orchestration bugs
+
+### 1. Operation parent/child hierarchy is flat — replicas are siblings of their service_deploy, not children
+**Spec §3 example** nests `service_deploy → replica_deploy`. **`app/Jobs/Environments/DeployEnvironment.php:74-82`** creates both as siblings under the same `environment_deploy` parent. Replica_deploy siblings rely on `RunStep::dispatchNextSiblingOperation` (`app/Jobs/Services/RunStep.php:120-140`) ordering by `id` — fragile, and if `service_deploy` fails, replica operations are not cancelled.
+**Fix:** Nest replica operations under their service's `service_deploy` operation (parent_id = service_deploy.id), and cascade-cancel children when a parent fails.
+
+### 2. Failed operations don't cancel siblings or children
+`RunStep::failed` (`app/Jobs/Services/RunStep.php:163-178`) only cancels the failed operation's remaining steps. Sibling/child operations under the same parent continue to dispatch via `dispatchNextSiblingOperation`. A failing service deploy will still trigger gateway cutover.
+**Fix:** On step failure, mark parent + all descendant operations as `CANCELLED`/`FAILED`. Re-check during sibling dispatch.
+
+### 3. Gateway cutover uses a hardcoded container name that doesn't exist
+`DeployEnvironment.php:202` runs `docker exec keystone-caddy caddy reload ...`, but Caddy replicas are created with name `keystone-service-{service->id}-{N}` per `DeployEnvironment.php:155`. The `keystone-caddy` container is never created — cutover will always fail.
+**Fix:** Look up the Caddy service's replica container name; or set a stable container_name in Caddy compose.
+
+### 4. Gateway cutover is monolithic; no add-upstream/reload/drain sub-sequence
+**Spec §15** requires: render new replica → health check → add new upstream → reload → drain old → stop old. `DeployEnvironment.php:201-206` does only `caddy reload && sleep 10 && stop draining`. There's no add-upstream step (Caddyfile is fully overwritten at `slice_configure` time), no real health check during cutover, and `sleep 10` is an arbitrary drain.
+**Fix:** Split into separate steps with explicit ordering, and tie drain to active connections / Caddy upstream health.
+
+### 5. `dispatchChildOperations` dispatches only the first child's first step
+`DeployEnvironment.php:377-387` dispatches a single step. Continuation depends on `RunStep::dispatchNextSiblingOperation` chasing siblings by id. If a child operation has zero steps (e.g. `service_deploy` for a service whose driver returned no plan), the chain dies silently.
+**Fix:** Make a single orchestrator (e.g. `DispatchOperationChain`) that knows how to walk the tree. Don't rely on implicit id-ordering between independently-created sibling operations.
+
+## Driver contract & implementations
+
+### 6. `Driver` base contract is anemic
+**Spec §9** lists 13 required driver capabilities. `app/Drivers/Driver.php` only declares `__construct` and `getOperationPlan`. Image policy, ports, volumes, env schema, health checks, resource defaults, slice types, env exports, firewall, update behavior are scattered across drivers without contractual enforcement.
+**Fix:** Define an interface with explicit methods (`type()`, `versionTrack()`, `defaultPorts()`, `firewallRules()`, `updateBehavior()`, etc.) and assert per-driver via tests.
+
+### 7. `Caddy2Driver::buildCaddyfile()` reads an undefined field — dead code
+`app/Drivers/Caddy/Caddy2Driver.php:46` references `$this->service->credentials['backend']`. Nothing ever sets this. The actual Caddyfile is rendered inline by `DeployEnvironment::configureCaddyRouteScript` (`app/Jobs/Environments/DeployEnvironment.php:321-335`).
+**Fix:** Delete `buildCaddyfile()`, or move Caddyfile generation into the driver and remove the duplicate in `DeployEnvironment`.
+
+### 8. Postgres18Driver has no slice provisioning in its operation plan
+**Spec §12:** "Creating a Postgres database/user should run as a slice operation against an existing Postgres replica, not redeploy the Postgres container." `AttachManagedService::createSliceProvisionOperation` (`app/Actions/Environments/AttachManagedService.php:108-125`) hardcodes the SQL script outside the driver. Slice provisioning logic belongs in the driver so other Postgres versions can implement it.
+**Fix:** Add `provisionSliceScript(ServiceSlice $slice): string` to the slice contract; have Postgres driver own the SQL.
+
+### 9. Postgres provision script assumes a `keystone` admin user that is never created
+`AttachManagedService.php:133` uses `($service->credentials ?? [])['user'] ?? 'keystone'`. The Postgres service is never seeded with admin credentials and the compose for Postgres never sets `POSTGRES_USER=keystone`. This will fail in production.
+**Fix:** Establish admin credentials on Postgres service creation (write to `service->credentials`); pass via `POSTGRES_USER`/`POSTGRES_PASSWORD` in compose env.
+
+### 10. Stateful update steps are placeholder strings that won't actually run
+`app/Actions/Services/CreateStatefulServiceUpdateOperation.php:38-42`:
+- `'docker compose down'` — no `-f path` so it runs in the SSH user's home directory.
+- `'docker volume ls'` — listing isn't "preserving"; it's a no-op.
+- `'docker compose up -d'` — no `-f path`, no image digest, no env update.
+- `'docker compose ps'` — not a real health check.
+
+Spec §11 specifies a real sequence: stop → preserve named volume → start new with updated digest → health check.
+**Fix:** Build steps from the driver against the service's actual compose path; verify the named volume exists before/after; replace healthcheck stub with `docker inspect --format '{{.State.Health.Status}}'` polling.
+
+### 11. Stateful update doesn't write the updated digest into compose before restart
+The operation sets `available_image_digest` on the service (line 55) but the compose file on disk is not re-rendered. `docker compose up -d` will pull from whatever digest is currently in the compose, not the new one.
+**Fix:** Insert a "Render compose with new digest" step before the start step.
+
+### 12. Valkey driver doesn't emit role-based env vars
+`app/Drivers/Valkey/Valkey8Driver.php:42-48` only emits `REDIS_HOST` and `REDIS_PORT`. **Spec §13** explicitly recommends `CACHE_STORE=redis`, `SESSION_DRIVER=redis`, `QUEUE_CONNECTION=redis` based on attachment role.
+**Fix:** Read the `EnvironmentAttachment.role` and add the appropriate Laravel env defaults (with the "Do not silently change queue behavior without confirmation" guard from §12).
+
+### 13. Valkey has no logical-DB isolation
+`AttachManagedService.php:64-71` creates a `logical_database` slice but never assigns a Redis database index (`REDIS_DB`). All environments attached to the same Valkey service share DB 0.
+**Fix:** Assign `REDIS_DB` per slice; include in `environmentExportsForSlice`.
+
+### 14. Caddy and Valkey slices have no `SLICE_PROVISION` operation
+`AttachManagedService::createSliceProvisionOperation` (line 110) early-returns unless `service->type === POSTGRES`. Caddy routes and Valkey logical DBs are created in the DB but never reconciled to the running service.
+**Fix:** Emit slice operations for all service types whose driver supports slices.
+
+### 15. Postgres driver doesn't export DB_* at the service level
+`Postgres18Driver::environmentExports()` returns `[]`. That's fine, but only slice exports work, so if a service has no slice but a Laravel app references DB_HOST, nothing wires it.
+**Fix:** Either guarantee a slice always exists for Postgres attachments, or emit DB_HOST at the service level via the attachment.
+
+## Deployment flow
+
+### 16. Migration timing is hardcoded to pre_switch
+`DeployEnvironment::serviceDeployScripts` (`app/Jobs/Environments/DeployEnvironment.php:236-238`) always emits the migration step before "Deploy replicas". **Spec §18** lists `migration_timing: pre_switch | post_switch` on service config — never read.
+**Fix:** Check `$service->config['migration_timing']` and either emit the migration step before replicas or after the gateway cutover.
+
+### 17. Migration mode `manual` is ignored
+`migrationScript` (line 292-301) only short-circuits when `migration_mode=disabled`. `manual` still auto-runs `php artisan migrate --force` during environment deploy. Spec §18 says manual mode should not run automatically.
+**Fix:** Treat `manual` the same as `disabled` for environment deploys; only the dedicated `environment-migrations.store` controller should run it.
+
+### 18. Two parallel migration code paths
+`DeployEnvironment::migrationScript` (line 292), `LaravelRuntimeDriver::getOperationPlan`, and `EnvironmentMigrationController` all emit migration scripts independently. They will drift.
+**Fix:** Centralize in one place (driver method or dedicated action) and call from all three.
+
+### 19. "Update gateway routes" step is a no-op
+`DeployEnvironment.php:248-250`:
+```
+'script' => 'test -f /home/keystone/gateway/Caddyfile',
+```
+This just checks file presence. The actual route update is in a separate `slice_configure` operation, so this step is dead code in the service-deploy chain.
+**Fix:** Either remove the step or have it actually trigger the route update for this service.
+
+### 20. Pre-switch service steps from spec §17 step 6 are missing entirely
+No "pre-switch" hooks are emitted by `DeployEnvironment` or any driver.
+**Fix:** Add a `preSwitchSteps(): array` driver capability and call it before the migration/replica steps.
+
+### 21. Multi-server replica placement is not implemented
+`DeployEnvironment::ensureServiceReplicas` (line 147-171) always assigns `server_id = $service->server_id`. There is no way to place replicas across multiple servers — yet the registry-required check at line 39-41 assumes multi-server deployments exist.
+**Fix:** Either ship single-server-only v1 and remove the multi-server gate, or wire `Service.process_roles`/placement policy to multiple servers in `ensureServiceReplicas`.
+
+### 22. No explicit `docker pull <ref>@<digest>` on target servers for multi-server
+`replicaDeployScripts` (line 261-290) does `docker compose up -d` only; for multi-server, target servers need to pull from the registry by digest. There is no pull step.
+**Fix:** Add `docker pull` step using `registry_ref` + digest before the `up -d` step on each target server.
+
+### 23. Build strategy `dedicated_builder` and `external_registry` are not enforced
+`BuildApplicationArtifact::execute` (`app/Actions/Environments/BuildApplicationArtifact.php:30`) picks any server via `buildServer()`. The push is only conditionally added when strategy === `EXTERNAL_REGISTRY` (line 89). There's no enforcement that `dedicated_builder` requires a builder service to exist, nor that `external_registry` skips local build entirely.
+**Fix:** Branch on strategy at the top of `execute()`; for `external_registry`, skip the build and resolve the digest from the registry (`docker manifest inspect`); for `dedicated_builder`, fail if no builder service is provisioned.
+
+### 24. Scheduler placement is not enforced at runtime
+**Spec §8:** `single` mode runs `schedule:run` on exactly one replica; `every_replica` runs on all. `LaravelRuntimeDriver` sets `AUTORUN_LARAVEL_SCHEDULER=true` based on the service's `process_roles`, but nothing applies the env per-replica based on `scheduler_target_service_id`/`scheduler_mode`. All replicas of the target service end up with the env var.
+**Fix:** When generating replica config, only emit `AUTORUN_LARAVEL_SCHEDULER=true` on the elected replica when `scheduler_mode=single`. The existing `PlanEnvironmentDeployment::blockers` (`app/Actions/Environments/PlanEnvironmentDeployment.php:79-87`) is a pre-flight check, not an enforcement.
+
+### 25. Compose file is generated via shell heredoc instead of a real upload
+`composeUploadScript` (line 303-319) inlines the compose body in a `cat <<'KEYSTONE_COMPOSE'` heredoc. Any quoting issue (binary, single-quote in env, large file) breaks. Spec §16 implies generated artifacts should be transferred via SSH/SCP.
+**Fix:** Use SCP (or `ssh ... 'cat > path'` with binary-safe encoding) instead of heredoc. Also drop separate generation of `.env` files — currently only the compose is uploaded; `.env` references in compose will 404 on disk.
+
+### 26. `.env` files never written to disk
+**Spec §16** layout includes `/home/keystone/services/<service-id>/.env`. `composeUploadScript` writes only `compose.yml`. If the compose `env_file: .env` directive is rendered, the deploy will fail.
+**Fix:** Render and upload `.env` alongside `compose.yml`.
+
+## Models, schema, and encryption
+
+### 27. Service.credentials migration column is plaintext but model casts as `encrypted:array`
+Migration `database/migrations/2025_03_27_121050_create_services_table.php:35` declares `text('credentials')`, while `app/Models/Service.php:36` casts it as `encrypted:array`. The cast works (Laravel encrypts on write) — but a `text` column without explicit `nullable()`/encoding intent is confusing. Confirm cast actually encrypts (it does for `encrypted:*` casts) and document.
+**Fix:** Add `->nullable()` and a comment in the migration noting the field is encrypted at the model layer.
+
+### 28. `EnvironmentVariable.value` cast missing array-vs-string handling
+Spec doesn't require it, but worth flagging: the cast is `encrypted` (scalar), which means complex values (JSON-encoded secrets) need explicit json_encode by callers. Currently `AttachManagedService.php:97-104` always passes scalar values, so this is OK.
+
+## UI and onboarding
+
+### 29. Onboarding (Spec §19) is not implemented
+No onboarding controller, no routes, no Inertia pages. The spec calls for a guided flow: organisation → provider → source → deploy key → registry → server → app/env → attachments. Currently users must navigate disjoint pages manually.
+**Fix:** Add an `OnboardingController` with a state machine on `Organisation` (or session-based progress) plus an Inertia wizard.
+
+### 30. Service detail/edit pages are missing
+`resources/js/pages/services/` only contains `updates/Create.vue`. There's no Index/Show/Edit. Spec §20 Phase 6 calls for "services under an environment with sensible defaults".
+**Fix:** Add service Show/Edit pages, including replica health, slices, and one-click update.
+
+### 31. No "managed attachment" guided flow
+`resources/js/pages/environment-attachments/Create.vue` exposes a raw service list. Spec §12 + §20 call for managed flows for Postgres / Valkey / Caddy with auto-defaulted slices.
+**Fix:** Build a guided picker per role (database / cache / queue / storage / gateway) that filters services to compatible types and previews the generated slice + env vars.
+
+### 32. Deploy policies are visible / no defaults hiding
+Spec §20 Phase 6: "Hide deploy policies by default." Currently they're set/exposed via `Service` form requests (`StoreServiceRequest`). No UI hiding.
+**Fix:** Don't expose `deploy_policy` in service create/edit UI; rely on driver-provided defaults from spec §2.
+
+### 33. `resources/js/pages/applications/Show.vue` has a stale stub comment
+Line 72: `<!-- Add instance button would go here -->` — references the removed `Instance` model.
+**Fix:** Remove the stale comment; add the actual "New environment" button.
+
+### 34. `resources/js/pages/servers/Index.vue` contains a `@todo pagination` literal
+Ship-ready code shouldn't have TODOs visible to the user.
+**Fix:** Either implement pagination or remove the comment.
+
+### 35. No UI surfaces variable source / overridable
+`EnvironmentVariableController::store` hardcodes `source=USER`. The UI in `environment-variables/Create.vue` provides no way to see managed vs. user vars, and the spec §13 requires the source/overridable badge.
+**Fix:** Read variables on the environment Show page grouped by source; for `managed_attachment` rows, show a "managed by Postgres slice X" badge and disable editing unless `overridable`.
+
+### 36. No environment Show page; deploys/migrations etc. are triggered from `applications/Show.vue` per-environment row
+This works but spec §20 Phase 6 wants environments to be the primary surface. Currently there's no environment detail page where services, replicas, slices, attachments, env vars, and operations are visible together.
+**Fix:** Add `environments/Show.vue` and route `applications/Show.vue`'s environment row to it.
+
+## Tests — coverage gaps
+
+### 37. `EnvironmentDeploymentControllerTest` only asserts dispatch
+Short test, no state assertions after run.
+**Fix:** Replace `Bus::fake()` with running the job inline; assert that operations, steps, replica records, compose files, and env vars are all in expected state.
+
+### 38. No test asserts deploy key cleanup after build
+`BuildApplicationArtifact.php:100-101` adds the cleanup trap. No test verifies the trap actually runs or that the key file is gone.
+**Fix:** In `BuildApplicationArtifactTest`, fake the remote runner and assert the build script contains `trap cleanup EXIT` AND that the operation_dir path resolves under `/home/keystone/operations/`.
+
+### 39. No test asserts the named volume naming convention
+Spec §10 names volumes `keystone_service_<id>_postgres_data`. No test in `ComposeRendererTest` checks volume name format.
+**Fix:** Snapshot-test or regex-assert the volume name pattern in `ComposeRendererTest`.
+
+### 40. No test for parent-child operation chain executing end-to-end
+`DeployEnvironmentJobTest` creates operations but never runs the resulting `RunStep` jobs through `Queue::handleAll()`-style flow.
+**Fix:** Add an integration test that fakes the SSH layer (return canned success) and lets the chain run, asserting each operation transitions to `COMPLETED` in the correct order.
+
+### 41. No test for cancellation cascade on failure
+None of the test files exercise `RunStep::failed` with sibling/child cancellation expectations (because that behavior doesn't exist yet — see gap #2).
+**Fix:** Add a test that fails a step mid-chain and asserts all later operations are `CANCELLED`.
+
+### 42. No test for stateful update flow against a Postgres service
+`StatefulServiceUpdateTest` likely asserts only operation/step rows. Need: rendered script asserts compose path is correct, named volume is preserved, new digest is written.
+**Fix:** Strengthen assertions to validate the full step contents.
+
+### 43. No test for multi-server build/push/pull
+`BuildArtifactPlanningTest` checks `requiresRegistry=true` but no test asserts a `docker push` and per-target `docker pull` actually occur.
+**Fix:** Add a job-level test with two-server topology and assert each target's deploy script includes a `docker pull <ref>@<digest>` before `compose up`.
+
+### 44. No test that `manual`/`disabled` migration modes are honored
+**Fix:** Parametrized test asserting `migrationScript` returns `'true'` for `disabled` and `manual` modes, and the real command for `auto`.
+
+### 45. No test for scheduler enforcement per replica
+**Fix:** Test that for `scheduler_mode=single`, only one replica's rendered env has `AUTORUN_LARAVEL_SCHEDULER=true`; for `every_replica`, all do.
+
+### 46. No test that managed attachment auto-creates slices for Valkey + Caddy
+`ManagedAttachmentTest` likely tests Postgres only.
+**Fix:** Extend dataset to cover Valkey logical_database and Caddy route slices, with their env exports.
+
+---
+
+## Suggested ordering
+
+1. Fix the orchestration bugs (#1-#5) — without these the chain doesn't reliably reach completion.
+2. Fix the Caddy cutover (#3, #4, #7, #19) — without these no environment can serve traffic.
+3. Fix Postgres slice provision admin user (#9) and stateful update scripts (#10, #11).
+4. Implement migration timing/mode (#16, #17, #18).
+5. Implement scheduler enforcement (#24) and multi-server placement + pull (#21, #22, #23).
+6. Then UI: onboarding (#29), environment Show page (#36), managed attachment UI (#31), variable source display (#35).
+7. Strengthen tests last (#37-#46) once the orchestrator and drivers are stable.
+
+Most of the schema and the high-level structure are correct — the gap is between the data model and the runtime behavior that's supposed to enforce/realize it.
--- a/docs/implementation-spec.md
+++ b/docs/implementation-spec.md
@@ -0,0 +1,726 @@
+# Keystone Implementation Spec
+
+## 1. Product Scope
+
+Keystone is a Laravel Forge-like deployment platform that runs applications and services with Docker. The v1 product is intentionally narrow:
+
+- Laravel is the only first-class application framework.
+- Application containers use a Keystone-managed Dockerfile based on `serversideup/php` with FrankenPHP.
+- Services are explicitly coded drivers, not arbitrary Docker images.
+- v1 is agentless and executes operations over SSH.
+- Docker Compose is used as the generated runtime artifact.
+- Caddy 2 is the default and only gateway for v1.
+- The Keystone database is the source of truth. Server files are generated artifacts.
+
+V1 should make the simple path robust before adding generic Docker support, distributed agents, HA databases, edge routing, or additional frameworks.
+
+## 2. Core Domain Model
+
+### Organisation
+
+Owns users, providers, registries, applications, servers, services, and environments.
+
+### Application
+
+A source-code project. In v1, first-class applications are Laravel repositories.
+
+Recommended fields:
+
+- `organisation_id`
+- `name`
+- `repository_url`
+- `repository_type`
+- `default_branch`
+- `deploy_key_public`
+- `deploy_key_private` encrypted
+- `deploy_key_fingerprint`
+- `deploy_key_installed_at` nullable
+
+### Environment
+
+The primary application deployment unit. An application has environments such as production, staging, or dev.
+
+Recommended fields:
+
+- `application_id`
+- `name`
+- `branch`
+- `status`
+- `scheduler_enabled`
+- `scheduler_target_service_id` nullable
+- `scheduler_mode`: `single` or `every_replica`
+- `build_config` json
+
+Default for Laravel environments:
+
+- Scheduler enabled.
+- Scheduler target is the primary web service.
+- Scheduler mode is `single`.
+
+### Service
+
+Every deployable thing is represented as a `Service`.
+
+Examples:
+
+- Laravel web runtime
+- Laravel worker runtime
+- Laravel websocket runtime
+- Caddy gateway
+- Postgres
+- Valkey
+- Future standalone services
+
+Recommended fields:
+
+- `organisation_id`
+- `environment_id` nullable
+- `server_id` nullable for single-placement legacy convenience only; long term use replicas
+- `name`
+- `category`
+- `type`
+- `version_track`
+- `driver_name`
+- `status`
+- `desired_replicas`
+- `desired_revision`
+- `deploy_policy`
+- `process_roles` json
+- `current_image_digest` nullable
+- `available_image_digest` nullable
+- `update_status`
+- `default_cpu_limit` nullable
+- `default_memory_limit_mb` nullable
+- `config` json
+
+Deploy policy defaults:
+
+- Laravel web: `with_environment`
+- Laravel worker: `with_environment`
+- Laravel websocket: `with_environment`
+- Database/cache/storage: `dependency_only`
+- Gateway: `manual_or_on_route_change`
+- Standalone services: `manual`
+
+The user should not need to configure these defaults during normal setup.
+
+### ServiceReplica
+
+A running instance of a service on a server. A service is logical; a replica is runtime placement.
+
+Recommended fields:
+
+- `service_id`
+- `server_id`
+- `operation_id` nullable
+- `container_name`
+- `container_id` nullable
+- `image_digest`
+- `internal_host`
+- `internal_port`
+- `public_port` nullable
+- `status`
+- `health_status`
+- `cpu_limit` nullable
+- `memory_limit_mb` nullable
+- `config` json
+
+Replica resource limits override service defaults. Null means unrestricted except host capacity.
+
+### ServiceSlice
+
+A logical sub-resource inside a service. Slices belong to `Service`, not `ServiceReplica`.
+
+Examples:
+
+- Database and user inside Postgres
+- Logical database or namespace inside Valkey
+- Route inside Caddy
+- Future bucket, topic, vhost, etc.
+
+Recommended fields:
+
+- `service_id`
+- `environment_id` nullable
+- `name`
+- `type`
+- `status`
+- `config` json
+- `credentials` encrypted json nullable
+
+Slices are not containers and should not be used for scaling. They are stable logical resources that survive service replica replacement.
+
+### EnvironmentAttachment
+
+Connects an environment to managed service slices.
+
+Recommended fields:
+
+- `environment_id`
+- `service_id`
+- `service_slice_id` nullable
+- `role`: `database`, `cache`, `queue`, `storage`, `gateway`, `custom`
+- `env_prefix` nullable
+- `is_primary`
+
+Attachments should point to slices whenever a slice exists. For example, a Laravel environment attaches to a Postgres database/user slice, not merely to the Postgres service.
+
+### EnvironmentVariable
+
+Represents user-defined and Keystone-managed runtime environment values.
+
+Recommended fields:
+
+- `environment_id`
+- `key`
+- `value` encrypted
+- `source`: `user`, `managed_attachment`, `system`
+- `service_slice_id` nullable
+- `overridable` boolean
+
+Managed values should be regenerated from attachments and slices.
+
+## 3. Operations Model
+
+Rename `Deployment` to `Operation`.
+
+An operation is the generic audit and execution object for all state-changing work.
+
+### Operation
+
+Recommended fields:
+
+- `id`
+- `parent_id` nullable
+- `hash`
+- `kind`
+- `target_type`
+- `target_id`
+- `status`
+- `started_at`
+- `finished_at`
+- timestamps
+
+Operation kinds:
+
+- `server_provision`
+- `service_deploy`
+- `replica_deploy`
+- `slice_provision`
+- `slice_configure`
+- `environment_deploy`
+- `gateway_cutover`
+- `config_change`
+- `credential_rotation`
+
+### OperationStep
+
+Rename `Step` to `OperationStep`.
+
+Recommended fields:
+
+- `operation_id`
+- `name`
+- `order`
+- `status`
+- `script`
+- `logs`
+- `error_logs`
+- `secrets` encrypted json nullable
+- `started_at`
+- `finished_at`
+- timestamps
+
+### Parent-Child Operations
+
+Environment deploys are parent operations that create child operations.
+
+Example:
+
+- `environment_deploy`
+- child `service_deploy` for web
+- child `replica_deploy` for each web replica
+- child `slice_configure` for Caddy route updates
+- child `gateway_cutover`
+
+Standalone service deploys and slice operations can also run independently.
+
+## 4. Server Provisioning
+
+V1 remains agentless over SSH.
+
+Provisioning flow:
+
+1. Create server through provider API.
+2. Wait for root SSH to become available.
+3. Execute provisioning script over SSH.
+4. Create Keystone management user.
+5. Install Docker Engine, Docker Compose plugin, UFW, fail2ban, and required runtime packages.
+6. Install Keystone SSH public key.
+7. Disable password login.
+8. Enable UFW with SSH open.
+9. Callback or SSH verification marks server active.
+
+Server permanent keys are for Keystone management only. Repository deploy keys must not be permanently installed on servers.
+
+## 5. Source Providers And Repository Access
+
+V1 source support:
+
+- Self-hosted Gitea
+- GitHub
+- Generic Git over SSH
+
+Repository access uses a Keystone-generated deploy key per application/repository.
+
+V1 flow:
+
+1. User enters repo SSH URL.
+2. Keystone generates an ed25519 deploy key.
+3. UI shows the public key.
+4. User adds it to Gitea/GitHub as read-only.
+5. Keystone verifies access with `git ls-remote`.
+
+During build operations, Keystone injects the encrypted private key into a temporary operation directory and uses `GIT_SSH_COMMAND`. The key is removed after the build. Repo keys are never permanently stored on target servers or builder services.
+
+## 6. Registry And Build Artifacts
+
+An external registry is required for multi-server application deployments.
+
+Single-server deployments may build and run a local image without a registry.
+
+Multi-server deployments must:
+
+1. Build once.
+2. Push the image to the configured external registry.
+3. Pull the exact same image digest on each target server.
+
+Supported registry types:
+
+- Generic Docker registry
+- Gitea registry
+- GHCR
+- Docker Hub
+
+### Build Service
+
+Building is a service capability, not a server type.
+
+A dedicated builder is represented as a `Service` with category `builder`. If no builder service exists, Keystone may build on the target server for single-server deployments.
+
+Build strategies:
+
+- `target_server`: build on selected target server. Valid for single-server.
+- `dedicated_builder`: build on builder service, then push/export artifact.
+- `external_registry`: pull prebuilt image from registry.
+
+For v1:
+
+- Single-server default: build on target server.
+- Multi-server: require configured registry and build once.
+- Do not rebuild independently on each server.
+
+### BuildArtifact
+
+Recommended fields:
+
+- `environment_id`
+- `commit_sha`
+- `image_tag`
+- `image_digest`
+- `registry_ref` nullable
+- `built_by_operation_id`
+- `built_by_service_id` nullable
+- `status`
+- `metadata` json
+
+## 7. Managed Laravel Runtime
+
+V1 uses Keystone-managed Dockerfile templates only. Custom Dockerfiles are deferred.
+
+Laravel runtime defaults:
+
+- Base: `serversideup/php` FrankenPHP image
+- PHP version configurable
+- Document root default: `public`
+- Health path default: `/up`, fallback `/`
+- Composer install with production defaults
+- JS build step configurable
+- Bun/Node strategy configurable
+
+The same build artifact is used by web, worker, and websocket services. Runtime services differ by entrypoint/command.
+
+Default topology:
+
+- One web service.
+- No worker service by default.
+- Scheduler enabled on the web service by default.
+- Dedicated worker service is recommended when queues are used, but created only when the user opts in.
+
+Worker options:
+
+- Dedicated worker service, recommended.
+- Embedded worker in web service, allowed for low-throughput apps but not recommended for production.
+- No workers, default.
+
+Keystone should warn against deployed environments using `QUEUE_CONNECTION=sync`, but it should not automatically create worker services.
+
+## 8. Scheduler Model
+
+Mirror Laravel Cloud's scheduler model.
+
+Scheduler is not a standalone service by default. It is a role/capability attached to a selected web or worker service.
+
+Defaults:
+
+- `scheduler_enabled`: true for Laravel templates.
+- `scheduler_target_service_id`: primary web service.
+- `scheduler_mode`: `single`.
+
+Runtime behavior:
+
+- `single`: run `schedule:run` every minute on exactly one selected replica.
+- `every_replica`: run on each replica. This is advanced and explicit.
+
+Keystone should enforce one scheduler runner per environment by default. Users may still use Laravel's `onOneServer()` for application-level safety.
+
+## 9. Service Drivers
+
+V1 services are explicitly coded drivers only. No arbitrary Docker image service in the v1 happy path.
+
+Driver contract should define:
+
+- service type and version track
+- default image policy
+- ports
+- volumes
+- environment schema
+- health checks
+- resource defaults
+- supported slice types
+- Compose rendering
+- operation steps
+- env var exports
+- firewall requirements
+- update behavior
+
+V1 driver list:
+
+- Caddy 2 gateway
+- Laravel managed runtime using `serversideup/php` FrankenPHP
+- Postgres 18
+- Valkey 8
+
+Use latest minor versions for new service deploy/update operations by resolving image tags to digests. Store the resolved digest on the operation/service/replica for reproducible rollbacks.
+
+Do not silently update managed service images. Show updates in the UI and require an explicit service update/redeploy operation.
+
+## 10. Persistent Storage
+
+Use named Docker volumes for persistent service-local data.
+
+Examples:
+
+- Postgres: `keystone_service_<id>_postgres_data`
+- Valkey: named volume when persistence is enabled
+- Caddy: named volumes for `/data` and `/config`
+
+Avoid distributed storage in v1. Moving a stateful service to another server requires an explicit migration operation.
+
+## 11. Stateful Service Updates
+
+V1 accepts downtime for single-node stateful updates.
+
+Postgres/Valkey update flow:
+
+1. User explicitly triggers update/redeploy.
+2. Keystone warns about downtime and data risk.
+3. Optional backup checkbox appears only if backup capability exists.
+4. Stop container.
+5. Preserve named volume.
+6. Start new container with updated image digest.
+7. Health check.
+8. Mark operation complete.
+
+Rolling stateful updates and HA clusters are v2.
+
+## 12. Slices And Attachments
+
+Attaching a managed service to an environment should create sensible default slices automatically.
+
+Postgres attachment:
+
+- Create database/user slice by default.
+- Generate credentials.
+- Wire `DB_*` environment variables.
+
+Valkey attachment:
+
+- Create/select logical slice if supported.
+- Wire `REDIS_*`.
+- Recommend `CACHE_STORE=redis`, `SESSION_DRIVER=redis`, or `QUEUE_CONNECTION=redis` depending on role.
+- Do not silently change queue behavior without confirmation.
+
+Caddy/domain attachment:
+
+- Create route slice.
+- Wire gateway route to environment web service.
+
+Advanced users can select existing slices or create slices manually from service detail pages.
+
+Slice operations should be independent from service container deployments. Creating a Postgres database/user should run as a slice operation against an existing Postgres replica, not redeploy the Postgres container.
+
+## 13. Environment Variables
+
+Keystone manages env vars from attachments and slices.
+
+Postgres slice should export:
+
+- `DB_CONNECTION=pgsql`
+- `DB_HOST`
+- `DB_PORT=5432`
+- `DB_DATABASE`
+- `DB_USERNAME`
+- `DB_PASSWORD`
+
+Valkey slice/service should export:
+
+- `REDIS_HOST`
+- `REDIS_PORT=6379`
+- optional `CACHE_STORE=redis`
+- optional `SESSION_DRIVER=redis`
+- optional `QUEUE_CONNECTION=redis`
+
+User-defined variables remain editable. Managed variables should show their source and whether they are overridable.
+
+## 14. Networking And Internal Aliases
+
+Support both same-server Docker networking and cross-server private networking.
+
+Routing preference:
+
+1. Same server: Docker network aliases/container DNS.
+2. Same provider private network: private IP and internal port.
+3. Public fallback only if explicitly allowed.
+
+V1 should not build distributed DNS. Use deterministic internal hostnames and generated env vars. Where Keystone controls Docker networks, use network aliases. For cross-server communication, inject private IP/port endpoints.
+
+Future agent/DNS systems should be possible, but are out of scope for v1.
+
+Recommended endpoint model:
+
+- `service_id`
+- `service_replica_id` nullable
+- `scope`: `docker_network`, `private_network`, `public`
+- `hostname`
+- `ip_address` nullable
+- `port`
+- `priority`
+- `health_status`
+
+## 15. Gateway And Cutover
+
+There must be exactly one gateway service per server for v1.
+
+Caddy owns public ports `80` and `443`. Application runtime containers should bind only to internal Docker networks or assigned internal ports.
+
+Zero-downtime deployment happens at the gateway layer:
+
+1. Render/start new service replica with unique container/project name.
+2. Health check new replica.
+3. Update Caddy upstreams to include the new healthy replica.
+4. Reload Caddy.
+5. Drain/remove old replica from Caddy upstreams.
+6. Stop old container after the drain window.
+
+For same-server upstreams, Caddy can use Docker network names. For cross-server upstreams, Caddy uses private IP and assigned internal port.
+
+Web services may span multiple servers in v1. Keystone provides load balancing through Caddy upstreams but does not optimize global latency or regional placement.
+
+Future v2 doctor page can flag:
+
+- cross-region upstreams
+- public-network fallbacks
+- missing workers for async queues
+- scheduler every-replica risks
+- inefficient database/cache placement
+
+## 16. Docker Compose Runtime
+
+Use generated Docker Compose files, not raw `docker run`, for v1 runtime management.
+
+Suggested server layout:
+
+- `/home/keystone/services/<service-id>/compose.yml`
+- `/home/keystone/services/<service-id>/.env`
+- `/home/keystone/gateway/Caddyfile`
+- `/home/keystone/operations/<operation-hash>/`
+
+Compose files are generated artifacts. The Keystone database is canonical.
+
+Compose should be used for:
+
+- container definitions
+- env files
+- named volumes
+- networks
+- health checks
+- restart policies
+- resource limits
+- labels
+
+Resource controls:
+
+- Use plain Docker runtime constraints such as `cpus`, `mem_limit`, and `memswap_limit`.
+- Avoid relying on Swarm-only `deploy.resources` semantics for v1.
+
+Example:
+
+```yaml
+services:
+  web:
+    image: registry.example.com/app:abc123
+    cpus: "1.0"
+    mem_limit: 1024m
+    memswap_limit: 1024m
+```
+
+## 17. Environment Deployment Flow
+
+Environment deployment creates a parent `environment_deploy` operation.
+
+High-level flow:
+
+1. Resolve target commit.
+2. Create or reuse build artifact.
+3. Compute desired service changes.
+4. Include only services with `deploy_policy=with_environment` and changed revision/config.
+5. Check dependency-only services and attached slices.
+6. Run pre-switch service steps.
+7. Run application migrations according to service migration policy.
+8. Deploy new web/worker/websocket replicas.
+9. Health check new replicas.
+10. Update gateway routes.
+11. Reload Caddy.
+12. Drain and stop old replicas.
+13. Mark operation complete.
+
+Database/cache services attached to the environment are checked but not redeployed unless the user explicitly deploys or updates them.
+
+## 18. Migrations
+
+Database migrations are owned by the application runtime service deployment.
+
+Recommended fields on service config:
+
+- `migration_mode`: `auto`, `manual`, `disabled`
+- `migration_timing`: `pre_switch`, `post_switch`
+- `migration_command`: default `php artisan migrate --force`
+
+Default for Laravel web services:
+
+- `migration_mode=auto`
+- `migration_timing=pre_switch`
+- command `php artisan migrate --force`
+
+Manual mode should allow the user to run migration operation explicitly.
+
+## 19. Onboarding
+
+Onboarding should guide users through:
+
+1. Organisation creation.
+2. Server provider setup, Hetzner first.
+3. Source provider/repository setup, including Gitea/GitHub/generic Git.
+4. Deploy key installation and verification.
+5. Registry setup. Optional for single-server, required for multi-server.
+6. Server creation/provisioning.
+7. Application/environment creation.
+8. Optional service attachments: Postgres, Valkey, domain/gateway.
+
+If an environment spans more than one server and no registry exists, deployment should be blocked with a registry setup prompt.
+
+## 20. Current Code Migration Plan
+
+The current code already has useful pieces:
+
+- Provider abstraction
+- Hetzner server creation
+- Server provisioning jobs
+- Service drivers
+- Polymorphic deployments
+- Step execution over SSH
+
+Refactor in phases.
+
+### Phase 1: Schema Alignment
+
+- Add `environments` table.
+- Rename `deployments` to `operations`.
+- Rename `steps` to `operation_steps`.
+- Add `operations.parent_id`.
+- Add `operations.kind`.
+- Add `service_replicas`.
+- Add `service_slices`.
+- Add `environment_attachments`.
+- Add `environment_variables`.
+- Add registry/source/build artifact tables.
+
+### Phase 2: Model Cleanup
+
+- Replace `Application::instances()` as the primary deployment path with `Application::environments()`.
+- Keep or migrate `Instance` into `ServiceReplica` depending on implementation cost.
+- Replace `Service::slices` references with real `ServiceSlice` relationship.
+- Replace `Deployment` references with `Operation`.
+- Replace deployment step jobs with operation step jobs.
+
+### Phase 3: Driver Contract
+
+- Define formal driver interfaces for service deployment, replica rendering, slices, health checks, and env exports.
+- Implement Caddy 2 driver.
+- Implement Postgres 18 driver with database/user slice provisioning.
+- Implement Valkey 8 driver.
+- Implement Laravel runtime driver/template.
+
+### Phase 4: Compose Renderer
+
+- Render Compose files from DB state.
+- Upload generated files over SSH.
+- Run `docker compose` operations.
+- Capture container IDs and health state into `ServiceReplica`.
+
+### Phase 5: Environment Deploy
+
+- Build application artifact.
+- Deploy web replicas.
+- Run migrations.
+- Health check.
+- Cut over Caddy.
+- Stop old replicas.
+
+### Phase 6: UI Simplification
+
+- Present environments as the primary application surface.
+- Present services under an environment with sensible defaults.
+- Hide deploy policies by default.
+- Provide one-click add worker.
+- Provide managed attachment flows for Postgres/Valkey/Caddy.
+
+## 21. Explicit V2 Deferrals
+
+Out of scope for v1:
+
+- Server agent.
+- Distributed internal DNS.
+- Edge routing or anycast.
+- Automatic regional topology optimization.
+- Custom Dockerfiles.
+- Arbitrary Docker image services.
+- Non-Laravel first-class app frameworks.
+- Managed Docker registry.
+- HA Postgres/Valkey.
+- Rolling stateful updates.
+- Distributed storage.
+- Full backup orchestration.
+- Automatic deploy key installation via Gitea/GitHub API.
+