Implement Keystone environment deployments

This commit is contained in:
2026-05-13 16:11:23 +01:00
parent 65d3142d03
commit aa680b25fd
175 changed files with 10258 additions and 740 deletions

View File

@@ -0,0 +1,220 @@
# Keystone Implementation Review — Gaps vs `docs/implementation-spec.md`
The schema/migrations/models are about 98% correct. The orchestration, drivers, UI, and tests have substantial gaps. Below are concrete, file-anchored issues to fix.
## Critical orchestration bugs
### 1. Operation parent/child hierarchy is flat — replicas are siblings of their service_deploy, not children
**Spec §3 example** nests `service_deploy → replica_deploy`. **`app/Jobs/Environments/DeployEnvironment.php:74-82`** creates both as siblings under the same `environment_deploy` parent. Replica_deploy siblings rely on `RunStep::dispatchNextSiblingOperation` (`app/Jobs/Services/RunStep.php:120-140`) ordering by `id` — fragile, and if `service_deploy` fails, replica operations are not cancelled.
**Fix:** Nest replica operations under their service's `service_deploy` operation (parent_id = service_deploy.id), and cascade-cancel children when a parent fails.
### 2. Failed operations don't cancel siblings or children
`RunStep::failed` (`app/Jobs/Services/RunStep.php:163-178`) only cancels the failed operation's remaining steps. Sibling/child operations under the same parent continue to dispatch via `dispatchNextSiblingOperation`. A failing service deploy will still trigger gateway cutover.
**Fix:** On step failure, mark parent + all descendant operations as `CANCELLED`/`FAILED`. Re-check during sibling dispatch.
### 3. Gateway cutover uses a hardcoded container name that doesn't exist
`DeployEnvironment.php:202` runs `docker exec keystone-caddy caddy reload ...`, but Caddy replicas are created with name `keystone-service-{service->id}-{N}` per `DeployEnvironment.php:155`. The `keystone-caddy` container is never created — cutover will always fail.
**Fix:** Look up the Caddy service's replica container name; or set a stable container_name in Caddy compose.
### 4. Gateway cutover is monolithic; no add-upstream/reload/drain sub-sequence
**Spec §15** requires: render new replica → health check → add new upstream → reload → drain old → stop old. `DeployEnvironment.php:201-206` does only `caddy reload && sleep 10 && stop draining`. There's no add-upstream step (Caddyfile is fully overwritten at `slice_configure` time), no real health check during cutover, and `sleep 10` is an arbitrary drain.
**Fix:** Split into separate steps with explicit ordering, and tie drain to active connections / Caddy upstream health.
### 5. `dispatchChildOperations` dispatches only the first child's first step
`DeployEnvironment.php:377-387` dispatches a single step. Continuation depends on `RunStep::dispatchNextSiblingOperation` chasing siblings by id. If a child operation has zero steps (e.g. `service_deploy` for a service whose driver returned no plan), the chain dies silently.
**Fix:** Make a single orchestrator (e.g. `DispatchOperationChain`) that knows how to walk the tree. Don't rely on implicit id-ordering between independently-created sibling operations.
## Driver contract & implementations
### 6. `Driver` base contract is anemic
**Spec §9** lists 13 required driver capabilities. `app/Drivers/Driver.php` only declares `__construct` and `getOperationPlan`. Image policy, ports, volumes, env schema, health checks, resource defaults, slice types, env exports, firewall, update behavior are scattered across drivers without contractual enforcement.
**Fix:** Define an interface with explicit methods (`type()`, `versionTrack()`, `defaultPorts()`, `firewallRules()`, `updateBehavior()`, etc.) and assert per-driver via tests.
### 7. `Caddy2Driver::buildCaddyfile()` reads an undefined field — dead code
`app/Drivers/Caddy/Caddy2Driver.php:46` references `$this->service->credentials['backend']`. Nothing ever sets this. The actual Caddyfile is rendered inline by `DeployEnvironment::configureCaddyRouteScript` (`app/Jobs/Environments/DeployEnvironment.php:321-335`).
**Fix:** Delete `buildCaddyfile()`, or move Caddyfile generation into the driver and remove the duplicate in `DeployEnvironment`.
### 8. Postgres18Driver has no slice provisioning in its operation plan
**Spec §12:** "Creating a Postgres database/user should run as a slice operation against an existing Postgres replica, not redeploy the Postgres container." `AttachManagedService::createSliceProvisionOperation` (`app/Actions/Environments/AttachManagedService.php:108-125`) hardcodes the SQL script outside the driver. Slice provisioning logic belongs in the driver so other Postgres versions can implement it.
**Fix:** Add `provisionSliceScript(ServiceSlice $slice): string` to the slice contract; have Postgres driver own the SQL.
### 9. Postgres provision script assumes a `keystone` admin user that is never created
`AttachManagedService.php:133` uses `($service->credentials ?? [])['user'] ?? 'keystone'`. The Postgres service is never seeded with admin credentials and the compose for Postgres never sets `POSTGRES_USER=keystone`. This will fail in production.
**Fix:** Establish admin credentials on Postgres service creation (write to `service->credentials`); pass via `POSTGRES_USER`/`POSTGRES_PASSWORD` in compose env.
### 10. Stateful update steps are placeholder strings that won't actually run
`app/Actions/Services/CreateStatefulServiceUpdateOperation.php:38-42`:
- `'docker compose down'` — no `-f path` so it runs in the SSH user's home directory.
- `'docker volume ls'` — listing isn't "preserving"; it's a no-op.
- `'docker compose up -d'` — no `-f path`, no image digest, no env update.
- `'docker compose ps'` — not a real health check.
Spec §11 specifies a real sequence: stop → preserve named volume → start new with updated digest → health check.
**Fix:** Build steps from the driver against the service's actual compose path; verify the named volume exists before/after; replace healthcheck stub with `docker inspect --format '{{.State.Health.Status}}'` polling.
### 11. Stateful update doesn't write the updated digest into compose before restart
The operation sets `available_image_digest` on the service (line 55) but the compose file on disk is not re-rendered. `docker compose up -d` will pull from whatever digest is currently in the compose, not the new one.
**Fix:** Insert a "Render compose with new digest" step before the start step.
### 12. Valkey driver doesn't emit role-based env vars
`app/Drivers/Valkey/Valkey8Driver.php:42-48` only emits `REDIS_HOST` and `REDIS_PORT`. **Spec §13** explicitly recommends `CACHE_STORE=redis`, `SESSION_DRIVER=redis`, `QUEUE_CONNECTION=redis` based on attachment role.
**Fix:** Read the `EnvironmentAttachment.role` and add the appropriate Laravel env defaults (with the "Do not silently change queue behavior without confirmation" guard from §12).
### 13. Valkey has no logical-DB isolation
`AttachManagedService.php:64-71` creates a `logical_database` slice but never assigns a Redis database index (`REDIS_DB`). All environments attached to the same Valkey service share DB 0.
**Fix:** Assign `REDIS_DB` per slice; include in `environmentExportsForSlice`.
### 14. Caddy and Valkey slices have no `SLICE_PROVISION` operation
`AttachManagedService::createSliceProvisionOperation` (line 110) early-returns unless `service->type === POSTGRES`. Caddy routes and Valkey logical DBs are created in the DB but never reconciled to the running service.
**Fix:** Emit slice operations for all service types whose driver supports slices.
### 15. Postgres driver doesn't export DB_* at the service level
`Postgres18Driver::environmentExports()` returns `[]`. That's fine, but only slice exports work, so if a service has no slice but a Laravel app references DB_HOST, nothing wires it.
**Fix:** Either guarantee a slice always exists for Postgres attachments, or emit DB_HOST at the service level via the attachment.
## Deployment flow
### 16. Migration timing is hardcoded to pre_switch
`DeployEnvironment::serviceDeployScripts` (`app/Jobs/Environments/DeployEnvironment.php:236-238`) always emits the migration step before "Deploy replicas". **Spec §18** lists `migration_timing: pre_switch | post_switch` on service config — never read.
**Fix:** Check `$service->config['migration_timing']` and either emit the migration step before replicas or after the gateway cutover.
### 17. Migration mode `manual` is ignored
`migrationScript` (line 292-301) only short-circuits when `migration_mode=disabled`. `manual` still auto-runs `php artisan migrate --force` during environment deploy. Spec §18 says manual mode should not run automatically.
**Fix:** Treat `manual` the same as `disabled` for environment deploys; only the dedicated `environment-migrations.store` controller should run it.
### 18. Two parallel migration code paths
`DeployEnvironment::migrationScript` (line 292), `LaravelRuntimeDriver::getOperationPlan`, and `EnvironmentMigrationController` all emit migration scripts independently. They will drift.
**Fix:** Centralize in one place (driver method or dedicated action) and call from all three.
### 19. "Update gateway routes" step is a no-op
`DeployEnvironment.php:248-250`:
```
'script' => 'test -f /home/keystone/gateway/Caddyfile',
```
This just checks file presence. The actual route update is in a separate `slice_configure` operation, so this step is dead code in the service-deploy chain.
**Fix:** Either remove the step or have it actually trigger the route update for this service.
### 20. Pre-switch service steps from spec §17 step 6 are missing entirely
No "pre-switch" hooks are emitted by `DeployEnvironment` or any driver.
**Fix:** Add a `preSwitchSteps(): array` driver capability and call it before the migration/replica steps.
### 21. Multi-server replica placement is not implemented
`DeployEnvironment::ensureServiceReplicas` (line 147-171) always assigns `server_id = $service->server_id`. There is no way to place replicas across multiple servers — yet the registry-required check at line 39-41 assumes multi-server deployments exist.
**Fix:** Either ship single-server-only v1 and remove the multi-server gate, or wire `Service.process_roles`/placement policy to multiple servers in `ensureServiceReplicas`.
### 22. No explicit `docker pull <ref>@<digest>` on target servers for multi-server
`replicaDeployScripts` (line 261-290) does `docker compose up -d` only; for multi-server, target servers need to pull from the registry by digest. There is no pull step.
**Fix:** Add `docker pull` step using `registry_ref` + digest before the `up -d` step on each target server.
### 23. Build strategy `dedicated_builder` and `external_registry` are not enforced
`BuildApplicationArtifact::execute` (`app/Actions/Environments/BuildApplicationArtifact.php:30`) picks any server via `buildServer()`. The push is only conditionally added when strategy === `EXTERNAL_REGISTRY` (line 89). There's no enforcement that `dedicated_builder` requires a builder service to exist, nor that `external_registry` skips local build entirely.
**Fix:** Branch on strategy at the top of `execute()`; for `external_registry`, skip the build and resolve the digest from the registry (`docker manifest inspect`); for `dedicated_builder`, fail if no builder service is provisioned.
### 24. Scheduler placement is not enforced at runtime
**Spec §8:** `single` mode runs `schedule:run` on exactly one replica; `every_replica` runs on all. `LaravelRuntimeDriver` sets `AUTORUN_LARAVEL_SCHEDULER=true` based on the service's `process_roles`, but nothing applies the env per-replica based on `scheduler_target_service_id`/`scheduler_mode`. All replicas of the target service end up with the env var.
**Fix:** When generating replica config, only emit `AUTORUN_LARAVEL_SCHEDULER=true` on the elected replica when `scheduler_mode=single`. The existing `PlanEnvironmentDeployment::blockers` (`app/Actions/Environments/PlanEnvironmentDeployment.php:79-87`) is a pre-flight check, not an enforcement.
### 25. Compose file is generated via shell heredoc instead of a real upload
`composeUploadScript` (line 303-319) inlines the compose body in a `cat <<'KEYSTONE_COMPOSE'` heredoc. Any quoting issue (binary, single-quote in env, large file) breaks. Spec §16 implies generated artifacts should be transferred via SSH/SCP.
**Fix:** Use SCP (or `ssh ... 'cat > path'` with binary-safe encoding) instead of heredoc. Also drop separate generation of `.env` files — currently only the compose is uploaded; `.env` references in compose will 404 on disk.
### 26. `.env` files never written to disk
**Spec §16** layout includes `/home/keystone/services/<service-id>/.env`. `composeUploadScript` writes only `compose.yml`. If the compose `env_file: .env` directive is rendered, the deploy will fail.
**Fix:** Render and upload `.env` alongside `compose.yml`.
## Models, schema, and encryption
### 27. Service.credentials migration column is plaintext but model casts as `encrypted:array`
Migration `database/migrations/2025_03_27_121050_create_services_table.php:35` declares `text('credentials')`, while `app/Models/Service.php:36` casts it as `encrypted:array`. The cast works (Laravel encrypts on write) — but a `text` column without explicit `nullable()`/encoding intent is confusing. Confirm cast actually encrypts (it does for `encrypted:*` casts) and document.
**Fix:** Add `->nullable()` and a comment in the migration noting the field is encrypted at the model layer.
### 28. `EnvironmentVariable.value` cast missing array-vs-string handling
Spec doesn't require it, but worth flagging: the cast is `encrypted` (scalar), which means complex values (JSON-encoded secrets) need explicit json_encode by callers. Currently `AttachManagedService.php:97-104` always passes scalar values, so this is OK.
## UI and onboarding
### 29. Onboarding (Spec §19) is not implemented
No onboarding controller, no routes, no Inertia pages. The spec calls for a guided flow: organisation → provider → source → deploy key → registry → server → app/env → attachments. Currently users must navigate disjoint pages manually.
**Fix:** Add an `OnboardingController` with a state machine on `Organisation` (or session-based progress) plus an Inertia wizard.
### 30. Service detail/edit pages are missing
`resources/js/pages/services/` only contains `updates/Create.vue`. There's no Index/Show/Edit. Spec §20 Phase 6 calls for "services under an environment with sensible defaults".
**Fix:** Add service Show/Edit pages, including replica health, slices, and one-click update.
### 31. No "managed attachment" guided flow
`resources/js/pages/environment-attachments/Create.vue` exposes a raw service list. Spec §12 + §20 call for managed flows for Postgres / Valkey / Caddy with auto-defaulted slices.
**Fix:** Build a guided picker per role (database / cache / queue / storage / gateway) that filters services to compatible types and previews the generated slice + env vars.
### 32. Deploy policies are visible / no defaults hiding
Spec §20 Phase 6: "Hide deploy policies by default." Currently they're set/exposed via `Service` form requests (`StoreServiceRequest`). No UI hiding.
**Fix:** Don't expose `deploy_policy` in service create/edit UI; rely on driver-provided defaults from spec §2.
### 33. `resources/js/pages/applications/Show.vue` has a stale stub comment
Line 72: `<!-- Add instance button would go here -->` — references the removed `Instance` model.
**Fix:** Remove the stale comment; add the actual "New environment" button.
### 34. `resources/js/pages/servers/Index.vue` contains a `@todo pagination` literal
Ship-ready code shouldn't have TODOs visible to the user.
**Fix:** Either implement pagination or remove the comment.
### 35. No UI surfaces variable source / overridable
`EnvironmentVariableController::store` hardcodes `source=USER`. The UI in `environment-variables/Create.vue` provides no way to see managed vs. user vars, and the spec §13 requires the source/overridable badge.
**Fix:** Read variables on the environment Show page grouped by source; for `managed_attachment` rows, show a "managed by Postgres slice X" badge and disable editing unless `overridable`.
### 36. No environment Show page; deploys/migrations etc. are triggered from `applications/Show.vue` per-environment row
This works but spec §20 Phase 6 wants environments to be the primary surface. Currently there's no environment detail page where services, replicas, slices, attachments, env vars, and operations are visible together.
**Fix:** Add `environments/Show.vue` and route `applications/Show.vue`'s environment row to it.
## Tests — coverage gaps
### 37. `EnvironmentDeploymentControllerTest` only asserts dispatch
Short test, no state assertions after run.
**Fix:** Replace `Bus::fake()` with running the job inline; assert that operations, steps, replica records, compose files, and env vars are all in expected state.
### 38. No test asserts deploy key cleanup after build
`BuildApplicationArtifact.php:100-101` adds the cleanup trap. No test verifies the trap actually runs or that the key file is gone.
**Fix:** In `BuildApplicationArtifactTest`, fake the remote runner and assert the build script contains `trap cleanup EXIT` AND that the operation_dir path resolves under `/home/keystone/operations/`.
### 39. No test asserts the named volume naming convention
Spec §10 names volumes `keystone_service_<id>_postgres_data`. No test in `ComposeRendererTest` checks volume name format.
**Fix:** Snapshot-test or regex-assert the volume name pattern in `ComposeRendererTest`.
### 40. No test for parent-child operation chain executing end-to-end
`DeployEnvironmentJobTest` creates operations but never runs the resulting `RunStep` jobs through `Queue::handleAll()`-style flow.
**Fix:** Add an integration test that fakes the SSH layer (return canned success) and lets the chain run, asserting each operation transitions to `COMPLETED` in the correct order.
### 41. No test for cancellation cascade on failure
None of the test files exercise `RunStep::failed` with sibling/child cancellation expectations (because that behavior doesn't exist yet — see gap #2).
**Fix:** Add a test that fails a step mid-chain and asserts all later operations are `CANCELLED`.
### 42. No test for stateful update flow against a Postgres service
`StatefulServiceUpdateTest` likely asserts only operation/step rows. Need: rendered script asserts compose path is correct, named volume is preserved, new digest is written.
**Fix:** Strengthen assertions to validate the full step contents.
### 43. No test for multi-server build/push/pull
`BuildArtifactPlanningTest` checks `requiresRegistry=true` but no test asserts a `docker push` and per-target `docker pull` actually occur.
**Fix:** Add a job-level test with two-server topology and assert each target's deploy script includes a `docker pull <ref>@<digest>` before `compose up`.
### 44. No test that `manual`/`disabled` migration modes are honored
**Fix:** Parametrized test asserting `migrationScript` returns `'true'` for `disabled` and `manual` modes, and the real command for `auto`.
### 45. No test for scheduler enforcement per replica
**Fix:** Test that for `scheduler_mode=single`, only one replica's rendered env has `AUTORUN_LARAVEL_SCHEDULER=true`; for `every_replica`, all do.
### 46. No test that managed attachment auto-creates slices for Valkey + Caddy
`ManagedAttachmentTest` likely tests Postgres only.
**Fix:** Extend dataset to cover Valkey logical_database and Caddy route slices, with their env exports.
---
## Suggested ordering
1. Fix the orchestration bugs (#1-#5) — without these the chain doesn't reliably reach completion.
2. Fix the Caddy cutover (#3, #4, #7, #19) — without these no environment can serve traffic.
3. Fix Postgres slice provision admin user (#9) and stateful update scripts (#10, #11).
4. Implement migration timing/mode (#16, #17, #18).
5. Implement scheduler enforcement (#24) and multi-server placement + pull (#21, #22, #23).
6. Then UI: onboarding (#29), environment Show page (#36), managed attachment UI (#31), variable source display (#35).
7. Strengthen tests last (#37-#46) once the orchestrator and drivers are stable.
Most of the schema and the high-level structure are correct — the gap is between the data model and the runtime behavior that's supposed to enforce/realize it.

726
docs/implementation-spec.md Normal file
View File

@@ -0,0 +1,726 @@
# Keystone Implementation Spec
## 1. Product Scope
Keystone is a Laravel Forge-like deployment platform that runs applications and services with Docker. The v1 product is intentionally narrow:
- Laravel is the only first-class application framework.
- Application containers use a Keystone-managed Dockerfile based on `serversideup/php` with FrankenPHP.
- Services are explicitly coded drivers, not arbitrary Docker images.
- v1 is agentless and executes operations over SSH.
- Docker Compose is used as the generated runtime artifact.
- Caddy 2 is the default and only gateway for v1.
- The Keystone database is the source of truth. Server files are generated artifacts.
V1 should make the simple path robust before adding generic Docker support, distributed agents, HA databases, edge routing, or additional frameworks.
## 2. Core Domain Model
### Organisation
Owns users, providers, registries, applications, servers, services, and environments.
### Application
A source-code project. In v1, first-class applications are Laravel repositories.
Recommended fields:
- `organisation_id`
- `name`
- `repository_url`
- `repository_type`
- `default_branch`
- `deploy_key_public`
- `deploy_key_private` encrypted
- `deploy_key_fingerprint`
- `deploy_key_installed_at` nullable
### Environment
The primary application deployment unit. An application has environments such as production, staging, or dev.
Recommended fields:
- `application_id`
- `name`
- `branch`
- `status`
- `scheduler_enabled`
- `scheduler_target_service_id` nullable
- `scheduler_mode`: `single` or `every_replica`
- `build_config` json
Default for Laravel environments:
- Scheduler enabled.
- Scheduler target is the primary web service.
- Scheduler mode is `single`.
### Service
Every deployable thing is represented as a `Service`.
Examples:
- Laravel web runtime
- Laravel worker runtime
- Laravel websocket runtime
- Caddy gateway
- Postgres
- Valkey
- Future standalone services
Recommended fields:
- `organisation_id`
- `environment_id` nullable
- `server_id` nullable for single-placement legacy convenience only; long term use replicas
- `name`
- `category`
- `type`
- `version_track`
- `driver_name`
- `status`
- `desired_replicas`
- `desired_revision`
- `deploy_policy`
- `process_roles` json
- `current_image_digest` nullable
- `available_image_digest` nullable
- `update_status`
- `default_cpu_limit` nullable
- `default_memory_limit_mb` nullable
- `config` json
Deploy policy defaults:
- Laravel web: `with_environment`
- Laravel worker: `with_environment`
- Laravel websocket: `with_environment`
- Database/cache/storage: `dependency_only`
- Gateway: `manual_or_on_route_change`
- Standalone services: `manual`
The user should not need to configure these defaults during normal setup.
### ServiceReplica
A running instance of a service on a server. A service is logical; a replica is runtime placement.
Recommended fields:
- `service_id`
- `server_id`
- `operation_id` nullable
- `container_name`
- `container_id` nullable
- `image_digest`
- `internal_host`
- `internal_port`
- `public_port` nullable
- `status`
- `health_status`
- `cpu_limit` nullable
- `memory_limit_mb` nullable
- `config` json
Replica resource limits override service defaults. Null means unrestricted except host capacity.
### ServiceSlice
A logical sub-resource inside a service. Slices belong to `Service`, not `ServiceReplica`.
Examples:
- Database and user inside Postgres
- Logical database or namespace inside Valkey
- Route inside Caddy
- Future bucket, topic, vhost, etc.
Recommended fields:
- `service_id`
- `environment_id` nullable
- `name`
- `type`
- `status`
- `config` json
- `credentials` encrypted json nullable
Slices are not containers and should not be used for scaling. They are stable logical resources that survive service replica replacement.
### EnvironmentAttachment
Connects an environment to managed service slices.
Recommended fields:
- `environment_id`
- `service_id`
- `service_slice_id` nullable
- `role`: `database`, `cache`, `queue`, `storage`, `gateway`, `custom`
- `env_prefix` nullable
- `is_primary`
Attachments should point to slices whenever a slice exists. For example, a Laravel environment attaches to a Postgres database/user slice, not merely to the Postgres service.
### EnvironmentVariable
Represents user-defined and Keystone-managed runtime environment values.
Recommended fields:
- `environment_id`
- `key`
- `value` encrypted
- `source`: `user`, `managed_attachment`, `system`
- `service_slice_id` nullable
- `overridable` boolean
Managed values should be regenerated from attachments and slices.
## 3. Operations Model
Rename `Deployment` to `Operation`.
An operation is the generic audit and execution object for all state-changing work.
### Operation
Recommended fields:
- `id`
- `parent_id` nullable
- `hash`
- `kind`
- `target_type`
- `target_id`
- `status`
- `started_at`
- `finished_at`
- timestamps
Operation kinds:
- `server_provision`
- `service_deploy`
- `replica_deploy`
- `slice_provision`
- `slice_configure`
- `environment_deploy`
- `gateway_cutover`
- `config_change`
- `credential_rotation`
### OperationStep
Rename `Step` to `OperationStep`.
Recommended fields:
- `operation_id`
- `name`
- `order`
- `status`
- `script`
- `logs`
- `error_logs`
- `secrets` encrypted json nullable
- `started_at`
- `finished_at`
- timestamps
### Parent-Child Operations
Environment deploys are parent operations that create child operations.
Example:
- `environment_deploy`
- child `service_deploy` for web
- child `replica_deploy` for each web replica
- child `slice_configure` for Caddy route updates
- child `gateway_cutover`
Standalone service deploys and slice operations can also run independently.
## 4. Server Provisioning
V1 remains agentless over SSH.
Provisioning flow:
1. Create server through provider API.
2. Wait for root SSH to become available.
3. Execute provisioning script over SSH.
4. Create Keystone management user.
5. Install Docker Engine, Docker Compose plugin, UFW, fail2ban, and required runtime packages.
6. Install Keystone SSH public key.
7. Disable password login.
8. Enable UFW with SSH open.
9. Callback or SSH verification marks server active.
Server permanent keys are for Keystone management only. Repository deploy keys must not be permanently installed on servers.
## 5. Source Providers And Repository Access
V1 source support:
- Self-hosted Gitea
- GitHub
- Generic Git over SSH
Repository access uses a Keystone-generated deploy key per application/repository.
V1 flow:
1. User enters repo SSH URL.
2. Keystone generates an ed25519 deploy key.
3. UI shows the public key.
4. User adds it to Gitea/GitHub as read-only.
5. Keystone verifies access with `git ls-remote`.
During build operations, Keystone injects the encrypted private key into a temporary operation directory and uses `GIT_SSH_COMMAND`. The key is removed after the build. Repo keys are never permanently stored on target servers or builder services.
## 6. Registry And Build Artifacts
An external registry is required for multi-server application deployments.
Single-server deployments may build and run a local image without a registry.
Multi-server deployments must:
1. Build once.
2. Push the image to the configured external registry.
3. Pull the exact same image digest on each target server.
Supported registry types:
- Generic Docker registry
- Gitea registry
- GHCR
- Docker Hub
### Build Service
Building is a service capability, not a server type.
A dedicated builder is represented as a `Service` with category `builder`. If no builder service exists, Keystone may build on the target server for single-server deployments.
Build strategies:
- `target_server`: build on selected target server. Valid for single-server.
- `dedicated_builder`: build on builder service, then push/export artifact.
- `external_registry`: pull prebuilt image from registry.
For v1:
- Single-server default: build on target server.
- Multi-server: require configured registry and build once.
- Do not rebuild independently on each server.
### BuildArtifact
Recommended fields:
- `environment_id`
- `commit_sha`
- `image_tag`
- `image_digest`
- `registry_ref` nullable
- `built_by_operation_id`
- `built_by_service_id` nullable
- `status`
- `metadata` json
## 7. Managed Laravel Runtime
V1 uses Keystone-managed Dockerfile templates only. Custom Dockerfiles are deferred.
Laravel runtime defaults:
- Base: `serversideup/php` FrankenPHP image
- PHP version configurable
- Document root default: `public`
- Health path default: `/up`, fallback `/`
- Composer install with production defaults
- JS build step configurable
- Bun/Node strategy configurable
The same build artifact is used by web, worker, and websocket services. Runtime services differ by entrypoint/command.
Default topology:
- One web service.
- No worker service by default.
- Scheduler enabled on the web service by default.
- Dedicated worker service is recommended when queues are used, but created only when the user opts in.
Worker options:
- Dedicated worker service, recommended.
- Embedded worker in web service, allowed for low-throughput apps but not recommended for production.
- No workers, default.
Keystone should warn against deployed environments using `QUEUE_CONNECTION=sync`, but it should not automatically create worker services.
## 8. Scheduler Model
Mirror Laravel Cloud's scheduler model.
Scheduler is not a standalone service by default. It is a role/capability attached to a selected web or worker service.
Defaults:
- `scheduler_enabled`: true for Laravel templates.
- `scheduler_target_service_id`: primary web service.
- `scheduler_mode`: `single`.
Runtime behavior:
- `single`: run `schedule:run` every minute on exactly one selected replica.
- `every_replica`: run on each replica. This is advanced and explicit.
Keystone should enforce one scheduler runner per environment by default. Users may still use Laravel's `onOneServer()` for application-level safety.
## 9. Service Drivers
V1 services are explicitly coded drivers only. No arbitrary Docker image service in the v1 happy path.
Driver contract should define:
- service type and version track
- default image policy
- ports
- volumes
- environment schema
- health checks
- resource defaults
- supported slice types
- Compose rendering
- operation steps
- env var exports
- firewall requirements
- update behavior
V1 driver list:
- Caddy 2 gateway
- Laravel managed runtime using `serversideup/php` FrankenPHP
- Postgres 18
- Valkey 8
Use latest minor versions for new service deploy/update operations by resolving image tags to digests. Store the resolved digest on the operation/service/replica for reproducible rollbacks.
Do not silently update managed service images. Show updates in the UI and require an explicit service update/redeploy operation.
## 10. Persistent Storage
Use named Docker volumes for persistent service-local data.
Examples:
- Postgres: `keystone_service_<id>_postgres_data`
- Valkey: named volume when persistence is enabled
- Caddy: named volumes for `/data` and `/config`
Avoid distributed storage in v1. Moving a stateful service to another server requires an explicit migration operation.
## 11. Stateful Service Updates
V1 accepts downtime for single-node stateful updates.
Postgres/Valkey update flow:
1. User explicitly triggers update/redeploy.
2. Keystone warns about downtime and data risk.
3. Optional backup checkbox appears only if backup capability exists.
4. Stop container.
5. Preserve named volume.
6. Start new container with updated image digest.
7. Health check.
8. Mark operation complete.
Rolling stateful updates and HA clusters are v2.
## 12. Slices And Attachments
Attaching a managed service to an environment should create sensible default slices automatically.
Postgres attachment:
- Create database/user slice by default.
- Generate credentials.
- Wire `DB_*` environment variables.
Valkey attachment:
- Create/select logical slice if supported.
- Wire `REDIS_*`.
- Recommend `CACHE_STORE=redis`, `SESSION_DRIVER=redis`, or `QUEUE_CONNECTION=redis` depending on role.
- Do not silently change queue behavior without confirmation.
Caddy/domain attachment:
- Create route slice.
- Wire gateway route to environment web service.
Advanced users can select existing slices or create slices manually from service detail pages.
Slice operations should be independent from service container deployments. Creating a Postgres database/user should run as a slice operation against an existing Postgres replica, not redeploy the Postgres container.
## 13. Environment Variables
Keystone manages env vars from attachments and slices.
Postgres slice should export:
- `DB_CONNECTION=pgsql`
- `DB_HOST`
- `DB_PORT=5432`
- `DB_DATABASE`
- `DB_USERNAME`
- `DB_PASSWORD`
Valkey slice/service should export:
- `REDIS_HOST`
- `REDIS_PORT=6379`
- optional `CACHE_STORE=redis`
- optional `SESSION_DRIVER=redis`
- optional `QUEUE_CONNECTION=redis`
User-defined variables remain editable. Managed variables should show their source and whether they are overridable.
## 14. Networking And Internal Aliases
Support both same-server Docker networking and cross-server private networking.
Routing preference:
1. Same server: Docker network aliases/container DNS.
2. Same provider private network: private IP and internal port.
3. Public fallback only if explicitly allowed.
V1 should not build distributed DNS. Use deterministic internal hostnames and generated env vars. Where Keystone controls Docker networks, use network aliases. For cross-server communication, inject private IP/port endpoints.
Future agent/DNS systems should be possible, but are out of scope for v1.
Recommended endpoint model:
- `service_id`
- `service_replica_id` nullable
- `scope`: `docker_network`, `private_network`, `public`
- `hostname`
- `ip_address` nullable
- `port`
- `priority`
- `health_status`
## 15. Gateway And Cutover
There must be exactly one gateway service per server for v1.
Caddy owns public ports `80` and `443`. Application runtime containers should bind only to internal Docker networks or assigned internal ports.
Zero-downtime deployment happens at the gateway layer:
1. Render/start new service replica with unique container/project name.
2. Health check new replica.
3. Update Caddy upstreams to include the new healthy replica.
4. Reload Caddy.
5. Drain/remove old replica from Caddy upstreams.
6. Stop old container after the drain window.
For same-server upstreams, Caddy can use Docker network names. For cross-server upstreams, Caddy uses private IP and assigned internal port.
Web services may span multiple servers in v1. Keystone provides load balancing through Caddy upstreams but does not optimize global latency or regional placement.
Future v2 doctor page can flag:
- cross-region upstreams
- public-network fallbacks
- missing workers for async queues
- scheduler every-replica risks
- inefficient database/cache placement
## 16. Docker Compose Runtime
Use generated Docker Compose files, not raw `docker run`, for v1 runtime management.
Suggested server layout:
- `/home/keystone/services/<service-id>/compose.yml`
- `/home/keystone/services/<service-id>/.env`
- `/home/keystone/gateway/Caddyfile`
- `/home/keystone/operations/<operation-hash>/`
Compose files are generated artifacts. The Keystone database is canonical.
Compose should be used for:
- container definitions
- env files
- named volumes
- networks
- health checks
- restart policies
- resource limits
- labels
Resource controls:
- Use plain Docker runtime constraints such as `cpus`, `mem_limit`, and `memswap_limit`.
- Avoid relying on Swarm-only `deploy.resources` semantics for v1.
Example:
```yaml
services:
web:
image: registry.example.com/app:abc123
cpus: "1.0"
mem_limit: 1024m
memswap_limit: 1024m
```
## 17. Environment Deployment Flow
Environment deployment creates a parent `environment_deploy` operation.
High-level flow:
1. Resolve target commit.
2. Create or reuse build artifact.
3. Compute desired service changes.
4. Include only services with `deploy_policy=with_environment` and changed revision/config.
5. Check dependency-only services and attached slices.
6. Run pre-switch service steps.
7. Run application migrations according to service migration policy.
8. Deploy new web/worker/websocket replicas.
9. Health check new replicas.
10. Update gateway routes.
11. Reload Caddy.
12. Drain and stop old replicas.
13. Mark operation complete.
Database/cache services attached to the environment are checked but not redeployed unless the user explicitly deploys or updates them.
## 18. Migrations
Database migrations are owned by the application runtime service deployment.
Recommended fields on service config:
- `migration_mode`: `auto`, `manual`, `disabled`
- `migration_timing`: `pre_switch`, `post_switch`
- `migration_command`: default `php artisan migrate --force`
Default for Laravel web services:
- `migration_mode=auto`
- `migration_timing=pre_switch`
- command `php artisan migrate --force`
Manual mode should allow the user to run migration operation explicitly.
## 19. Onboarding
Onboarding should guide users through:
1. Organisation creation.
2. Server provider setup, Hetzner first.
3. Source provider/repository setup, including Gitea/GitHub/generic Git.
4. Deploy key installation and verification.
5. Registry setup. Optional for single-server, required for multi-server.
6. Server creation/provisioning.
7. Application/environment creation.
8. Optional service attachments: Postgres, Valkey, domain/gateway.
If an environment spans more than one server and no registry exists, deployment should be blocked with a registry setup prompt.
## 20. Current Code Migration Plan
The current code already has useful pieces:
- Provider abstraction
- Hetzner server creation
- Server provisioning jobs
- Service drivers
- Polymorphic deployments
- Step execution over SSH
Refactor in phases.
### Phase 1: Schema Alignment
- Add `environments` table.
- Rename `deployments` to `operations`.
- Rename `steps` to `operation_steps`.
- Add `operations.parent_id`.
- Add `operations.kind`.
- Add `service_replicas`.
- Add `service_slices`.
- Add `environment_attachments`.
- Add `environment_variables`.
- Add registry/source/build artifact tables.
### Phase 2: Model Cleanup
- Replace `Application::instances()` as the primary deployment path with `Application::environments()`.
- Keep or migrate `Instance` into `ServiceReplica` depending on implementation cost.
- Replace `Service::slices` references with real `ServiceSlice` relationship.
- Replace `Deployment` references with `Operation`.
- Replace deployment step jobs with operation step jobs.
### Phase 3: Driver Contract
- Define formal driver interfaces for service deployment, replica rendering, slices, health checks, and env exports.
- Implement Caddy 2 driver.
- Implement Postgres 18 driver with database/user slice provisioning.
- Implement Valkey 8 driver.
- Implement Laravel runtime driver/template.
### Phase 4: Compose Renderer
- Render Compose files from DB state.
- Upload generated files over SSH.
- Run `docker compose` operations.
- Capture container IDs and health state into `ServiceReplica`.
### Phase 5: Environment Deploy
- Build application artifact.
- Deploy web replicas.
- Run migrations.
- Health check.
- Cut over Caddy.
- Stop old replicas.
### Phase 6: UI Simplification
- Present environments as the primary application surface.
- Present services under an environment with sensible defaults.
- Hide deploy policies by default.
- Provide one-click add worker.
- Provide managed attachment flows for Postgres/Valkey/Caddy.
## 21. Explicit V2 Deferrals
Out of scope for v1:
- Server agent.
- Distributed internal DNS.
- Edge routing or anycast.
- Automatic regional topology optimization.
- Custom Dockerfiles.
- Arbitrary Docker image services.
- Non-Laravel first-class app frameworks.
- Managed Docker registry.
- HA Postgres/Valkey.
- Rolling stateful updates.
- Distributed storage.
- Full backup orchestration.
- Automatic deploy key installation via Gitea/GitHub API.