keystone/docs/managed-registry.md

---
status: in-progress
implemented_slice: app-side-operation-generation-maintenance-readiness
---

# Managed Registry Plan

## Implementation Status

Keystone now has app-side managed registry provisioning, auth, smoke-check, and pruning operations. The generated remote scripts install and run `registry:2` with local storage, htpasswd auth, deletion enabled, Caddy HTTPS proxy snippets, Docker auth setup, push/pull smoke checks, manifest deletion, registry garbage collection, and artifact pruning status transitions.

This document remains `in-progress` until the install flow runs these operations as part of first-run Keystone setup and the generated remote scripts have been validated against the supported production host layouts.

Keystone should be self-hosted first. A fresh install should include a working build and image pipeline without requiring the user to bring an external Docker registry, S3 bucket, or separate build server.

## Product Principles

- The Keystone control node is the default build node.
- Keystone provides a first-party managed Docker registry by default.
- The managed registry stores images on local disk first.
- The registry storage path must be configurable for mounted VPS volumes.
- Multi-server deployments using the managed registry require an HTTPS registry URL trusted by every build and runtime node.
- External registries, S3-backed storage, and dedicated build nodes are optional advanced features.
- Multi-server deployments should work out of the box after Keystone is installed.
- Registry credentials must not be persisted in operation scripts, logs, or UI-visible output.
- Old build artifacts should be pruned automatically, retaining the latest 3 successful artifacts per environment by default.
- Build and deploy should be separate phases, even when started by one user action.
- Users should be able to connect an existing Ubuntu server as a Keystone node without using a cloud provider integration.

## Default Self-Hosted Shape

When Keystone is installed on a server, that server becomes the control node. The install process should prepare:

- Keystone application services.
- Docker and Docker Compose.
- A managed `registry:2` service.
- Local registry storage.
- Generated registry credentials.
- A default build capability on the control node.

This is separate from server provisioning. Keystone needs two scripts/flows:

- `install-keystone.sh` installs Keystone itself on the control node.
- The remote provisioning script prepares other servers so they can be managed by Keystone.

Remote provisioning should continue to install Docker, configure SSH access, prepare the `keystone` user, and link the server back to Keystone. It should not be responsible for installing the Keystone application itself.

Default settings:

```text
Build node: Keystone control node
Registry: registry:2 managed by Keystone
Registry URL: install-provided HTTPS hostname, for example registry.example.com
Registry storage driver: local
Registry storage path: /home/keystone/registry/data
Image retention: latest 3 successful artifacts per environment
Auth: generated htpasswd build and runtime credentials managed by Keystone
```

The install flow should allow overriding the storage path, for example:

```text
/mnt/keystone-registry
```

This lets users place registry image data on a mounted VPS volume while keeping Keystone's default behavior simple.

The install flow should also require a registry hostname for normal multi-server managed-registry use. Keystone should configure that hostname with HTTPS, usually through the control node's web proxy. A plain HTTP `host:5000` registry should only be available as an explicit local development or advanced fallback because it requires insecure-registry configuration on every Docker daemon that builds or pulls images.

## Default Image Flow

```text
Git repository
  -> Keystone control node builds Docker image
  -> Keystone pushes image to the managed registry
  -> Target servers pull image from the managed registry
  -> Target servers run containers
```

The build node and registry are separate concepts:

- Build node: where `git clone`, `docker build`, and `docker push` run.
- Registry: where built images are stored and later pulled from.

The control node is the default build node, but users should later be able to add a dedicated build node from Keystone settings.

The running Keystone server is the control node. This does not necessarily need to be represented as a normal deploy target server at first. A lightweight installation/control-node setting may be enough until Keystone needs HA control-plane support.

If Keystone later supports HA control planes, the control node concept should become more explicit so the app can distinguish between:

- The current web/queue/scheduler node.
- The active registry host.
- The default build node.
- Runtime nodes used for deployed applications.

## Image References

Managed registry image names should be stable and collision-resistant. Use IDs in the repository path so renaming an application or environment does not move the image repository.

Default tag format:

```text
registry.example.com/keystone/{application_uuid}/{environment_uuid}:{git_sha}
```

Deployment reference format:

```text
registry.example.com/keystone/{application_uuid}/{environment_uuid}@sha256:...
```

Each successful build artifact should store:

- The registry host.
- The full pushed tag.
- The registry manifest digest.
- The application and environment IDs.
- The source commit SHA.

Deployments should consume the stored digest reference. Tags are useful for humans and registry lookup, but deployments should not depend on mutable tags such as `latest`.

## Registry URL And TLS

The managed registry must be exposed over HTTPS for the normal multi-server path, ideally behind the control node's web proxy, for example:

```text
registry.example.com
```

The install flow should ask for the registry hostname and configure TLS before marking the managed registry ready. Target servers and build nodes must be able to resolve the hostname and trust the certificate before they can build, push, or deploy images.

Avoid defaulting to a plain `host:5000` registry. Plain HTTP registries require Docker daemon insecure-registry configuration on every build and target server, which adds onboarding friction and weakens the default security posture. If Keystone supports this fallback, it should be clearly labelled as local development or advanced use.

Target servers must be able to reach the registry URL before they can deploy images built by Keystone.

Managed registry health checks should verify:

- The registry service is running.
- The registry URL is reachable over HTTPS from the control node.
- The registry URL is reachable over HTTPS from the selected build node.
- The registry URL is reachable over HTTPS from target runtime servers.
- Build credentials can log in and push a small test manifest or image.
- Runtime credentials can log in and pull the pushed test artifact.

## Authentication

Use `registry:2` htpasswd authentication for the first version.

Keystone should:

- Generate separate build and runtime registry credentials.
- Write the registry htpasswd file during provisioning.
- Store credentials encrypted.
- Configure build and target servers for registry access.
- Use `docker login --password-stdin` when login is needed.

Do not inline registry passwords into persisted operation scripts. Operation steps are stored and may be visible in the UI or logs.

The build node should receive build credentials. Runtime target servers should receive runtime credentials. With `registry:2` htpasswd authentication alone, these credentials are not truly push- or pull-scoped; any authenticated registry user can push and pull. The separation is still useful for rotation, auditing, and limiting which credential is distributed to each machine, but Keystone should not present runtime credentials as read-only until it adds token auth or another authorization layer.

When Keystone configures Docker auth on a server, it should do so idempotently and with explicit ownership. For the default `keystone` user model, registry auth should live in that user's Docker config:

```text
/home/keystone/.docker/config.json
```

The file should be owned by `keystone:keystone` and readable only by that user where possible. If a root-owned Docker context is required for a specific operation, Keystone should write the equivalent root-owned config intentionally rather than relying on whichever user happened to run `docker login`.

Preferred approaches:

- Configure Docker auth on each server through a separate secure action.
- Or write root-owned / user-owned credential files on the server and have deployment scripts read from those files.

Token auth can be considered later if Keystone needs per-repository, per-server, or true push/pull scoped credentials. It should not be part of the first implementation.

## Build Planning

Build planning should assume a default managed registry exists after install.

For the default path:

- Build strategy: build on control node.
- Registry: managed local registry.
- Artifact reference: full managed registry image reference.

Multi-server deploys should no longer block because the user has not configured an external registry. They should only block if the managed registry is missing, unhealthy, or unreachable.

External registries should remain available as an advanced override.

Build strategy should not be exposed to users as low-level values such as `target_server`, `dedicated_builder`, or `external_registry`. The UI should expose intent instead:

- Default build node.
- Specific build node.
- External registry override.

Internally, build planning can still map those choices to implementation strategies.

## Build Execution

The default build execution should:

1. Select the configured build node, defaulting to the control node.
2. Clone the application repository.
3. Render the Keystone Dockerfile.
4. Log in to the managed registry.
5. Build the image.
6. Tag the image using the managed registry reference.
7. Push the image.
8. Resolve and store the registry manifest digest.

Control-node builds should have guardrails so the default path does not destabilize Keystone itself:

- Limit concurrent builds on the control node, defaulting to one at a time.
- Check available disk before cloning, building, and pushing.
- Remove temporary clone/build directories after each build.
- Prune local build images and intermediate layers separately from registry artifact retention.
- Surface disk pressure as a build-node health problem before accepting more builds.

Example flow:

```bash
docker login registry.example.com --username keystone-build --password-stdin
docker build --file Dockerfile.keystone --tag registry.example.com/keystone/app-uuid/env-uuid:aaaaaaaaaaaa .
docker push registry.example.com/keystone/app-uuid/env-uuid:aaaaaaaaaaaa
docker manifest inspect registry.example.com/keystone/app-uuid/env-uuid:aaaaaaaaaaaa
```

The stored digest must be the registry manifest digest, not a local image ID. Digest-based pulls and registry manifest deletion depend on this being correct.

Build execution should create a build operation that can succeed or fail independently from deployment. A deployment can then depend on a successful build artifact.

## Deploy Execution

Target servers should pull immutable image references from the managed registry.

Deploy execution should:

1. Ensure the target server has registry auth configured.
2. Pull the exact image digest.
3. Render Compose with the full registry image reference.
4. Start or update containers.

Example pull reference:

```text
registry.example.com/keystone/app-uuid/env-uuid@sha256:...
```

Compose should use the full registry reference, not only `sha256:...`.

Deploy execution should be a separate operation phase from build execution. The deploy phase should consume a completed build artifact and should not be responsible for building the artifact itself.

Operations should have explicit execution targets. Inferring the SSH target only from the operation target model becomes fragile once Keystone has build nodes, registry maintenance, and runtime deployment steps.

Each operation or operation step should be able to declare where it runs:

- Control node.
- Build node.
- Runtime server.
- Specific server.

## Pruning And Retention

Default retention should keep the latest 3 successful build artifacts per environment.

Pruning should also retain:

- Any artifact currently referenced by a service's available image digest.
- Any artifact currently referenced by a service's current image digest.
- Any artifact needed for an active deployment operation.

Pruning should remove old registry manifests first, then run registry garbage collection to remove unreferenced blobs from local disk.

`registry:2` requires deletion to be enabled:

```text
REGISTRY_STORAGE_DELETE_ENABLED=true
```

Garbage collection is safest when the registry is not accepting writes. The first implementation should treat manifest deletion and blob garbage collection as separate steps: delete old manifests under the normal retention policy, then run blob garbage collection only during a controlled maintenance window, using a lock so pruning does not race with active builds or pushes.

Suggested cleanup flow:

1. Acquire a registry maintenance lock.
2. Find prunable artifacts by environment retention rules.
3. Delete old manifests through the registry API.
4. Stop the registry or put it in a safe maintenance state.
5. Run registry garbage collection.
6. Restart the registry.
7. Mark artifacts as pruned or delete their records.
8. Release the lock.

## Future Extensions

These should be optional settings, not onboarding requirements:

- Dedicated build nodes.
- S3-compatible registry storage.
- External registries such as GHCR, Gitea, Docker Hub, or generic registries.
- True push- and pull-scoped credentials.
- Credential rotation.
- Per-server or per-repository scoped auth.
- Configurable retention per application or environment.

The first version should optimize for a self-hosted user installing Keystone on a VPS and being able to deploy with minimal additional setup.

## Existing Server Provisioning

Keystone should support connecting an existing Ubuntu server as a managed node. This is important for users running VPSs, Proxmox VMs, homelab hardware, or manually provisioned servers.

The flow should be:

1. User creates a server record in Keystone as an existing server.
2. Keystone shows a one-time provisioning command.
3. User runs the command on the server as root or a sudo-capable user.
4. The script installs Docker and required packages.
5. The script creates/configures the `keystone` user.
6. The script installs Keystone's management SSH key.
7. The script calls back to Keystone with a one-time token.
8. Keystone marks the server active.

This should sit alongside cloud-provider provisioning. Cloud providers can create the VM automatically, but the same remote preparation logic should be reused where possible.

Provisioning callbacks should not authenticate only by `server_id` or IP address. They should use a short-lived, single-use provisioning token tied to the server record.

Avoid passing sensitive values such as sudo passwords in URL query strings. Safer options include:

- Generate a short-lived provisioning token and pass only that in the URL.
- Store sensitive bootstrap data server-side and let the provisioning script exchange the one-time token for the data it needs.
- Prefer SSH key-based provider bootstrap where available instead of root password bootstrap.
- If a password must be used, pass it over SSH stdin or an encrypted job payload, not through a script URL.

The remote provisioning script can still be downloaded from Keystone, but the URL should not contain long-lived secrets or reusable credentials.

### Sudo Password Handling

Keep the current Forge-like user model for now:

- Provisioned servers have a `keystone` user.
- SSH login is key-only.
- The generated sudo password is for the human user to SSH in and run elevated commands manually.
- Keystone automation continues to use SSH key access and Docker/sudo-capable permissions as required.

This model is acceptable, but sudo password delivery should be hardened.

Laravel protections help with some leak paths:

- `ShouldBeEncrypted` protects queued job payloads.
- Encrypted casts protect stored secrets.
- Hidden model attributes avoid accidental serialization.
- PHP `#[\SensitiveParameter]` can prevent secret values appearing in stack traces.

These protections do not cover query strings, shell process arguments, rendered scripts left on disk, reverse-proxy logs, or third-party request logging.

Minimal hardening plan:

1. Keep generating a sudo password for the provisioned `keystone` user.
2. Keep flashing the sudo password to the user once after server creation.
3. Add `#[\SensitiveParameter]` to job constructor parameters such as `rootPassword` and `sudoPassword`.
4. Stop passing `sudo_password` in the provision script URL.
5. Use a short-lived, single-use provisioning token in the URL instead.
6. Store the sudo password encrypted server-side until the provisioning script is rendered or exchanged.
7. Ensure the remote provisioning script deletes itself at the end of provisioning.
8. Avoid writing the plaintext sudo password to logs or long-lived files.

The goal is to preserve the simple human-admin UX while removing avoidable secret exposure from URLs and leftover bootstrap artifacts.