Skip to main content

Implementation Plans

How we plan, track, and implement features and fixes.

Related: WORKFLOW.md - End-to-end flow from idea to implementation


Folder Structure

website/docs/ai-developer/plans/
├── backlog/ # Approved plans waiting for implementation
├── active/ # Currently being worked on (max 1-2 at a time)
└── completed/ # Done - kept for reference

Flow

Idea/Problem → PLAN file in backlog/ → active/ → completed/

(or INVESTIGATE file first if unclear)

File Types

PLAN-*.md

For work that is ready to implement. The scope is clear, the approach is known.

When to create:

  • Bug fix with known solution
  • Feature request with clear requirements
  • Infrastructure change with defined scope

Naming Conventions:

FormatUse CaseExample
PLAN-<short-name>.mdStandalone plan, no specific orderPLAN-postgres-backup-cronjob.md
PLAN-<nnn>-<short-name>.mdOrdered sequence, indicates execution orderPLAN-001-monitoring-foundation.md

Ordered Plans (PLAN-nnn-*)

When an investigation produces multiple related plans that should be executed in a specific order, use three-digit numbering to indicate the sequence:

PLAN-001-monitoring-foundation.md      # Must be done first (critical foundation)
PLAN-002-prometheus-config.md # Can start after 001
PLAN-003-grafana-dashboards.md # Depends on 002
PLAN-004-alerting-rules.md # Depends on 003

Benefits of ordered numbering:

  • Clear execution sequence at a glance
  • Dependencies are implicit in the number order
  • Easy to track progress through a large initiative
  • Files sort naturally in file explorers

When to use ordered numbering:

  • Investigation produces 3+ related plans
  • Plans have sequential dependencies
  • Work is part of a larger initiative (e.g., monitoring stack overhaul)

When NOT to use ordered numbering:

  • Standalone bug fix or small feature
  • Plans can be executed in any order
  • Single plan from an investigation

Splitting Investigations into Multiple Plans

When an investigation covers a large initiative (e.g., deploying a new platform service with multiple phases), split it into separate ordered plans rather than one monolithic plan. Each plan should be independently completable and deliverable.

How to split:

  1. Group by dependency and risk — phases that need different prerequisites (e.g., "no cluster needed" vs "requires running cluster") should be separate plans
  2. Group by completeness — each plan should deliver something useful on its own, even if later plans aren't started yet
  3. Keep optional/deferred work separate — don't mix required work with nice-to-haves in the same plan

Example: Deploying a new service with catalog generation

INVESTIGATE-backstage.md                    ← Research and decisions
↓ produces:
PLAN-001-backstage-metadata-and-generator.md ← No cluster needed, low risk
PLAN-002-backstage-deployment.md ← Cluster needed, medium risk
PLAN-003-backstage-auth-and-plugins.md ← Optional, after deployment works
  • PLAN-001 adds metadata fields and builds the generator — pure code, no cluster, can be tested locally
  • PLAN-002 deploys Backstage following the adding-a-service guide — requires a running cluster
  • PLAN-003 adds Authentik SSO and extra plugins — optional, only if Authentik is deployed

Each plan references the investigation and the previous plan in its header:

**Investigation**: [INVESTIGATE-backstage.md](../backlog/INVESTIGATE-backstage.md)
**Prerequisites**: PLAN-001 must be complete first

Benefits:

  • Earlier plans can be completed and merged while later plans are still being refined
  • Risk is isolated — a deployment failure in PLAN-002 doesn't block the metadata/generator work in PLAN-001
  • Optional work (auth, plugins) can stay in backlog indefinitely without blocking core functionality
  • Each plan is small enough to review and validate in one session

INVESTIGATE-*.md

For work that needs research first. The problem exists but the solution is unclear.

When to create:

  • Complex infrastructure where options need evaluation
  • Bug with unknown root cause
  • Feature requiring architectural decisions

Naming: INVESTIGATE-<topic>.md

Examples:

  • INVESTIGATE-monitoring-architecture.md
  • INVESTIGATE-multi-cluster-networking.md

After investigation: Create one or more PLAN files with the chosen approach.


Plan Structure

Every plan has these sections:

1. Header (Required)

# Plan Title

> **IMPLEMENTATION RULES:** Before implementing this plan, read and follow:
> - [WORKFLOW.md](../../WORKFLOW.md) - The implementation process
> - [PLANS.md](../../PLANS.md) - Plan structure and best practices

## Status: Backlog | Active | Blocked | Completed

**Goal**: One sentence describing what this achieves.

**Last Updated**: 2026-01-18

**GitHub Issue**: #42 (optional - if tracking with issues)

The IMPLEMENTATION RULES blockquote ensures Claude Code reads the workflow and plan guidelines before starting work.

2. Dependencies (If applicable)

**Prerequisites**: PLAN-001 must be complete first
**Blocks**: PLAN-003 cannot start until this is done
**Priority**: High | Medium | Low

For ordered plans (PLAN-nnn-*), dependencies are often implicit in the number order. Only add explicit dependency notes when the relationship is non-obvious.

3. Problem Summary (Required)

What's wrong or what's needed. Be specific.

4. Phases with Tasks (Required)

Break work into phases. Each phase has:

  • Numbered tasks
  • A validation step at the end (usually user confirmation)
## Phase 1: Setup

### Tasks

- [ ] 1.1 Create the ConfigMap
- [ ] 1.2 Add validation rules
- [ ] 1.3 Test with dry-run

### Validation

User confirms phase is complete.

---

## Phase 2: Implementation

### Tasks

- [ ] 2.1 Create the deployment manifest
- [ ] 2.2 Add the service manifest
- [ ] 2.3 Apply and verify deployment

### Validation

User confirms deployment works correctly.

5. Acceptance Criteria (Required)

## Acceptance Criteria

- [ ] Manifests apply without errors
- [ ] Pods are running and healthy
- [ ] Service is accessible
- [ ] Documentation is updated

6. Implementation Notes (Optional)

Technical details, gotchas, code patterns to follow.

7. Files to Modify (Optional but helpful)

## Files to Modify

- `manifests/250-new-service.yaml`
- `docs/services/new-service.md`

Status Values

StatusMeaningLocation
BacklogApproved, waiting to startplans/backlog/
ActiveCurrently being worked onplans/active/
BlockedWaiting on something elseplans/backlog/ or plans/active/
CompletedDoneplans/completed/

Updating Plans During Implementation

Critical: Plans are living documents. Update them as you work.

When starting a phase:

## Phase 2: Implementation — IN PROGRESS

When completing a task:

- [x] 2.1 Update the manifest ✓
- [ ] 2.2 Add the service

When a phase is done:

## Phase 2: Implementation — ✅ DONE

When blocked:

## Status: Blocked

**Blocked by**: Waiting for decision on approach

When complete:

  1. Update status: ## Status: Completed
  2. Add completion date: **Completed**: 2026-01-18
  3. Move file: mv website/docs/ai-developer/plans/active/PLAN-xyz.md website/docs/ai-developer/plans/completed/
  4. (Optional) Close GitHub issue if using issue tracking

Validation

Every phase ends with validation. The simplest form is asking the user to confirm.

Default: User Confirmation

Claude asks: "Phase 1 complete. Does this look good to continue?"

In the plan, this can be written as:

### Validation

User confirms phase is complete.

Optional: Automated Check

When a command can verify the work, include it:

### Validation

```bash
kubectl apply --dry-run=client -f manifests/xxx-new-service.yaml
kubectl get pods -n namespace -l app=new-service

User confirms output is correct.


### Key Point

Don't force automated validation when it's impractical. User confirmation is valid and often the best approach.

---

## Plan Templates

### Simple Bug Fix

```markdown
# Fix: [Bug Description]

> **IMPLEMENTATION RULES:** Before implementing this plan, read and follow:
> - [WORKFLOW.md](../../WORKFLOW.md) - The implementation process
> - [PLANS.md](../../PLANS.md) - Plan structure and best practices

## Status: Backlog

**Goal**: [One sentence]

**GitHub Issue**: #XX (optional)

**Last Updated**: YYYY-MM-DD

---

## Problem

[What's broken]

## Solution

[How to fix it]

---

## Phase 1: Fix

### Tasks

- [ ] 1.1 [Specific change]
- [ ] 1.2 [Another change]

### Validation

User confirms fix is correct.

---

## Acceptance Criteria

- [ ] Bug is fixed
- [ ] No regressions
- [ ] Manifests apply cleanly

Feature Implementation

# Feature: [Feature Name]

> **IMPLEMENTATION RULES:** Before implementing this plan, read and follow:
> - [WORKFLOW.md](../../WORKFLOW.md) - The implementation process
> - [PLANS.md](../../PLANS.md) - Plan structure and best practices

## Status: Backlog

**Goal**: [One sentence]

**GitHub Issue**: #XX (optional)

**Last Updated**: YYYY-MM-DD

---

## Overview

[What this feature does and why]

---

## Phase 1: [Setup/Preparation]

### Tasks

- [ ] 1.1 [Task]
- [ ] 1.2 [Task]

### Validation

User confirms phase is complete.

---

## Phase 2: [Core Implementation]

### Tasks

- [ ] 2.1 [Task]
- [ ] 2.2 [Task]

### Validation

User confirms phase is complete.

---

## Acceptance Criteria

- [ ] [Criterion]
- [ ] Deployment succeeds
- [ ] Services are accessible
- [ ] Documentation updated

---

## Files to Modify

- `manifests/xxx-new-feature.yaml`

Investigation

# Investigate: [Topic]

> **IMPLEMENTATION RULES:** Before implementing this plan, read and follow:
> - [WORKFLOW.md](../../WORKFLOW.md) - The implementation process
> - [PLANS.md](../../PLANS.md) - Plan structure and best practices

## Status: Backlog

**Goal**: Determine the best approach for [topic]

**Last Updated**: YYYY-MM-DD

---

## Questions to Answer

1. [Question 1]
2. [Question 2]

---

## Current State

[What exists now]

---

## Options

### Option A: [Name]

**Pros:**
-

**Cons:**
-

### Option B: [Name]

**Pros:**
-

**Cons:**
-

---

## Recommendation

[After investigation, what do we do?]

---

## Next Steps

- [ ] Create PLAN-xyz.md with chosen approach
- For multiple related plans, use ordered naming: PLAN-001-*, PLAN-002-*, etc.

Working with Claude Code

See WORKFLOW.md for the complete flow from idea to implementation.


Best Practices

  1. One active plan at a time - finish before starting another
  2. Small phases - easier to validate and recover from errors
  3. Specific tasks - "Update line 42 in manifests/xyz.yaml" not "Fix the thing"
  4. Runnable validation - commands, not descriptions
  5. Update as you go - the plan is the source of truth
  6. Keep completed plans - they're documentation
  7. Check existing lib/ before creating new code - see Library Reuse Rules below

Library Reuse Rules

CRITICAL: Before writing any new code in provision-host/uis/lib/, you MUST:

1. Check Existing Libraries

Review these files for existing functionality:

LibraryPurpose
paths.shAll path detection - TEMPLATES_DIR, EXTEND_DIR, SECRETS_DIR, etc.
utilities.shBase utilities - get_base_path(), die(), config_* functions
logging.shAll logging - log_info(), log_error(), print_section()
first-run.shInitialization - check_first_run(), generate_ssh_keys()

2. Use Existing Functions

DO NOT create duplicate path functions. Use paths.sh:

# Good - use paths.sh functions
source "$LIB_DIR/paths.sh"
templates_dir=$(get_templates_dir)
secrets_dir=$(get_secrets_dir)

# Bad - creating your own path detection
_my_detect_templates_dir() { ... } # WRONG!

3. If New Functionality is Needed

Ask these questions before creating new functions:

  1. Does this already exist in another library?
  2. Should this be added to an existing library instead?
  3. Will multiple libraries need this? → Add to shared library
  4. Is this truly specific to this feature? → OK to add locally

4. Centralized Path Functions

All paths are managed by paths.sh. Available functions:

get_templates_dir()           # provision-host/uis/templates/
get_extend_dir() # .uis.extend/
get_secrets_dir() # .uis.secrets/
get_services_dir() # provision-host/uis/services/
get_tools_dir() # provision-host/uis/tools/
get_hosts_templates_dir() # templates/uis.extend/hosts/
get_secrets_templates_dir() # templates/uis.secrets/
get_cloud_init_templates_dir() # templates/ubuntu-cloud-init/

Why This Matters

Code duplication leads to:

  • Inconsistent behavior (different functions return different values)
  • Maintenance burden (fix bugs in multiple places)
  • Confusion (which function should I use?)