Goals
- Make tests readable and reviewable.
- Prefer verifying behavior through the public contract (user-facing surfaces) over internal details.
- Keep the suite reliable: deterministic, minimal flake, clear failures.
- Any behavior change should include a test change (add/adjust). Pure refactors only require test changes if behavior changes.
What “public contract” means
Public contract is anything exposed to end users and not an internal implementation detail. Practical mapping in this repo:- Server: HTTP endpoints + request/response schemas (and documented behavior).
-
SDK: symbols exported from
sdks/python/src/agent_control/__init__.py(and documented behavior). -
Models: Pydantic models/fields/validation/serialization in
models/src/agent_control_models/. - Engine: stable, intended entrypoints for evaluation behavior (avoid asserting on internal helpers/private module structure).
Prefer testing via public contract (rule of thumb)
Choose the narrowest user-facing interface that can express the scenario:- Server behavior → drive via HTTP endpoints (create/setup via API where feasible).
- SDK behavior → drive via exported SDK API.
- Engine behavior → drive via the engine’s stable entrypoints.
- Only if needed: test internal helpers for hard-to-reach edge cases or performance-sensitive parsing/validation.
- Contract tests survive refactors (less coupled to internals).
- They catch integration mismatches between packages (models/server/sdk).
- They better reflect how users experience failures.
- The public route to set up state is disproportionately slow or complex.
- You need to force an otherwise-unreachable error path.
- You’re testing a pure function where the “public API” adds no value.
# Given: block (e.g., “Given: seeded DB row directly for speed”).
Given / When / Then style
Use# Given, # When, # Then comments to separate intent from mechanics. This helps smaller models (and humans) avoid mixing setup/action/assertions, and makes tests easy to scan.
Guidelines:
- Given: inputs, state, preconditions (fixtures/mocks/seed data).
- When: the single action under test (call a function / make a request).
- Then: assertions about outcomes (return value, error, side effects).
- Prefer one When per test. If you need multiple actions, split tests unless the steps are inseparable.
- Keep comments short and specific (often one line each).
Example: unit-level validation
Examples below are illustrative; adjust imports/names and fill in placeholders to match the concrete code under test.Example: API-level behavior
Example: SDK-level behavior
Setup guidance (contract-first)
- Prefer creating records via public endpoints rather than writing DB rows directly.
-
Prefer invoking behavior via public entrypoints:
- Server: HTTP endpoints (the service layer is internal; use it directly only when endpoint setup is impractical).
-
SDK: symbols exported from
sdks/python/src/agent_control/__init__.py.
- Avoid asserting on internal/private fields unless they are part of the contract (schemas, response fields, documented behavior).
Running tests (Makefile-first)
Prefer Makefile targets when available:-
All tests:
make test -
Server tests:
make server-test -
Engine tests:
make engine-test -
SDK tests:
make sdk-test
cd models && uv run pytest).
Package-specific notes:
-
Server tests use a configured test database (see
server/Makefile; invoked viamake server-test). -
SDK tests start a local server and wait on
/health(invoked viamake sdk-test).