Core Concepts¶

This page explains the key concepts behind Krayne and Ray on Kubernetes.

Ray cluster anatomy¶

A Ray cluster consists of a head node and one or more worker groups. Krayne manages these as Kubernetes pods via the KubeRay operator.

graph TB
  subgraph cluster["Ray Cluster"]
    direction TB
    Head["<b>Head Node</b><br/>GCS Server<br/>Dashboard :8265<br/>Ray Client :10001<br/>GCS :6379"]
    subgraph wg1["Worker Group: cpu-workers"]
      W1["Worker 1<br/>15 CPU, 48Gi"]
      W2["Worker 2<br/>15 CPU, 48Gi"]
    end
    subgraph wg2["Worker Group: gpu-workers"]
      G1["Worker 1<br/>1x A100 GPU"]
      G2["Worker 2<br/>1x A100 GPU"]
    end
    Head --- W1
    Head --- W2
    Head --- G1
    Head --- G2
  end
  User["User"] -->|"Dashboard :8265"| Head
  User -->|"Ray Client :10001"| Head
  User -->|"Notebook :8888"| Head

Component	Role
Head node	Runs the Global Control Service (GCS), Ray dashboard, and scheduling. By default Krayne pins the head's Ray `num-cpus` to `0` so user tasks are routed to workers; opt-in by setting `head.runs_tasks: true`.
Worker group	A set of identically configured worker pods. A cluster can have multiple worker groups (e.g., CPU workers and GPU workers).
Services	Jupyter notebook, code-server, and SSH are exposed on the head node by default; each can be disabled in `services`.

KubeRay and the RayCluster CRD¶

KubeRay is a Kubernetes operator that manages Ray clusters via a Custom Resource Definition (CRD): ray.io/v1/RayCluster.

Krayne generates the RayCluster manifest from your configuration and submits it to the Kubernetes API. The KubeRay operator then reconciles the desired state — creating pods, services, and networking.

sequenceDiagram
  participant User
  participant Krayne as Krayne CLI/SDK
  participant K8s as Kubernetes API
  participant KubeRay as KubeRay Operator
  participant Pods as Ray Pods

  User->>Krayne: krayne create my-cluster
  Krayne->>Krayne: Build ClusterConfig
  Krayne->>Krayne: build_manifest(config)
  Krayne->>K8s: Create RayCluster CR
  K8s->>KubeRay: Notify: new RayCluster
  KubeRay->>Pods: Create head + worker pods
  Pods-->>KubeRay: Pods running
  KubeRay-->>K8s: Update status: Ready
  K8s-->>Krayne: Status: ready
  Krayne-->>User: Cluster ready!

You never need to write the RayCluster YAML yourself — Krayne handles manifest generation, submission, and status polling.

Cluster lifecycle¶

A cluster moves through several states from creation to deletion:

stateDiagram-v2
  [*] --> creating: krayne create
  creating --> ready: All pods running
  creating --> image_pull_error: Bad container image
  creating --> crash_loop: Container crash
  creating --> unschedulable: Insufficient resources
  ready --> ready: krayne scale
  ready --> [*]: krayne delete
  image_pull_error --> [*]: krayne delete
  crash_loop --> [*]: krayne delete
  unschedulable --> [*]: krayne delete

Status	Meaning
`creating`	Cluster submitted, pods being scheduled
`ready`	All pods running, cluster operational
`containers-creating`	Pod scheduled, pulling images
`image-pull-error`	Container image not found or inaccessible
`crash-loop`	Container repeatedly crashing (`CrashLoopBackOff`)
`unschedulable`	Kubernetes cannot schedule pods (insufficient CPU, memory, or GPUs)
`pods-pending`	Pods waiting to be scheduled
`running`	Pods running but cluster not fully ready

Use krayne describe <name> to check the current status at any time.

Namespaces¶

Krayne scopes all operations to a Kubernetes namespace. The default namespace is default, but you can specify any namespace:

# CLI
krayne create my-cluster -n ml-team
krayne get -n ml-team

# Python SDK
from krayne.api import create_cluster
from krayne.config import ClusterConfig

config = ClusterConfig(name="my-cluster", namespace="ml-team")
create_cluster(config)

Clusters in different namespaces are independent — they can share names without conflict.

Configuration model¶

Krayne uses a layered configuration system with three sources, resolved in order of precedence:

flowchart LR
  CLI["<b>CLI Flags</b><br/>(highest priority)"]
  YAML["<b>YAML File</b>"]
  Defaults["<b>Built-in Defaults</b><br/>(lowest priority)"]

  CLI --> Merge["Merge"]
  YAML --> Merge
  Defaults --> Merge
  Merge --> Validate{"Pydantic<br/>Validation"}
  Validate -->|"valid"| Config["ClusterConfig"]
  Validate -->|"invalid"| Error["ConfigValidationError"]

The only required field is name. Everything else has sensible defaults:

# This is a complete, valid command
krayne create my-cluster

See Configuration for the full config model and defaults.

Services¶

Krayne exposes several services on the head node, each mapped to a container port:

Service	Default	Port	Description
Jupyter Notebook	Enabled	8888	Web-based notebook environment on the head node
SSH	Enabled	22	SSH access to the head node
Code Server	Enabled	8443	Browser-based code-server, installed at container startup

When enabled, service URLs appear in ClusterInfo (e.g. notebook_url, code_server_url, ssh_url) and in the CLI output.

All services are installed and started via a postStart lifecycle hook on the ray-head container. Jupyter is installed with pip install notebook, and Code Server is installed from a standalone pre-built binary (no apt-get or curl required).

Services are configured via the services section of ClusterConfig or the YAML file:

services:
  notebook: true
  code_server: true
  ssh: true

To access services from your local machine, use krayne tun-open / krayne tun-close:

krayne tun-open my-cluster   # start tunnels (idempotent)
krayne tun-close my-cluster   # stop tunnels (idempotent)

CLI and SDK parity¶

Every operation available from the CLI is available as a Python function with the same semantics:

CLI Command	SDK Function	Description
`krayne create`	`create_cluster()`	Create a new cluster
`krayne get`	`list_clusters()`	List all clusters
`krayne describe`	`describe_cluster()`	Get detailed cluster info
`krayne scale`	`scale_cluster()`	Scale a worker group
`krayne delete`	`delete_cluster()`	Delete a cluster
—	`get_cluster()`	Get info for a single cluster
—	`wait_until_ready()`	Poll until cluster is ready

The SDK is designed for automation — use it in scripts, notebooks, and CI/CD pipelines.

What's next¶

Local Sandbox — set up a local development environment
Creating Clusters — create CPU, GPU, and multi-worker clusters
Configuration — full config model, defaults, and YAML schema