Skip to content

Core Concepts

This page explains the key concepts behind Krayne and Ray on Kubernetes.


Ray cluster anatomy

A Ray cluster consists of a head node and one or more worker groups. Krayne manages these as Kubernetes pods via the KubeRay operator.

graph TB
  subgraph cluster["Ray Cluster"]
    direction TB
    Head["<b>Head Node</b><br/>GCS Server<br/>Dashboard :8265<br/>Ray Client :10001<br/>GCS :6379"]
    subgraph wg1["Worker Group: cpu-workers"]
      W1["Worker 1<br/>15 CPU, 48Gi"]
      W2["Worker 2<br/>15 CPU, 48Gi"]
    end
    subgraph wg2["Worker Group: gpu-workers"]
      G1["Worker 1<br/>1x A100 GPU"]
      G2["Worker 2<br/>1x A100 GPU"]
    end
    Head --- W1
    Head --- W2
    Head --- G1
    Head --- G2
  end
  User["User"] -->|"Dashboard :8265"| Head
  User -->|"Ray Client :10001"| Head
  User -->|"Notebook :8888"| Head
Component Role
Head node Runs the Global Control Service (GCS), Ray dashboard, and scheduling. By default Krayne pins the head's Ray num-cpus to 0 so user tasks are routed to workers; opt-in by setting head.runs_tasks: true.
Worker group A set of identically configured worker pods. A cluster can have multiple worker groups (e.g., CPU workers and GPU workers).
Services Jupyter notebook, code-server, and SSH are exposed on the head node by default; each can be disabled in services.

KubeRay and the RayCluster CRD

KubeRay is a Kubernetes operator that manages Ray clusters via a Custom Resource Definition (CRD): ray.io/v1/RayCluster.

Krayne generates the RayCluster manifest from your configuration and submits it to the Kubernetes API. The KubeRay operator then reconciles the desired state — creating pods, services, and networking.

sequenceDiagram
  participant User
  participant Krayne as Krayne CLI/SDK
  participant K8s as Kubernetes API
  participant KubeRay as KubeRay Operator
  participant Pods as Ray Pods

  User->>Krayne: krayne create my-cluster
  Krayne->>Krayne: Build ClusterConfig
  Krayne->>Krayne: build_manifest(config)
  Krayne->>K8s: Create RayCluster CR
  K8s->>KubeRay: Notify: new RayCluster
  KubeRay->>Pods: Create head + worker pods
  Pods-->>KubeRay: Pods running
  KubeRay-->>K8s: Update status: Ready
  K8s-->>Krayne: Status: ready
  Krayne-->>User: Cluster ready!

You never need to write the RayCluster YAML yourself — Krayne handles manifest generation, submission, and status polling.


Cluster lifecycle

A cluster moves through several states from creation to deletion:

stateDiagram-v2
  [*] --> creating: krayne create
  creating --> ready: All pods running
  creating --> image_pull_error: Bad container image
  creating --> crash_loop: Container crash
  creating --> unschedulable: Insufficient resources
  ready --> ready: krayne scale
  ready --> [*]: krayne delete
  image_pull_error --> [*]: krayne delete
  crash_loop --> [*]: krayne delete
  unschedulable --> [*]: krayne delete
Status Meaning
creating Cluster submitted, pods being scheduled
ready All pods running, cluster operational
containers-creating Pod scheduled, pulling images
image-pull-error Container image not found or inaccessible
crash-loop Container repeatedly crashing (CrashLoopBackOff)
unschedulable Kubernetes cannot schedule pods (insufficient CPU, memory, or GPUs)
pods-pending Pods waiting to be scheduled
running Pods running but cluster not fully ready

Use krayne describe <name> to check the current status at any time.


Namespaces

Krayne scopes all operations to a Kubernetes namespace. The default namespace is default, but you can specify any namespace:

# CLI
krayne create my-cluster -n ml-team
krayne get -n ml-team

# Python SDK
from krayne.api import create_cluster
from krayne.config import ClusterConfig

config = ClusterConfig(name="my-cluster", namespace="ml-team")
create_cluster(config)

Clusters in different namespaces are independent — they can share names without conflict.


Configuration model

Krayne uses a layered configuration system with three sources, resolved in order of precedence:

flowchart LR
  CLI["<b>CLI Flags</b><br/>(highest priority)"]
  YAML["<b>YAML File</b>"]
  Defaults["<b>Built-in Defaults</b><br/>(lowest priority)"]

  CLI --> Merge["Merge"]
  YAML --> Merge
  Defaults --> Merge
  Merge --> Validate{"Pydantic<br/>Validation"}
  Validate -->|"valid"| Config["ClusterConfig"]
  Validate -->|"invalid"| Error["ConfigValidationError"]

The only required field is name. Everything else has sensible defaults:

# This is a complete, valid command
krayne create my-cluster

See Configuration for the full config model and defaults.


Services

Krayne exposes several services on the head node, each mapped to a container port:

Service Default Port Description
Jupyter Notebook Enabled 8888 Web-based notebook environment on the head node
SSH Enabled 22 SSH access to the head node
Code Server Enabled 8443 Browser-based code-server, installed at container startup

When enabled, service URLs appear in ClusterInfo (e.g. notebook_url, code_server_url, ssh_url) and in the CLI output.

All services are installed and started via a postStart lifecycle hook on the ray-head container. Jupyter is installed with pip install notebook, and Code Server is installed from a standalone pre-built binary (no apt-get or curl required).

Services are configured via the services section of ClusterConfig or the YAML file:

services:
  notebook: true
  code_server: true
  ssh: true

To access services from your local machine, use krayne tun-open / krayne tun-close:

krayne tun-open my-cluster   # start tunnels (idempotent)
krayne tun-close my-cluster   # stop tunnels (idempotent)

CLI and SDK parity

Every operation available from the CLI is available as a Python function with the same semantics:

CLI Command SDK Function Description
krayne create create_cluster() Create a new cluster
krayne get list_clusters() List all clusters
krayne describe describe_cluster() Get detailed cluster info
krayne scale scale_cluster() Scale a worker group
krayne delete delete_cluster() Delete a cluster
get_cluster() Get info for a single cluster
wait_until_ready() Poll until cluster is ready

The SDK is designed for automation — use it in scripts, notebooks, and CI/CD pipelines.


What's next