Skip to content

Core Concepts

This page explains the key concepts behind Krayne and Ray on Kubernetes.


Ray cluster anatomy

A Ray cluster consists of a head node and one or more worker groups. Krayne manages these as Kubernetes pods via the KubeRay operator.

graph TB
  subgraph cluster["Ray Cluster"]
    direction TB
    Head["<b>Head Node</b><br/>GCS Server<br/>Dashboard :8265<br/>Ray Client :10001<br/>GCS :6379"]
    subgraph wg1["Worker Group: cpu-workers"]
      W1["Worker 1<br/>15 CPU, 48Gi"]
      W2["Worker 2<br/>15 CPU, 48Gi"]
    end
    subgraph wg2["Worker Group: gpu-workers"]
      G1["Worker 1<br/>1x A100 GPU"]
      G2["Worker 2<br/>1x A100 GPU"]
    end
    Head --- W1
    Head --- W2
    Head --- G1
    Head --- G2
  end
  User["User"] -->|"Dashboard :8265"| Head
  User -->|"Ray Client :10001"| Head
  User -->|"Notebook :8888"| Head
Component Role
Head node Runs the Global Control Service (GCS), Ray dashboard, and scheduling. Does not typically run user workloads.
Worker group A set of identically configured worker pods. A cluster can have multiple worker groups (e.g., CPU workers and GPU workers).
Services Jupyter notebook, Code Server, and SSH are optionally exposed on the head node.

KubeRay and the RayCluster CRD

KubeRay is a Kubernetes operator that manages Ray clusters via a Custom Resource Definition (CRD): ray.io/v1/RayCluster.

Krayne generates the RayCluster manifest from your configuration and submits it to the Kubernetes API. The KubeRay operator then reconciles the desired state — creating pods, services, and networking.

sequenceDiagram
  participant User
  participant Krayne as Krayne CLI/SDK
  participant K8s as Kubernetes API
  participant KubeRay as KubeRay Operator
  participant Pods as Ray Pods

  User->>Krayne: krayne create my-cluster
  Krayne->>Krayne: Build ClusterConfig
  Krayne->>Krayne: build_manifest(config)
  Krayne->>K8s: Create RayCluster CR
  K8s->>KubeRay: Notify: new RayCluster
  KubeRay->>Pods: Create head + worker pods
  Pods-->>KubeRay: Pods running
  KubeRay-->>K8s: Update status: Ready
  K8s-->>Krayne: Status: ready
  Krayne-->>User: Cluster ready!

You never need to write the RayCluster YAML yourself — Krayne handles manifest generation, submission, and status polling.


Cluster lifecycle

A cluster moves through several states from creation to deletion:

stateDiagram-v2
  [*] --> creating: krayne create
  creating --> ready: All pods running
  creating --> image_pull_error: Bad container image
  creating --> crash_loop: Container crash
  creating --> unschedulable: Insufficient resources
  ready --> ready: krayne scale
  ready --> [*]: krayne delete
  image_pull_error --> [*]: krayne delete
  crash_loop --> [*]: krayne delete
  unschedulable --> [*]: krayne delete
Status Meaning
creating Cluster submitted, pods being scheduled
ready All pods running, cluster operational
containers-creating Pod scheduled, pulling images
image-pull-error Container image not found or inaccessible
crash-loop Container repeatedly crashing (CrashLoopBackOff)
unschedulable Kubernetes cannot schedule pods (insufficient CPU, memory, or GPUs)
pods-pending Pods waiting to be scheduled
running Pods running but cluster not fully ready

Use krayne describe <name> to check the current status at any time.


Namespaces

Krayne scopes all operations to a Kubernetes namespace. The default namespace is default, but you can specify any namespace:

# CLI
krayne create my-cluster -n ml-team
krayne get -n ml-team

# Python SDK
from krayne.api import create_cluster
from krayne.config import ClusterConfig

config = ClusterConfig(name="my-cluster", namespace="ml-team")
create_cluster(config)

Clusters in different namespaces are independent — they can share names without conflict.


Configuration model

Krayne uses a layered configuration system with three sources, resolved in order of precedence:

flowchart LR
  CLI["<b>CLI Flags</b><br/>(highest priority)"]
  YAML["<b>YAML File</b>"]
  Defaults["<b>Built-in Defaults</b><br/>(lowest priority)"]

  CLI --> Merge["Merge"]
  YAML --> Merge
  Defaults --> Merge
  Merge --> Validate{"Pydantic<br/>Validation"}
  Validate -->|"valid"| Config["ClusterConfig"]
  Validate -->|"invalid"| Error["ConfigValidationError"]

The only required field is name. Everything else has sensible defaults:

# This is a complete, valid command
krayne create my-cluster

See Configuration for the full config model and defaults.


Services

Krayne exposes several services on the head node, each mapped to a container port:

Service Default Port Description
Jupyter Notebook Enabled 8888 Web-based notebook environment on the head node
SSH Enabled 22 SSH access to the head node
Code Server Enabled 8443 Browser-based code-server, installed at container startup

When enabled, service URLs appear in ClusterInfo (e.g. notebook_url, code_server_url, ssh_url) and in the CLI output.

All services are installed and started via a postStart lifecycle hook on the ray-head container. Jupyter is installed with pip install notebook, and Code Server is installed from a standalone pre-built binary (no apt-get or curl required).

Services are configured via the services section of ClusterConfig or the YAML file:

services:
  notebook: true
  code_server: true
  ssh: true

To access services from your local machine, use krayne tun-open / krayne tun-close:

krayne tun-open my-cluster   # start tunnels (idempotent)
krayne tun-close my-cluster   # stop tunnels (idempotent)

CLI and SDK parity

Every operation available from the CLI is available as a Python function with the same semantics:

CLI Command SDK Function Description
krayne create create_cluster() Create a new cluster
krayne get list_clusters() List all clusters
krayne describe describe_cluster() Get detailed cluster info
krayne scale scale_cluster() Scale a worker group
krayne delete delete_cluster() Delete a cluster
get_cluster() Get info for a single cluster
wait_until_ready() Poll until cluster is ready

The SDK is designed for automation — use it in scripts, notebooks, and CI/CD pipelines.


What's next