Deploying a Headless Cluster Using the CLI


This guide walks you through setting up an 8-node distributed LLM cluster using the webAI CLI. You'll learn how to configure multiple machines to work together, enabling distributed processing for large language models.

Prerequisites
Before beginning, ensure you have:
• 8 machines (Mac or Linux preferred) on the same subnet/LAN
• All machines must have:
  • Internet access
  • SSH enabled
  • Xcode Command Line Tools installed for macOS: xcode-select --install
• One machine designated as the controller
• Seven machines designated as workers

Setup Process

  1. Generate SSH Key on Controller

    First, create an SSH key on your controller machine that will be used to connect to workers:

    ssh-keygen -t ed25519 -C "webai@controller"
    • Press Enter to accept the default file location (~/.ssh/id_ed25519)
    • You can leave the passphrase empty for automated connections (optional but recommended for production)
  2. Copy SSH Key to Each Worker Node

    Copy your public key to each worker machine to enable passwordless authentication:

    ssh-copy-id user@192.168.1.101
    ssh-copy-id user@192.168.1.102
    ssh-copy-id user@192.168.1.103
    ssh-copy-id user@192.168.1.104
    ssh-copy-id user@192.168.1.105
    ssh-copy-id user@192.168.1.106
    ssh-copy-id user@192.168.1.107

    Replace user and IP addresses with your actual worker usernames and IPs.

    If your system doesn't have ssh-copy-id, use this alternative method:

    cat ~/.ssh/id_ed25519.pub | ssh user@192.168.1.101 "mkdir -p ~/.ssh && cat  ~/.ssh/authorized_keys"
  3. Verify SSH Connectivity

    Test the SSH connection to each worker to ensure passwordless access works properly:

    ssh user@192.168.1.101

    If successful, you'll connect without being prompted for a password. Repeat this test for all worker nodes.

  4. Unpack webAI CLI

    On the controller machine:

    1. Unzip the webAI CLI package:

      unzip webai-cli.zip
      cd Headless
    2. Verify the folder structure:

      Headless/
      ├── rtctl
      ├── ips.yaml
      └── runtime/
  5. Configure ips.yaml

    Edit the ips.yaml file to define your cluster layout:

    controller: 192.168.1.100
    workers:
    - user@192.168.1.101
    - user@192.168.1.102
    - user@192.168.1.103
    - user@192.168.1.104
    - user@192.168.1.105
    - user@192.168.1.106
    - user@192.168.1.107
    • Replace 192.168.1.100 with your controller's IP address
    • Replace user with the actual username on each worker node
    • Use # at the beginning of a line to comment out any node you don't want to include

Cluster Management

Starting the Cluster

From within the Headless/ directory:

  1. Start the controller:

    ./rtctl start controller
  2. Start all workers:

    ./rtctl start workers --from-file ips.yaml --controller-ip 192.168.1.100

    The --controller-ip should match the controller IP in your ips.yaml file.

Running a Model

  1. To see all available models:

    ./rtctl run model --list
  2. To run a distributed model across your cluster:

    ./rtctl run scaled -f ips.yaml --model mlx-community/Meta-Llama-3.1-8B-Instruct-4bit

Interacting with the Model

Once the model is running, you can start a chat session:

./rtctl run chat "What's the fastest land animal?"

The system will respond with the answer and performance metrics:

  • ttft (Time to First Token): How quickly the first response appeared
  • tps (Tokens Per Second): Processing speed of the model

Checking Cluster Status

To verify the health of your cluster:

./rtctl status -f ips.yaml

Stopping the Cluster

  1. Stop the running model:
    ./rtctl stop scaled
  2. Stop all worker nodes:
    ./rtctl stop workers -f ips.yaml
  3. Stop the controller:
    ./rtctl stop controller
  4. (Optional) For a complete cleanup:
    ./rtctl clean -f ips.yaml

Best Practices

  • Network Configuration: Use a private network or VLAN for optimal performance and security
  • Connection Types: You can mix Wi-Fi, Ethernet, and Thunderbolt connections in your cluster
    • Thunderbolt: Provides the highest performance (up to 40Gbps) and lowest latency, preferred when available between nodes
    • Ethernet: Offers reliable, stable connectivity (1-10Gbps) and is preferred over Wi-Fi for consistent performance
    • Wi-Fi: Acceptable for internet access and basic connectivity, but not recommended for primary inter-node communication
  • Resource Allocation: Distribute models based on the available memory and compute power of each node
  • Error Handling: If a node fails to connect, check its SSH configuration and network connectivity

Troubleshooting

  • Check that SSH is enabled on all machines and that firewalls allow SSH connections
  • Ensure all nodes have sufficient RAM for the selected model
  • Check network bandwidth between nodes; bandwidth limitations can impact distributed processing
  • Ensure you're running commands from within the Headless/ directory