webFrame


webFrame provides a comprehensive approach for executing Large Language Models (LLMs) in resource-constrained and distributed environments. Our vision is for webFrame to become our de facto inference approach for LLMs on any compute platform.

How Does it Work?

webFrame parses models from Hugging Face into an intermediate representation that can be mapped to backend-specific modules. At execution time, webFrame:

  • Reads the compute availability of the current environment
  • Determines the appropriate compute plan (returned as a sub-flow)
  • Automatically creates the flow on canvas

This compute plan may run on a single node or across several nodes from the selected compute cluster.

Distribution and Optimization

By default, webFrame evenly distributes LLM workloads across all provided nodes in a cluster.

When using webFrame's optimizer settings, a novel quantization process is applied at execution time to:

  • Maximize model accuracy given the provided resources
  • Distribute the workload minimally

Supported Models

webFrame supports many common LLM architectures, with plans to expand to more architectures in the future:

  • codegemma: 7bit
  • Codestral: 22B v0.1
  • DeepSeek R1: 4bit
  • DeepSeek R1 Distill: Llama 8B
  • DeepSeek V3: 4bit
  • Llama: 3.2 1B
  • Llama: 3.2 3B Instruct
  • Llama: 3.3 70B Instruct
  • Llama: 3.3 70B Instruct 4bit
  • Llama: 3.3 70B Instruct 8bit
  • Ministral: 8B Instruct 2410 bf16
  • Mistral: 7B Instruct v0.3
  • Mistral NeMo: Minitron 8B Instruct
  • Mistral Small: 24B Instruct 2501
  • Mixtral: 8x22B Instruct v.01
  • nvidia_Llama 3.1 Nemotron: 70B Instruct HF_4bit
  • Phi 3: medium 128k instruct bf16
  • Phi 3.5: mini instruct bf16
  • Phi: 4
  • Qwen2: 72B Instruct
  • Qwen: 7B Instruct
  • Qwen2.5: 14B Instruct
  • Qwen2.5: 72B Instruct
  • Qwen2.5 Coder: 7B Instruct
  • QwQ: 32B
  • QwQ: 32B Preview
  • sum small unquantized

Models requiring an API key are marked with a key icon, while optimizer-compatible models have a rocket icon in the interface.

Creating a webFrame Flow

There are two main approaches to creating a webFrame flow. Both require:

1. A configured cluster
2. A HuggingFace API key (only for gated models)

Method 1: Manual Element Assembly

  • Drag the LLM, Prompt API, and Response API elements onto the canvas
  • Connect the elements in sequence

Key differences from previous versions:
• On Navigator versions ≥2.191-1, the LLM element will be replaced with the webFrame element
• Safeguarding is now its own separate element
• Custom adapter weights are temporarily unavailable

webframe-kb-image1.png

Method 2: API Integration

The API element enables scripting and programmatic interaction, including Companion app integration.

  • Drag the API and LLM elements onto the canvas
  • Connect the elements appropriately
  • Runtime will automatically assign a port, opening a connection at localhost:<port>
  • Send prompts via POST requests to http://localhost:<port>/prompt
  • Running a flow initiates a new conversation in Companion

Configuration Settings

  • Cluster The compute cluster where the model will run. Configure this in the Clusters page.
  • Hugging Face API key Required only for accessing gated models on Hugging Face.
  • Temperature Controls response randomness - higher values produce more creative outputs.
  • Token Limit Restricts the length of model responses.
  • Optimizer Algorithm-driven optimization to reduce model size while minimizing performance impact. Scale: 0 (no optimization) to 1 (maximum compression). Not compatible with pre-quantized models.
  • Cluster Node Memory Cap Limits memory allocation per cluster node for model execution. Useful for reserving memory for other flows and controlling model distribution. Currently uses fixed values; dynamic allocation based on system memory coming soon.