webFrame
webFrame provides a comprehensive approach for executing LLMs in resource-constrained and distributed environments. Our vision is for webFrame to become our de facto inference approach for LLMs on any compute platform.
How Does it Work?
webFrame parses models from Hugging Face into an intermediate representation that can be mapped to backend-specific modules. At execution time, webFrame reads in the compute availability of the current environment and determines the appropriate compute plan (returned as a sub-flow) and automatically creates the flow on canvas. This compute plan may run on one node or across several nodes from the compute cluster selected.
By default, webFrame will evenly distribute LLM workloads across all provided nodes in a cluster. When using webFrame's optimizer settings, a novel quantization process is applied at execution time to maximize model accuracy given the provided resources while attempting to distribute the workload minimally.
At present, webFrame supports many common LLM architectures and the team hopes to support more architectures in the future.
- LLaMA
- Mistral
- Cohere
- Gemma
- Phi
- Phi MoE
- Grin MoE
- StableLM
- DeepSeek
- Nemotron
- Qwe 2
Creating a webFrame Flow
There are three ways to create a webFrame flow. All 3 need the following to get started.
1. You will need a configured cluster
2. You will need a HuggingFace API key (only for gated models)
Models requiring an API key are marked with a key icon, while optimizer-compatible models have a rocket icon.
Using A Template
A webFrame LLM template is available on the Home page. This will create a new project with the prompt/response API elements connected to the webFrame LLM element. The default settings are enough to get a user started. Please see the settings section below for more detailed information.
Manual Element Assembly
Drag the LLM, Prompt API, and response API elements onto the canvas.
Key differences from previous versions:
• On Navigator versions ≥2.191-1, the LLM element will be replaced with the webFrame element.
• Safeguarding is now its own separate element
• Custom adapter weights are temporarily unavailable
API Integration
The API element enables scripting and programmatic interaction, including Companion app integration.
Drag the API and LLM elements onto the canvas.
Runtime will automatically assign a port, opening a connection at localhost:<port>.
Send prompts via GET requests to http://localhost:/prompt
Running a flow initiates a new conversation in Companion.
Settings
- Cluster The compute cluster where the model will run. Configure this in the Clusters page. See for more information.
- Hugging Face API key Required only for accessing gated models on Hugging Face. See for key acquisition instructions.
- Temperature Controls response randomness - higher values produce more creative outputs.
- Token Limit Restricts the length of model responses.
- Optimizer Algorithm-driven optimization to reduce model size while minimizing performance impact. Scale: 0 (no optimization) to 1 (maximum compression). Not compatible with pre-quantized models.
- Cluster Node Memory Cap Limits memory allocation per cluster node for model execution. Useful for reserving memory for other flows and controlling model distribution. Currently uses fixed values; dynamic allocation based on system memory coming soon.