LLM API


The LLM API is designed to integrate seamlessly with your LLM flow creations, enabling external applications to interact with your workflows. This guide provides step-by-step instructions on setting up, configuring, and using the API element effectively.

The API element is currently in Beta and can only handle one API request at a time.

When using load balancers, ensure that each deployment has the API "port" setting incremented.

API Element Setup

To set up the API element in your workflow:

Inputs

  • Top Left Input Always connect this to the "Initiator" element, which serves as a keep-alive for the API.
  • Second Input Connect this to your LLM flow's output.

Output

  • API Output The output of the API element should feed back into your LLM flow's input.

API Element Settings

The API element has customizable settings to suit your needs:

  • API Key This is a user-defined alphanumeric key used for authentication.
  • Port Number The default port is 5050, but it can be customized to any available port.

API Endpoint

To interact with the API, use the following endpoint details:

  • URL http://<machine-ip>:<port-number> (Default: http://localhost:5050)
  • Method POST
  • Route /prompt
  • Authentication Use the X-API-Key header to authenticate.
  • Response The response is streamed back as server-sent events (SSE).

API Request Payload

When making a POST request to the API, the payload should be structured as follows:

{
  "message": [
    {
      "role": "string",
      "content": "string"
    }
  ]
}
  • role One of "system", "user", or "assistant"
  • content The chat/message content to be sent to the LLM.

Usage Examples

Here are examples of how to use the API with cURL and Python.

cURL Example:

curl -N --location 'http://localhost:5050/prompt' \
--header 'Content-Type: application/json' \
--header 'X-API-Key: QWERTY123' \
--data '{
  "message": [{"role": "user", "content": "Once upon a time,"}]
}'

Python Example (using requests):

import json
import requests

def fetch_llm_response(chat_history) - str:
    url = "http://localhost:5050/prompt"
    headers = {
        "Content-Type": "application/json",
        "X-API-Key": "QWERTY123",
    }
    data = {"message": chat_history}

    response = requests.post(url, headers=headers, json=data, stream=True)
    if response.status_code != 200:
        print(f"Error: Failed to fetch response from server. Status code: {response.status_code}")
        return

    output = ""
    for chunk in response.iter_content(chunk_size=None):
        try:
            responses = [
                ('{"choices": ' + f"{x}")
                for x in str(chunk.decode("utf-8")).split('{"choices":')[1:]
            ]
            for response in responses:
                response_json = json.loads(response.strip())
                print(response_json["choices"][0]["message"]["content"], flush=True, end="")
                output += response_json["choices"][0]["message"]["content"]
        except json.JSONDecodeError:
            print(traceback.format_exc())
            print("Invalid JSON:", chunk.decode("utf-8"))

    print("\n")
    return output

# Example usage
chat_history = [
    {
        "role": "system",
        "content": "You are a helpful assistant. Please help the user with their questions and answer as truthfully as you can."
    },
]

if __name__ == "__main__":
    while True:
        prompt = input("Enter your LLM prompt (or 'exit' to quit): ")
        if prompt == "exit":
            break
        chat_history.append({"role": "user", "content": prompt})
        msg = fetch_llm_response(chat_history)
        chat_history.append({"role": "assistant", "content": msg})

Advanced Usage: Load Balancing with NGINX

For handling increased load, consider using a load balancer like NGINX with multiple API deployments across different network ports.

NGINX Configuration Example:

In the NGINX example, the API would be available at localhost:8080/prompt and allow 2 simultaneous connections.

events {
    worker_connections 1024;
}

http {
    upstream backend {
        server host.docker.internal:5050 max_conns=1;
        server host.docker.internal:5051 max_conns=1;
    }

    server {
        listen 8080;
        server_name host.docker.internal;

        location / {
            proxy_pass http://backend;
            proxy_set_header Connection '';
            proxy_http_version 1.1;
            proxy_buffering off;
            proxy_cache off;
            proxy_read_timeout 86400s;
            proxy_send_timeout 86400s;

            # SSE-specific headers
            proxy_set_header X-Accel-Buffering no;
            proxy_set_header Cache-Control no-cache;
        }
    }
}

Dockerfile Example:

FROM nginx:latest
COPY nginx.conf /etc/nginx/nginx.conf