Skip to main content

💥 OpenAI Proxy Server

LiteLLM Server manages:

  • Calling 100+ LLMs Huggingface/Bedrock/TogetherAI/etc. in the OpenAI ChatCompletions & Completions format
  • Set custom prompt templates + model-specific configs (temperature, max_tokens, etc.)

Quick Start​

View all the supported args for the Proxy CLI here

$ litellm --model huggingface/bigcode/starcoder

#INFO: Proxy running on http://0.0.0.0:8000

Test​

In a new shell, run, this will make an openai.ChatCompletion request

litellm --test

This will now automatically route any requests for gpt-3.5-turbo to bigcode starcoder, hosted on huggingface inference endpoints.

Replace openai base​

import openai 

openai.api_base = "http://0.0.0.0:8000"

print(openai.ChatCompletion.create(model="test", messages=[{"role":"user", "content":"Hey!"}]))

Supported LLMs​

$ export AWS_ACCESS_KEY_ID=""
$ export AWS_REGION_NAME="" # e.g. us-west-2
$ export AWS_SECRET_ACCESS_KEY=""
$ litellm --model bedrock/anthropic.claude-v2

Server Endpoints​

  • POST /chat/completions - chat completions endpoint to call 100+ LLMs
  • POST /completions - completions endpoint
  • POST /embeddings - embedding endpoint for Azure, OpenAI, Huggingface endpoints
  • GET /models - available models on server

Using with OpenAI compatible projects​

LiteLLM allows you to set openai.api_base to the proxy server and use all LiteLLM supported LLMs in any OpenAI supported project

This tutorial assumes you're using the `big-refactor` branch of LM Harness https://github.com/EleutherAI/lm-evaluation-harness/tree/big-refactor

Step 1: Start the local proxy

$ litellm --model huggingface/bigcode/starcoder

Using a custom api base

$ export HUGGINGFACE_API_KEY=my-api-key #[OPTIONAL]
$ litellm --model huggingface/tinyllama --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud

OpenAI Compatible Endpoint at http://0.0.0.0:8000

Step 2: Set OpenAI API Base & Key

$ export OPENAI_API_BASE=http://0.0.0.0:8000

LM Harness requires you to set an OpenAI API key OPENAI_API_SECRET_KEY for running benchmarks

export OPENAI_API_SECRET_KEY=anything

Step 3: Run LM-Eval-Harness

python3 -m lm_eval \
--model openai-completions \
--model_args engine=davinci \
--task crows_pairs_english_age

Proxy Configs​

The Config allows you to set the following params

Param NameDescription
model_listList of supported models on the server, with model-specific configs
litellm_settingslitellm Module settings, example litellm.drop_params=True, litellm.set_verbose=True, litellm.api_base

Example Config​

model_list:
- model_name: zephyr-alpha
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: huggingface/HuggingFaceH4/zephyr-7b-alpha
api_base: http://0.0.0.0:8001
- model_name: zephyr-beta
litellm_params:
model: huggingface/HuggingFaceH4/zephyr-7b-beta
api_base: https://<my-hosted-endpoint>

litellm_settings:
drop_params: True
set_verbose: True

Quick Start - Config​

Here's how you can use multiple llms with one proxy config.yaml.

Step 1: Setup Config​

model_list:
- model_name: zephyr-alpha
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: huggingface/HuggingFaceH4/zephyr-7b-alpha
api_base: http://0.0.0.0:8001
- model_name: gpt-4
litellm_params:
model: gpt-4
api_key: sk-1233
- model_name: claude-2
litellm_params:
model: claude-2
api_key: sk-claude

Step 2: Start Proxy with config​

$ litellm --config /path/to/config.yaml

Step 3: Start Proxy with config​

If you're repo let's you set model name, you can call the specific model by just passing in that model's name -

Setting model name

import openai 
openai.api_base = "http://0.0.0.0:8000"

completion = openai.ChatCompletion.create(model="zephyr-alpha", messages=[{"role": "user", "content": "Hello world"}])
print(completion.choices[0].message.content)

Setting API Base with model name If you're repo only let's you specify api base, then you can add the model name to the api base passed in -

import openai 
openai.api_base = "http://0.0.0.0:8000/openai/deployments/zephyr-alpha/chat/completions" # zephyr-alpha will be used

completion = openai.ChatCompletion.create(model="gpt-3.5-turbo", messages=[{"role": "user", "content": "Hello world"}])
print(completion.choices[0].message.content)

Save Model-specific params (API Base, API Keys, Temperature, etc.)​

You can use the config to save model-specific information like api_base, api_key, temperature, max_tokens, etc.

Step 1: Create a config.yaml file

model_list:
- model_name: gpt-3.5-turbo
litellm_params: # params for litellm.completion() - https://docs.litellm.ai/docs/completion/input#input---request-body
model: azure/chatgpt-v-2 # azure/<your-deployment-name>
api_key: your_azure_api_key
api_version: your_azure_api_version
api_base: your_azure_api_base
- model_name: mistral-7b
litellm_params:
model: ollama/mistral
api_base: your_ollama_api_base

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Model Alias​

Set a model alias for your deployments.

In the config.yaml the model_name parameter is the user-facing name to use for your deployment.

E.g.: If we want to save a Huggingface TGI Mistral-7b deployment, as 'mistral-7b' for our users, we might save it as:

model_list:
- model_name: mistral-7b # ALIAS
litellm_params:
model: huggingface/mistralai/Mistral-7B-Instruct-v0.1 # ACTUAL NAME
api_key: your_huggingface_api_key # [OPTIONAL] if deployed on huggingface inference endpoints
api_base: your_api_base # url where model is deployed

Set Custom Prompt Templates​

LiteLLM by default checks if a model has a prompt template and applies it (e.g. if a huggingface model has a saved chat template in it's tokenizer_config.json). However, you can also set a custom prompt template on your proxy in the config.yaml:

Step 1: Save your prompt template in a config.yaml

# Model-specific parameters
model_list:
- model_name: mistral-7b # model alias
litellm_params: # actual params for litellm.completion()
model: "huggingface/mistralai/Mistral-7B-Instruct-v0.1"
api_base: "<your-api-base>"
api_key: "<your-api-key>" # [OPTIONAL] for hf inference endpoints
initial_prompt_value: "\n"
roles: {"system":{"pre_message":"<|im_start|>system\n", "post_message":"<|im_end|>"}, "assistant":{"pre_message":"<|im_start|>assistant\n","post_message":"<|im_end|>"}, "user":{"pre_message":"<|im_start|>user\n","post_message":"<|im_end|>"}}
final_prompt_value: "\n"
bos_token: "<s>"
eos_token: "</s>"
max_tokens: 4096

Step 2: Start server with config

$ litellm --config /path/to/config.yaml

Proxy CLI Arguments​

--host​

  • Default: '0.0.0.0'
  • The host for the server to listen on.
  • Usage:
    litellm --host 127.0.0.1

--port​

  • Default: 8000
  • The port to bind the server to.
  • Usage:
    litellm --port 8080

--num_workers​

  • Default: 1
  • The number of uvicorn workers to spin up.
  • Usage:
    litellm --num_workers 4

--api_base​

  • Default: None
  • The API base for the model litellm should call.
  • Usage:
    litellm --model huggingface/tinyllama --api_base https://k58ory32yinf1ly0.us-east-1.aws.endpoints.huggingface.cloud

--api_version​

  • Default: None
  • For Azure services, specify the API version.
  • Usage:
    litellm --model azure/gpt-deployment --api_version 2023-08-01 --api_base https://<your api base>"

--model or -m​

  • Default: None
  • The model name to pass to Litellm.
  • Usage:
    litellm --model gpt-3.5-turbo

--test​

  • Type: bool (Flag)
  • Proxy chat completions URL to make a test request.
  • Usage:
    litellm --test

--alias​

  • Default: None
  • An alias for the model, for user-friendly reference.
  • Usage:
    litellm --alias my-gpt-model

--debug​

  • Default: False
  • Type: bool (Flag)
  • Enable debugging mode for the input.
  • Usage:
    litellm --debug

--temperature​

  • Default: None
  • Type: float
  • Set the temperature for the model.
  • Usage:
    litellm --temperature 0.7

--max_tokens​

  • Default: None
  • Type: int
  • Set the maximum number of tokens for the model output.
  • Usage:
    litellm --max_tokens 50

--request_timeout​

  • Default: 600
  • Type: int
  • Set the timeout in seconds for completion calls.
  • Usage:
    litellm --request_timeout 300

--drop_params​

  • Type: bool (Flag)
  • Drop any unmapped params.
  • Usage:
    litellm --drop_params

--add_function_to_prompt​

  • Type: bool (Flag)
  • If a function passed but unsupported, pass it as a part of the prompt.
  • Usage:
    litellm --add_function_to_prompt

--config​

  • Configure Litellm by providing a configuration file path.
  • Usage:
    litellm --config path/to/config.json

--telemetry​

  • Default: True
  • Type: bool
  • Help track usage of this feature.
  • Usage:
    litellm --telemetry False