Request Parameters

Complete reference for request parameters in the Chat Completions API.

Overview

The Chat Completions API provides a unified interface for interacting with multiple LLM providers. All requests require at minimum a model identifier and a messages array.

Base Endpoint: POST /chat/completions

Authentication: Bearer token in the Authorization header.

Required Parameters

Parameter	Type	Description
`model`	string	The ID of the model to use (e.g., `gpt-4o`, `gemini-1.5-pro`, `claude-3-5-sonnet-20241022`)
`messages`	array	An array of message objects representing the conversation history

Optional Parameters

Parameter	Type	Default	Range	Description
`stream`	boolean	`false`	-	Enable streaming responses via Server-Sent Events
`temperature`	number	`1.0`	0.0 - 2.0	Sampling temperature controlling randomness
`max_tokens`	integer	model-dependent	varies	Maximum number of tokens to generate
`top_p`	number	`1.0`	0.0 - 1.0	Nucleus sampling threshold
`frequency_penalty`	number	`0`	-2.0 - 2.0	Penalty for frequently used tokens
`presence_penalty`	number	`0`	-2.0 - 2.0	Penalty for introducing new topics
`stop`	string or array	`null`	-	Sequences where generation stops

Message Object Structure

Each message in the messages array must contain:

Parameter	Type	Required	Description
`role`	string	Yes	The role of the message author: `system`, `user`, `assistant`, `tool`
`content`	string	Yes	The content of the message
`name`	string	No	Optional name of the participant
`tool_calls`	array	No	Tool calls generated by the model (for role: `assistant`)
`tool_call_id`	string	No	ID of the tool call (for role: `tool`)

Parameter Details

model

Identifies which AI model to use for generating completions. Different models have different capabilities, costs, and performance characteristics.

Examples:

gpt-4o - OpenAI's GPT-4 Omni
gemini-1.5-pro - Google's Gemini Pro
claude-3-5-sonnet-20241022 - Anthropic's Claude 3.5 Sonnet

messages

An ordered array of message objects representing the conversation. The first message is typically a system message setting behavior, followed by alternating user and assistant messages.

Structure:

[
  {
    "role": "system",
    "content": "You are a helpful assistant."
  },
  {
    "role": "user",
    "content": "Hello! How are you?"
  },
  {
    "role": "assistant",
    "content": "I'm doing well, thank you!"
  }
]

stream

When set to true, the API returns responses as Server-Sent Events (SSE), allowing you to display partial results as they're generated. When false, the complete response is returned in a single JSON object.

Streaming Response Format:

data: {"choices":[{"delta":{"content":"Hello"},"index":0}]}

data: {"choices":[{"delta":{"content":" world!"},"index":0}]}

data: [DONE]

temperature

Controls the randomness of the model's output. Higher values (closer to 2.0) make the output more random and creative, while lower values (closer to 0.0) make it more focused and deterministic.

Guidelines:

0.0 - 0.3: Highly deterministic, factual responses
0.7 - 1.0: Balanced creativity and coherence (default)
1.0 - 2.0: Highly creative, unpredictable outputs

max_tokens

Sets an upper limit on the number of tokens the model can generate in the response. The actual output may be shorter if the model naturally completes its response.

Note: Different models have different maximum token limits. Exceeding these limits will result in an error.

top_p

An alternative to temperature for controlling diversity. Only the top p probability mass of tokens is considered for sampling.

Recommendation: Use either temperature or top_p, but not both simultaneously.

frequency_penalty

Reduces the likelihood of the model repeating the same content. Positive values discourage repetition, while negative values encourage it.

Range: -2.0 to 2.0

presence_penalty

Encourages the model to talk about new topics. Higher positive values make the model more likely to introduce new concepts.

Range: -2.0 to 2.0

stop

Specifies sequences where the model should stop generating. This can be a single string or an array of up to 4 sequences.

Example:

{
  "stop": ["\n", "END"]
}

Request Examples

Minimal Request

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "What is the capital of France?"
    }
  ]
}

Request with Temperature Control

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a creative writer."
    },
    {
      "role": "user",
      "content": "Write a short story about a robot."
    }
  ],
  "temperature": 1.5,
  "max_tokens": 500
}

Request with Streaming

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "Explain quantum computing in simple terms."
    }
  ],
  "stream": true,
  "temperature": 0.7
}

Request with Penalties

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "Write a product description for a new smartphone."
    }
  ],
  "temperature": 0.8,
  "frequency_penalty": 0.5,
  "presence_penalty": 0.3,
  "max_tokens": 300
}

Request with Stop Sequences

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "List three programming languages."
    }
  ],
  "temperature": 0.5,
  "stop": ["4.", "\n\n"]
}

Complex Multi-turn Conversation

{
  "model": "gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a technical support specialist. Be concise and helpful."
    },
    {
      "role": "user",
      "content": "My internet connection is slow. What should I do?"
    },
    {
      "role": "assistant",
      "content": "Try these steps:\n1. Restart your router\n2. Check your cables\n3. Test with another device"
    },
    {
      "role": "user",
      "content": "I tried those but it's still slow."
    }
  ],
  "temperature": 0.3,
  "max_tokens": 200
}

Best Practices

Choosing Temperature

Code generation: Use low temperature (0.0 - 0.3)
Creative writing: Use higher temperature (0.8 - 1.5)
General assistance: Use default (1.0)

Managing Token Limits

Always set max_tokens to control costs
Account for both input and output tokens in your pricing calculations
Different models have different context windows

Optimizing for Quality

Use presence_penalty to encourage variety in long conversations
Use frequency_penalty to avoid repetitive phrasing
Provide clear system messages to set expectations

Streaming vs Non-Streaming

Use streaming for real-time user interfaces
Use non-streaming for batch processing and API integrations

Common Errors

Error	Cause	Solution
`invalid_model`	Non-existent model ID	Verify model ID in the models list
`invalid_messages`	Empty or malformed messages array	Ensure at least one message with valid role and content
`invalid_temperature`	Temperature outside 0.0-2.0 range	Adjust temperature to valid range
`invalid_max_tokens`	Exceeds model's maximum	Reduce max_tokens value
`invalid_stop`	More than 4 stop sequences	Limit to 4 sequences maximum