Building Applications with the ChatGPT API
The previous chapter covered the fundamentals: getting a key, understanding tokens, and making single API calls. Now we go deeper. Real applications are more complex — they stream responses so users see text as it generates, they maintain multi-turn conversation history, they call external tools via function calling, and they need to handle errors and rate limits without crashing.
This chapter builds up to a complete, working CLI chatbot in Python that you can run on your own machine today.
1. Streaming Responses
By default, the API waits until the model has finished generating the entire response, then returns it all at once. For a short response this is fine. For a 500-word essay or a long piece of code, the user stares at a blank screen for several seconds before seeing anything. That is a poor experience.
Streaming solves this: the API sends tokens as they are generated, and your application displays them progressively — just like watching ChatGPT type in the web interface.
Enabling Streaming in Python
Add stream=True to your API call. The response is now an iterator of chunks rather than a single object:
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{"role": "user", "content": "Explain the history of the Indian rupee in 3 paragraphs."}
],
stream=True
)
# Print each token as it arrives
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
print() # newline after stream ends
The flush=True argument to print() ensures the output appears immediately rather than buffering. Without it, Python may batch terminal output and the streaming effect is lost.
Streaming in a Web Application
In a web application (Flask, FastAPI, Django), you would return the streaming response as a Server-Sent Events (SSE) stream. The browser receives chunks and updates the DOM progressively — exactly how chat.openai.com works. The pattern is:
from flask import Flask, Response, stream_with_context
from openai import OpenAI
app = Flask(__name__)
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
@app.route("/chat")
def chat():
def generate():
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": "Tell me about Bengaluru's tech ecosystem."}],
stream=True
)
for chunk in stream:
content = chunk.choices[0].delta.content
if content:
yield f"data: {content}\n\n"
return Response(stream_with_context(generate()), mimetype="text/event-stream")
2. Managing Conversation History
The API is stateless. Every request you make starts fresh — the model has no memory of previous calls in the same session unless you explicitly include prior messages in the messages array.
This means your application is responsible for:
- Storing messages as the conversation progresses
- Appending both user messages and assistant responses to the history
- Sending the full history with each new API call
The Conversation History Pattern
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
# Start with a system prompt
conversation_history = [
{
"role": "system",
"content": "You are a knowledgeable assistant about Indian mutual funds. Answer clearly and always recommend users consult a SEBI-registered advisor for investment decisions."
}
]
def chat(user_message: str) -> str:
# Add the new user message to history
conversation_history.append({"role": "user", "content": user_message})
# Call the API with the full history
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=conversation_history
)
# Extract the assistant's reply
assistant_message = response.choices[0].message.content
# Append it to history for the next turn
conversation_history.append({"role": "assistant", "content": assistant_message})
return assistant_message
# Example conversation
print(chat("What is the difference between ELSS and PPF?"))
print(chat("Which one is better for tax saving?"))
print(chat("Can you explain the lock-in period for the first one you mentioned?"))
The third message ("the first one you mentioned") works because the model has access to the full prior exchange in conversation_history.
The Context Window Problem
Every model has a maximum context window — the total number of tokens it can process in a single request (input + output combined). For gpt-4o-mini, this is 128,000 tokens. That sounds enormous, but a long conversation accumulates quickly, especially if responses are detailed.
When history grows too large, you have three options:
Option A — Truncation: Keep only the last N messages. Simple, but the model loses early context.
MAX_HISTORY = 20 # keep last 20 messages
if len(conversation_history) > MAX_HISTORY + 1: # +1 for system message
# Always keep system message + last MAX_HISTORY messages
conversation_history = [conversation_history[0]] + conversation_history[-(MAX_HISTORY):]
Option B — Summarisation: Periodically ask the model to summarise the conversation so far, then replace the accumulated history with a single summary message.
Option C — Semantic search: Store messages in a vector database and retrieve only the most relevant prior messages for each new query (Retrieval-Augmented Generation). This is more complex but scales to very long conversations.
3. Function Calling (Tool Use)
Function calling is one of the most powerful API features. It allows you to define functions that the model can "call" when it determines that an external data source or action is needed to answer the user's question. The model does not actually execute your code — it generates a structured JSON object specifying which function to call and with what arguments. Your application then calls the real function and returns the result to the model.
The Flow
1. User asks a question
2. You send the question + function definitions to the API
3. Model responds with a "tool_call" instead of a text answer
4. Your application executes the real function with the model's arguments
5. You send the function's result back to the model
6. Model generates a final natural-language response using the result
Defining Tools
tools = [
{
"type": "function",
"function": {
"name": "get_stock_price",
"description": "Get the current stock price for an Indian company listed on NSE or BSE.",
"parameters": {
"type": "object",
"properties": {
"ticker_symbol": {
"type": "string",
"description": "The NSE/BSE ticker symbol, e.g. RELIANCE, TCS, INFY"
},
"exchange": {
"type": "string",
"enum": ["NSE", "BSE"],
"description": "The stock exchange"
}
},
"required": ["ticker_symbol"]
}
}
}
]
Handling a Tool Call
import json
def get_stock_price(ticker_symbol: str, exchange: str = "NSE") -> dict:
# In a real app, this would call a market data API
# Here we return mock data
mock_prices = {"RELIANCE": 2950.50, "TCS": 3820.00, "INFY": 1680.25}
price = mock_prices.get(ticker_symbol.upper(), None)
if price:
return {"ticker": ticker_symbol, "price": price, "currency": "INR", "exchange": exchange}
return {"error": f"Ticker {ticker_symbol} not found"}
def chat_with_tools(user_message: str) -> str:
messages = [
{"role": "system", "content": "You are a helpful stock market assistant for Indian investors."},
{"role": "user", "content": user_message}
]
# First call: model decides whether to use a tool
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
tools=tools,
tool_choice="auto" # model decides when to use tools
)
choice = response.choices[0]
# Check if the model wants to call a function
if choice.finish_reason == "tool_calls":
tool_call = choice.message.tool_calls[0]
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
# Execute the real function
function_result = get_stock_price(**function_args)
# Add the assistant's tool call and our result to the message history
messages.append(choice.message)
messages.append({
"role": "tool",
"tool_call_id": tool_call.id,
"content": json.dumps(function_result)
})
# Second call: model generates a natural-language response using the result
final_response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
return final_response.choices[0].message.content
# No tool call — return the direct response
return choice.message.content
print(chat_with_tools("What is the current price of TCS?"))
Function calling is the mechanism behind AI agents that can search the web, query databases, send emails, or call any API you define.
4. Structured Outputs — JSON Mode
Sometimes you need the model to return data in a specific format that your application can parse — not prose, but structured JSON. The API offers two mechanisms for this.
JSON Mode
Add response_format={"type": "json_object"} to your call. The model will return valid JSON, but you still control the schema through your prompt:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": "You are a data extraction assistant. Always respond with valid JSON."
},
{
"role": "user",
"content": """Extract the following details from this job posting and return as JSON:
Job posting: "We are hiring a Senior Python Developer at our Pune office.
CTC: ₹18–24 LPA. Requirements: 5+ years Python, Django, PostgreSQL.
Apply by 31 July 2026."
Extract: job_title, location, salary_range, required_skills (list), application_deadline"""
}
],
response_format={"type": "json_object"}
)
import json
data = json.loads(response.choices[0].message.content)
print(data)
Output:
{
"job_title": "Senior Python Developer",
"location": "Pune",
"salary_range": "₹18–24 LPA",
"required_skills": ["Python", "Django", "PostgreSQL"],
"application_deadline": "31 July 2026"
}
When to Use Structured Output
Structured output is essential when:
- Your application needs to parse the model's response programmatically
- You are feeding the model's output into a database, a UI component, or another system
- You want consistent, predictable response shapes rather than free-form prose
5. Building a Complete CLI Chatbot in Python
Now let us combine everything — history management, streaming, and a system prompt — into a complete, working CLI chatbot. This is the kind of tool you could actually use for daily work.
Full Code
Save this as chatbot.py:
import os
import sys
from openai import OpenAI
from dotenv import load_dotenv
load_dotenv()
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
SYSTEM_PROMPT = """You are a helpful assistant for Indian professionals and students.
You answer questions clearly and concisely. When asked about financial, legal, or medical topics,
provide helpful general information while recommending consultation with a qualified professional.
You are familiar with Indian context: rupees, Indian companies, Indian law, Indian education system."""
MAX_HISTORY_MESSAGES = 20 # beyond system prompt
def truncate_history(history: list) -> list:
"""Keep system message + last MAX_HISTORY_MESSAGES messages."""
if len(history) <= MAX_HISTORY_MESSAGES + 1:
return history
return [history[0]] + history[-(MAX_HISTORY_MESSAGES):]
def stream_response(messages: list) -> str:
"""Stream the model's response and return the full text."""
full_response = ""
stream = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
stream=True,
temperature=0.7,
max_tokens=1000
)
print("\nAssistant: ", end="", flush=True)
for chunk in stream:
delta = chunk.choices[0].delta
if delta.content:
print(delta.content, end="", flush=True)
full_response += delta.content
print("\n")
return full_response
def main():
print("=== ChatBot (type 'quit' or 'exit' to stop, 'clear' to reset) ===\n")
conversation_history = [
{"role": "system", "content": SYSTEM_PROMPT}
]
while True:
try:
user_input = input("You: ").strip()
except (KeyboardInterrupt, EOFError):
print("\nGoodbye!")
sys.exit(0)
if not user_input:
continue
if user_input.lower() in ("quit", "exit"):
print("Goodbye!")
break
if user_input.lower() == "clear":
conversation_history = [{"role": "system", "content": SYSTEM_PROMPT}]
print("Conversation cleared.\n")
continue
# Add user message
conversation_history.append({"role": "user", "content": user_input})
# Truncate if needed
conversation_history = truncate_history(conversation_history)
try:
assistant_reply = stream_response(conversation_history)
conversation_history.append({"role": "assistant", "content": assistant_reply})
except Exception as e:
print(f"\nError: {e}\n")
# Remove the failed user message from history
conversation_history.pop()
if __name__ == "__main__":
main()
Run it:
python chatbot.py
You now have a streaming, multi-turn chatbot in the terminal with history management and graceful error handling.
6. Production Best Practices
Moving from a working script to a reliable production service requires handling the messy realities of the real world: API errors, rate limits, and unexpected inputs.
Rate Limits
OpenAI enforces rate limits on two dimensions:
- RPM (Requests Per Minute) — how many calls you can make per minute
- TPM (Tokens Per Minute) — how many tokens you can process per minute
Rate limits vary by tier. New accounts have lower limits; as you spend more, limits increase. When you exceed a rate limit, the API returns a 429 error.
Retry with Exponential Backoff
The standard pattern for handling rate limits and transient errors is exponential backoff — wait a short time after the first failure, longer after the second, and so on:
import time
import random
from openai import OpenAI, RateLimitError, APIError
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def call_with_retry(messages: list, max_retries: int = 5) -> str:
for attempt in range(max_retries):
try:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages,
max_tokens=500
)
return response.choices[0].message.content
except RateLimitError:
if attempt == max_retries - 1:
raise
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Retrying in {wait_time:.1f} seconds...")
time.sleep(wait_time)
except APIError as e:
if e.status_code in (500, 503) and attempt < max_retries - 1:
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"API error {e.status_code}. Retrying in {wait_time:.1f} seconds...")
time.sleep(wait_time)
else:
raise
Input Validation
Before sending user input to the API, validate and sanitise it:
MAX_INPUT_LENGTH = 4000 # characters
def validate_input(user_input: str) -> str:
if not user_input or not user_input.strip():
raise ValueError("Input cannot be empty.")
if len(user_input) > MAX_INPUT_LENGTH:
raise ValueError(f"Input too long. Maximum {MAX_INPUT_LENGTH} characters.")
return user_input.strip()
Cost Controls
Set usage limits in your OpenAI account dashboard to prevent unexpected bills. In your application, log token usage per request and aggregate daily costs:
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def call_with_cost_tracking(messages: list) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=messages
)
usage = response.usage
# gpt-4o-mini pricing (approximate, check current rates)
input_cost = (usage.prompt_tokens / 1_000_000) * 0.15
output_cost = (usage.completion_tokens / 1_000_000) * 0.60
total_cost = input_cost + output_cost
logger.info(
f"Tokens: {usage.prompt_tokens} in / {usage.completion_tokens} out | "
f"Cost: ${total_cost:.6f}"
)
return response.choices[0].message.content
Model Fallback
If your primary model is unavailable or too slow, fall back to a faster, cheaper model:
def call_with_fallback(messages: list) -> str:
for model in ["gpt-4o", "gpt-4o-mini"]:
try:
response = client.chat.completions.create(
model=model,
messages=messages,
timeout=10.0 # 10-second timeout
)
return response.choices[0].message.content
except Exception as e:
logger.warning(f"Model {model} failed: {e}. Trying next.")
raise RuntimeError("All models failed.")
Environment-Based Configuration
Avoid hardcoding model names, temperature values, or token limits. Use environment variables or a config file:
import os
MODEL = os.environ.get("OPENAI_MODEL", "gpt-4o-mini")
TEMPERATURE = float(os.environ.get("OPENAI_TEMPERATURE", "0.7"))
MAX_TOKENS = int(os.environ.get("OPENAI_MAX_TOKENS", "1000"))
This lets you change behaviour across environments (development, staging, production) without code changes.
Common Pitfalls
Pitfall 1 — Not handling the stateless API correctly. Forgetting to append the assistant's response to the conversation history means the model loses context on the next turn. Every turn must append both the user message and the resulting assistant message.
Pitfall 2 — Allowing unbounded history growth. Without truncation or summarisation, long conversations will eventually exceed the context window, causing errors. Implement a history management strategy from day one.
Pitfall 3 — No retry logic in production. The OpenAI API occasionally returns 429 (rate limit) or 5xx (server error) responses. Without retry logic, your application fails on these transient errors. Exponential backoff is the standard solution.
Pitfall 4 — Treating function call results as trusted. When you define tools, the model generates the function arguments. Validate those arguments before passing them to real functions, especially if they interact with databases or external APIs. A maliciously crafted user message could attempt prompt injection to manipulate the arguments.
Pitfall 5 — Not setting a timeout. API calls can occasionally hang. Set a timeout parameter to prevent your application from waiting indefinitely.
Pitfall 6 — Ignoring finish_reason in streaming. In a streaming response, the final chunk includes the finish_reason. If it is length, the response was cut off. Your application should handle this gracefully rather than presenting a truncated answer as complete.
Pitfall 7 — Over-engineering for day one. You do not need vector databases, caching layers, and model fallback on your first prototype. Build simple, observe real usage patterns, then optimise what actually causes problems.
Practice Exercises
-
Extend the CLI chatbot from section 5 to display the token usage and estimated cost in rupees at the end of each response. Use the approximate rate of ₹85 per USD.
-
Add a
/summarisecommand to the CLI chatbot that, when typed, asks the model to summarise the conversation so far and replaces the full history with the summary as a single system-level context message. -
Implement function calling for a simple use case: define a
get_weatherfunction (return mock data) and aconvert_currencyfunction (convert USD to INR at a fixed rate). Build a chatbot that uses these tools when relevant. -
Build a batch processing script that reads 20 customer support emails from a text file (one per line), classifies each as "Billing", "Technical", "Returns", or "General" using the API with
temperature=0andresponse_format="json_object", and writes the results to a CSV file. -
Implement the exponential backoff retry function from section 6 and test it by temporarily setting an invalid API key to trigger errors, then a valid one. Confirm that the retry mechanism behaves correctly.
Summary
- Streaming responses (
stream=True) sends tokens progressively to the user, eliminating blank-screen wait times and significantly improving perceived performance. - Conversation history is managed entirely by your application — append both user messages and assistant responses to the
messagesarray after each turn. - History truncation (keeping only the last N messages or summarising) is necessary for long conversations to stay within the model's context window.
- Function calling allows the model to request execution of your application's functions — the model specifies which function and with what arguments, your code executes it and returns the result, and the model uses that result in its final response.
- JSON mode (
response_format="json_object") forces the model to return valid, parseable JSON, essential for data extraction and any integration with downstream systems. - The complete CLI chatbot demonstrates all these patterns together: history management, streaming, error handling, and a graceful command loop.
- Production deployments require retry logic with exponential backoff for rate limit and server errors, input validation, cost tracking, and configuration via environment variables rather than hardcoded values.
- Build simply first — add complexity (vector search, caching, fallback models) only when real usage reveals the need.