Why Your LLM Responses Are Slow and What a Full Stack Developer Can Actually Do About It
Your AI feature shipped. Users are complaining it's slow. You check the LLM API dashboard — the model inference time is 1.2 seconds. That looks fine. But your users are waiting 6–8 seconds. Something between your application and the API is adding 5–7 seconds of latency that has nothing to do with the model.
This is the most common LLM performance problem in production, and it's largely invisible until you measure it correctly. The model gets the blame. The actual culprits are almost always in the application layer.
Unlike model inference speed (which you can't control without switching providers), every problem covered in this article is something a full stack developer can fix in the application layer. Some of these fixes take an afternoon. Some require architectural changes. All of them are measurable.

The Latency Stack: Where Time Is Actually Spent
Before optimising, you need to measure correctly. Most developers look at total response time and assume it's model inference. It almost never is.
LLM response latency has five distinct components:
1. Network roundtrip to the API — the time for your request to travel from your server to the LLM provider's infrastructure and back. Determined by geographic distance.
2. Time to first token (TTFT) — the time from request received by the model until the first output token. Affected by prompt length, server load, and cold/warm inference workers.
3. Token generation time — tokens per second × total tokens in the response. Pure inference speed — largely outside your control for a given model.
4. Application processing overhead — time your backend spends constructing the prompt, making database calls to retrieve conversation history, running input validation, and assembling the request before it's sent to the API.
5. Time to first byte on the client — after the API returns, the time for your backend to process the response and begin forwarding it to the frontend.
The non-obvious finding: When developers measure correctly and break latency into these five components, they consistently find that components 1, 4, and 5 account for more of the total latency than components 2 and 3 combined. The model inference is often not the slowest part. The application plumbing around it is.
Measuring: Add Timestamps Now
You cannot optimise what you cannot measure. Add timestamp logging to your LLM pipeline immediately:
async function callLLMWithTiming(userId, messages) {
const timings = {
requestReceived: Date.now(),
};
// Step 4: Application processing overhead
const processedMessages = await buildMessages(userId, messages);
timings.messagesBuilt = Date.now();
// Steps 1 + 2: Network + TTFT
timings.apiCallStarted = Date.now();
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages: processedMessages,
stream: true,
});
let firstToken = null;
let tokenCount = 0;
for await (const chunk of stream) {
if (!firstToken) {
firstToken = Date.now();
timings.firstTokenReceived = firstToken;
}
tokenCount++;
yield chunk; // Stream to client
}
timings.generationComplete = Date.now();
// Log the component breakdown
logger.info('llm_latency_breakdown', {
userId,
appProcessingMs: timings.apiCallStarted - timings.messagesBuilt,
ttftMs: timings.firstTokenReceived - timings.apiCallStarted,
generationMs: timings.generationComplete - timings.firstTokenReceived,
totalMs: timings.generationComplete - timings.requestReceived,
tokenCount,
});
}
After running this in production for a day, look at your appProcessingMs value. If it's over 200ms, you have application-layer overhead to address before you do anything else.
Fix 1: Move the Server Closer to the LLM Provider
If your server is in ap-south-1 (Mumbai) and you're calling OpenAI's API (hosted in US East), you're adding ~180ms of round-trip network latency on every request. For streaming, this affects every step: request transmission, first token receipt, and stream completion.
Check your latency by provider region:
# Measure network latency to OpenAI API
curl -w "@curl-format.txt" -o /dev/null -s https://api.openai.com/v1/models
# curl-format.txt:
# time_namelookup: %{time_namelookup}
# time_connect: %{time_connect}
# time_appconnect: %{time_appconnect}
# time_total: %{time_total}
OpenAI's inference clusters are primarily in US East. If your users are in Asia, deploying your API server in us-east-1 reduces network latency to the LLM provider significantly — even if it increases latency to the user. Use edge functions or regional caching for the user-facing layer.
Fix 2: Reduce Prompt Construction Overhead
Slow prompt construction is the most commonly missed application-layer latency source. A typical conversation feature:
- Receives the user's message
- Queries PostgreSQL for conversation history (N messages)
- Queries PostgreSQL or Redis for user context (preferences, profile)
- Queries a vector database for relevant context documents
- Builds the final messages array
- Calls the LLM API
Steps 2–4 are database calls that happen sequentially in most implementations. Each adds 10–50ms. Three sequential calls add 30–150ms before the first LLM token.
// Slow: sequential database calls
async function buildMessagesSlowly(userId, userMessage) {
const history = await getConversationHistory(userId); // 40ms
const userContext = await getUserContext(userId); // 25ms
const relevantDocs = await retrieveContext(userMessage); // 60ms
return buildPrompt(history, userContext, relevantDocs, userMessage);
}
// Fast: parallel database calls
async function buildMessagesFast(userId, userMessage) {
const [history, userContext, relevantDocs] = await Promise.all([
getConversationHistory(userId), // 40ms
getUserContext(userId), // 25ms (runs simultaneously)
retrieveContext(userMessage), // 60ms (runs simultaneously)
]);
return buildPrompt(history, userContext, relevantDocs, userMessage);
// Total: 60ms (the slowest call) instead of 125ms
}
Parallelising database calls at prompt construction time is one of the highest-ROI latency improvements available — and it's often a 5-minute code change.
Fix 3: Cache User Context
User context (profile, preferences, subscription tier, feature flags) changes rarely but is fetched on every request. Cache it:
async function getUserContextCached(userId) {
const cacheKey = `user_context:${userId}`;
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
const context = await db.query(
'SELECT plan, preferences, locale, created_at FROM users WHERE id = $1',
[userId]
);
await redis.setex(cacheKey, 300, JSON.stringify(context.rows[0])); // 5-minute TTL
return context.rows[0];
}
A 25ms database call that happens on every request becomes a 1ms Redis read after the first call. Multiply by 1,000 requests per minute and the aggregate impact is significant.
Fix 4: Trim Conversation History
Every message in conversation history is tokens sent to the API. A 20-turn conversation might have 4,000 tokens of history — adding 80–100ms to TTFT and increasing cost on every call.
function trimConversationHistory(history, maxTurns = 8, maxTokens = 4000) {
// Always keep system message and at least last 4 turns
const systemMessages = history.filter(m => m.role === 'system');
const conversationMessages = history.filter(m => m.role !== 'system');
// Take most recent N turns
const recentMessages = conversationMessages.slice(-maxTurns * 2);
// Estimate token count (roughly 4 chars per token)
const estimatedTokens = recentMessages.reduce(
(sum, m) => sum + Math.ceil(m.content.length / 4), 0
);
if (estimatedTokens > maxTokens) {
// Keep only the most recent messages that fit in the budget
let tokenCount = 0;
const trimmedMessages = [];
for (let i = recentMessages.length - 1; i >= 0; i--) {
const msgTokens = Math.ceil(recentMessages[i].content.length / 4);
if (tokenCount + msgTokens > maxTokens) break;
trimmedMessages.unshift(recentMessages[i]);
tokenCount += msgTokens;
}
return [...systemMessages, ...trimmedMessages];
}
return [...systemMessages, ...recentMessages];
}
Fix 5: Start Streaming Earlier
The single highest-impact change for perceived latency is enabling streaming if you haven't already. A 7-second total generation time with streaming shows the first token at ~400ms. Without streaming, users wait the full 7 seconds.
Even if you implement all the other optimisations above and get total generation time down to 4 seconds — the first-token experience with streaming at 400ms will feel dramatically faster than waiting 4 seconds for a full response to appear.
Streaming is not an optimisation — it's a requirement for any interactive AI feature.

The Optimisation Priority Order
Apply these in order of impact-to-effort ratio:
- Enable streaming — 1 day of work, transforms user experience from 7s wait to 400ms first token
- Parallelise database calls at prompt construction — 2 hours of work, saves 50–150ms per request
- Cache user context — 2 hours of work, saves 20–50ms per request
- Trim conversation history — 2 hours of work, reduces TTFT and cost
- Deploy server closer to LLM provider — 1 day of infrastructure work, saves 50–200ms of network latency
- Implement prompt caching (OpenAI and Anthropic both offer this) — saves 30–50% on TTFT for requests with shared context
The model is not the bottleneck in most production LLM applications. The application plumbing around it is. These fixes are measurable, implementable without switching providers, and collectively can reduce perceived latency from "frustratingly slow" to "acceptably fast" without touching model configuration.





