Support email is a weird problem for solo developers. You ship an app, people use it, and then the emails start rolling in. Feature requests, bug reports, refund demands, people who think you’re a completely different company, and the occasional heartfelt message from someone who genuinely loves what you built.
For a really long time, I handled everything manually. I was spending hours every week writing the same responses to the same questions, and the worst part was that the quality suffered. By email number forty, my responses were terse and unhelpful. Nobody wins in that scenario.
So I built an AI agent to handle it. Not a chatbot sitting on a website somewhere—an autonomous email processing pipeline. It reads my inbox, figures out what each email is about, writes a response using my product knowledge, and sends it. When it’s not confident, it escalates to me. When it’s spam, it quietly files it away. When it’s a legal inquiry, it routes it to the right place.
It took a few iterations to get right. Here’s how it works.
The Pipeline
The core architecture is a sequential pipeline that processes one email at a time. No fancy event-driven system, no message queues, no microservices. Just a CLI tool that connects to an IMAP inbox, pulls unread emails, processes each one, and moves them to the appropriate folder when done.
The pipeline has five stages:
- Fetch – Connect to IMAP, grab all unread emails, parse them
- Classify – Ask the LLM to categorize the email
- Lookup – Check conversation history, load the knowledge base
- Respond – Generate a response (or route to a human)
- Deliver – Send the reply and file the original
Every email goes through every stage, and every stage is designed to fail safely. If anything goes wrong at any point, the email gets forwarded to me with an error report. No email ever gets silently dropped.
I run this on a cron schedule. It’s not real-time, but support email doesn’t need to be. A 15-minute delay is imperceptible to users and gives me time to spot-check if I want to.
Classification: The Triage Layer
Classification is the most important design decision in the whole system, and the one I got wrong initially.
My first approach was obvious: give the LLM the email, the knowledge base, all the context, and ask it to both classify and respond in one shot. This worked okay, but it had a subtle problem. The knowledge base was biasing the classification. The LLM would see a vague email, notice that the knowledge base had information about a related topic, and confidently classify it as a support request even when it was clearly a sales pitch or a misdirected inquiry.
So I split it into two separate calls. The classifier gets the email and nothing else—no knowledge base, no help articles, no product documentation. It makes a pure triage decision based solely on the content of the email.
The classifier returns a structured JSON response:
{
"classification": "SUPPORT",
"confidence": 0.85,
"reason": "User reports app crashes when opening large files",
"detectedLanguage": "en",
"detectedProduct": "primary"
}
I defined seven categories: spam, business inquiries, legal requests, wrong-product emails, feature requests, support questions, and refund requests. Each category has a detailed definition in the classifier prompt—not a one-liner, but a paragraph explaining what qualifies and what doesn’t.
The spam definition, for instance, includes “empty bug report templates that contain only diagnostic information but no actual problem description.” That one came from experience. I was getting dozens of automated crash reports with zero user context, and the classifier kept trying to help these phantom users until I explicitly told it not to.
One important detail: if the classifier fails or returns something unexpected, it defaults to “support” with a very low confidence score. This means the email will be processed but almost certainly escalated to me. I’d rather handle a false positive than miss a real support request.
The Knowledge Base: Why I Didn’t Build RAG
When people hear “AI agent that answers questions,” they immediately think RAG—retrieval-augmented generation, with vector embeddings, chunked documents, semantic search, the whole nine yards.
I didn’t build any of that. My knowledge base is a single Markdown file per product that gets sent to the LLM in its entirety with every request.
Why? Because my knowledge base is small enough. For a typical app, the entire knowledge base—FAQs, common issues, policies, feature descriptions—fits in maybe 4,000 tokens. That’s a rounding error in a 200k context window. Chunking and embedding a document that small would add complexity without adding value. I’d be building infrastructure to solve a problem I don’t have.
The knowledge base file is structured intentionally. It’s not just a dump of documentation. It has sections like:
- Product overview – What the app does, what platforms it runs on
- Features NOT available – Explicit list of things the app doesn’t do (this prevents the LLM from hallucinating capabilities)
- Common issues – Detailed decision trees for frequent support scenarios
- Internal data – Pricing tiers, refund policies, marked as “INTERNAL ONLY—never mention to users”
That last one is interesting. I include internal pricing data so the LLM can verify whether a user’s charge matches my app or belongs to some other company. Users occasionally email me about charges from apps that have nothing to do with mine—same category, similar name, different developer. The internal data helps the agent figure this out without ever revealing the actual prices.
The explicit “features NOT available” section is equally important. LLMs have a tendency to assume capabilities that sound reasonable. If a user asks “can your app do X?” and X sounds plausible, the LLM might say yes. Listing what the app explicitly doesn’t do is a simple guardrail that prevents this.
Agent-Driven Search
The knowledge base handles most questions, but some users ask about specific workflows or niche features that are only documented on the help site. For one of my apps, that’s a comprehensive online guide at help.pdf-pro.net with dozens of articles covering every feature and workflow (I migrated that entire help site from WordPress to Hugo earlier this year—eleven languages, and the search index the agent queries was a byproduct of that migration). Too much to stuff into the knowledge base, but too valuable to ignore. So I gave the agent tools.
Not pre-computed retrieval. Actual tools—function calls that the LLM can invoke during response generation. Two tools, specifically:
search_help_site(query, language)– Searches the help site, returns titles, excerpts, and URLsget_help_article(url)– Fetches the full content of a specific article
The LLM decides when to search and what to look for. It can search multiple times with different queries, read several articles, and synthesize the results. It’s essentially a ReAct loop with a cap of ten iterations to prevent runaway behavior.
This approach has a big advantage over traditional RAG: the search is contextual. The LLM understands the user’s question, crafts appropriate search queries, evaluates whether the results are relevant, and can refine its search if the first attempt doesn’t find what it needs. Pre-computed embeddings can’t do that—they retrieve based on surface-level similarity, which often misses the actual intent.
When the agent uses help articles in its response, it includes the URLs so users can read more. These are tracked in the response metadata, which helps me see which articles are most useful and which questions aren’t covered by existing documentation.
Response Generation and Confidence
The responder is a separate LLM call that receives everything: the email, the knowledge base, conversation history, any help articles the agent fetched, and even image attachments (resized screenshots get sent as vision input).
Its output is structured:
{
"response": "The actual email text to send",
"confidence": 0.85,
"needsClarification": false,
"shouldEscalate": false,
"escalationReason": null,
"helpArticlesUsed": ["https://help.example.com/article/123"]
}
The confidence score drives the entire downstream behavior. Above 0.7: send the response automatically. Below 0.7: send a clarification email asking the user for more details. If shouldEscalate is true: route the entire conversation to me with the escalation reason.
I spent a lot of time calibrating the confidence scoring in the prompt. Without guidance, LLMs tend to be overconfident. My prompt explicitly defines what different confidence levels mean: 0.9+ means the answer is directly covered in the knowledge base, 0.7-0.9 means it’s a reasonable inference, below 0.7 means significant uncertainty.
The response format is plain text only. No Markdown, no HTML, no formatting that might render weirdly in different email clients. This is a deliberate choice—I’d rather have slightly less pretty responses that work everywhere than beautifully formatted ones that break in Outlook.
One rule I’m particularly happy with: “Answer the question and stop.” No “Is there anything else I can help you with?” No “Feel free to reach out if you have more questions!” These canned engagement hooks are the hallmark of a bot, and users notice. A concise, direct answer that addresses their specific problem is more helpful and more human-sounding than a thorough answer wrapped in pleasantries.
Safety Mechanisms
This is the part I was most nervous about. Letting an AI send emails on your behalf is—let’s be honest—terrifying. One hallucinated response, one confidently wrong answer, and you’ve got an angry user and a damaged reputation.
So I built multiple safety layers:
The classification firewall. The classifier operates without any product knowledge. It can’t be influenced by the knowledge base into misclassifying emails. A legal inquiry stays a legal inquiry even if the knowledge base has information that could technically “answer” it.
The confidence threshold. Any response below 0.7 confidence gets downgraded to a clarification email. The user still gets a timely response, but it asks for more details rather than guessing.
Explicit escalation triggers. The responder prompt defines specific scenarios that must always escalate: knowledge gaps where neither the knowledge base nor help site covers the topic, and—this is crucial—any time a user suspects they’re talking to an AI. The prompt says: don’t confirm, don’t deny, just escalate. Trying to convince someone they’re talking to a human when they’re not is a losing game.
Repeat refund detection. First refund request from a sender gets an automated response (usually verification steps). Second refund request from the same sender automatically routes to me. This prevents the agent from getting stuck in a loop with a frustrated user who needs human intervention.
Error forwarding. If the entire pipeline fails for any reason—API timeout, parsing error, whatever—the original email gets forwarded to me with the error details. I’d rather get woken up by a forwarded email than discover three days later that someone’s urgent issue fell into a void.
Blacklist. Persistent sender blacklist for repeat offenders, scammers, and AI-generated spam. Blacklisted emails are silently moved to a folder without any processing. One subtlety here: for emails that come through contact forms (which all share the same From address), the blacklist checks the Reply-To header instead. I learned this the hard way when blacklisting the contact form’s no-reply address accidentally blocked every future contact form submission.
Conversation Threading
Support rarely ends in one email. Users reply, provide more context, ask follow-up questions. The agent needs to understand that email #3 in a thread is connected to emails #1 and #2.
I use the standard email threading headers—Message-ID, In-Reply-To, and References—to track conversations. When a new email arrives, the parser extracts the thread ID from the References header (the first entry is always the original message) and looks up the conversation history in a local SQLite database.
Every inbound and outbound message gets stored as a conversation record. When generating a response, the full conversation history is included in the prompt, formatted with timestamps and message numbers. This gives the LLM context about what was already discussed, what solutions were already suggested, and what the user has already tried.
When an email gets escalated or routed to me, the conversation history comes along—numbered, timestamped, with all attachments from previous messages prefixed with [msg N] so I can cross-reference. If the email is in a language I don’t speak, a separate LLM call translates both the message and the conversation history, and appends the translation below.
Here’s roughly what a routed email looks like when it lands in my inbox:
Support escalation — requires human attention.
Reason: User reports repeated crashes after update
Language: ja
Confidence: 0.45
──────────────────────────────────────────────────
ORIGINAL MESSAGE
──────────────────────────────────────────────────
From: Tanaka Yuki <tanaka@example.com>
Date: 2026-02-11T09:15:00Z
Subject: アップデート後にクラッシュする
macOS 15.3、ファイルサイズは約120MBです。
再インストールも試しましたが改善しません。
──────────────────────────────────────────────────
CONVERSATION HISTORY (2 messages)
──────────────────────────────────────────────────
[msg 1] [2026-02-10 14:23:45 UTC] Customer:
アップデート後、大きなファイルを開くと
毎回クラッシュします。助けてください。
[msg 2] [2026-02-10 15:05:33 UTC] Support Agent (automated):
お問い合わせありがとうございます。お使いのmacOSの
バージョンと、問題が発生するファイルの
おおよそのサイズを教えていただけますか?
──────────────────────────────────────────────────
TRANSLATED MESSAGE (auto-translated to English)
──────────────────────────────────────────────────
Subject: Crashes after update
macOS 15.3, file size is about 120MB. I also tried
reinstalling but it didn't help.
──────────────────────────────────────────────────
CONVERSATION HISTORY — TRANSLATED
──────────────────────────────────────────────────
[msg 1] Customer:
After updating, the app crashes every time I open
large files. Please help.
[msg 2] Support Agent (automated):
Thank you for contacting us. Could you tell me your
macOS version and the approximate size of the files
causing the problem?
This means I don’t have to dig through my inbox to reconstruct the conversation. Everything I need is in one forwarded email—original message, full history, and a translation if I need it.
Handling Attachments
Screenshots are incredibly common in support email. Users attach photos of error messages, screen recordings of bugs, screenshots of their settings. Ignoring these means missing crucial context.
The system processes attachments in two layers. Every attachment—regardless of type—gets preserved as-is for potential human review. But PNG and JPG images additionally get resized (capped at 1536 pixels on the longest edge) and sent to the LLM as vision input.
This means the agent can actually look at a screenshot and say “I can see you’re on the settings page, and the toggle for X is turned off—try enabling it.” That’s a dramatically better experience than asking the user to describe what they see.
Images over 10MB get skipped for vision to avoid API timeouts, but they’re still preserved for human escalation. Non-image attachments (PDFs, logs, zip files) aren’t sent to the LLM but are forwarded along when the email is routed to me.
Routing
Not every email deserves an AI response. Business inquiries, legal requests, feature requests, and wrong-product emails all need human attention—but from different humans and with different context.
Each category has a configured routing address and email template. The routing is deterministic: the AI classifies, and the code routes based on that classification. The AI never decides where to send an email—it only decides what the email is about. This prevents a hallucinating LLM from sending sensitive emails to the wrong address.
When routing an email, the system includes the full conversation history and all attachments. If the email is in a language other than the ones I speak, a separate translation call appends a translation to the forwarded email so the recipient can understand it without reaching for Google Translate.
The translation is fire-and-forget: if the API call fails, the routing continues without it. Translation is nice to have, not mission-critical.
Configuration as Prompt Engineering
Here’s something I realized midway through building this: the real product isn’t the code. It’s the prompts.
The classifier prompt and responder prompt are external text files, not hardcoded strings. The code only appends the output format. This means someone can completely change how the agent behaves—different classification categories, different response tone, different escalation criteria—without touching a line of code.
The knowledge base is the same way. It’s a Markdown file that the user writes and maintains. The system just reads it and passes it along.
This turned out to be the right call. Tuning the agent’s behavior is almost entirely a prompt engineering exercise, not a software engineering one. When the agent misclassifies a certain type of email, I update the classifier prompt with a better category definition. When it gives an unhelpful response to a common question, I add more detail to the knowledge base. The code itself rarely changes.
The configuration file uses TOML with environment variable substitution for secrets (API keys, SMTP credentials). Each product in a multi-product mailbox gets its own knowledge base file and can optionally have its own help site. The classifier detects which product an email is about and the responder loads the matching context.
What I Learned
After running this for a while, a few things surprised me:
Most support email is predictable. About 80% of inbound emails fall into maybe 15 patterns. The knowledge base doesn’t need to be exhaustive—it needs to cover the common cases well. The uncommon cases escalate to me, and that’s fine.
Confidence calibration matters more than response quality. A perfect response sent with incorrect confidence is worse than a mediocre response sent with correct confidence. If the agent is 60% sure and says so, I can review it. If it’s 60% sure but reports 90%, I won’t review it and users will get bad answers.
The “features NOT available” section is worth its weight in gold. Before I added it, the agent would occasionally tell users that yes, the app can do something it absolutely cannot do. Explicitly listing what’s not there eliminated this class of hallucination almost entirely.
Plain text beats formatted responses. I initially tried Markdown responses with nice formatting. They looked great in Gmail, terrible in Apple Mail, and unreadable in some corporate email clients. Plain text is universal and—counterintuitively—reads as more human. Real humans don’t send Markdown emails.
Short responses are better responses. The agent’s best responses are 2-3 sentences: acknowledge the problem, provide the solution, done. Its worst responses are the ones where it over-explains, adds caveats, and tries to anticipate follow-up questions. I tuned the prompt to prioritize brevity, and customer satisfaction went up.
Building this has fundamentally changed how I think about support. It’s not about replacing human support with AI—it’s about reserving human attention for the situations that actually need it. The agent handles the routine stuff quickly and consistently, and I handle the edge cases with the care they deserve.
That feels like the right division of labor.