How a Self-Hosted AI Assistant Helps Our Team To Monitor AWS

Managing a backend infrastructure with 300+ containers (microservices, databases, etc.) and only a 10–15 person team is tough. That’s why we built a self-hosted AI assistant. Instead of drowning in logs, spikes, and endless alerts, the AI Assistant helps us monitor systems 24/7, flag anomalies early, and even suggest optimizations — all while keeping sensitive data private.

At some point, we thought: Why not get an AI to help with this? Specifically, we envisioned a kind of intelligent assistant, powered by a large language model (LLM), that could watch over our logs and systems 24/7, alert us to anything weird, and even suggest optimizations. Essentially, an AI co-pilot for our infrastructure.

To avoid any sensitive data leaving our walls, we decided to self-host this AI Assistant. You might worry that a self-hosted LLM would be too “dumb” or weak compared to the latest cloud AI services. But that's not really true anymore. Open-source and open-weight LLMs have improved a lot – OpenAI itself recently released GPT-OSS, a family of open models that deliver strong real-world performance and even support tool usage. Sure, it's not going to be ChatGPT-5 Premium level (if that even exists!), but an open model is perfectly capable of doing the kind of text analysis and pattern recognition we need. And we get the benefit of keeping our data private.

In this article, I'll walk you through how we built and implemented our AI Assistant, what tools and data we equipped it with, and how it's helping us manage our infrastructure without going crazy. I'll also share some examples of what it does – from catching anomalies before they become incidents, to suggesting scaling and updates that make our lives easier. All in a friendly, (hopefully) not-too-formal tone, because this is coming from an engineer who nearly went insane tailing logs at 3 AM, and lived to tell the tale!

Building an AI Assistant for Monitoring

To get our AI assistant up and running, we had to provide it with the right inputs and context about our systems. At a high level, we set up three main things for it:

  1. A knowledge base of all our logs (RAG system): We push logs from all our services into a Retrieval-Augmented Generation (RAG) database that the LLM can query. In practice, this means indexing our logs (maybe in a vector database) so the AI can pull up relevant log lines or events when analyzing a problem. By having access to historical and real-time logs, the assistant can spot unusual patterns or errors by comparing against what's "normal" in the data it has seen.
  2. Context about system versions and configurations: We built a tool that can provide the AI with information about the versions of libraries, protocols, and components we’re running. For example, if we have Service A running version 1.2.3 of a library, the AI will know that context. This way, when analyzing an issue, it can consider whether a bug might be related to a specific version. We essentially prompt the LLM with environment data (versions, config settings) by default, so it doesn't operate in a vacuum.
  3. Knowledge of known vulnerabilities and backdoors: We integrated a secondary knowledge base (another RAG system) filled with information from public security advisories, CVE databases, and websites where people post details of backdoors or exploits (especially those that have been patched). The idea is, if some suspicious activity shows up in our logs, the AI can cross-reference it against known issues. For instance, if there's a known backdoor exploit for Apache version X, and our AI knows we run that version, it can immediately flag that the pattern in the logs might be an attempt to use that exploit. Even if the exploit is already patched in a newer version, knowing about it helps the AI to diagnose the issue faster.

By setting up these three knowledge sources, we give our AI assistant a holistic view of the system: what's happening now (logs), what we're running (versions/config), and what threats or bugs are known out there (vulnerabilities database). That’s a lot of context for it to chew on, but that's exactly what modern LLMs are good at: absorbing a bunch of text and finding connections.

Ensuring the AI Assistant Stays in Line

Now, I know what you might be thinking: “Giving an AI access to my logs and system info? What if it goes rogue and does something crazy?” Oh, I almost forgot to address this – don't worry! We designed the assistant to be read-only and tool-limited, meaning it can only perform actions that we explicitly allow. We don’t just let it execute arbitrary code or push buttons in our infrastructure.

Using frameworks like LangChain, we provided the AI with a fixed set of tools/functions it can call. Think of it like writing a list of safe commands for a junior engineer on their first day. For example, we gave it a function to read files from a specific logs directory, a function to query our monitoring API for metrics, a function to perform web searches (more on that in a second), and so on. Each function is carefully sandboxed – it can only do exactly what it says. If a function is for reading logs, it can't delete anything or modify anything, just read. By defining this toolset, we hard-limit what the AI can do. It can't, say, deploy new containers or change configurations, because we never gave it a tool that does that.

Speaking of web searches: we even set up a mini “Google-like” tool for the AI, using a project called Firecrawl. Firecrawl is basically a web crawling and search API for AI agents – it “delivers the entire internet to AI agents” docs.firecrawl.dev – so the assistant can safely search the web and scrape content as needed. The nice part is we can self-host this or control it, so when the AI "googles" something, that activity stays within our environment and the results come back to it in a controlled way. Firecrawl’s tagline is to deliver the entire internet to AI agents, which is both awesome and a little terrifying – but in our case we restrict it to just browsing known security sites or documentation when needed.

The bottom line: we have full control over what the AI assistant can and cannot do. It's like giving it a toy steering wheel – it feels like it's driving, but we ensure it stays on the rails we’ve set. This way, we can sleep easy knowing the AI won't, for example, suddenly decide to shut down servers or rm -rf anything. It’s there to observe, analyze, and report – not directly operate critical systems.

AI in Action: Early Detection of Anomalies and Attacks

So, how does this AI assistant actually help us day-to-day? One of the coolest use cases is catching security issues or anomalies early, sometimes even before any human notices a problem. Here's an example scenario of what it can do:

  1. Noticing Strange Log Patterns: The AI is continuously parsing our logs (or can be prompted to do so at intervals). Let’s say it sees a sudden surge of failed login attempts across several services, or an unusual sequence of 500-error codes that haven't happened before. Because it has read so much of our logs historically, it can tell what's out of the ordinary. It might think, "Hmm, this pattern of errors looks a bit like a DDoS attack or a brute-force attempt."
  2. Cross-Referencing Known Issues: Suppose those log entries look somewhat like an exploit. The assistant will cross-reference the details (IP addresses, error messages, payload patterns, etc.) with that vulnerability/backdoor knowledge base we gave it. For example, maybe there's a known vulnerability CVE-2025-1234 that describes an attack sequence that leaves a telltale error message in logs. If the AI finds a match or even a fuzzy similarity, that's a big clue. It could say, "These logs resemble the signature of CVE-2025-1234 exploitation attempts."
  3. Web Search for Clues: If the AI doesn't find anything in the internal knowledge bases, it can use the search tool. It might do a quick web search like "weird error code XYZ after login attempt" or search security forums for the error text it's seeing. Thanks to the controlled search capability, it can retrieve a few relevant snippets from the internet (support forum posts, security blogs, StackOverflow, etc.) to gather more context. Maybe it finds a forum post where someone had a similar log and it turned out to be a bot scraping the API, for instance.
  4. Forming a Hypothesis and Alerting: After gathering all this info, the AI puts together an analysis for us. It might say something like: "I noticed an abnormal pattern of login failures from a single IP, which could indicate a brute-force attack. It resembles a known issue (CVE-2025-1234) affecting our version of Service A. I searched online and found an admin on a forum who saw similar logs, which were caused by a botnet trying default passwords. Recommendation: Consider blocking IP 1.2.3.4 and checking if Service A version 1.2.3 has a patch for that vulnerability."

That kind of analysis is incredibly useful. Instead of just generic alerts like "high error rate on service X," we get a contextual explanation and even a suggested action. Essentially, the AI assistant acts like a Tier-1 support engineer triaging an incident, but supercharged with all knowledge at its fingertips.

In one real instance, our AI alerted us to a possible DDoS pattern on a lesser-used API endpoint. Nothing had crashed yet, but it noticed a spike in traffic and error responses. We probably wouldn't have caught it at 2 AM until it actually overwhelmed something. But the AI gave us a heads-up, and we mitigated by scaling that service preemptively and adding some IP filtering. That early warning prevented an outage. You can imagine how happy the on-call engineer was to not get woken up by a pager an hour later, because the issue never got that far!

Performance Profiling and Scaling Recommendations

Catching security issues is awesome, but we didn't stop there. We also use the AI assistant for performance profiling and suggesting scaling optimizations. This is more about efficiency and reliability than security.

For example, the AI can look at our metrics and logs to identify slow spots or potential bottlenecks in the system. We occasionally allow it to access certain performance metrics (CPU usage, memory usage, request latency, etc.) in read-only mode. If we notice, say, an API is consistently hitting 90% CPU during peak hours, the AI might flag that and say, "Service B has high CPU usage daily around 8 PM, which is causing response times to increase. This might be a good candidate for scaling out or optimizing the code." It's like having a performance analyst on staff, always watching the trends.

We even experimented with letting the AI trigger profiling mode on some services. Essentially, we have some endpoints or internal toggles that can turn on more detailed logging or metrics for short periods (because running with full debug logging 24/7 would be too slow). The AI can decide, "Hmm, Service C looks like it's using a lot more memory than usual. Let me enable detailed GC logging for 5 minutes." It uses a tool-function to do that (again, one we carefully provided). Then it reads those detailed logs, figures out maybe there's a memory leak or just a big spike due to some cron job, and then it reports that back. This selective profiling means we gather deep insights only when needed, without permanently sacrificing performance.

Perhaps most directly, the AI can recommend scaling actions. If a particular container or service is frequently maxing out its resources, the AI might suggest "Hey, we should increase the replica count for Service D from 3 to 5," or "maybe it's time to move to a larger instance type for the database because it's hitting I/O limits." We haven’t fully automated it to execute scaling (that would be a next step, though we’d want a human to approve first), but even the suggestion is valuable. It saves us from digging through Grafana dashboards to figure out where the pain points are – the AI surfaces them proactively.

Proactive Maintenance: Updates and CI/CD Insights

Another area where our AI assistant shines is in proactive maintenance tasks – things that won't break anything, but can improve our system health and keep us up to date.

One such task is tracking when important libraries or dependencies have updates. Since the AI knows what versions we’re running (from that context we feed it), it can periodically check against the latest known versions. For instance, if we're running PostgreSQL 13 and version 15 is out with significant performance improvements, the AI can gently remind us: "It looks like PostgreSQL 15 is available and offers better indexing performance. We are two major versions behind. Consider planning an upgrade." Or it might notice we're using an outdated version of a Python package that has known bugs, and suggest updating to a newer patch release.

The assistant can also watch our CI/CD pipeline logs for recurring issues. Imagine that every time we deploy Service E, there's a warning or a non-critical test that fails (but not enough to halt the deployment). Over time, the team might ignore it if it's not urgent. The AI, however, doesn't forget. It could point out, "Every deploy of Service E shows a deprecation warning about an API call. This hasn’t caused failure yet, but it might in the future. Maybe we should fix that before it becomes a real problem." It’s like having a very meticulous QA engineer who combs through all the build logs and remembers historical patterns.

In essence, the AI serves as an advisor for housekeeping: suggesting updates, nudging us about warnings, and highlighting things that are easy to overlook during busy development cycles. These are things that won't immediately bring the system down, but addressing them can prevent future incidents or performance issues. And importantly, the AI can do this without any risk – it’s just reading info and making recommendations, not actually applying any patches itself (that’s still up to us humans, at least for now!).

Benefits for a Small Team

Integrating an AI assistant into our monitoring and ops workflow has brought huge benefits for our small team. To sum up a few key advantages:

  • 24/7 Vigilance: The AI never sleeps. It can continuously watch logs and metrics, catching issues at 3 AM that we might miss until next morning. This has improved our incident response time and often prevented incidents entirely by early detection.
  • Reduced Alert Fatigue: Because the AI provides context and analysis (not just raw alerts), the notifications we get are more actionable. The team doesn’t have to sift through as many false positives or vague alerts. We hear about the meaningful stuff.
  • Knowledge Integration: The assistant ties together knowledge of our internal system and external information (like vulnerabilities and best practices). A human engineer would have to Google around for that; the AI does it in seconds and gives a consolidated answer.
  • Proactive Optimizations: It’s not just about problems. The AI highlights opportunities to improve (scale up this service, update that dependency, optimize this query). It’s like having a proactive team member always looking for tuning opportunities, which is something we rarely have time for otherwise.
  • Scale with Small Team: Perhaps the biggest benefit – it allows 10-15 people to effectively manage an infrastructure that would normally require a much larger ops team. The AI handles a lot of the grunt work of monitoring and initial analysis, so the human team can focus on actual development and high-level decision making. It’s a force multiplier for us.

Conclusion

Is our AI assistant perfect? Of course not. An LLM is not a magical wand that can solve everything effortlessly – it has its limitations. It only knows what fits in its prompt (context window), so if there’s a ton of data or some very recent event not in the context, it might miss it. And it can occasionally get things wrong or make an odd suggestion (we always double-check critical recommendations).

But even with those caveats, the payoff has been well worth it. By giving the AI a limited set of tools and clear instructions, we've basically gained an automated teammate who is diligent, tireless, and pretty smart about our system. We’ve learned that an LLM can do a lot as long as you constrain the scope and feed it the right information. It won’t replace our engineers (no AI overlords just yet!), but it certainly augments our capabilities in a big way.

Going forward, we’re excited to expand this system even more. We might enable it to automatically open tickets or even execute safe remediation scripts in the future, once we’re confident in its judgments. For now, it’s already doing a fantastic job of helping us stay on top of a complex infrastructure with minimal stress.

So, if you’re a CTO or an engineer in a small team overwhelmed by too many servers, containers, and not enough eyes to watch them – you might want to give an AI assistant a try. Speaking from experience, it’s a game-changer to have that extra pair of (virtual) eyes keeping watch. Oh, and one more thing I almost forgot: don't be afraid that the AI will “chemistry up” something bizarre in your infrastructure (as I half-jokingly worried). With the right safeguards, it will stay firmly under your control, doing only what it’s told. In the end, it’s just another tool – but wow, what a tool it can be for those of us trying to keep complex systems running smoothly with lean teams.

0 thoughts on "How a Self-Hosted AI Assistant Helps Our Team To Monitor AWS"

Leave a Reply

Your email address will not be published. Required fields are marked *

Table of Contents

s c r o l l u p

Back to top