<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://gunnymarc.github.io/feed.xml" rel="self" type="application/atom+xml" /><link href="https://gunnymarc.github.io/" rel="alternate" type="text/html" /><updated>2026-06-09T16:15:52-04:00</updated><id>https://gunnymarc.github.io/feed.xml</id><title type="html">Semper ex Datis</title><subtitle>U.S. Marine | Data Scientist. Long-form technical articles on LLMs, Data Science, and machine learning.</subtitle><author><name>Marc Buraczynski</name><email></email></author><entry><title type="html">The Emergence of Agentic AI in Network Observability: Architectural Patterns and Integration Strategies</title><link href="https://gunnymarc.github.io/posts/2026/06/ai-assistants-cursor-architecture-network-observability/" rel="alternate" type="text/html" title="The Emergence of Agentic AI in Network Observability: Architectural Patterns and Integration Strategies" /><published>2026-06-08T00:00:00-04:00</published><updated>2026-06-08T00:00:00-04:00</updated><id>https://gunnymarc.github.io/posts/2026/06/AI_Assistants_Cursor_Architecture_Network_Observability</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/06/ai-assistants-cursor-architecture-network-observability/"><![CDATA[<p><strong>Authored by: Marc Buraczynski</strong>
<strong>Publication Date: 2026-06-09</strong></p>

<h2 id="executive-summary">Executive Summary</h2>

<p>The landscape of information technology is undergoing a paradigm shift, moving from manually operated systems to increasingly autonomous operations powered by artificial intelligence. This transition is most pronounced in the domain of AI assistants, which have evolved from simple, stateless chatbots into sophisticated, persistent agents capable of reasoning, planning, and interacting with complex digital environments. This report provides a comprehensive technical analysis of the architectural foundations underpinning these modern AI agents, with a particular focus on their application within a generic, multi-tenant network observability Software-as-a-Service (SaaS) platform.</p>

<p>Our analysis dissects the core components of agentic systems, including the iterative memory loops, advanced planning frameworks that supersede early models like ReAct, and the repository-aware context management seen in development environments such as Cursor. We explore the transition from simple Retrieval-Augmented Generation (RAG) to more advanced GraphRAG architectures, which leverage structured knowledge graphs to enable complex, multi-hop reasoning—a critical capability for diagnosing issues in distributed network infrastructure [18, 19].</p>

<p>The central thesis of this paper is the proposal of a novel architecture for an AI-powered copilot embedded within a network observability platform. This copilot is designed to ingest and correlate a wide array of telemetry data—including logs, metrics, traces, and crucial network-layer information like Border Gateway Protocol (BGP) and Domain Name System (DNS) telemetry. By creating a unified knowledge base that combines a multi-layered memory system with a dynamic service topology graph, the agent can automate complex incident management workflows, from root cause analysis to natural language investigation.</p>

<p>Finally, the report addresses the formidable challenges of deploying such powerful AI systems in an enterprise context. We detail the stringent security, governance, and compliance requirements necessary for a multi-tenant SaaS environment. This includes architectural patterns for achieving robust tenant isolation using technologies like microVMs, principles for creating auditable and privacy-preserving AI systems aligned with frameworks like the NIST AI RMF [2], and the critical role of Human-in-the-Loop (HITL) oversight. Through diagrams, comparative tables, and illustrative code samples, this paper provides a detailed blueprint for building and integrating the next generation of intelligent, autonomous systems for network operations.</p>

<h2 id="methodology">Methodology</h2>

<p>The findings presented in this report are the result of a comprehensive analysis of peer-reviewed academic papers, pre-print articles from arXiv, technical blogs from leading technology companies, and official documentation for open-source projects and commercial products. The research focused on literature published between 2022 and 2026, capturing the rapid evolution of Large Language Models (LLMs), agentic architectures, and their application in software engineering and IT operations.</p>

<p>The analytical process involved synthesizing information from disparate sources to identify common architectural patterns, emerging best practices, and significant challenges. Conflicts in information were resolved by prioritizing methodologies and results presented in peer-reviewed papers or those substantiated with robust, verifiable data. Claims from vendor-specific marketing materials were cross-referenced with technical documentation and independent analyses. The proposed architecture for a network observability copilot is a synthetic construct, integrating established principles from the reviewed literature into a novel, domain-specific application. This report, compiled on June 7, 2026, is based on publicly available information and does not reflect the proprietary inner workings of any specific commercial product, representing a potential limitation in scope.</p>

<h2 id="the-anatomy-of-modern-ai-assistants">The Anatomy of Modern AI Assistants</h2>

<p>The concept of the AI assistant has fundamentally evolved from a passive, request-response mechanism into an active, autonomous agent capable of pursuing long-term goals. This transformation is driven by a new class of architectures that endow Large Language Models (LLMs) with memory, planning capabilities, and the ability to interact with external tools and environments. These systems are no longer just language processors; they are digital agents that perceive, reason, and act within a persistent context, marking a significant step towards more general artificial intelligence.</p>

<h3 id="the-agentic-loop-core-principles-of-operation">The Agentic Loop: Core Principles of Operation</h3>

<p>At the heart of any modern AI agent is an iterative operational cycle, often referred to as the <strong>agent loop</strong>. This loop extends the basic functionality of an LLM by placing it within a framework of continuous interaction with an environment. The canonical stages of this loop are perceiving the environment, reasoning about the current state and objectives, creating a plan of action, executing that plan through tool use, and observing the outcome, which then feeds back into the next cycle of perception [3, 4]. This process allows the agent to move beyond single-turn interactions and engage in complex, multi-step tasks that require statefulness and adaptation.</p>

<p>Integral to this agent loop is a more formalized memory process, best described as the <strong>write–manage–read</strong> cycle [4, 5]. This paradigm treats memory not as a passive data store but as an active, managed component of the agent’s cognitive architecture. In the “write” phase, new information from observations, tool results, or internal reflections is captured and structured. The “manage” phase, a critical differentiator of modern agents, involves sophisticated processes like pruning irrelevant data, compressing information, consolidating related memories, and resolving contradictions to maintain the integrity and utility of the memory store [5]. Finally, the “read” phase involves selectively retrieving the most relevant information to inject into the agent’s working context, thereby informing its reasoning and planning for the next action. This continuous loop of writing, curating, and retrieving information is what enables an agent to learn from experience, maintain a persistent “belief state,” and avoid the context limitations of the underlying LLM [4].</p>

<h3 id="advanced-memory-architectures-the-foundation-of-persistence">Advanced Memory Architectures: The Foundation of Persistence</h3>

<p>The evolution from simple chatbots to persistent agents is largely attributable to the development of sophisticated memory systems that mimic aspects of human cognition. Early agents were constrained by the finite context window of the LLM, effectively suffering from a form of digital amnesia between sessions. Modern architectures overcome this limitation by externalizing memory into a multi-layered structure, offloading the cognitive burden from the model’s parameters to dedicated infrastructure. This allows the agent to build a rich history of experiences and knowledge over time.</p>

<p>Research has converged on a taxonomy of memory that categorizes information by its function and temporal scope. <strong>Working Memory</strong> is the most immediate layer, operating within the agent’s active context window and holding task-specific information for the current operation [4]. <strong>Episodic Memory</strong> serves as a long-term log of concrete experiences, storing sequences of actions, observations, and outcomes, often timestamped and scored for importance. From these raw episodes, the agent synthesizes <strong>Semantic Memory</strong>, which contains abstract, de-contextualized knowledge, such as user preferences, general facts, or learned rules. The final layer is <strong>Procedural Memory</strong>, which stores reusable skills, executable plans, and heuristics for tool use, enabling the agent to perform familiar tasks more efficiently without re-deriving the solution from scratch [4].</p>

<p><img src="https://weaviate.io/assets/images/memory-types-886ed8c1574b5c38418d121e4ecf3741.png" alt="Diagram showing different types of AI agent memory, including episodic, semantic, and procedural." />
<em>A conceptual model illustrating the distinct layers of memory—Working, Episodic, Semantic, and Procedural—that enable long-term persistence and learning in advanced AI agents.</em></p>

<p>Furthermore, state-of-the-art designs are increasingly adopting <strong>graph-based memory architectures</strong>. Unlike linear logs or unstructured vector databases, knowledge graphs represent information as a network of entities and relationships [6]. This structure preserves causal and hierarchical dependencies, enabling more complex forms of reasoning. For instance, an agent can traverse the graph to perform multi-hop queries, uncovering connections between memories that would be missed by simple semantic similarity search [6]. This capacity for structural reasoning is a crucial enabler for tackling complex, long-horizon problems that require a deep understanding of interconnected concepts.</p>

<h3 id="planning-and-tool-use-from-react-to-orchestration">Planning and Tool Use: From ReAct to Orchestration</h3>

<p>An agent’s ability to achieve goals is contingent on its capacity to form coherent plans and execute them by interacting with external tools, such as APIs, databases, or code interpreters. The seminal <strong>ReAct (Reason and Act)</strong> framework pioneered a powerful paradigm by interleaving reasoning traces (“Thought”) with tool invocations (“Action”) and subsequent “Observations” [7]. This structure forces the LLM to verbalize its reasoning, track its progress, and adjust its plan based on new information. However, the linear, step-by-step nature of ReAct often leads to “local optimization traps,” where the agent gets stuck on a suboptimal path because it lacks a global, high-level strategy [8]. This makes it inefficient for complex tasks that could benefit from parallel execution or a more sophisticated plan.</p>

<p>To overcome these limitations, the field has evolved toward architectures that decouple planning from execution. <strong>Planner-centric frameworks</strong> employ a dedicated “Planner” agent that first analyzes a complex query and constructs a global execution plan, often represented as a Directed Acyclic Graph (DAG) [8]. This DAG explicitly models dependencies between sub-tasks, allowing an “Executor” or “Worker” agent to run independent tool calls in parallel, significantly reducing latency and improving efficiency.</p>

<p>Another advanced pattern is the use of <strong>Multi-Agent Systems</strong>, which distribute cognitive labor across a team of specialized agents. In this model, a “Leader” or “Orchestrator” agent is responsible for high-level strategy, task decomposition, and error handling [9]. It delegates specific sub-tasks to a swarm of “Worker” agents, which may be specialized for functions like research, code generation, or data analysis [9, 10]. This hierarchical structure mirrors human engineering teams and reduces the cognitive load on any single model, making it more robust and scalable. These orchestration layers often use deterministic frameworks to manage state and transitions between agents, ensuring that the overall workflow is reliable and auditable [10]. This progression from the simple ReAct loop to complex, orchestrated multi-agent systems represents a significant leap in the ability of AI to perform sophisticated, long-horizon tasks.</p>

<p><img src="https://www.firecrawl.dev/images/blog/ai-agents/agent-architecture.webp" alt="Diagram illustrating an advanced agentic architecture involving planning, tool use, and memory." />
<em>This diagram depicts a modern agentic architecture, highlighting the central role of the agent in coordinating planning, memory retrieval, and tool execution to interact with its environment and achieve goals.</em></p>

<h2 id="the-cursor-like-paradigm-agentic-ai-in-development-environments">The “Cursor-like” Paradigm: Agentic AI in Development Environments</h2>

<p>The integration of agentic AI into software development has given rise to a new generation of tools that transcend simple code completion. The “Cursor-like” paradigm, named after one of its prominent exemplars, represents a developer environment where the AI is not just a passive assistant but an active collaborator with deep awareness of the entire codebase. These systems function as autonomous agents embedded within the Integrated Development Environment (IDE), capable of understanding repository-wide context, orchestrating complex refactoring tasks, and interacting directly with the developer’s command line and file system.</p>

<h3 id="beyond-autocomplete-repository-scale-awareness">Beyond Autocomplete: Repository-Scale Awareness</h3>

<p>Traditional AI coding assistants primarily operated on the local context of the currently open file, offering suggestions based on the surrounding lines of code. This limited their utility for complex tasks that require understanding interdependencies across multiple files, modules, and APIs. Modern agents overcome this by achieving <strong>repository-scale awareness</strong>. They achieve this by pre-processing the entire codebase to build a persistent, queryable knowledge base [10].</p>

<p>A key technology enabling this is the use of structural parsers like <strong>Tree-Sitter</strong> to construct knowledge graphs of the code [10]. Instead of treating code as flat text, these agents parse it into a structured representation of entities (e.g., functions, classes, variables) and their relationships (e.g., calls, imports, inheritance). This allows the agent to perform sophisticated structural queries, such as “find all functions that call this deprecated API” or “show me the definition of the class this object inherits from,” without needing to manually read dozens of files. This structural retrieval is far more token-efficient and accurate than naive text-based search [10]. This structured knowledge is often exposed to the agent through a standardized interface known as the <strong>Model Context Protocol (MCP)</strong>, which provides a consistent way for the agent to interact with external knowledge sources and tools, regardless of the underlying infrastructure [11].</p>

<h3 id="a-comparison-of-modern-ai-coding-assistants">A Comparison of Modern AI Coding Assistants</h3>

<p>The market for AI coding assistants has matured, with several key players offering distinct approaches to integrating AI into the development workflow. While all aim to boost developer productivity, they differ in their architecture, ecosystem integration, and security postures. The table below compares prominent tools based on available research [12, 13, 14].</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Feature</th>
      <th style="text-align: left">GitHub Copilot</th>
      <th style="text-align: left">Cursor</th>
      <th style="text-align: left">Windsurf</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Core Paradigm</strong></td>
      <td style="text-align: left">Ecosystem-integrated pair programmer</td>
      <td style="text-align: left">AI-native, repository-aware code editor</td>
      <td style="text-align: left">Performance-oriented, terminal-aware agent</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Context Management</strong></td>
      <td style="text-align: left">Primarily file-level and open tabs, with some repository-level search</td>
      <td style="text-align: left">Deep codebase indexing via local Merkle trees</td>
      <td style="text-align: left">“Flow” paradigm with strong terminal and browser context awareness</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Key Differentiator</strong></td>
      <td style="text-align: left">Deep integration with GitHub platform (PRs, Actions, Security)</td>
      <td style="text-align: left">Mature agentic workflows (e.g., “Composer”) and team-wide rule enforcement (<code class="language-plaintext highlighter-rouge">.cursorrules</code>)</td>
      <td style="text-align: left">High performance and tight integration with the OpenAI ecosystem</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Security Model</strong></td>
      <td style="text-align: left">Enterprise-grade compliance, data segregation, and IP indemnification [1]</td>
      <td style="text-align: left">Local-first indexing; relies on <code class="language-plaintext highlighter-rouge">.cursorignore</code> to prevent sensitive data transmission</td>
      <td style="text-align: left">Dependent on the underlying OpenAI API security and privacy policies</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Target User</strong></td>
      <td style="text-align: left">Developers and teams heavily invested in the GitHub ecosystem</td>
      <td style="text-align: left">Professional developers and teams seeking a fully AI-integrated editing experience</td>
      <td style="text-align: left">Developers prioritizing raw performance and a terminal-centric workflow</td>
    </tr>
  </tbody>
</table>

<p>This comparison highlights a a fundamental trade-off: deeply integrated ecosystem players like GitHub Copilot provide robust enterprise governance, while more agile, editor-native tools like Cursor offer more advanced agentic workflows at the potential cost of standardized enterprise controls.</p>

<h3 id="code-sample-a-simplified-agentic-workflow-in-python">Code Sample: A Simplified Agentic Workflow in Python</h3>

<p>To make the concept of an agentic workflow more concrete, consider the following pseudo-code example using a hypothetical Python framework. This code illustrates how an agent might perform a simple refactoring task: finding all instances of an old function name and suggesting a replacement. This demonstrates the core loop of planning, acting (tool use), and synthesizing a result.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># A simplified Python pseudo-code example of an agentic workflow for code refactoring.
</span>
<span class="kn">import</span> <span class="nn">agent_framework</span> <span class="k">as</span> <span class="n">af</span>

<span class="c1"># Define Tools available to the agent
# In a real system, these would interact with the file system and a structural code index.
</span><span class="k">class</span> <span class="nc">CodebaseTools</span><span class="p">:</span>
    <span class="o">@</span><span class="nb">staticmethod</span>
    <span class="k">def</span> <span class="nf">find_function_calls</span><span class="p">(</span><span class="n">function_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
        <span class="s">"""Finds all files and line numbers where a function is called."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"TOOL: Searching for calls to '</span><span class="si">{</span><span class="n">function_name</span><span class="si">}</span><span class="s">'..."</span><span class="p">)</span>
        <span class="c1"># In a real implementation, this would query a Tree-Sitter-based index.
</span>        <span class="k">return</span> <span class="p">[</span>
            <span class="p">{</span><span class="s">"file"</span><span class="p">:</span> <span class="s">"src/main.py"</span><span class="p">,</span> <span class="s">"line"</span><span class="p">:</span> <span class="mi">56</span><span class="p">},</span>
            <span class="p">{</span><span class="s">"file"</span><span class="p">:</span> <span class="s">"src/utils.py"</span><span class="p">,</span> <span class="s">"line"</span><span class="p">:</span> <span class="mi">102</span><span class="p">},</span>
        <span class="p">]</span>

    <span class="o">@</span><span class="nb">staticmethod</span>
    <span class="k">def</span> <span class="nf">read_file_line</span><span class="p">(</span><span class="n">file_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">line_number</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Reads a specific line from a file."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"TOOL: Reading line </span><span class="si">{</span><span class="n">line_number</span><span class="si">}</span><span class="s"> from '</span><span class="si">{</span><span class="n">file_path</span><span class="si">}</span><span class="s">'..."</span><span class="p">)</span>
        <span class="c1"># Dummy implementation
</span>        <span class="k">if</span> <span class="n">file_path</span> <span class="o">==</span> <span class="s">"src/main.py"</span><span class="p">:</span>
            <span class="k">return</span> <span class="s">"result = old_deprecated_function(data)"</span>
        <span class="k">return</span> <span class="s">"value = old_deprecated_function(config)"</span>

    <span class="o">@</span><span class="nb">staticmethod</span>
    <span class="k">def</span> <span class="nf">suggest_refactor</span><span class="p">(</span><span class="n">file_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">line_number</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">old_code</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">new_function_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Generates a refactoring suggestion for a line of code."""</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"TOOL: Generating refactor suggestion for '</span><span class="si">{</span><span class="n">file_path</span><span class="si">}</span><span class="s">:</span><span class="si">{</span><span class="n">line_number</span><span class="si">}</span><span class="s">'..."</span><span class="p">)</span>
        <span class="c1"># This function would use an LLM to generate the replacement code.
</span>        <span class="n">new_code</span> <span class="o">=</span> <span class="n">old_code</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"old_deprecated_function"</span><span class="p">,</span> <span class="n">new_function_name</span><span class="p">)</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"Replace line </span><span class="si">{</span><span class="n">line_number</span><span class="si">}</span><span class="s"> in '</span><span class="si">{</span><span class="n">file_path</span><span class="si">}</span><span class="s">' with: `</span><span class="si">{</span><span class="n">new_code</span><span class="si">}</span><span class="s">`"</span>

<span class="c1"># Create an agent with a set of tools
</span><span class="n">agent</span> <span class="o">=</span> <span class="n">af</span><span class="p">.</span><span class="n">Agent</span><span class="p">(</span>
    <span class="n">name</span><span class="o">=</span><span class="s">"RefactoringAgent"</span><span class="p">,</span>
    <span class="n">tools</span><span class="o">=</span><span class="p">[</span>
        <span class="n">CodebaseTools</span><span class="p">.</span><span class="n">find_function_calls</span><span class="p">,</span>
        <span class="n">CodebaseTools</span><span class="p">.</span><span class="n">read_file_line</span><span class="p">,</span>
        <span class="n">CodebaseTools</span><span class="p">.</span><span class="n">suggest_refactor</span><span class="p">,</span>
    <span class="p">],</span>
    <span class="n">model</span><span class="o">=</span><span class="s">"gpt-4-turbo"</span> <span class="c1"># Specify the LLM to use for reasoning
</span><span class="p">)</span>

<span class="c1"># The User's request
</span><span class="n">user_request</span> <span class="o">=</span> <span class="s">"Please find all uses of 'old_deprecated_function' and replace them with 'new_stable_function'."</span>

<span class="c1"># Agent execution loop
</span><span class="k">def</span> <span class="nf">run_refactoring_agent</span><span class="p">(</span><span class="n">request</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
    <span class="s">"""Orchestrates the agent's plan to fulfill the user request."""</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"AGENT: Received request. Devising a plan."</span><span class="p">)</span>
    
    <span class="c1"># 1. Plan: The agent's LLM brain decides the sequence of actions.
</span>    <span class="n">plan</span> <span class="o">=</span> <span class="p">[</span>
        <span class="s">"Use the 'find_function_calls' tool to locate all instances of 'old_deprecated_function'."</span><span class="p">,</span>
        <span class="s">"For each instance found, use the 'read_file_line' tool to get the exact code."</span><span class="p">,</span>
        <span class="s">"Use the 'suggest_refactor' tool to generate a replacement for each line."</span><span class="p">,</span>
        <span class="s">"Compile all suggestions into a final report for the user."</span>
    <span class="p">]</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"AGENT: Plan created:</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s">"- </span><span class="si">{</span><span class="n">step</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">step</span> <span class="ow">in</span> <span class="n">plan</span><span class="p">))</span>

    <span class="c1"># 2. Act: The agent executes the plan by calling the tools.
</span>    <span class="n">old_function</span> <span class="o">=</span> <span class="s">"old_deprecated_function"</span>
    <span class="n">new_function</span> <span class="o">=</span> <span class="s">"new_stable_function"</span>

    <span class="n">call_locations</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="n">run_tool</span><span class="p">(</span><span class="s">"find_function_calls"</span><span class="p">,</span> <span class="n">function_name</span><span class="o">=</span><span class="n">old_function</span><span class="p">)</span>
    
    <span class="n">suggestions</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">loc</span> <span class="ow">in</span> <span class="n">call_locations</span><span class="p">:</span>
        <span class="n">line_content</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="n">run_tool</span><span class="p">(</span><span class="s">"read_file_line"</span><span class="p">,</span> <span class="n">file_path</span><span class="o">=</span><span class="n">loc</span><span class="p">[</span><span class="s">"file"</span><span class="p">],</span> <span class="n">line_number</span><span class="o">=</span><span class="n">loc</span><span class="p">[</span><span class="s">"line"</span><span class="p">])</span>
        <span class="n">suggestion</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="n">run_tool</span><span class="p">(</span><span class="s">"suggest_refactor"</span><span class="p">,</span> <span class="n">file_path</span><span class="o">=</span><span class="n">loc</span><span class="p">[</span><span class="s">"file"</span><span class="p">],</span> <span class="n">line_number</span><span class="o">=</span><span class="n">loc</span><span class="p">[</span><span class="s">"line"</span><span class="p">],</span> <span class="n">old_code</span><span class="o">=</span><span class="n">line_content</span><span class="p">,</span> <span class="n">new_function_name</span><span class="o">=</span><span class="n">new_function</span><span class="p">)</span>
        <span class="n">suggestions</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">suggestion</span><span class="p">)</span>

    <span class="c1"># 3. Observe/Synthesize: The agent compiles the results into a human-readable format.
</span>    <span class="n">final_report</span> <span class="o">=</span> <span class="s">"Refactoring complete. Here are the suggested changes:</span><span class="se">\n\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">suggestions</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">AGENT: Final Report:</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="n">final_report</span><span class="p">)</span>

<span class="c1"># Run the agent
</span><span class="n">run_refactoring_agent</span><span class="p">(</span><span class="n">user_request</span><span class="p">)</span>

</code></pre></div></div>
<p>This example, while simplified, captures the essence of the “Cursor-like” paradigm: an agent that can reason about a user’s intent, formulate a multi-step plan, interact with the codebase through specialized tools, and synthesize the results into a concrete, actionable outcome.</p>

<h2 id="architecting-an-ai-assistant-for-network-observability-saas">Architecting an AI Assistant for Network Observability SaaS</h2>

<p>The explosive growth in the complexity of distributed systems has turned network and service monitoring into a significant challenge for IT operations and Site Reliability Engineering (SRE) teams. These teams are inundated with a constant stream of telemetry data from countless sources—logs, metrics, traces, and network protocol updates. An AI assistant, or copilot, embedded within a network observability SaaS platform presents a powerful opportunity to automate the cognitive-heavy tasks of incident investigation, root cause analysis, and proactive system management, transforming a reactive operational model into a proactive, data-driven one.</p>

<h3 id="the-opportunity-taming-complexity-in-network-monitoring">The Opportunity: Taming Complexity in Network Monitoring</h3>

<p>The core problem in modern network observability is not a lack of data, but a surplus of it [15]. Human operators struggle to manually correlate disparate signals to diagnose issues. For example, a latency spike observed in application metrics might be caused by a database overload, a misconfigured load balancer, an upstream API failure, or a sub-optimal BGP routing change happening thousands of miles away [16, 17]. Pinpointing the true root cause requires expertise, time, and the painstaking process of cross-referencing data from multiple, often siloed, monitoring tools.</p>

<p>An agentic AI assistant is uniquely suited to address this challenge. By leveraging its ability to ingest and reason over vast, heterogeneous datasets, it can automate the correlation process that is so burdensome for humans [21]. It can answer natural language questions about system health, automatically investigate alerts as they fire, and provide evidence-backed explanations for its conclusions. This allows human experts to focus their attention on strategic remediation and system improvement rather than getting lost in the weeds of diagnostic data analysis.</p>

<h3 id="a-proposed-architecture-for-a-network-observability-copilot">A Proposed Architecture for a Network Observability Copilot</h3>

<p>To realize this vision, we propose a multi-component architecture for a network observability copilot that integrates the advanced agentic principles discussed previously. This architecture is centered around a sophisticated knowledge base that combines a multi-layered memory system with a dynamic knowledge graph, serving as the AI’s long-term memory and world model.</p>

<p><img src="https://d3lkc3n5th01x7.cloudfront.net/wp-content/uploads/2024/08/26051537/Advanced-RAG.png" alt="Diagram showing a GraphRAG architecture for an enterprise application." />
<em>An example of a GraphRAG architecture, which combines semantic search with graph traversal to retrieve interconnected, contextual information for the LLM, a model well-suited for a network observability copilot.</em></p>

<p>The ingestion layer of this architecture would continuously process telemetry streams from various sources. Application logs, system metrics, and distributed traces provide a view of the service layer, while specialized data feeds for BGP updates, DNS query responses, and flow records (like IPFIX) offer crucial visibility into the underlying network fabric [16]. This data is then processed and stored within the agent’s memory system.</p>

<ul>
  <li><strong>Episodic Memory:</strong> This layer would store a historical record of all incidents, including the alerts that triggered them, the investigation steps taken (both by humans and the AI), chat transcripts from incident response channels, and the final resolution. Each incident becomes a discrete “episode” that the agent can learn from [4].</li>
  <li><strong>Semantic Memory:</strong> Through a process of periodic reflection and summarization, the agent distills higher-level knowledge from raw episodes. This semantic store might contain insights like, “Deployments to the ‘us-east-1’ region containing database schema changes have a 30% higher chance of causing P1 incidents,” or summaries of service runbooks [4].</li>
  <li><strong>Procedural Memory:</strong> This layer stores learned, executable workflows for diagnosing specific types of alerts. For example, upon receiving a “high latency” alert for a particular service, the agent could invoke a pre-defined procedure that automatically checks database load, recent deployments, and upstream service health in a specific sequence [4].</li>
  <li><strong>Graph-Based Knowledge Base:</strong> This is the centerpiece of the architecture. It is a dynamic knowledge graph that models the entire monitored environment as a set of interconnected entities. Nodes in the graph would represent services, databases, Kubernetes pods, hosts, and network prefixes. Edges would represent dependencies, communication pathways, and logical relationships (e.g., “Service A depends on Database B,” “Pod X runs on Host Y,” “Traffic to Prefix Z traverses AS Path [1]”). This <strong>GraphRAG</strong> (Graph Retrieval-Augmented Generation) approach allows the agent to reason about the system’s topology and perform multi-hop queries to understand the “blast radius” of a failure [18, 19].</li>
</ul>

<h3 id="agentic-workflows-for-incident-management">Agentic Workflows for Incident Management</h3>

<p>With this architecture in place, the observability copilot can execute a range of sophisticated workflows that automate and augment the incident management lifecycle. A multi-agent system, comprising a high-level Orchestrator and specialized Worker agents, would be ideal for managing these complex tasks [9].</p>

<ul>
  <li>
    <p><strong>Use Case 1: Automated Root Cause Analysis (RCA):</strong> When a critical alert fires, the Orchestrator agent initiates an investigation. It spawns multiple Worker agents in parallel. One agent analyzes metric data to characterize the anomaly’s scope and timing. Another agent scans log files from the affected service for error messages within the same timeframe. A third, specialized “Network Agent,” queries the knowledge graph to check for any correlating BGP path changes or DNS anomalies that occurred concurrently [20, 22]. The agents feed their findings back to the Orchestrator, which uses the LLM’s reasoning capabilities to synthesize the information, identify the most probable root cause, and present a cited, evidence-backed summary to the on-call engineer.</p>
  </li>
  <li>
    <p><strong>Use Case 2: Natural Language Investigation:</strong> An SRE can interact with the copilot in a conversational manner. For example, they might ask, <em>“What was the impact of the BGP route leak this morning affecting our primary European prefixes?”</em> The Orchestrator agent would parse this query, identify the key entities (“BGP route leak,” “European prefixes”), and delegate the investigation. A Worker agent would query the Episodic memory for the relevant incident record from that morning. Another agent would query the knowledge graph to identify all services dependent on infrastructure associated with those prefixes [20]. The final response would be a comprehensive summary, including which services experienced increased latency, the duration of the impact, and a link to the full post-mortem report.</p>
  </li>
  <li>
    <p><strong>Use Case 3: Proactive Anomaly Explanation:</strong> The copilot can also operate proactively. A monitoring agent could continuously analyze network telemetry for subtle performance degradations that might not trigger a hard alert threshold. Upon detecting a consistent increase in latency for traffic routed through a specific Internet Service Provider (ISP), the agent could proactively generate a report explaining the anomaly, identifying the affected customer traffic, and suggesting potential traffic engineering adjustments to mitigate the issue before it becomes a major incident [22].</p>
  </li>
</ul>

<h3 id="illustrative-code-sample-diagnosing-a-network-anomaly">Illustrative Code Sample: Diagnosing a Network Anomaly</h3>

<p>The following Python pseudo-code provides a conceptual look at how an agent function might approach diagnosing a latency spike. It demonstrates the correlation of multiple data sources, a key capability of the proposed architecture.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># A simplified Python pseudo-code example for a network observability agent function.
</span>
<span class="kn">from</span> <span class="nn">observability_tools</span> <span class="kn">import</span> <span class="n">MetricsDB</span><span class="p">,</span> <span class="n">TracesDB</span><span class="p">,</span> <span class="n">BGPLogDB</span><span class="p">,</span> <span class="n">ServiceGraph</span>

<span class="k">class</span> <span class="nc">ObservabilityAgent</span><span class="p">:</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="c1"># Initialize connections to data sources and the knowledge graph
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">metrics</span> <span class="o">=</span> <span class="n">MetricsDB</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">traces</span> <span class="o">=</span> <span class="n">TracesDB</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">bgp_logs</span> <span class="o">=</span> <span class="n">BGPLogDB</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">service_graph</span> <span class="o">=</span> <span class="n">ServiceGraph</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">llm</span> <span class="o">=</span> <span class="s">"anthropic.claude-3-opus-20240229-v1:0"</span> <span class="c1"># The reasoning engine
</span>
    <span class="k">def</span> <span class="nf">diagnose_latency_spike</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">alert_details</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""
        Investigates a latency spike alert by correlating metrics, traces, and network data.
        """</span>
        <span class="n">service_name</span> <span class="o">=</span> <span class="n">alert_details</span><span class="p">[</span><span class="s">"service"</span><span class="p">]</span>
        <span class="n">start_time</span> <span class="o">=</span> <span class="n">alert_details</span><span class="p">[</span><span class="s">"start_time"</span><span class="p">]</span>
        <span class="n">end_time</span> <span class="o">=</span> <span class="n">alert_details</span><span class="p">[</span><span class="s">"end_time"</span><span class="p">]</span>
        
        <span class="c1"># --- Step 1: Gather Initial Evidence from different telemetry sources ---
</span>        
        <span class="c1"># Agent analyzes metrics to confirm the spike
</span>        <span class="n">metric_summary</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">.</span><span class="n">get_latency_summary</span><span class="p">(</span><span class="n">service_name</span><span class="p">,</span> <span class="n">start_time</span><span class="p">,</span> <span class="n">end_time</span><span class="p">)</span>
        
        <span class="c1"># Agent finds the slowest traces during the incident window
</span>        <span class="n">slowest_traces</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">traces</span><span class="p">.</span><span class="n">find_slowest_traces</span><span class="p">(</span><span class="n">service_name</span><span class="p">,</span> <span class="n">start_time</span><span class="p">,</span> <span class="n">end_time</span><span class="p">)</span>
        
        <span class="c1"># Agent checks for any BGP routing changes in the same window
</span>        <span class="n">bgp_changes</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">bgp_logs</span><span class="p">.</span><span class="n">get_updates_for_prefixes</span><span class="p">(</span>
            <span class="n">service_prefixes</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">service_graph</span><span class="p">.</span><span class="n">get_prefixes_for_service</span><span class="p">(</span><span class="n">service_name</span><span class="p">),</span>
            <span class="n">start_time</span><span class="o">=</span><span class="n">start_time</span><span class="p">,</span>
            <span class="n">end_time</span><span class="o">=</span><span class="n">end_time</span>
        <span class="p">)</span>

        <span class="c1"># --- Step 2: Formulate Hypotheses based on Evidence ---
</span>        
        <span class="n">causal_hypotheses</span> <span class="o">=</span> <span class="p">[]</span>
        
        <span class="k">if</span> <span class="s">"database_query"</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">slowest_traces</span><span class="p">):</span>
            <span class="n">causal_hypotheses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"The latency spike may be caused by slow database queries."</span><span class="p">)</span>
            
        <span class="n">downstream_services</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">service_graph</span><span class="p">.</span><span class="n">get_downstream_dependencies</span><span class="p">(</span><span class="n">service_name</span><span class="p">)</span>
        <span class="k">if</span> <span class="nb">any</span><span class="p">(</span><span class="n">service</span> <span class="ow">in</span> <span class="nb">str</span><span class="p">(</span><span class="n">slowest_traces</span><span class="p">)</span> <span class="k">for</span> <span class="n">service</span> <span class="ow">in</span> <span class="n">downstream_services</span><span class="p">):</span>
            <span class="n">causal_hypotheses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"The issue may originate from a slow downstream dependency."</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">bgp_changes</span><span class="p">:</span>
            <span class="n">causal_hypotheses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"A concurrent BGP routing change was detected, potentially causing suboptimal traffic paths. Changes: </span><span class="si">{</span><span class="n">bgp_changes</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            
        <span class="c1"># --- Step 3: Synthesize a Root Cause Narrative using the LLM ---
</span>        
        <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""
        You are an expert SRE. A latency spike was detected for the service '</span><span class="si">{</span><span class="n">service_name</span><span class="si">}</span><span class="s">'.
        Analyze the following evidence and provide a concise root cause analysis.
        
        Metric Summary: </span><span class="si">{</span><span class="n">metric_summary</span><span class="si">}</span><span class="s">
        
        Slowest Traces Analysis: The slowest traces show significant time spent in these spans: </span><span class="si">{</span><span class="n">slowest_traces</span><span class="si">}</span><span class="s">.
        
        BGP Log Analysis: The following BGP updates were observed during the incident: </span><span class="si">{</span><span class="n">bgp_changes</span><span class="si">}</span><span class="s">.
        
        Causal Hypotheses: </span><span class="si">{</span><span class="n">causal_hypotheses</span><span class="si">}</span><span class="s">
        
        Based on all evidence, what is the most likely root cause? Be precise and provide your reasoning.
        """</span>
        
        <span class="c1"># The LLM reasons over the correlated data to generate an explanation.
</span>        <span class="n">root_cause_narrative</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">llm</span><span class="p">.</span><span class="n">generate</span><span class="p">(</span><span class="n">prompt</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">root_cause_narrative</span>

<span class="c1"># --- Example Usage ---
# An alert comes in from the monitoring system.
</span><span class="n">alert</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"service"</span><span class="p">:</span> <span class="s">"api-gateway"</span><span class="p">,</span>
    <span class="s">"alert_type"</span><span class="p">:</span> <span class="s">"P95_LATENCY_SPIKE"</span><span class="p">,</span>
    <span class="s">"start_time"</span><span class="p">:</span> <span class="s">"2026-06-07T10:00:00Z"</span><span class="p">,</span>
    <span class="s">"end_time"</span><span class="p">:</span> <span class="s">"2026-06-07T10:15:00Z"</span>
<span class="p">}</span>

<span class="c1"># The agent is triggered to perform RCA.
</span><span class="n">agent</span> <span class="o">=</span> <span class="n">ObservabilityAgent</span><span class="p">()</span>
<span class="n">analysis_report</span> <span class="o">=</span> <span class="n">agent</span><span class="p">.</span><span class="n">diagnose_latency_spike</span><span class="p">(</span><span class="n">alert</span><span class="p">)</span>

<span class="k">print</span><span class="p">(</span><span class="s">"--- Automated Incident Analysis Report ---"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">analysis_report</span><span class="p">)</span>
</code></pre></div></div>
<p>This example illustrates how an agent can systematically gather and correlate evidence from multiple domains—application performance and network routing—to form and evaluate hypotheses, ultimately producing a coherent and actionable analysis that would be time-consuming and difficult for a human operator to construct under pressure.</p>

<h2 id="enterprise-integration-security-governance-and-compliance">Enterprise Integration: Security, Governance, and Compliance</h2>

<p>Integrating a powerful, autonomous AI copilot into an enterprise-grade, multi-tenant SaaS platform is not merely a technical challenge; it is a profound security and governance undertaking. The very capabilities that make these agents effective—access to vast data stores, the ability to interact with production systems, and autonomous decision-making—also introduce significant risks if not architected with a security-first mindset. For a network observability product serving multiple customers, ensuring absolute tenant isolation, data privacy, and regulatory compliance is paramount.</p>

<h3 id="the-multi-tenant-security-challenge">The Multi-Tenant Security Challenge</h3>

<p>In a multi-tenant environment, the primary threat is data leakage across tenant boundaries. An AI assistant, by its nature, processes large amounts of contextual data. A single flaw in its logic or a vulnerability to a technique like <strong>prompt injection</strong> could lead to catastrophic consequences [23, 24]. A malicious actor in one tenant could craft a query designed to trick the agent into revealing sensitive data—such as infrastructure details, proprietary code, or PII—from another tenant. Furthermore, the risk of “excessive agency,” where an agent is manipulated into performing unauthorized actions, is a critical concern identified by security frameworks like the OWASP Top 10 for LLM Applications [23]. The non-deterministic nature of LLMs means that traditional application security models, which rely on predictable code paths, are insufficient. Security cannot be an afterthought; it must be structurally embedded in the architecture.</p>

<h3 id="architectural-patterns-for-secure-multi-tenancy">Architectural Patterns for Secure Multi-Tenancy</h3>

<p>To mitigate these risks, a defense-in-depth strategy based on <strong>structural isolation</strong> is required [26]. This principle dictates that tenant separation should be enforced by the underlying infrastructure, not by application-level logic that could be flawed or bypassed. Several architectural patterns are key to achieving this.</p>

<ul>
  <li>
    <p><strong>Compute and Runtime Isolation:</strong> AI agents that execute LLM-generated or un-trusted code pose a significant threat. Standard containerization, which shares the host kernel, is often insufficient. A stronger approach is to use lightweight virtual machines, or <strong>microVMs</strong>, such as those created by <strong>Firecracker</strong> [25]. Firecracker provides hardware-level virtualization, ensuring that each tenant’s agent execution (or even each individual agent invocation) occurs in a completely separate kernel environment [25, 26]. This prevents container escape vulnerabilities and ensures that one tenant’s processes cannot interfere with another’s.</p>
  </li>
  <li>
    <p><strong>Data and Credential Security:</strong> The agent’s access to data must be rigorously controlled. Instead of relying on application logic like <code class="language-plaintext highlighter-rouge">WHERE tenant_id = X</code>, a more robust pattern is <strong>namespace separation</strong>, where each tenant’s data resides in a physically separate storage bucket, database schema, or vector collection [27]. This makes cross-tenant access impossible by design. Furthermore, a tenant-aware proxy should be placed between the agent and any backend services. This proxy is responsible for stripping any credentials the agent might erroneously inject into a request and unconditionally rewriting tenant identifiers based on the trusted session context, preventing the model from hallucinating or being tricked into accessing another tenant’s resources [28].</p>
  </li>
</ul>

<p><img src="https://prod-assets.cosmic.aws.dev/a/3A5xgciOz0TaylxMgq9tRwGKySQ/arch.webp?imgSize=571x961" alt="Diagram showing a secure multi-tenant architecture for an AI application." />
<em>This architectural diagram illustrates a multi-tenant AI platform, emphasizing data segregation and isolated processing, which are critical for enterprise security.</em></p>

<h3 id="governance-auditability-and-human-in-the-loop-hitl">Governance, Auditability, and Human-in-the-Loop (HITL)</h3>

<p>Beyond infrastructure security, robust governance and oversight mechanisms are essential for building trust and meeting regulatory requirements. Organizations should align their AI governance strategy with established frameworks like the <strong>NIST AI Risk Management Framework (AI RMF)</strong>, which provides a structured approach to identifying, measuring, and managing AI-related risks [2].</p>

<p><strong>Role-Based Access Control (RBAC)</strong> must be extended to the AI agents themselves. Each agent should be treated as a non-human identity with its own set of permissions, adhering to the principle of least privilege [31]. This ensures that an agent authorized to read observability data cannot, for instance, execute a command to modify a network device configuration unless explicitly permitted for that user’s role.</p>

<p>A comprehensive <strong>AI Audit Trail</strong> is non-negotiable. Traditional logging is insufficient because it only captures the state of the system. An AI audit trail must log the <em>intent</em> and <em>reasoning</em> of the agent [29]. This includes logging the full prompt, the retrieved context, the “thought” process or plan generated by the LLM, the specific tools called, and the final output [30]. This level of traceability is crucial for forensic analysis, debugging, and demonstrating compliance to regulators.</p>

<p>Finally, for high-stakes actions, a <strong>Human-in-the-Loop (HITL)</strong> workflow is essential. While the AI agent can autonomously perform analysis and suggest remediation steps (e.g., “roll back deployment X” or “apply this traffic filter”), the final execution of any action that modifies the production environment must require explicit approval from a human operator [32]. The system must balance the need for safety with the operational desire for low latency by using smart escalation policies, routing only the most critical or ambiguous decisions for human review [33].</p>

<h3 id="comparison-of-security-isolation-models">Comparison of Security Isolation Models</h3>

<p>The choice of an isolation model involves a trade-off between security, cost, and complexity. The following table compares different approaches an organization might consider when architecting a multi-tenant AI SaaS product.</p>

<table>
  <thead>
    <tr>
      <th style="text-align: left">Isolation Model</th>
      <th style="text-align: left">Isolation Strength</th>
      <th style="text-align: left">Typical Cost</th>
      <th style="text-align: left">Implementation Complexity</th>
      <th style="text-align: left">Performance Overhead</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td style="text-align: left"><strong>Logical (Row-Level Security)</strong></td>
      <td style="text-align: left">Weakest</td>
      <td style="text-align: left">Low</td>
      <td style="text-align: left">Low</td>
      <td style="text-align: left">Minimal</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Application-Level Namespacing</strong></td>
      <td style="text-align: left">Moderate</td>
      <td style="text-align: left">Low</td>
      <td style="text-align: left">Moderate</td>
      <td style="text-align: left">Low</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Container-Based Isolation</strong></td>
      <td style="text-align: left">Strong</td>
      <td style="text-align: left">Moderate</td>
      <td style="text-align: left">Moderate</td>
      <td style="text-align: left">Moderate</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>MicroVM-Based Isolation</strong></td>
      <td style="text-align: left">Strongest</td>
      <td style="text-align: left">High</td>
      <td style="text-align: left">High</td>
      <td style="text-align: left">High</td>
    </tr>
    <tr>
      <td style="text-align: left"><strong>Physical (Dedicated Hardware)</strong></td>
      <td style="text-align: left">Absolute</td>
      <td style="text-align: left">Very High</td>
      <td style="text-align: left">Very High</td>
      <td style="text-align: left">None (per tenant)</td>
    </tr>
  </tbody>
</table>

<p>For a network observability SaaS product handling sensitive customer data, a model combining application-level namespacing for data storage with microVM-based isolation for compute workloads offers a strong balance of security and scalability [26]. While more complex to implement than simpler models, this hybrid approach provides the necessary defense-in-depth to earn the trust of enterprise customers.</p>

<h2 id="conclusion">Conclusion</h2>

<p>The evolution of AI assistants into persistent, autonomous agents marks a pivotal moment for enterprise software, particularly in complex domains like network observability. The architectural patterns that have emerged—sophisticated multi-layered memory, advanced planning and orchestration frameworks, and repository-aware context management—provide the foundational components for building truly intelligent systems. These agents are no longer just tools but are becoming active collaborators, capable of automating the cognitively demanding work of incident investigation and root cause analysis.</p>

<p>This report has proposed a comprehensive architecture for a network observability copilot, one that leverages these principles to tame the overwhelming firehose of modern telemetry data. By integrating a GraphRAG knowledge base that unifies service topology with network path information, such a system can perform complex, multi-hop reasoning, correlating signals across application and network layers to provide rapid, evidence-backed insights. The potential to drastically reduce mean time to resolution, eliminate dependency on tribal knowledge, and shift human operators from reactive firefighting to proactive optimization is immense.</p>

<p>However, this great power comes with great responsibility. The successful deployment of agentic AI in a multi-tenant SaaS environment is contingent upon a security-first approach. Robust structural isolation, fine-grained access control, comprehensive auditability of agent reasoning, and unwavering human oversight for critical actions are not optional features but core design requirements. As we move forward, the organizations that succeed will be those that master this duality—innovating boldly with agentic AI while grounding their systems in the unshakeable principles of security, governance, and trust.</p>

<h1 id="references">References</h1>
<ol>
  <li><a href="https://docs.github.com/en/enterprise-cloud@latest/admin/enforcing-policies/enforcing-policies-for-your-enterprise/enforcing-policies-for-github-copilot-in-your-enterprise">Enforcing policies for GitHub Copilot in your enterprise - GitHub Docs</a></li>
  <li><a href="https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf">NIST AI Risk Management Framework (AI RMF 1.0), Generative Artificial Intelligence Profile (NIST AI 600-1), July 2024</a></li>
  <li><a href="https://blogs.oracle.com/developers/what-is-the-ai-agent-loop-the-core-architecture-behind-autonomous-ai-systems">“What is the AI Agent Loop?”, Oracle Developer Resource Center</a></li>
  <li><a href="https://arxiv.org/html/2603.07670v1">A Survey on Large Language Model-based Autonomous Agents, March 2024</a></li>
  <li><a href="https://towardsdatascience.com/a-practical-guide-to-memory-for-autonomous-llm-agents/">A Practical Guide to Memory for Autonomous LLM Agents, Towards Data Science</a></li>
  <li><a href="https://arxiv.org/html/2602.05665v1">From Language Models to Practical Agents: A Survey on Memory Mechanisms, February 2024</a></li>
  <li><a href="https://arxiv.org/abs/2210.03629">ReAct: Synergizing Reasoning and Acting in Language Models, October 2022</a></li>
  <li><a href="https://arxiv.org/html/2511.10037v1">Beyond ReAct: A Plan-centric Approach for Action-Generation Models, November 2023</a></li>
  <li><a href="https://arxiv.org/html/2512.03560v1">“Hierarchical LLM-agent for solving complex enterprise-scenario tool-use tasks”, December 2023</a></li>
  <li><a href="https://www.langchain.com/blog/agentic-engineering-redefining-software-engineering">“Agentic Engineering: Redefining Software Engineering with AI”, LangChain Blog</a></li>
  <li><a href="https://arxiv.org/html/2603.27277v1">Codebase-Memory: A Code Language Model with Repository-Scale Read and Write Operations, March 2024</a></li>
  <li><a href="https://www.wiz.io/academy/ai-security/cursor-vs-github">“GitHub Copilot vs Cursor”, Wiz Academy</a></li>
  <li><a href="https://newsletter.pragmaticengineer.com/p/cursor">“Cursor: The AI-first Code Editor”, The Pragmatic Engineer</a></li>
  <li><a href="https://www.digitalapplied.com/blog/github-copilot-vs-cursor-vs-windsurf-ai-coding-assistants">“GitHub Copilot vs Cursor vs Windsurf”, Digital Applied</a></li>
  <li><a href="https://www.site24x7.com/learn/correlating-metrics-traces-logs.html">“Correlating Metrics, Traces, and Logs for Holistic Visibility”, Site24x7</a></li>
  <li><a href="https://www.kentik.com/kentipedia/bgp-ipfix-analysis/">“BGP in Flow Analysis (IPFIX/NetFlow)”, Kentik</a></li>
  <li><a href="https://www.catchpoint.com/network-experience/network-reachability">“Mastering Network Path Observability”, Catchpoint</a></li>
  <li><a href="https://arxiv.org/html/2502.06864v1">From Local to Global: A Graph-Based Approach for Multi-Document RAG, February 2024</a></li>
  <li><a href="https://arxiv.org/html/2507.03226v2">GraphRAG: A Cost-Effective Framework for Enterprise-Level Retrieval-Augmented Generation, July 2024</a></li>
  <li><a href="https://coroot.com/ai">“AI Root Cause Analysis”, Coroot</a></li>
  <li><a href="https://incident.io/blog/what-is-ai-sre-complete-guide-2026">“What is AI SRE? The Ultimate Guide for 2026”, incident.io</a></li>
  <li><a href="https://arxiv.org/html/2506.04514v1">BEAR: A Framework for BGP Event Analysis and Reporting with Large Language Models (LLMs), June 2024</a></li>
  <li><a href="https://owasp.org/www-project-top-10-for-large-language-model-applications/">OWASP Top 10 for Large Language Model Applications</a></li>
  <li><a href="https://arxiv.org/html/2604.05440v1">The Security of LLM-based Applications in a Multi-tenant Environment, April 2024</a></li>
  <li><a href="https://firecracker-microvm.github.io/">Firecracker MicroVM</a></li>
  <li><a href="https://blaxel.ai/blog/multi-tenant-isolation-ai-agents">“SaaS Isolation for AI Agents: A Deep Dive into Multi-Tenant Security”, Blaxel</a></li>
  <li><a href="https://dzone.com/articles/isolation-boundaries-multi-tenant-ai-architecture-guardrail">“Isolation Boundaries: A Guide to Multi-Tenant AI Architecture Guardrails”, DZone</a></li>
  <li><a href="https://dev.to/ksankar/defense-in-depth-tenant-isolation-for-an-agent-that-executes-code-375j">“Defense in Depth: Tenant Isolation for an Agent That Executes Code”, Dev.to</a></li>
  <li><a href="https://www.loginradius.com/blog/engineering/auditing-and-logging-ai-agent-activity">“Auditing and Logging AI Agent Activity: A Developer’s Guide”, LoginRadius Engineering Blog</a></li>
  <li><a href="https://www.cxtoday.com/security-privacy-compliance/ai-audit-trail-regulatory-scrutiny/">“AI Audit Trail: Navigating Regulatory Scrutiny”, CXtoday</a></li>
  <li><a href="https://techcommunity.microsoft.com/blog/microsoftdefendercloudblog/architecting-trust-a-nist-based-security-governance-framework-for-ai-agents/4490556">“Architecting Trust: A NIST-based Security Governance Framework for AI Agents”, Microsoft Defender for Cloud Blog</a></li>
  <li><a href="https://www.elementum.ai/blog/human-in-the-loop-agentic-ai">“Human in the Loop for Agentic AI”, Elementum</a></li>
  <li><a href="https://www.comet.com/site/blog/human-in-the-loop/">The Human-in-the-Loop Framework at Comet</a></li>
</ol>]]></content><author><name>Marc Buraczynski</name></author><category term="agentic AI" /><category term="network observability" /><category term="Cursor" /><category term="LLM architecture" /><category term="GraphRAG" /><category term="multi-tenant security" /><summary type="html"><![CDATA[Authored by: Marc Buraczynski Publication Date: 2026-06-09]]></summary></entry><entry><title type="html">Advancing Network Observability with Custom-Developed Machine Learning Models</title><link href="https://gunnymarc.github.io/posts/2026/06/advancing-network-observability-ml-models/" rel="alternate" type="text/html" title="Advancing Network Observability with Custom-Developed Machine Learning Models" /><published>2026-06-02T00:00:00-04:00</published><updated>2026-06-02T00:00:00-04:00</updated><id>https://gunnymarc.github.io/posts/2026/06/Advancing%20Network%20Observability%20with%20Custom-Developed%20Machine%20Learning%20Models</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/06/advancing-network-observability-ml-models/"><![CDATA[<p><em>15 min read</em></p>

<hr />

<p>Here’s a scenario that will feel painfully familiar to anyone who’s run a network operations center in the last five years.</p>

<p>It’s 2:47 AM. A critical SaaS application starts degrading for users across three regions. The monitoring dashboard lights up — but it lit up <em>after</em> users started complaining. Your on-call engineer begins the investigation, correlating alerts across half a dozen tools. <strong>Three hours later</strong>, they find the root cause: a subtle BGP routing change by an upstream provider that cascaded into latency spikes across multiple paths.</p>

<p>The data that predicted this failure? It was sitting in your telemetry pipeline the entire time. Nobody — and no <em>thing</em> — was looking at it the right way.</p>

<p>This is the gap that <strong>custom-developed machine learning models</strong> close. Not generic, off-the-shelf analytics packages that treat every network the same. But models trained specifically on <em>your</em> network’s unique behavior, topology, and traffic patterns. And it’s not theoretical anymore.</p>

<hr />

<h2 id="the-uncomfortable-truth-about-threshold-based-monitoring">The Uncomfortable Truth About Threshold-Based Monitoring</h2>

<p>Let’s be honest about something the industry has danced around for too long.</p>

<p>Traditional network monitoring — the kind built on static thresholds — was designed for a world that no longer exists. A world where networks were relatively contained, failure modes were predictable, and an engineer could reasonably hold the full topology in their head.</p>

<p><strong>That world is gone.</strong></p>

<p>Today’s enterprise networks span on-premises data centers, multiple cloud providers, SaaS applications, global WAN links, and millions of endpoints. The combinations of failure modes that can emerge from this complexity are simply too numerous to enumerate in advance.</p>

<p>And yet, most organizations are still running monitoring stacks that operate on a simple principle: <em>“Alert me when metric X crosses threshold Y.”</em></p>

<p>The problems with this approach are well-documented but worth restating:</p>

<ul>
  <li><strong>It can only detect problems it was programmed to look for.</strong> Novel failure modes — the ones that actually cause the worst outages — slip through entirely.</li>
  <li><strong>Thresholds go stale.</strong> A threshold that worked perfectly last quarter may miss a new type of degradation entirely — or flood your team with false positives after a routine infrastructure change.</li>
  <li><strong>It’s reactive by design.</strong> By the time a threshold fires, the damage is already happening. Users are already impacted. Revenue is already at risk.</li>
</ul>

<p>The fundamental limitation isn’t in the tooling. It’s in the <em>paradigm</em>. Threshold-based monitoring tells you when something is already broken. What we need is a system that tells us when something is <strong>about to break</strong> — and ideally, <em>why</em>.</p>

<p>That’s not a monitoring problem. That’s a machine learning problem.</p>

<hr />

<h2 id="why-custom-ml-models--and-why-generic-solutions-fall-short">Why Custom ML Models — and Why Generic Solutions Fall Short</h2>

<p>Machine learning isn’t new to IT operations. But here’s the uncomfortable reality: <strong>your network is not like anyone else’s network.</strong></p>

<p>Every enterprise network is fundamentally unique — its specific topology, application portfolio, traffic patterns, and operational history create a fingerprint that generic ML models can’t learn. An anomaly detection system tuned for SaaS traffic will generate noise when applied to financial services or manufacturing IoT environments.</p>

<p><strong>This is why custom model development — models trained specifically on your organization’s telemetry — has become the dividing line between ML deployments that deliver transformative value and those that become expensive shelfware.</strong></p>

<p>Three factors make custom ML practical:</p>

<p><strong>1. The data is good enough.</strong> ThousandEyes provides structured, normalized, multi-dimensional telemetry across network layers — precisely what ML models need.</p>

<p><strong>2. The models have matured.</strong> Autoencoders, Graph Neural Networks, Transformer architectures — production-ready and well-understood.</p>

<p><strong>3. The business case is proven.</strong> Organizations deploying custom ML report <strong>60–80% MTTR reductions</strong> and <strong>70% fewer false-positive alerts</strong>.</p>

<p>Custom models deliver four advantages that generic solutions cannot:</p>

<ul>
  <li><strong>Pattern recognition tuned to YOUR network</strong> — learning your unique topology, application mix, and traffic patterns, flagging deviations that matter <em>in your context</em></li>
  <li><strong>Continuous adaptability to YOUR changes</strong> — retraining on your ongoing telemetry as your environment evolves, no stale thresholds</li>
  <li><strong>Cross-domain correlation for YOUR stack</strong> — finding relationships between the systems you actually run, not generic assumptions</li>
  <li><strong>Predictive power aligned with YOUR SLAs</strong> — raising alerts <strong>10-30 minutes before service impact</strong>, tuned to your performance baselines</li>
</ul>

<hr />

<h2 id="thousandeyes-as-the-data-foundation">ThousandEyes as the Data Foundation</h2>

<p><strong>Models are only as good as the data they consume.</strong> ThousandEyes provides the ideal foundation for custom ML:</p>

<ul>
  <li><strong>End-to-end path visibility</strong> across internet, cloud, SaaS, and enterprise segments</li>
  <li><strong>Active and passive monitoring</strong> via globally distributed agents that proactively test paths</li>
  <li><strong>Cross-layer correlation</strong> connecting network events to application outcomes in a single data model</li>
</ul>

<p>This structured, time-synchronized telemetry makes rapid custom model development possible. The heavy lifting — data collection, normalization, correlation — is already done. Custom models extend ThousandEyes by learning your organization-specific patterns and delivering predictive capabilities.</p>

<p><strong>Think of it this way:</strong> ThousandEyes handles “what’s happening.” Custom models add “what’s about to happen” and “why” — tuned to your network’s unique characteristics.</p>

<p><img src="/images/v2_diag1_te_pipeline.png" alt="How Custom ML Models Enhance ThousandEyes" /></p>

<p><em>The pipeline shows how ThousandEyes data sources feed custom ML models to deliver actionable business outcomes tailored to your organization.</em></p>

<hr />

<h2 id="why-custom-models-are-non-negotiable">Why Custom Models Are Non-Negotiable</h2>

<p><strong>Why can’t a vendor sell you a pre-trained model that works out of the box?</strong></p>

<p>Because your network has a unique fingerprint that generic models can’t learn:</p>

<p><strong>Unique Topology:</strong> Your San Francisco-to-Singapore path routes through specific transit providers with predictable congestion at 18:00 UTC. Your application stack spikes every time the nightly ETL runs. Generic models trained on aggregated industry data never see these patterns.</p>

<p><strong>Unique Applications:</strong> A financial services firm’s network during market open looks nothing like a retailer’s during Black Friday. Custom models learn <em>your</em> definition of “busy,” <em>your</em> traffic distribution, <em>your</em> normal failure modes.</p>

<p><strong>Unique SLAs:</strong> A 10ms latency increase is noise for a CDN but catastrophic for a trading platform. Custom models learn YOUR thresholds, YOUR priorities, YOUR acceptable trade-offs.</p>

<p><strong>The False Positive Problem:</strong> Generic models flag everything statistically unusual because they don’t know your context. Custom models trained on 60–90 days of <em>your</em> telemetry learn what “unusual but normal” means — planned maintenance signatures, expected business-driven traffic spikes, benign infrastructure quirks. <strong>Result: 60–80% fewer false positives</strong> because the model knows your network’s personality.</p>

<p><strong>Proprietary Systems:</strong> Many organizations run custom-built applications or industry-specific infrastructure that doesn’t exist in any vendor’s training dataset. Custom model development is the <em>only</em> path to ML-enhanced observability for these environments.</p>

<hr />

<h2 id="real-world-customization-why-industry-context-matters">Real-World Customization: Why Industry Context Matters</h2>

<p>Custom models adapt to fundamentally different operational realities. Three examples illustrate the point:</p>

<p><strong>Financial Services:</strong> A trading platform needs anomaly detection trained on latency <em>variance</em>, not absolute values — a 2ms spike at 9:29 AM (market open) is catastrophic; the same spike at 3 PM is noise. Custom models learn market hours, understand trading volume patterns, and prioritize specific market data feeds. Generic models can’t distinguish between critical pre-market latency and routine afternoon variation.</p>

<p><strong>E-Commerce/Retail:</strong> Seasonal traffic variations (Black Friday, Cyber Monday) would trigger constant false positives in generic anomaly detectors. Custom models learn that Black Friday traffic is <em>expected</em> and instead watch for deviations from the expected Black Friday pattern. Capacity forecasting aligns with promotional calendars and campaign-driven spikes, not industry-average growth curves.</p>

<p><strong>Healthcare/Telehealth:</strong> HIPAA compliance constrains what can be logged. Custom models use privacy-preserving feature engineering, learn healthcare operational rhythms (shift changes, morning rounds, appointment patterns), and understand that telehealth video quality thresholds differ from consumer video streaming. Generic models trained on SaaS or retail networks miss the unique signatures of EHR systems and medical imaging transfers.</p>

<p><strong>The pattern:</strong> Custom models learn the specific rhythm, priorities, and failure modes of <em>your</em> industry and network — not statistical averages across all networks.</p>

<hr />

<h2 id="four-custom-model-applications-that-deliver-real-impact">Four Custom Model Applications That Deliver Real Impact</h2>

<h3 id="1-custom-anomaly-detection-that-actually-works">1. Custom Anomaly Detection That Actually Works</h3>

<p>Traditional anomaly detection has a credibility problem. Too many false positives have trained operations teams to ignore alerts — which means real problems get buried in noise.</p>

<p><strong>Custom-developed</strong> anomaly detection takes a fundamentally different approach. Instead of comparing metrics against static thresholds <em>or</em> using generic pre-trained models, it uses <strong>autoencoders trained exclusively on your network’s telemetry</strong> — a class of neural network that learns what normal behavior looks like for <em>your specific network</em>, including all its time-of-day patterns, seasonal variations, traffic profiles, and infrastructure-specific quirks.</p>

<p>Think of it like a veteran NOC engineer who has worked <em>your</em> network for years and intuitively knows when something “feels off,” even before they can articulate why. The custom model does the same — it’s trained exclusively on <em>your</em> normal behavior and flags anything it can’t recognize as normal <em>in your context</em>.</p>

<p><strong>The difference between generic and custom models is stark:</strong></p>

<p>A generic anomaly detector might flag your planned nightly backup job as an anomaly because traffic suddenly spikes at 2 AM. A custom model trained on your data knows that pattern is expected and ignores it — while catching the <em>unusual</em> 2 AM spike that indicates a problem.</p>

<p><img src="/images/v2_diag2_anomaly.png" alt="Predictive Anomaly Detection — How It Works" /></p>

<p><em>This diagram illustrates the workflow from telemetry collection through baseline learning to real-time anomaly detection and alerting.</em></p>

<p><strong>The results speak for themselves:</strong></p>

<ul>
  <li>Catch degradation <strong>10–30 minutes before users notice</strong> service impact</li>
  <li>Reduce false-positive alert volume by <strong>60–80%</strong> compared to threshold alerting <em>and generic ML</em></li>
  <li>Detect subtle, multi-metric patterns unique to your topology that no generic solution would surface</li>
</ul>

<p>ThousandEyes’ path trace data, latency distributions, and BGP event feeds provide the training dataset. The custom model learns what <em>your</em> routing topology normally looks like, what <em>your</em> CDN behavior patterns are, and what changes matter <em>in your environment</em>. Changes that would normally require manual investigation become automatically detectable signals — tailored to your network’s fingerprint.</p>

<h3 id="2-custom-root-cause-analysis-models">2. Custom Root Cause Analysis Models</h3>

<p>This is where custom ML delivers its most dramatic operational impact.</p>

<p>When a service degrades, the visible symptom — slow application response, packet loss — is rarely the root cause. Something upstream triggered it: a routing change, a link failure, a misconfigured device. Finding that root cause typically involves a “war room” of engineers manually correlating data across multiple tools for hours.</p>

<p><strong>Graph Neural Networks trained on your topology</strong> change this equation entirely. By representing <em>your</em> network as a graph — devices as nodes, connections as edges — the custom model learns the dependency relationships between every component <em>in your specific environment</em>. When an alert fires, it propagates the signal back through the graph, computing which upstream events most likely caused the observed downstream effect <em>based on historical patterns it learned from your incident data</em>.</p>

<p>Here’s why customization is critical: <strong>Your network’s causal relationships are unique.</strong></p>

<p>In your environment, a specific upstream BGP change might predictably impact certain downstream paths due to your routing policy. A generic model doesn’t know that relationship. A custom model trained on your topology and historical incidents <em>does</em> — it’s learned from every previous failure how problems propagate through <em>your</em> infrastructure.</p>

<p>The output? A <strong>ranked list of probable root causes, each with supporting evidence drawn from your network’s historical behavior</strong> — delivered in seconds rather than hours.</p>

<p>ThousandEyes’ hop-by-hop path data and BGP routing intelligence give the custom graph model a precise, real-time map of <em>your</em> network’s active topology. The model learns which paths matter most in your environment, which upstream dependencies are critical, and which failure modes you’ve seen before. This makes causal tracing far more accurate than generic approaches based on static network diagrams, CMDB data, or industry-average dependency models.</p>

<p><strong>Impact: MTTR drops from hours to minutes</strong> for complex, multi-hop failures — because the model understands <em>your</em> network’s unique failure signatures.</p>

<h3 id="3-custom-performance-forecasting-and-capacity-planning">3. Custom Performance Forecasting and Capacity Planning</h3>

<p>Over-provisioning wastes money. Under-provisioning causes outages. The traditional approach to capacity planning — a mix of gut feel, historical averages, and generous safety margins — is expensive and unreliable.</p>

<p><strong>Custom time-series forecasting models</strong> trained on <em>your historical traffic patterns</em> change this equation. Using Temporal Convolutional Networks or Transformer-based architectures, these models learn <em>your</em> specific demand patterns: business-driven traffic cycles, seasonal variations unique to your industry, growth trends specific to your applications, and the characteristic signatures of your peak usage periods.</p>

<p>A generic forecasting model might predict capacity needs based on industry averages or simple trend extrapolation. A custom model knows that <em>your</em> SaaS application sees predictable traffic spikes every Monday at 9 AM when users return from the weekend, that <em>your</em> e-commerce platform experiences specific seasonal patterns tied to your product launches, and that <em>your</em> video conferencing infrastructure has grown at a specific rate correlated with your headcount expansion.</p>

<p><strong>Custom graph-based models</strong> trained on your topology analyze where traffic can be redistributed within <em>your specific infrastructure</em> to improve efficiency — accounting for your routing policies, your multi-cloud architecture, and your business-critical path priorities.</p>

<p>The result: <strong>data-driven confidence</strong> in capacity decisions tailored to your business context — reducing over-provisioning costs while maintaining the SLA headroom <em>your</em> applications require and preempting congestion events before they impact <em>your</em> users.</p>

<h3 id="4-custom-security-threat-detection-beyond-signatures">4. Custom Security Threat Detection Beyond Signatures</h3>

<p>Traditional security tools detect threats that have been seen before and catalogued. The most dangerous attacks — zero-day exploits, novel exfiltration methods, sophisticated lateral movement — are by definition <em>not yet in any signature database</em>.</p>

<p><strong>Custom behavioral detection models</strong> trained on <em>your</em> network’s traffic patterns close this gap. By learning the statistical patterns of normal behavior <em>in your specific environment</em>, a custom model flags any significant deviation as potentially suspicious — regardless of whether the specific attack technique has been seen before.</p>

<p>Here’s why the custom approach is essential for security: <strong>What’s “normal” in your network is fundamentally different from what’s normal elsewhere.</strong></p>

<p>Your organization has unique traffic flows: specific applications that communicate with specific external services, characteristic usage patterns tied to your business operations, expected data transfer volumes between segments, and normal employee behavior patterns. A custom security model learns these patterns from your data and flags deviations <em>in your context</em>.</p>

<p><strong>Example:</strong> A generic model might flag your research team’s legitimate large file transfers to cloud storage as potential exfiltration because the volume is “unusual.” A custom model trained on your data knows this is normal <em>for your organization</em> and ignores it — while flagging the truly anomalous transfer from accounting to an unknown external destination.</p>

<p><strong>What custom behavioral models catch that generic solutions miss:</strong></p>

<ul>
  <li><strong>DDoS campaigns</strong> — anomalous inbound traffic volume and source-IP distribution <em>relative to YOUR baseline</em></li>
  <li><strong>Data exfiltration</strong> — unusual outbound flows to unexpected destinations at unusual times <em>for YOUR organization</em></li>
  <li><strong>Lateral movement</strong> — abnormal inter-segment communication that violates <em>YOUR</em> learned traffic norms and segmentation policies</li>
  <li><strong>Beaconing / C2 communication</strong> — distinctive timing patterns in DNS queries or connection intervals that deviate from <em>YOUR</em> normal application behavior</li>
</ul>

<p>The strongest security posture combines custom ML behavioral detection (which catches the unknown threats unique to your attack surface) with signature-based detection (which catches known threats) — each compensating for the weaknesses of the other, both tuned to your environment.</p>

<hr />

<h2 id="the-custom-model-development-lifecycle">The Custom Model Development Lifecycle</h2>

<p>Five stages from development to production:</p>

<p><strong>1. Data Collection &amp; Feature Engineering:</strong> Gather 60–90 days of ThousandEyes telemetry. Network engineers and data scientists identify which paths, metrics, and patterns matter for your SLAs. Feature selection tailored to your business.</p>

<p><strong>2. Model Training &amp; Validation:</strong> Train exclusively on your data. The model learns your nightly patterns, predictable congestion signatures, and application-specific latency profiles.</p>

<p><strong>3. Deployment &amp; Integration:</strong> Start in “shadow mode” — predictions run but don’t trigger alerts yet. Validate accuracy against actual incidents before transitioning to active alerting.</p>

<p><strong>4. Continuous Monitoring &amp; Drift Detection:</strong> Automated tracking detects when “normal” changes. When drift exceeds thresholds, trigger retraining on recent data.</p>

<p><strong>5. Iterative Refinement:</strong> Incident retrospectives feed back into training. Every incident resolved makes the model smarter.</p>

<hr />

<h2 id="the-architecture">The Architecture</h2>

<p>The architecture is <strong>four layers</strong>:</p>

<p><strong>Layer 1 — Data (ThousandEyes).</strong> ThousandEyes collects, normalizes, and structures telemetry. Your custom models consume this data.</p>

<p><strong>Layer 2 — Custom ML Models.</strong> Specialized models for each task — anomaly detection, root cause analysis, forecasting, security. <strong>Each trained exclusively on YOUR data and tuned for your network’s patterns.</strong></p>

<p><strong>Layer 3 — MLOps Orchestration.</strong> Monitors model performance, detects <strong>concept drift</strong> (when “normal” changes), and triggers retraining automatically on your updated data.</p>

<p><strong>Layer 4 — Action.</strong> Automated alerts with contextual explanations, operations dashboards, remediation triggers, capacity reports. Engineers receive a <em>diagnosis</em> grounded in your network’s behavior.</p>

<p><img src="/images/v2_diag4_architecture.png" alt="Conceptual Architecture: ML on Top of ThousandEyes" /></p>

<p><em>The four-layer architecture showing how ThousandEyes data flows through custom ML models and MLOps orchestration to deliver actionable insights.</em></p>

<p>Key principle: <strong>no single monolithic model.</strong> A collection of specialized custom models sharing a common data foundation, modular and continuously refined.</p>

<hr />

<h2 id="implementation-start-small-scale-smart">Implementation: Start Small, Scale Smart</h2>

<p>Deploying custom ML models doesn’t require a “big bang.” Successful organizations follow a phased approach:</p>

<table>
  <thead>
    <tr>
      <th>Phase</th>
      <th>Focus</th>
      <th>Timeline</th>
      <th>Primary Benefit</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>1</strong></td>
      <td>Custom anomaly detection on priority paths</td>
      <td>6–10 weeks</td>
      <td>Alert fatigue reduction tailored to YOUR traffic</td>
    </tr>
    <tr>
      <td><strong>2</strong></td>
      <td>Custom root cause analysis</td>
      <td>10–18 weeks</td>
      <td>MTTR reduction based on YOUR topology</td>
    </tr>
    <tr>
      <td><strong>3</strong></td>
      <td>Custom traffic forecasting</td>
      <td>14–22 weeks</td>
      <td>Spend optimization aligned with YOUR growth</td>
    </tr>
    <tr>
      <td><strong>4</strong></td>
      <td>Custom security detection</td>
      <td>18–26 weeks</td>
      <td>Zero-day detection tuned to YOUR baseline</td>
    </tr>
  </tbody>
</table>

<p><strong>Four critical success factors:</strong></p>

<ol>
  <li>
    <p><strong>Data quality first.</strong> Ensure 60–90 days of clean ThousandEyes telemetry before training. Garbage in, garbage out.</p>
  </li>
  <li>
    <p><strong>Domain experts + data scientists.</strong> Custom model development is a collaboration. ML engineers need operational context; network engineers need data science expertise. Neither succeeds alone.</p>
  </li>
  <li>
    <p><strong>Model governance from day one.</strong> Every model needs a clear owner, performance baseline, retraining policy, and drift detection threshold. Custom models require ongoing stewardship.</p>
  </li>
  <li>
    <p><strong>Interpretability is non-negotiable.</strong> Model outputs must explain <em>why</em> — which metrics, which patterns, what confidence level. Engineers need to validate before acting.</p>
  </li>
</ol>

<p>The common pitfall: <strong>deploy and forget.</strong> Custom models need continuous monitoring and maintenance. Your network evolves; your models must adapt.</p>

<hr />

<h2 id="the-compounding-effect">The Compounding Effect</h2>

<p><img src="/images/v2_diag3_bizvalue.png" alt="Business Value of ML-Enhanced Network Observability" /></p>

<p><em>The business value diagram illustrates the tangible benefits and before/after comparison of ML-enhanced network observability.</em></p>

<p>Each custom model application delivers value independently. But the real power emerges when they work together — all trained on the same organizational data.</p>

<p>Custom traffic forecasts inform capacity decisions. Custom anomaly detection flags deviations from those forecasts. Custom root cause analysis traces problems through <em>your</em> topology referencing <em>your</em> incident history. Custom security models distinguish operational from adversarial deviations using <em>your</em> baseline. Resolution data from each incident feeds back into every model.</p>

<p><strong>Custom models trained on the same organization’s data develop shared context.</strong> The anomaly detector knows the same patterns the capacity planner predicts. The root cause analyzer understands the same topology the security model monitors. They speak the language of <em>your</em> network’s behavior.</p>

<p>Every incident resolved, every false positive eliminated makes <em>every custom model</em> more accurate. <strong>The models get better at understanding YOUR network specifically</strong> — not networks in general. This compounding advantage is nearly impossible to replicate and becomes a sustained competitive moat as models accumulate years of your operational reality.</p>

<hr />

<h2 id="where-do-we-go-from-here">Where Do We Go From Here?</h2>

<p>As networks grow more complex, the limitations of reactive monitoring and generic analytics become <strong>strategically untenable</strong>.</p>

<p>The technology is ready. The data platforms exist. The business case is proven. The question is whether your organization will be an early mover or a late adopter.</p>

<p><strong>Practical first steps:</strong></p>

<ol>
  <li><strong>Audit your ThousandEyes deployment.</strong> Verify agent coverage and ensure 60–90 days of clean telemetry.</li>
  <li><strong>Identify your unique pain points.</strong> Alert fatigue? Unclear root causes? Capacity planning guesswork?</li>
  <li><strong>Scope a Phase 1 pilot</strong> on your most critical paths. Define which metrics matter <em>for your environment</em> and what constitutes actionable alerts <em>for your SLAs</em>.</li>
  <li><strong>Set realistic expectations.</strong> Custom models take 6–10 weeks for Phase 1 because they include data collection, feature engineering, and training on <em>your</em> data.</li>
</ol>

<p>Six to ten weeks from now, you’ll have hard data on value — false positive reductions, early warning lead times, incident detection accuracy <em>in your environment</em>. And a foundation to build on.</p>

<p><strong>The strategic insight:</strong> The networks that win will be the ones that <em>understand themselves</em> — continuously, predictively, intelligently. That understanding comes from custom models trained on their own operational reality.</p>

<p>Generic ML is better than thresholds. <strong>Custom ML is better than generic.</strong> The gap is the difference between “this tool flagged an anomaly” and “this model understands our network and told us exactly what’s about to break, why it matters, and what to do.”</p>

<hr />

<h2 id="references">References</h2>
<ol>
  <li><a href="https://ieeexplore.ieee.org/document/10693044/">Distribution Network Topology Optimization Method Based on Graph Neural Network - ieeexplore.ieee.org</a></li>
  <li><a href="https://ieeexplore.ieee.org/document/10826583/">Graph Deep Learning Meets Persistent Homology - ieeexplore.ieee.org</a></li>
  <li><a href="https://ieeexplore.ieee.org/document/9871984/">Graph Neural Networks from a Graph Signal Processing Perspective: A Concise Overview - ieeexplore.ieee.org</a></li>
  <li><a href="https://distill.pub/2021/gnn-intro/">A Gentle Introduction to Graph Neural Networks - distill.pub</a></li>
  <li><a href="https://ieeexplore.ieee.org/abstract/document/10960451/">A Comprehensive Survey on Graph Neural Networks - ieeexplore.ieee.org</a></li>
  <li><a href="https://medium.com/@amit25173/temporal-convolutional-network-an-overview-4d2b6f03d6f8">Temporal Convolutional Network — An Overview - medium.com</a></li>
  <li><a href="https://unit8.com/resources/temporal-convolutional-networks-and-forecasting/">Temporal Convolutional Networks and Forecasting - unit8.com</a></li>
  <li><a href="https://www.sciencedirect.com/topics/computer-science/temporal-convolutional-network">Temporal Convolutional Network - ScienceDirect - www.sciencedirect.com</a></li>
  <li>
    <table>
      <tbody>
        <tr>
          <td>[What is TCN?</td>
          <td>Activeloop Glossary - www.activeloop.ai](https://www.activeloop.ai/resources/glossary/temporal-convolutional-networks-tcn/)</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li><a href="https://www.nature.com/articles/s41598-022-25472-z">Temporal convolutional networks and data rebalancing for clinical length of stay and mortality prediction - www.nature.com</a></li>
  <li><a href="https://arxiv.org/html/2408.13561v1">Unsupervised anomaly detection with Vision-Transformer and Gaussian Random Field based Variational Autoencoders - arxiv.org</a></li>
  <li><a href="https://link.springer.com/article/10.1186/s40537-025-01342-z">Research on anomaly detection in attributed networks based on community-aware contrastive adversarial VGAE - link.springer.com</a></li>
  <li><a href="https://link.springer.com/chapter/10.1007/978-981-96-0994-9_20">Novel Feature Extraction on Unsupervised Anomaly Detection for Network Intrusion Packets with Variational Autoencoders - link.springer.com</a></li>
  <li><a href="https://medium.com/@luc.frachon/anomaly-detection-using-a-variational-autoencoder-part-ii-beeb30f0d88f">Anomaly Detection using a Variational Autoencoder — Part II - medium.com</a></li>
  <li><a href="https://www.mdpi.com/2073-8994/17/4/520">SOVAE: An Anomaly Detection System in 5G Network-Based NIDS Using Self-Organizing Map and Variational Autoencoder - www.mdpi.com</a></li>
  <li><a href="https://arxiv.org/abs/2410.03805">Time-series forecasting with local attention mechanism-based Transformer network - arxiv.org</a></li>
  <li><a href="https://arxiv.org/html/2410.03805v1">Time-series forecasting with local attention mechanism-based Transformer network - arxiv.org</a></li>
  <li><a href="https://www.sciencedirect.com/org/science/article/pii/S154622182200399X">Deep learning based trajectory and time series forecasting: a review, a comparison, and a look forward - www.sciencedirect.com</a></li>
  <li><a href="https://pubs.aip.org/aip/aco/article/1/1/016104/3362108/Easy-attention-A-simple-attention-mechanism-for">Easy attention: A simple attention mechanism for key-value-query-free and softmax-free transformers with applications to time series and chaotic systems - pubs.aip.org</a></li>
  <li><a href="https://www.nature.com/articles/s41598-024-66886-1">Efficient Transformer-based time series forecasting with sparse attention - www.nature.com</a></li>
  <li><a href="https://arxiv.org/html/2308.12874v3">Easy Attention: A Simple Attention Mechanism for Key-Value-Query-Free and Softmax-Free Transformers - arxiv.org</a></li>
  <li><a href="https://www.nature.com/articles/s41595-025-97244-4">Attention-based transformer networks predict autonomous vehicle trajectories in the real world - www.nature.com</a></li>
  <li><a href="https://link.springer.com/chapter/10.1007/978-3-031-80853-1_5">ARIMA modeling and forecasting of network traffic based-on GARCH model - link.springer.com</a></li>
  <li><a href="https://dl.acm.org/doi/10.1145/3180496.3180613">Time-Varying Network Traffic Prediction using Filtering and ARIMA Model - dl.acm.org</a></li>
  <li><a href="https://www.sciencedirect.com/science/article/pii/S1877705817300620">A Survey of Multivariate Time Series Forecasting Methods on Transportation Networks - www.sciencedirect.com</a></li>
  <li><a href="https://link.springer.com/chapter/10.1007/11751595_26">Forecasting Internet Network Traffic for the Next new Packets Using ARIMA Model - link.springer.com</a></li>
  <li><a href="https://link.springer.com/chapter/10.1007/0-387-34167-6_5">Non-linear Network Traffic Prediction Using Hybrid ARIMA-GARCH Model - link.springer.com</a></li>
  <li><a href="https://arxiv.org/html/2407.11105v1">On the Effectiveness of Data Preprocessing and Hyper-Parameter Optimization for Anomaly Detection in Network Traffic - arxiv.org</a></li>
  <li><a href="https://www.kentik.com/blog/the-reality-of-machine-learning-in-network-observability/">The Reality of Machine Learning in Network Observability - www.kentik.com</a></li>
  <li><a href="https://arxiv.org/abs/2108.13557">ML-Pipes: An End-to-End Machine Learning Pipeline for Research and Practice - arxiv.org</a></li>
  <li><a href="https://www.nature.com/articles/s41598-024-56706-x">Evaluation metrics and statistical tests for machine learning - www.nature.com</a></li>
  <li><a href="https://www.researchgate.net/publication/378940467_Evaluation_metrics_and_statistical_tests_for_machine_learning">Evaluation metrics and statistical tests for machine learning. - www.researchgate.net</a></li>
  <li>
    <table>
      <tbody>
        <tr>
          <td>[Model Monitoring</td>
          <td>Arize ‍️ - arize.com](https://arize.com/model-monitoring/)</td>
        </tr>
      </tbody>
    </table>
  </li>
  <li><a href="https://www.datadoghq.com/blog/ml-model-monitoring-in-production-best-practices/">Machine learning model monitoring: Best practices - www.datadoghq.com</a></li>
  <li><a href="https://link.springer.com/protocol/10.1007/978-1-0716-3195-9_20">Evaluating Machine Learning Models and Their Diagnostic Value - link.springer.com</a></li>
  <li><a href="https://link.springer.com/article/10.1007/s42452-025-07312-7">A survey on machine learning-based network anomaly detection - link.springer.com</a></li>
  <li><a href="https://link.springer.com/article/10.1140/epjqt/s40507-025-00414-6">Quantum machine learning in cybersecurity: a comprehensive survey - link.springer.com</a></li>
  <li><a href="https://www.mdpi.com/2306-5729/10/3/33">Data quality or model selection: which is more important on the performance of ML-based network abnormal and attack detection? - www.mdpi.com</a></li>
  <li><a href="https://arxiv.org/html/2601.04089v1">A Leakage-Resistant Experimental Protocol for Deep Learning-Based Network Traffic Classification - arxiv.org</a></li>
  <li><a href="https://www.nature.com/articles/s41598-024-70983-6">Software defined networking based network traffic classification using machine learning techniques - www.nature.com</a></li>
  <li><a href="httpshttps://aircconline.com/abstract/ijcnc/v17n3/17325cnc07.html">Classification of Network Traffic Using Machine Learning Models on the NetML Dataset - aircconline.com</a></li>
  <li><a href="https://arxiv.org/abs/2503.02141">Network Traffic Classification Using Machine Learning, Transformer, and Large Language Models - arxiv.org</a></li>
  <li><a href="https://www.geeksforgeeks.org/machine-learning/metrics-for-machine-learning-model/">Evaluation Metrics in Machine Learning - GeeksforGeeks - www.geeksforgeeks.org</a></li>
  <li><a href="https://scikit-learn.org/stable/modules/model_evaluation.html">Metrics and scoring: quantifying the quality of predictions - scikit-learn.org</a></li>
  <li><a href="https://indatalabs.com/blog/predictive-models-performance-evaluation-important">Predictive Models Performance Evaluation: Which to Choose? - indatalabs.com</a></li>
  <li><a href="https://chbrown.github.io/kdd-2013-usb/kdd/p1294.pdf">Predictive Model Performance: Offline and Online Evaluations - chbrown.github.io</a></li>
  <li><a href="https://ieeexplore.ieee.org/document/9610045/">A Feature Engineering-Based Survey of Anomaly Detection Techniques - ieeexplore.ieee.org</a></li>
  <li><a href="https://arxiv.org/abs/2509.18007">Traffic-Explainer: A Model-Agnostic Explanation Framework for Deep Learning-based Network Traffic Classification - arxiv.org</a></li>
  <li><a href="https://arxiv.org/abs/2003.01261">On the Robustness of Deep Learning-based Network Traffic Classification against Adversarial Perturbations - arxiv.org</a></li>
  <li><a href="https://arxiv.org/abs/2107.12193">A Survey and Benchmark of Automatic-explaining Methodologies for both Network Traffic Classification and Anomaly Detection - arxiv.org</a></li>
  <li><a href="https://www.techscience.com/iasc/v36n3/51950/html">A review of performance evaluation metrics for classifiers in cyber-security - www.techscience.com</a></li>
  <li><a href="https://jmlr.csail.mit.edu/papers/volume3/forman03a/forman03a.pdf">An Extensive Review of Feature Selection Metrics for Text Classification - jmlr.csail.mit.edu</a></li>
  <li><a href="https://arxiv.org/html/2403.10319v2">NetBench: A Large-Scale and Comprehensive Network Traffic Benchmark Dataset for Foundation Models - arxiv.org</a></li>
  <li><a href="https://www.researchgate.net/publication/392611976_CLASSIFICATION_OF_NETWORK_TRAFFIC_USING_MACHINE_LEARNING_MODELS_ON_THE_NETML_DATASET">CLASSIFICATION OF NETWORK TRAFFIC USING MACHINE LEARNING MODELS ON THE NETML DATASET - www.researchgate.net</a></li>
  <li><a href="https://ieeexplore.ieee.org/document/9803674/">Network traffic verification based on a public dataset for IDS systems and machine learning classification algorithms - ieeexplore.ieee.org</a></li>
  <li><a href="https://data.mendeley.com/datasets/5pmnkshffm/3">Network traffic and code for machine learning classification - data.mendeley.com</a></li>
  <li><a href="https://link.springer.com/chapter/10.1007/978-3-030-58112-1_35">Improving Classification Performance for Unbalanced Network Intrusion Detection Data - link.springer.com</a></li>
  <li><a href="https://www.mdpi.com/2073-8994/17/12/2087">Optimizing Anomaly Detection with ResCAE-BiGRU Hybrid Deep Learning for Imbalanced Network Traffic - www.mdpi.com</a></li>
  <li><a href="https://www.frontiersin.org/journals/artificial-intelligence/articles/10.3389/frai.2025.1625891/full">Privacy-Preserving and Accuracy-Improving Framework for Cyber-Attack Classification via Federated Learning in Imbalanced Datasets - www.frontiersin.org</a></li>
  <li><a href="https://www.cmjpublishers.com/wp-content/uploads/2025/06/leveraging-big-datasets-for-machine-learning-based-anomaly-detection-in-cybersecurity-network-traffic.pdf">Leveraging Big Datasets For Machine Learning Based Anomaly Detection In Cybersecurity Network Traffic - www.cmjpublishers.com</a></li>
  <li><a href="https://link.springer.com/chapter/10.1007/978-3-031-08333-4_7">Predictive Model for Detection of Class Imbalance Problem in a Flow-Based Network Intrusion Detection Dataset - link.springer.com</a></li>
  <li><a href="https://otexts.com/fpp3/tscv.html">Cross-validation - ottext.com</a></li>
  <li><a href="https://www.nixtla.io/docs/forecasting/evaluation/cross_validation">Time-series Cross-validation - www.nixtla.io</a></li>
  <li><a href="https://www.researchgate.net/publication/387146892_Implementing_Time_Series_Cross_Validation_to_Evaluate_the_Forecasting_Model_Performance">Implementing Time Series Cross Validation to Evaluate the Forecasting Model Performance - www.researchgate.net</a></li>
  <li><a href="https://nixtlaverse.nixtla.io/mlforecast/docs/how-to-guides/cross_validation.html">Cross-validation - mlforecast - nixtlaverse.nixtla.io</a></li>
  <li><a href="https://www.analyticsvidhya.com/blog/2026/03/time-series-cross-validation/">Time Series Cross Validation: A Comprehensive Guide - www.analyticsvidhya.com</a></li>
  <li><a href="https://www.geeksforgeeks.org/machine-learning/time-series-cross-validation/">Time Series Cross-Validation - GeeksforGeeks - www.geeksforgeeks.org</a></li>
  <li><a href="https://www.evidentlyai.com/ml-in-production/concept-drift">What Is Concept Drift in ML and How to Detect It? - evidentlyai.com</a></li>
  <li><a href="https://techcommunity.microsoft.com/blog/fasttrackforazureblog/identifying-drift-in-ml-models-best-practices-for-generating-consistent-reliable/4040531">Identifying drift in ML models: Best practices for generating consistent, reliable responses - techcommunity.microsoft.com</a></li>
  <li><a href="https://aerospike.com/blog/model-drift-machine-learning/">Model Drift in Machine Learning - aerospike.com</a></li>
  <li><a href="https://www.motius.com/post/what-is-concept-drift-and-how-to-detect-it">What Is Concept Drift and How to Detect It - Motius - www.motius.com</a></li>
  <li><a href="https://www.databricks.com/blog/2019/09/18/productionizing-machine-learning-from-deployment-to-drift-detection.html">Productionizing Machine Learning: From Deployment to Drift Detection - www.databricks.com</a></li>
  <li><a href="https://imerit.ai/resources/blog/staying-ahead-of-drift-in-machine-learning-systems-all-una/">What is Model Drift in ML Systems? A Complete Guide - imerit.ai</a></li>
</ol>]]></content><author><name>Marc Buraczynski</name></author><category term="network observability" /><category term="machine learning" /><category term="ThousandEyes" /><category term="anomaly detection" /><category term="custom ML" /><summary type="html"><![CDATA[15 min read]]></summary></entry><entry><title type="html">Network Observability with GCN-LSTM</title><link href="https://gunnymarc.github.io/posts/2026/05/gcn-lstm-network-observability/" rel="alternate" type="text/html" title="Network Observability with GCN-LSTM" /><published>2026-05-24T00:00:00-04:00</published><updated>2026-05-24T00:00:00-04:00</updated><id>https://gunnymarc.github.io/posts/2026/05/GCN_LSTM_Network_Observability_White_Paper</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/05/gcn-lstm-network-observability/"><![CDATA[<p><strong>DATE:</strong> 2026-05-24</p>

<p><strong>Subject:</strong> Theoretical Application of a Combined Graph Convolutional Network (GCN) and Long Short-Term Memory (LSTM) Framework to Enhance Network Observability</p>

<hr />

<h2 id="executive-summary">Executive Summary</h2>

<p>Modern network architectures, from public cloud environments to industrial sensor networks, have grown into complex, dynamic, and distributed systems. This complexity challenges traditional monitoring approaches, which often focus on individual component metrics and fail to capture the emergent behaviors and subtle performance degradations that define contemporary operational issues. To address this, the paradigm of <strong>network observability</strong> has emerged, shifting the focus from simple data collection to deep, inferential understanding of a system’s internal state through the analysis of its external outputs.</p>

<p>This report presents a theoretical and architectural framework for enhancing network observability by applying a hybrid deep learning model that combines <strong>Graph Convolutional Networks (GCNs)</strong> and <strong>Long Short-Term Memory (LSTM) networks</strong>. The GCN+LSTM model is uniquely suited to the challenges of modern networks by its inherent ability to process data that is both structurally and temporally complex.</p>

<p>The core of this approach lies in modeling the network and its telemetry data as a <strong>dynamic graph</strong>, where network entities (e.g., services, hosts, sensors) are nodes and their interactions are edges. GCNs analyze the spatial dimension of this data, capturing the intricate dependencies and relational patterns across the network topology at a given moment. LSTMs analyze the temporal dimension, modeling how these patterns evolve over time.</p>

<p>This report details the application of this framework to two primary, high-impact use cases:</p>

<ol>
  <li>
    <p><strong>Anomaly Detection:</strong> Moving beyond single-metric thresholding to identify complex, system-wide anomalies that manifest as deviations from learned normal spatiotemporal patterns. Research demonstrates this approach can achieve precision and recall rates approaching 0.90 for detecting subtle, chronic failures in complex cloud systems.</p>
  </li>
  <li>
    <p><strong>Network Performance Prediction:</strong> Proactively forecasting key performance indicators (KPIs) and path performance metrics (PPMs) such as latency, congestion, and packet delivery ratio. This enables intelligent routing, proactive resource scaling, and congestion control, with studies showing significant improvements in prediction accuracy over state-of-the-art methods.</p>
  </li>
</ol>

<p>By synthesizing spatial and temporal dynamics, the GCN+LSTM framework provides a powerful tool for operational inference. It transforms raw telemetry streams into actionable insights, enabling engineering teams to move from a reactive to a proactive operational posture. This document provides the foundational knowledge, architectural design, and evidence-based justification for considering the GCN+LSTM model as a cornerstone of a next-generation network observability strategy.</p>

<hr />

<h2 id="1-formalizing-network-observability-in-the-modern-era">1. Formalizing Network Observability in the Modern Era</h2>

<p>The term “network observability” represents a critical evolution from traditional network monitoring. While monitoring is concerned with collecting and displaying metrics (<strong>what</strong> is happening), observability is concerned with inferential analysis to understand <strong>why</strong> it is happening. It is the practice of inferring the internal state, health, and performance of a complex system by analyzing the telemetry data it generates.</p>

<p>Synthesizing from contemporary research, network observability in the context of large-scale distributed systems can be formally defined as:</p>

<blockquote>
  <p><em>A system property and a technical practice wherein the internal operational state of a network is inferred through the comprehensive analysis of external telemetry data (e.g., metrics, logs, traces). It goes beyond tracking individual Key Performance Indicators (KPIs) to enable joint judgments based on the synergistic, spatiotemporal relationships among distributed components, thereby facilitating the discovery of hidden systemic states and the prediction of future behavior.</em></p>
</blockquote>

<p>This definition is predicated on several core principles derived from the challenges of modern network environments:</p>

<ul>
  <li><strong>Rejection of Siloed Analysis:</strong> In architectures such as microservices, industrial IoT (IIoT), or Vehicular Ad Hoc Networks (VANETs), the status of the system cannot be determined by examining individual components in isolation. An issue in one service may only become apparent through its subtle, cascading effects on downstream services. Observability requires modeling the entire system and its interconnections (Yu et al., 2023).</li>
  <li><strong>Embrace of Dynamic Topology:</strong> Unlike static, monolithic systems, modern networks exhibit dynamic topologies where connections and even components themselves are ephemeral. Observability must account for these structural changes over time, capturing not just how node properties change but how their relationships evolve (Yu et al., 2023).</li>
  <li><strong>Focus on Operational Inference:</strong> The ultimate goal of observability is not data collection but actionable inference. This includes core tasks like <strong>network tomography</strong>—the inference of unobserved network characteristics from observed measurements—as well as fault diagnosis, performance prediction, and automated traffic control (Hu et al., 2025). For example, by measuring end-to-end path performance metrics (PPMs) like latency for a subset of paths, a robust observability model can infer the latency for all other paths in the network.</li>
  <li><strong>Detection of “Chronic” Failures:</strong> Modern, resilient systems often do not fail catastrophically. Instead, they suffer from “gradual, chronic, localized failures or quality degradations” (Yu et al., 2023). These subtle issues, such as a slight increase in packet loss or a minor rise in service latency under specific load conditions, are often invisible to traditional monitoring but are prime targets for an observability framework capable of detecting faint deviations from complex, normal behaviors.</li>
</ul>

<p>In essence, network observability demands a transition from collecting data points to understanding data patterns within a holistic, dynamic context.</p>

<h2 id="2-modeling-network-telemetry-as-graph-structured-temporal-data">2. Modeling Network Telemetry as Graph-Structured Temporal Data</h2>

<p>The power of the GCN+LSTM framework stems from its natural alignment with the structure of network telemetry data. A modern network is fundamentally a graph, and its behavior is a time series. By formally mapping observability data into this structure, we unlock the ability to apply advanced spatiotemporal modeling.</p>

<h3 id="21-the-graph-data-model">2.1 The Graph Data Model</h3>

<p>At any given time step <em>t</em>, the state of a network can be represented as a property graph, <code class="language-plaintext highlighter-rouge">G_t</code>, consisting of nodes, edges, and their associated features.</p>

<ul>
  <li><strong>Nodes (Vertices):</strong> Nodes represent the core entities of the network. Their definition is use-case dependent:
    <ul>
      <li>In a cloud infrastructure, nodes can be microservices, containers, pods, virtual machines, or physical hosts (Yu et al., 2023).</li>
      <li>In an Industrial IoT (IIoT) context, nodes are sensors, actuators, controllers, or gateways (Yang et al., 2025).</li>
      <li>In a communication network, nodes represent routers, switches, or other network hardware.</li>
    </ul>
  </li>
  <li><strong>Node Features:</strong> Each node possesses a set of attributes, represented as a feature vector. These are typically the KPIs collected from the entity. Examples include:
    <ul>
      <li>CPU and memory utilization</li>
      <li>Disk I/O rates</li>
      <li>Sensor readings (e.g., temperature, pressure)</li>
      <li>Queue depth or buffer utilization</li>
    </ul>
  </li>
  <li><strong>Edges (Connections):</strong> Edges represent the interactions, communication pathways, or logical relationships between nodes. The existence of an edge signifies a dependency.
    <ul>
      <li>In a microservices application, an edge could represent an API call from one service to another.</li>
      <li>In an IIoT network, an edge could be derived from a Spearman correlation matrix, indicating a strong statistical relationship between the readings of two different sensors (Yang et al., 2025).</li>
      <li>In a computer network, an edge represents a physical or logical link.</li>
    </ul>
  </li>
  <li><strong>Edge Features:</strong> Like nodes, edges can have their own feature vectors describing the nature of the interaction.
    <ul>
      <li>Communication volume (e.g., requests per second, data transferred)</li>
      <li>Communication latency or response time</li>
      <li>Packet loss rate</li>
      <li>Protocol type</li>
    </ul>
  </li>
</ul>

<p>This data can be structured into matrices suitable for machine learning: a <strong>Node Feature Matrix (<code class="language-plaintext highlighter-rouge">X</code>)</strong>, where each row corresponds to a node’s features, and an <strong>Adjacency Matrix (<code class="language-plaintext highlighter-rouge">A</code>)</strong>, which defines the connectivity between nodes.</p>

<h3 id="22-the-temporal-dimension">2.2 The Temporal Dimension</h3>

<p>A single graph snapshot provides a spatial view of the network at one instant. However, the most critical insights come from observing how this graph evolves. The state of the network at time <em>t</em> is deeply dependent on its state at <em>t-1, t-2</em>, and so on.</p>

<p>By collecting these graph snapshots at regular intervals, we create a <strong>sequence of graphs</strong>: <code class="language-plaintext highlighter-rouge">[G_{t-k}, ..., G_{t-1}, G_t]</code>. This sequence represents the dynamic, spatiotemporal behavior of the network, capturing both the changing properties of nodes/edges and the potential for the graph’s topology itself to change.</p>

<p><img src="/images/figure1-temporal_graph_model.png" alt="Infographic illustrating the mapping of network observability data to a temporal graph data model." />
<em>Figure 1: Conceptual mapping of network state over time to a sequence of graph snapshots, forming the basis for spatiotemporal analysis.</em></p>

<h3 id="23-mapping-to-the-gcnlstm-framework">2.3 Mapping to the GCN+LSTM Framework</h3>

<p>This graph-structured temporal data model is precisely what the GCN+LSTM architecture is designed to process. The two components work in concert:</p>

<ol>
  <li>
    <p><strong>GCN for Spatial Feature Extraction:</strong> For each graph snapshot <code class="language-plaintext highlighter-rouge">G_t</code> in the sequence, a GCN is used to process the graph structure. The GCN generates an embedding (a dense vector representation) for each node by aggregating feature information from its local neighborhood. This process effectively encodes the spatial context of each node—its state relative to the nodes it is connected to. The output of this stage is a sequence of spatially-aware graph embeddings.</p>
  </li>
  <li>
    <p><strong>LSTM for Temporal Feature Extraction:</strong> The sequence of graph embeddings produced by the GCN is then fed into an LSTM. The LSTM is renowned for its ability to model long-range dependencies in sequential data. It processes the sequence of graph states, learning the temporal patterns of how the network evolves from one state to the next.</p>
  </li>
</ol>

<p>This dual approach allows the model to learn complex, high-level <strong>spatiotemporal features</strong> that are impossible to capture with methods that treat metrics as independent time series or analyze a network graph statically. It directly models the core principle of observability: that system behavior is an emergent property of interconnected components evolving through time.</p>

<h2 id="3-primary-observability-use-cases">3. Primary Observability Use Cases</h2>

<p>The GCN+LSTM framework supports a range of operational inference tasks. This report focuses on two primary use cases—anomaly detection and performance prediction—that offer significant value to network engineering and architecture teams.</p>

<h3 id="31-use-case-1-advanced-anomaly-detection">3.1 Use Case 1: Advanced Anomaly Detection</h3>

<p>Traditional anomaly detection, often relying on statistical methods like PCA or single-variate time-series models, is ill-equipped for the complexity of modern systems. It struggles to distinguish between benign fluctuations and genuine, subtle incidents that arise from multi-component interactions.</p>

<h4 id="311-problem-scope">3.1.1 Problem Scope</h4>

<p>Anomalies in distributed systems are rarely simple crashes. More common and insidious are issues like:</p>
<ul>
  <li>A “gray failure” where a service is running but operating at a degraded performance level.</li>
  <li>A cascading slowdown initiated by a resource bottleneck in one component that propagates through a chain of service calls.</li>
  <li>Anomalous behavior that only occurs when specific conditions across multiple, disparate components align.</li>
</ul>

<p>An example from an Elasticsearch cluster illustrates this: slight increases in client-side latency, when correlated with overlapping resource usage patterns on specific server nodes, can indicate an underlying performance anomaly that is invisible when looking at server KPIs alone (Yu et al., 2023). These are precisely the types of events a spatiotemporal model is designed to find.</p>

<h4 id="312-the-gcnlstm-approach">3.1.2 The GCN+LSTM Approach</h4>

<p>The anomaly detection task is framed as a <strong>forecasting problem</strong>. The GCN+LSTM model is trained exclusively on historical telemetry data from periods of normal network operation. Its objective is to learn the intricate patterns of “normalcy” and accurately predict the network’s state at the next time step (<code class="language-plaintext highlighter-rouge">t+1</code>) based on a sequence of past states (<code class="language-plaintext highlighter-rouge">t-k, ..., t</code>).</p>

<p>The detection mechanism is as follows:</p>
<ol>
  <li><strong>Training:</strong> The model learns a function <code class="language-plaintext highlighter-rouge">F</code> that maps a sequence of past graph snapshots to a predicted future snapshot: <code class="language-plaintext highlighter-rouge">Ĝ_{t+1} = F(G_{t-k}, ..., G_t)</code>.</li>
  <li><strong>Inference:</strong> During live operation, the model continuously makes predictions.</li>
  <li><strong>Anomaly Scoring:</strong> The predicted graph snapshot <code class="language-plaintext highlighter-rouge">Ĝ_{t+1}</code> (containing predicted node/edge features) is compared to the actual, measured graph snapshot <code class="language-plaintext highlighter-rouge">G_{t+1}</code>. A <strong>reconstruction error</strong> or <strong>prediction error</strong> is calculated.</li>
  <li><strong>Thresholding:</strong> If this error exceeds a predefined, statistically derived threshold, it signifies that the network is behaving in a way that deviates from its learned normal patterns. An alert is triggered.</li>
</ol>

<p>The GCN captures anomalous spatial patterns (e.g., a node’s CPU is high while a neighbor’s throughput is unexpectedly low), and the LSTM detects anomalous temporal sequences (e.g., this spatial pattern has never occurred following a period of low network-wide latency).</p>

<h4 id="313-supporting-evidence">3.1.3 Supporting Evidence</h4>

<p>Research provides strong validation for this approach. The AD-DSTL method, which employs a GCN-LSTM architecture for cloud system anomaly detection, was evaluated on four distinct datasets, including a production microservices system with 92 nodes. The model demonstrated superior robustness and a significantly higher F1-score compared to baseline models like standalone GCN, LSTM, and SVM. At higher anomaly levels, both precision and recall reached approximately 0.9, indicating high accuracy and a low false-positive rate (Yu et al., 2023). Similarly, the GCRL model applied to industrial sensor networks improved the F1-score by 4.35% over other state-of-the-art methods, effectively detecting anomalies in water distribution and hydraulic systems (Yang et al., 2025).</p>

<h3 id="32-use-case-2-network-performance-prediction">3.2 Use Case 2: Network Performance Prediction</h3>

<p>Proactive network management depends on the ability to foresee future conditions. Network performance prediction aims to forecast metrics like latency, throughput, and congestion, enabling systems to adapt before performance is impacted. This is a core tenet of network tomography: inferring unobserved or future performance from existing measurements.</p>

<h4 id="321-problem-scope">3.2.1 Problem Scope</h4>

<p>Key challenges in performance prediction include:</p>
<ul>
  <li><strong>Path-Level Prediction:</strong> Predicting the end-to-end performance of a path (e.g., between two services or across a WAN) is more complex than predicting a single node’s state, as it depends on the aggregated performance of all links and nodes along that path.</li>
  <li><strong>Congestion Forecasting:</strong> Predicting when and where network congestion will occur is vital for traffic engineering and dynamic routing, especially in highly mobile environments like VANETs where traffic patterns change rapidly.</li>
  <li><strong>Incomplete Knowledge:</strong> In many real-world scenarios, the complete network topology or the exact routing paths are unknown or hidden for security reasons. A predictive model should ideally not depend on having complete prior knowledge (Hu et al., 2025).</li>
</ul>

<h4 id="322-the-gcnlstm-approach">3.2.2 The GCN+LSTM Approach</h4>

<p>For performance prediction, the GCN+LSTM model is trained as a supervised regression model. The objective is to predict a specific target value (or set of values) for a future time step.</p>

<p>The process is as follows:</p>
<ol>
  <li><strong>Training:</strong> The model is given sequences of past graph snapshots as input and corresponding future performance metrics as labels. For example, the input could be network telemetry from <code class="language-plaintext highlighter-rouge">t-k</code> to <code class="language-plaintext highlighter-rouge">t</code>, and the label could be the average latency of a specific path at time <code class="language-plaintext highlighter-rouge">t+1</code>.</li>
  <li><strong>Learning:</strong>
    <ul>
      <li>The GCN component learns to create powerful node and path embeddings that implicitly capture topological information and spatial dependencies relevant to performance.</li>
      <li>The LSTM component learns the temporal dynamics of how traffic patterns and node states evolve to influence future performance.</li>
    </ul>
  </li>
  <li><strong>Prediction:</strong> Once trained, the model can take a current sequence of telemetry data and output a direct prediction for a future metric (e.g., “congestion level on node X will be 85% in 5 minutes” or “latency on path Y will be 120ms”).</li>
</ol>

<p>This approach allows for a range of predictive tasks, from node-level KPI prediction to complex, end-to-end path performance metric (PPM) prediction.</p>

<h4 id="323-supporting-evidence">3.2.3 Supporting Evidence</h4>

<p>A GCN-LSTM model applied to urban VANETs demonstrated its effectiveness in predicting traffic dynamics to enable adaptive routing and congestion control. The hybrid model significantly outperformed benchmarks, achieving a <strong>Packet Delivery Ratio (PDR) of 95.0%</strong> and reducing prediction errors (Mean Absolute Error of 0.02) far below other methods. This high predictive accuracy translated directly into improved network performance (Maray, 2026). Further, research in network tomography with path-centric graph neural networks (a conceptually similar approach) shows that such models can predict additive metrics like latency with significantly lower error (e.g., a MAPE of 0.6907 on an Internet dataset vs. &gt;0.81 for other methods) without requiring full knowledge of the network topology (Hu et al., 2025).</p>

<h3 id="33-other-potential-use-cases">3.3 Other Potential Use Cases</h3>

<p>Beyond these two primary applications, the spatiotemporal features learned by a GCN+LSTM model can be leveraged for other critical observability tasks:</p>

<ul>
  <li><strong>Automated Root Cause Analysis:</strong> By analyzing the attention weights or feature importance within the model following an anomaly detection, it may be possible to automatically identify the nodes, edges, and time points that contributed most significantly to the anomalous prediction, thereby pinpointing the likely root cause.</li>
  <li><strong>Proactive Resource Management:</strong> Predictions of future workload or performance degradation can be used to trigger automated remediation actions, such as scaling up cloud resources, diverting traffic, or scheduling preventative maintenance before users are impacted.</li>
  <li><strong>Security Threat Detection:</strong> Spatiotemporal anomaly detection can be applied to security-relevant data. An unusual pattern of communication (e.g., a host suddenly communicating with many new internal endpoints) could be flagged as a potential lateral movement attack, even if the individual connections are low-volume.</li>
</ul>

<p><img src="/images/figure2-observability_workflow.png" alt="Infographic showing how anomaly detection and performance prediction outputs from the model feed into and enhance network observability workflows." />
<em>Figure 2: The role of GCN+LSTM model outputs in an integrated observability workflow, enabling proactive and automated operational responses.</em></p>

<h2 id="4-model-architecture-and-foundations">4. Model Architecture and Foundations</h2>

<p>This section details the architectural components, mathematical underpinnings, and evaluation metrics for a GCN+LSTM framework.</p>

<h3 id="41-model-architecture">4.1 Model Architecture</h3>

<p>The GCN+LSTM model is an end-to-end deep learning architecture that processes a sequence of graph snapshots to produce a prediction.</p>

<p><strong>Input:</strong> A sequence of <code class="language-plaintext highlighter-rouge">k+1</code> graph snapshots from time <code class="language-plaintext highlighter-rouge">t-k</code> to <code class="language-plaintext highlighter-rouge">t</code>. Each snapshot consists of a node feature matrix <code class="language-plaintext highlighter-rouge">X_i</code> and an adjacency matrix <code class="language-plaintext highlighter-rouge">A_i</code>.</p>

<p><strong>Processing Pipeline:</strong></p>
<ol>
  <li><strong>Spatial Encoding (GCN):</strong> For each time step <code class="language-plaintext highlighter-rouge">i</code> in the input sequence, the graph <code class="language-plaintext highlighter-rouge">(X_i, A_i)</code> is passed through one or more GCN layers.
    <ul>
      <li>The GCN aggregates information from neighboring nodes, updating each node’s feature vector to create a spatially aware embedding <code class="language-plaintext highlighter-rouge">Z_i</code>. This step is performed for every snapshot in the sequence, producing a sequence of embeddings <code class="language-plaintext highlighter-rouge">[Z_{t-k}, ..., Z_t]</code>.</li>
    </ul>
  </li>
  <li><strong>Temporal Encoding (LSTM):</strong> The sequence of node embeddings <code class="language-plaintext highlighter-rouge">[Z_{t-k}, ..., Z_t]</code> is fed into an LSTM network.
    <ul>
      <li>The LSTM processes the sequence step-by-step, maintaining an internal hidden state that captures the temporal dynamics of how the graph evolves. The final hidden state of the LSTM, <code class="language-plaintext highlighter-rouge">h_t</code>, represents a compressed spatiotemporal summary of the entire input sequence.</li>
    </ul>
  </li>
  <li><strong>Prediction Head (Fully Connected Layer):</strong> The final LSTM hidden state <code class="language-plaintext highlighter-rouge">h_t</code> is passed to a final feed-forward neural network (the “head”).
    <ul>
      <li>The structure of this head depends on the task. For anomaly detection, it might aim to reconstruct the input or predict the next graph state. For performance prediction, it will output a regression value. An activation function like Softmax (for classification) or a linear activation (for regression) is used to produce the final output.</li>
    </ul>
  </li>
</ol>

<p>Some architectures may employ a dual-LSTM structure, processing node and edge features in separate parallel streams before fusion, to more explicitly model both entity and interaction dynamics (Yu et al., 2023).</p>

<p><img src="/images/figure3-gcn_lstm_pipeline.png" alt="Infographic detailing the GCN+LSTM processing pipeline, from data ingestion to output generation." />
<em>Figure 3: Architectural overview of the GCN+LSTM processing pipeline, showing the flow from graph sequences to spatiotemporal encoding and final prediction.</em></p>

<h3 id="42-mathematical-foundations">4.2 Mathematical Foundations</h3>

<h4 id="graph-representation">Graph Representation</h4>
<ul>
  <li><strong>Adjacency Matrix <code class="language-plaintext highlighter-rouge">A</code>:</strong> A square matrix of size <code class="language-plaintext highlighter-rouge">N x N</code> (where <code class="language-plaintext highlighter-rouge">N</code> is the number of nodes) where <code class="language-plaintext highlighter-rouge">A_ij = 1</code> if an edge exists between node <code class="language-plaintext highlighter-rouge">i</code> and node <code class="language-plaintext highlighter-rouge">j</code>, and 0 otherwise.</li>
  <li><strong>Node Feature Matrix <code class="language-plaintext highlighter-rouge">X</code>:</strong> A matrix of size <code class="language-plaintext highlighter-rouge">N x F</code> (where <code class="language-plaintext highlighter-rouge">F</code> is the number of features per node) where row <code class="language-plaintext highlighter-rouge">i</code> contains the feature vector for node <code class="language-plaintext highlighter-rouge">i</code>.</li>
</ul>

<h4 id="graph-convolutional-network-gcn-layer">Graph Convolutional Network (GCN) Layer</h4>
<p>The core of the GCN is its propagation rule, which defines how node representations are updated at each layer <code class="language-plaintext highlighter-rouge">l</code>. The simplified formula for a GCN layer is:</p>

<p><code class="language-plaintext highlighter-rouge">H⁽ˡ⁺¹⁾ = σ(D̃⁻¹/² Ã D̃⁻¹/² H⁽ˡ⁾ W⁽ˡ⁾)</code></p>

<p>Where:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">H⁽ˡ⁾</code> is the matrix of node activations at layer <code class="language-plaintext highlighter-rouge">l</code> (<code class="language-plaintext highlighter-rouge">H⁽⁰⁾ = X</code>).</li>
  <li><code class="language-plaintext highlighter-rouge">Ã = A + I</code> is the adjacency matrix <code class="language-plaintext highlighter-rouge">A</code> with self-loops added (so a node includes its own features in the aggregation).</li>
  <li><code class="language-plaintext highlighter-rouge">D̃</code> is the diagonal degree matrix of <code class="language-plaintext highlighter-rouge">Ã</code>. The term <code class="language-plaintext highlighter-rouge">D̃⁻¹/² Ã D̃⁻¹/²</code> is a symmetric normalization of the adjacency matrix that prevents the scale of feature vectors from exploding and stabilizes the learning process.</li>
  <li><code class="language-plaintext highlighter-rouge">W⁽ˡ⁾</code> is a trainable weight matrix for layer <code class="language-plaintext highlighter-rouge">l</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">σ</code> is a non-linear activation function, such as ReLU (<code class="language-plaintext highlighter-rouge">max(0, x)</code>).</li>
</ul>

<p>In essence, this operation computes a weighted average of the feature vectors of a node and its immediate neighbors. Stacking these layers allows the model to learn representations based on larger neighborhoods.</p>

<h4 id="long-short-term-memory-lstm-layer">Long Short-Term Memory (LSTM) Layer</h4>
<p>An LSTM is a type of Recurrent Neural Network (RNN) designed to overcome the vanishing gradient problem and learn long-term dependencies. It achieves this through a series of “gates” that control the flow of information. At each time step <code class="language-plaintext highlighter-rouge">t</code>, an LSTM cell takes the current input <code class="language-plaintext highlighter-rouge">x_t</code> and the previous hidden state <code class="language-plaintext highlighter-rouge">h_{t-1}</code> to compute the new hidden state <code class="language-plaintext highlighter-rouge">h_t</code>.</p>

<p>This is governed by three gates:</p>
<ul>
  <li><strong>Forget Gate (<code class="language-plaintext highlighter-rouge">f_t</code>):</strong> Decides what information to discard from the cell state.</li>
  <li><strong>Input Gate (<code class="language-plaintext highlighter-rouge">i_t</code>):</strong> Decides which new information to store in the cell state.</li>
  <li><strong>Output Gate (<code class="language-plaintext highlighter-rouge">o_t</code>):</strong> Decides what part of the cell state to use for the output hidden state.</li>
</ul>

<p>These gates allow the LSTM to selectively remember relevant information from many time steps in the past while discarding irrelevant information, making it ideal for modeling the temporal evolution of network states.</p>

<h3 id="43-evaluation-metrics">4.3 Evaluation Metrics</h3>

<p>The performance of the GCN+LSTM framework can be assessed using standard machine learning metrics tailored to the specific use case.</p>

<h4 id="for-anomaly-detection-classification-task">For Anomaly Detection (Classification Task):</h4>
<ul>
  <li><strong>Precision:</strong> <code class="language-plaintext highlighter-rouge">TP / (TP + FP)</code> – Of all the alerts generated, what fraction were actual anomalies? High precision is crucial for building operator trust and avoiding alert fatigue.</li>
  <li><strong>Recall:</strong> <code class="language-plaintext highlighter-rouge">TP / (TP + FN)</code> – Of all the actual anomalies that occurred, what fraction did the system detect? High recall is critical for ensuring that important events are not missed.</li>
  <li><strong>F1-Score:</strong> <code class="language-plaintext highlighter-rouge">2 * (Precision * Recall) / (Precision + Recall)</code> – The harmonic mean of precision and recall, providing a single score that balances the two. It is often the primary metric for evaluating anomaly detectors on imbalanced datasets.</li>
</ul>

<p>Studies show GCN+LSTM models achieving F1-scores of <strong>95.96%</strong> on the WADI industrial control dataset (Yang et al., 2025) and precision/recall around <strong>0.85-0.90</strong> on cloud system datasets (Yu et al., 2023).</p>

<h4 id="for-performance-prediction-regression-task">For Performance Prediction (Regression Task):</h4>
<ul>
  <li><strong>Mean Absolute Error (MAE):</strong> <code class="language-plaintext highlighter-rouge">(1/n) * Σ|y_i - ŷ_i|</code> – The average absolute difference between the predicted values (<code class="language-plaintext highlighter-rouge">ŷ_i</code>) and the actual values (<code class="language-plaintext highlighter-rouge">y_i</code>). It is easily interpretable as it is in the same units as the target variable.</li>
  <li><strong>Root Mean Squared Error (RMSE):</strong> <code class="language-plaintext highlighter-rouge">sqrt((1/n) * Σ(y_i - ŷ_i)²) </code>– Similar to MAE, but penalizes larger errors more heavily.</li>
  <li><strong>Mean Absolute Percentage Error (MAPE):</strong> <code class="language-plaintext highlighter-rouge">(100/n) * Σ|(y_i - ŷ_i) / y_i|</code> – Expresses the error as a percentage of the actual value, useful for understanding the relative error.</li>
</ul>

<p>In VANET performance prediction, a GCN+LSTM model achieved an MAE of <strong>0.02</strong> and an RMSE of <strong>0.07</strong>, demonstrating very high predictive accuracy (Maray, 2026). In network tomography tasks, GNN-based approaches reduced MAPE on latency prediction to <strong>0.6907</strong>, outperforming baselines that scored over 0.81 (Hu et al., 2025).</p>

<h2 id="5-conclusion">5. Conclusion</h2>

<p>The GCN+LSTM framework represents a significant theoretical and practical advancement for the field of network observability. By treating network telemetry as dynamic, structured graph data, this approach moves beyond the limitations of traditional monitoring and provides a powerful engine for operational inference. Its proven ability to model the complex, interdependent, and time-varying nature of modern distributed systems makes it exceptionally well-suited for high-value use cases like sophisticated anomaly detection and proactive performance prediction.</p>

<p>While the implementation of such a system requires careful data engineering and model training, the evidence from recent research is compelling. Multiple studies across different domains—cloud computing, industrial control systems, and vehicular networks—have independently reached the same conclusion: the combination of GCN for spatial analysis and LSTM for temporal analysis yields state-of-the-art results.</p>

<p>For technical engineering groups and network architects, this framework offers a clear path toward a more intelligent, automated, and proactive operational model. By adopting a GCN+LSTM approach, organizations can enhance their ability to understand and control their increasingly complex network environments, improve system reliability, and optimize performance in ways that are unattainable with conventional methods. This report provides the foundational basis for exploring the strategic integration of this technology into next-generation observability platforms.</p>

<hr />

<h1 id="references">References</h1>
<ol>
  <li><a href="https://www.techscience.com/iasc/v37n2/53268/html">Anomaly Detection for Cloud Systems with Dynamic Spatiotemporal Learning - Tech Science Press</a></li>
  <li><a href="https://www.nature.com/articles/s41598-025-33193-2">Hybrid deep learning techniques for adaptive routing and congestion control in urban VANET for wireless mobile networking - Nature</a></li>
  <li><a href="https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0324543">Cloud-edge collaborative data anomaly detection in industrial sensor networks - PLOS</a></li>
  <li><a href="https://arxiv.org/html/2502.16430">Network Tomography with Path-Centric Graph Neural Network - arXiv</a></li>
</ol>]]></content><author><name>Marc Buraczynski</name></author><category term="networking" /><category term="graph neural networks" /><category term="observability" /><category term="GCN" /><category term="LSTM" /><summary type="html"><![CDATA[DATE: 2026-05-24]]></summary></entry><entry><title type="html">Network Modeling in Cisco ThousandEyes</title><link href="https://gunnymarc.github.io/posts/2026/05/thousandeyes-network-modeling/" rel="alternate" type="text/html" title="Network Modeling in Cisco ThousandEyes" /><published>2026-05-22T00:00:00-04:00</published><updated>2026-05-22T00:00:00-04:00</updated><id>https://gunnymarc.github.io/posts/2026/05/thousandeyes-network-modeling</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/05/thousandeyes-network-modeling/"><![CDATA[<p><em>A technical deep-dive into the graph-theoretic foundations, algorithms, and data structures that power ThousandEyes’ network intelligence platform.</em></p>

<hr />

<h2 id="i-introduction">I. Introduction</h2>

<p>Modern enterprise infrastructure depends on networks that no single organization owns or controls. A request from a remote employee’s browser to a SaaS application may traverse a home ISP, a regional transit provider, one or more Tier-1 backbone networks, a CDN edge node, a cloud provider’s internal fabric, and finally the application’s load balancer — all before the first byte of response data is generated. When performance degrades, the immediate question — <em>where is the problem?</em> — demands visibility across every one of those domains.</p>

<p>Cisco ThousandEyes addresses this challenge by deploying a global mesh of software agents that continuously probe network paths, collect BGP routing tables, and measure application response times. The raw output of these probes — ICMP TTL-exceeded messages, TCP handshake timings, BGP UPDATE messages, SNMP interface counters — is voluminous and, in isolation, unintelligible. What transforms it into actionable intelligence is <strong>graph theory</strong>: the branch of mathematics concerned with pairwise relationships between objects.</p>

<p>Every core visualization and detection capability in ThousandEyes is, at its foundation, a graph operation:</p>

<ul>
  <li><strong>Path Visualization</strong> constructs a directed, weighted graph of IP hops from agents to a destination, then overlays performance metrics on nodes and edges.</li>
  <li><strong>BGP Route Visualization</strong> builds an Autonomous-System-level directed graph from route collector data, enabling detection of hijacks, leaks, and path instability.</li>
  <li><strong>Device Layer</strong> auto-discovers internal infrastructure via LLDP/CDP and renders it as a Layer-2 adjacency graph enriched with SNMP health telemetry.</li>
  <li><strong>Internet Insights</strong> aggregates de-identified measurements from the entire ThousandEyes agent fleet into a global provider-infrastructure graph, applying cluster analysis to detect macro-scale outages.</li>
</ul>

<p>This article examines these graph models in detail: the abstractions they use, the algorithms that build and analyze them, the visualization techniques that make them interpretable, and the programmatic interfaces that allow engineers to extend them.</p>

<hr />

<h2 id="ii-graph-theory-foundations-as-applied-in-thousandeyes">II. Graph Theory Foundations as Applied in ThousandEyes</h2>

<p>Before examining each product capability, it is useful to establish the specific graph-theoretic constructs that ThousandEyes employs and how they map to network engineering concepts.</p>

<h3 id="21-core-abstractions">2.1 Core Abstractions</h3>

<p><strong>Nodes (Vertices).</strong> In ThousandEyes, a node represents a distinct network entity. The specific entity depends on the graph model in use:</p>

<table>
  <thead>
    <tr>
      <th>Graph Model</th>
      <th>Node Represents</th>
      <th>Example</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Path Visualization</td>
      <td>A unique IP address responding to a probe</td>
      <td><code class="language-plaintext highlighter-rouge">72.14.236.217</code> (Google edge router)</td>
    </tr>
    <tr>
      <td>BGP Route Visualization</td>
      <td>An Autonomous System (AS)</td>
      <td>AS 15169 (Google)</td>
    </tr>
    <tr>
      <td>Device Layer</td>
      <td>A discovered network device</td>
      <td>Cisco Catalyst 9300 switch</td>
    </tr>
    <tr>
      <td>Internet Insights</td>
      <td>A provider Point of Presence (PoP)</td>
      <td>Comcast PoP, Chicago</td>
    </tr>
  </tbody>
</table>

<p><strong>Edges (Links).</strong> An edge represents a connection or relationship between two nodes. Edges carry attributes — metadata and performance metrics — that are central to the platform’s diagnostic value:</p>

<ul>
  <li><strong>Path Visualization edges</strong>: Represent a network segment between two consecutive hops. Attributes include forwarding loss (%), link delay (ms), number of traces traversing the link, DSCP markings, and minimum path MTU.</li>
  <li><strong>BGP edges</strong>: Represent a peering or transit relationship between two ASes. Attributes include the number of path changes observed, reachability percentage, and BGP update counts.</li>
  <li><strong>Device Layer edges</strong>: Represent Layer-2 connections between device interfaces, discovered via neighbor protocol advertisements.</li>
</ul>

<p><strong>Directed vs. undirected.</strong> Path Visualization graphs are inherently directed — traffic flows from agent (source) to destination. ThousandEyes renders this with arrows and supports toggling between source-to-target, target-to-source, and bidirectional views. In Agent-to-Agent tests, both directions are measured independently, producing two distinct directed graphs that may differ substantially due to asymmetric routing. BGP Route Visualization is also directed: edges point from the monitoring vantage point toward the origin AS, following the AS-path attribute in reverse.</p>

<p><strong>Weighted graphs.</strong> Nearly all ThousandEyes graphs are weighted. The weight on an edge is the value of a selected performance metric — typically latency, loss, or jitter. The platform’s color-coding system maps these weights to a green-to-red gradient, providing immediate visual encoding of graph-edge severity.</p>

<h3 id="22-graph-representations-in-the-platform">2.2 Graph Representations in the Platform</h3>

<p>ThousandEyes employs three primary graph representations internally, each optimized for its use case:</p>

<p><strong>Interactive Directed Graph (Path Visualization).</strong> The path trace data collected by agents is assembled into a composite directed graph where shared hops across multiple agents are merged into single nodes and divergent routes branch visually. This is conceptually close to a directed acyclic graph (DAG) from agents (sources) to the destination (sink), although routing loops — when detected — introduce cycles and are flagged with a distinct red-loop indicator.</p>

<p><strong>AS-Level Directed Graph (BGP Route Visualization).</strong> BGP data from public monitors (RIPE-RIS, RouteViews) and customer-deployed private monitors is assembled into a graph where each node is an AS and each edge is a segment of the AS-path. The resulting structure is a directed forest rooted at the monitored prefix’s origin AS, with monitor vantage points as leaves.</p>

<p><strong>Adjacency Graph (Device Layer).</strong> Internal topology is represented as an undirected adjacency graph built from LLDP and CDP neighbor tables. Each device is a node; each discovered neighbor relationship is an edge. SNMP polling enriches nodes with health metrics (CPU utilization, memory consumption, interface error rates, bandwidth utilization), turning the raw adjacency graph into a health-annotated topology map.</p>

<h3 id="23-key-graph-properties-thousandeyes-exploits">2.3 Key Graph Properties ThousandEyes Exploits</h3>

<p>Several classical graph properties map directly to network monitoring concepts:</p>

<p><strong>Connectivity.</strong> The fundamental question — <em>can the agent reach the target?</em> — is a connectivity query on the path graph. A disconnected graph (no path from agent node to destination node) indicates a reachability failure. ThousandEyes reports this as 100% loss with an incomplete path trace.</p>

<p><strong>Path multiplicity.</strong> Modern networks use Equal-Cost Multi-Path (ECMP) routing, meaning multiple shortest paths may exist between two points. ThousandEyes exploits this by performing 3 to 10 parallel path traces per agent, each using a unique TCP source port to encourage the network’s ECMP hash function to select different paths. The resulting graph captures this multiplicity: split paths are rendered with varying line thickness proportional to the number of traces traversing each link.</p>

<p><strong>Branching and convergence.</strong> When multiple agents test the same destination, their paths often diverge near the source and converge near the destination. The graph representation merges convergent hops into shared nodes, producing a tree-like structure that clearly shows where paths overlap and where they diverge — critical for determining whether a problem affects one agent or many.</p>

<p><strong>Cycles (routing loops).</strong> A well-functioning network graph should be acyclic along any given path. When ThousandEyes detects that a packet revisits a previously seen node, it renders a red loop indicator around that node, immediately flagging a routing misconfiguration.</p>

<hr />

<h2 id="iii-thousandeyes-network-models-and-their-graph-structures">III. ThousandEyes Network Models and Their Graph Structures</h2>

<h3 id="31-path-visualization-model">3.1 Path Visualization Model</h3>

<p>Path Visualization is ThousandEyes’ signature capability and its most direct application of graph theory. It constructs a composite graph from the path trace data collected by all agents testing a given target, rendering the Internet’s routing topology as a navigable, metric-annotated visual.</p>

<p><strong>Graph construction.</strong> Each ThousandEyes agent — whether a Cloud Agent deployed in a public data center, an Enterprise Agent on a customer’s network, or an Endpoint Agent on a user’s device — performs path traces to the test target. The agent sends probe packets with incrementally increasing Time-To-Live (TTL) values. Each intermediate router decrements the TTL; when it reaches zero, the router responds with an ICMP Time Exceeded message, revealing its IP address. This process, repeated until the target responds, produces an ordered sequence of IP addresses — a path.</p>

<p>To discover ECMP routes, each agent performs multiple parallel path traces (3 by default, configurable up to 10) using unique, randomized TCP source ports. Since ECMP hash functions typically incorporate the source port into their path-selection decision, different source ports may yield different paths through the network.</p>

<p>The resulting set of paths from all agents is merged into a single directed graph:</p>

<ol>
  <li><strong>Nodes</strong> are created for each unique IP address observed across all path traces. Nodes are categorized as:
    <ul>
      <li><strong>Agent nodes</strong> (leftmost): The originating ThousandEyes agents.</li>
      <li><strong>Intermediate nodes</strong>: IP addresses of routers along the path, typically belonging to ISPs, transit providers, or cloud fabrics.</li>
      <li><strong>Destination node</strong> (rightmost): The target IP address.</li>
      <li><strong>Blank nodes</strong>: Placeholders for hops that did not respond to probes (rendered as empty circles).</li>
    </ul>
  </li>
  <li>
    <p><strong>Edges</strong> connect consecutive nodes in each observed path. When multiple agents share a common hop, the edges converge at that node, producing a merged graph rather than parallel isolated paths.</p>
  </li>
  <li><strong>Edge attributes</strong> encode performance data:
    <ul>
      <li><strong>Forwarding loss (%)</strong>: Percentage of probes that were dropped at this hop.</li>
      <li><strong>Link delay (ms)</strong>: Estimated minimum transmission delay across this edge.</li>
      <li><strong>Jitter (ms)</strong>: Variability in probe round-trip times.</li>
      <li><strong>DSCP marking</strong>: The Differentiated Services Code Point value observed in returned packets.</li>
      <li><strong>Minimum path MTU (bytes)</strong>: The smallest Maximum Transmission Unit along the path up to this point.</li>
      <li><strong>Trace count</strong>: The number of individual path traces that traversed this edge — rendered as line thickness.</li>
    </ul>
  </li>
  <li><strong>Node attributes</strong> include the IP address, reverse DNS hostname, WHOIS-derived network ownership, autonomous system number, and geographic location.</li>
</ol>

<p><strong>Temporal dimension.</strong> Path Visualization is not a static snapshot. ThousandEyes collects data in discrete test rounds (typically every 2 minutes), and the visualization can be scrubbed across a timeline. This allows engineers to observe how the graph structure changes over time — routes shifting, new hops appearing, existing hops becoming lossy — providing a temporal graph analysis capability.</p>

<h3 id="32-bgp-route-visualization-model">3.2 BGP Route Visualization Model</h3>

<p>While Path Visualization operates at the IP-hop level (Layer 3 forwarding plane), BGP Route Visualization operates at the Autonomous System level (Layer 3 control plane). It models the Internet’s routing topology as an AS-path graph.</p>

<p><strong>Data sources.</strong> ThousandEyes ingests BGP routing data from two categories of monitors:</p>

<ul>
  <li><strong>Public BGP monitors</strong>: eBGP sessions maintained with routers participating in the RIPE Routing Information Service (RIPE-RIS) and the University of Oregon’s RouteViews project, as well as ThousandEyes’ own public BGP collectors. These provide an “outside-in” view — how the global Internet sees a given prefix.</li>
  <li><strong>Private BGP monitors</strong>: Customer-configured multi-hop eBGP sessions between their own BGP speakers and ThousandEyes’ route collectors. These provide an “inside-out” view — how the customer’s own network sees external prefixes.</li>
</ul>

<p><strong>Graph structure.</strong> For a monitored prefix, the BGP Route Visualization constructs a directed graph where:</p>

<ul>
  <li><strong>Nodes</strong> are Autonomous Systems, identified by their ASN and annotated with the organization name (sourced from WHOIS registries, CAIDA, BGP.Tools, APNIC, and RIPE NCC).</li>
  <li><strong>Edges</strong> represent AS-path segments. An edge from AS <em>A</em> to AS <em>B</em> means that <em>B</em> is the next hop in the AS-path as advertised to the monitor. Edge direction follows the AS-path from the monitor toward the origin AS.</li>
  <li><strong>Edge metrics</strong> include: the number of path changes observed in a given time window, reachability percentage (what fraction of the time the prefix was visible via this path), and raw BGP update counts.</li>
</ul>

<p><strong>AS prepending detection.</strong> A common traffic engineering technique is AS-path prepending, where an AS inserts its own ASN multiple times into the AS-path to make a route appear longer and thus less preferred. In the graph, this manifests as a self-loop on a node — the same ASN appearing consecutively in the path. ThousandEyes highlights these prepended segments, allowing engineers to distinguish genuine path lengthening from artificial manipulation.</p>

<p><strong>RPKI validation layer.</strong> ThousandEyes validates route origins against the Resource Public Key Infrastructure (RPKI) and annotates the graph accordingly. Each prefix-origin pair is marked as <em>Valid</em> (the origin AS is authorized by an ROA), <em>Invalid</em> (the origin AS contradicts a published ROA, suggesting a possible hijack), or <em>Not Found</em> (no ROA exists for this prefix). This transforms the AS graph into a security-annotated graph where routing anomalies are immediately visible.</p>

<h3 id="33-device-layer-topology-model">3.3 Device Layer Topology Model</h3>

<p>The Device Layer extends ThousandEyes’ graph modeling inward, mapping an organization’s own network infrastructure.</p>

<p><strong>Discovery algorithm.</strong> Starting from Enterprise Agents deployed within the network, ThousandEyes queries LLDP (Link Layer Discovery Protocol) and CDP (Cisco Discovery Protocol) neighbor tables via SNMP. Each neighbor advertisement reveals a connected device and its interface, providing the raw adjacency data for graph construction. The discovery process crawls outward from the agent, building a breadth-first traversal of the network’s Layer-2 topology.</p>

<p><strong>Graph structure.</strong> The resulting graph is an undirected adjacency graph where:</p>

<ul>
  <li><strong>Nodes</strong> represent network devices — routers, switches, firewalls, load balancers, wireless controllers — each rendered with a type-specific icon.</li>
  <li><strong>Edges</strong> represent Layer-2 links between device interfaces.</li>
  <li><strong>Node attributes</strong> are enriched via SNMP polling: device type, firmware version, CPU utilization, memory consumption, interface error counters, and bandwidth utilization per interface.</li>
</ul>

<p><strong>Correlation with Path Visualization.</strong> The Device Layer graph is not isolated — it is correlated with the path trace graph. When an IP address in the Path Visualization corresponds to a discovered device in the Device Layer, the two graphs are linked, allowing engineers to pivot from “this hop has 5% packet loss” to “this hop is interface GigabitEthernet0/1 on switch <code class="language-plaintext highlighter-rouge">core-sw-01</code>, which is currently at 94% CPU.”</p>

<h3 id="34-internet-insights--the-aggregate-network-graph">3.4 Internet Insights — The Aggregate Network Graph</h3>

<p>Internet Insights operates at the largest scale: a global graph of Internet infrastructure derived from the collective measurements of the entire ThousandEyes agent fleet.</p>

<p><strong>Data aggregation.</strong> ThousandEyes agents worldwide — cloud agents, enterprise agents, endpoint agents — collectively perform billions of measurements daily. This data is de-identified (all customer-specific and private-network information is stripped) and aggregated into a global dataset. The result is a graph of Internet provider infrastructure where:</p>

<ul>
  <li><strong>Nodes</strong> represent network Points of Presence (PoPs) for ISPs, CDNs, DNS providers, IaaS platforms, UCaaS services, SECaaS providers, and major SaaS applications.</li>
  <li><strong>Edges</strong> represent observed connectivity between PoPs, derived from the aggregate path trace data.</li>
</ul>

<p><strong>Outage detection as graph cluster analysis.</strong> Internet Insights identifies outages by detecting anomalous clusters within this graph:</p>

<ul>
  <li><strong>Network outages</strong>: Triggered when a concentration of 100% packet-loss events is detected within a single network PoP in a short time frame. The algorithm continuously monitors lossy interfaces across all networks and PoPs, maintaining baselines for normal loss levels. When loss events significantly exceed the baseline within a PoP, the algorithm classifies the event as an outage, estimating its scope (how many PoPs are affected) and scale (how many vantage points are impacted).</li>
  <li><strong>Application outages</strong>: Triggered when multiple globally distributed vantage points simultaneously fail to reach an application’s servers or receive error responses. The requirement for multi-vantage-point confirmation ensures that isolated agent-side issues are not misclassified as provider outages.</li>
</ul>

<p><strong>Geographical and topological views.</strong> The outage graph is rendered both geographically (outages superimposed on a world map) and topologically (outages shown in context of the provider’s network structure), allowing engineers to quickly assess scope and impact.</p>

<h3 id="35-cloud-and-sd-wan-enriched-models">3.5 Cloud and SD-WAN Enriched Models</h3>

<p>ThousandEyes augments its base graph models with enrichment layers for cloud and SD-WAN environments:</p>

<p><strong>Cloud network enrichment.</strong> In collaboration with AWS, Azure, and GCP, ThousandEyes maps IP addresses in the path graph to specific cloud services, regions, and availability zones. A raw IP node like <code class="language-plaintext highlighter-rouge">52.93.178.12</code> is annotated as “AWS S3, us-east-1.” For AWS Global Accelerator targets, the platform compares observed TCP latency against expected latency benchmarks, providing a deviation metric directly on the enriched node.</p>

<p><strong>SD-WAN overlay/underlay dual-layer graph.</strong> For organizations using Cisco SD-WAN or Meraki MX, ThousandEyes constructs a two-layer graph model. The <strong>overlay graph</strong> shows the logical SD-WAN tunnel paths between branch sites and application endpoints. The <strong>underlay graph</strong> shows the physical network paths those tunnels traverse — through ISPs, MPLS circuits, or direct Internet paths. By correlating performance metrics across both layers, engineers can determine whether a problem is in the overlay configuration or the underlay transport.</p>

<p><strong>Meraki enrichment.</strong> When integrated with Meraki environments, path visualization nodes within the Meraki network are enriched with the hosting network name, MX appliance name, connected client count, and WAN application score — providing campus/branch context directly within the graph.</p>

<hr />

<h2 id="iv-algorithms-and-computational-methods">IV. Algorithms and Computational Methods</h2>

<h3 id="41-path-discovery-and-traversal">4.1 Path Discovery and Traversal</h3>

<p>The foundation of ThousandEyes’ path graph is the <strong>TTL-incrementing probe algorithm</strong> — an engineered variant of traceroute optimized for multi-agent, multi-path environments.</p>

<p><strong>Basic mechanism.</strong> The agent estimates the path distance to the target and then sends probe packets with incrementally increasing TTL values, starting from TTL=1. Each intermediate router decrements the TTL and, upon reaching zero, responds with an ICMP Time Exceeded message containing the router’s source IP address. The agent records the responding IP and its round-trip time, then sends the next probe with TTL+1. The process terminates when a response from the target itself is received or when a maximum TTL is reached without a response (rendering blank nodes for unresponsive hops).</p>

<p><strong>Multi-path discovery.</strong> To detect ECMP routes, each agent performs 3 parallel path traces by default (configurable up to 10). Each trace uses a unique, randomized TCP source port. Because most ECMP implementations hash on the 5-tuple (source IP, destination IP, source port, destination port, protocol), varying the source port encourages the network to select different forwarding paths. The resulting set of paths is merged into the composite graph, with split paths rendered as branches and their relative usage indicated by edge thickness.</p>

<p><strong>Protocol selection.</strong> Agents support both TCP and ICMP-based path tracing:</p>
<ul>
  <li><strong>TCP mode</strong>: Sends TCP SYN packets; expects SYN+ACK or RST from the target. Preferred for targets behind firewalls that may drop ICMP.</li>
  <li><strong>ICMP mode</strong>: Sends ICMP Echo Request packets; expects Echo Reply from the target. Useful when TCP ports are filtered.</li>
</ul>

<p><strong>Bidirectional tracing.</strong> In Agent-to-Agent tests, both endpoints perform independent path traces toward each other. This produces two directed graphs — source-to-target and target-to-source — which often differ due to asymmetric routing. The visualization allows toggling between these views, providing complete bidirectional path visibility.</p>

<p><strong>Continuous high-frequency probing.</strong> For tests configured with 1-minute intervals, ThousandEyes sends one probe per second over the entire interval (rather than a burst at the start). This continuous sampling captures intermittent loss events that burst-based probing might miss, and the results are rendered as a sparkline visualization showing per-second packet drop patterns.</p>

<h3 id="42-shortest-path-and-latency-analysis">4.2 Shortest Path and Latency Analysis</h3>

<p>While ThousandEyes does not run Dijkstra’s algorithm on the path graph in the classical sense (it observes actual forwarding paths rather than computing optimal ones), the platform performs analogous weighted-graph analysis:</p>

<p><strong>End-to-end latency.</strong> The total latency from agent to target is measured via TCP or ICMP round-trip time. This is the weight of the shortest path in the observed graph — though in practice, the Internet may not route along the latency-optimal path.</p>

<p><strong>Per-hop delay estimation.</strong> ThousandEyes estimates the transmission delay across each individual link by measuring the round-trip time to consecutive hops and computing the differential. This isolates each edge’s latency contribution, enabling engineers to identify the specific link responsible for latency spikes — analogous to computing edge weights in a weighted graph and finding the maximum-weight edge.</p>

<p><strong>Benchmark comparison.</strong> For cloud-enriched nodes, the platform compares observed latency against provider-published benchmarks. For example, for AWS Global Accelerator targets, ThousandEyes compares the measured TCP connection time against AWS’s expected latency for that region, flagging deviations that indicate network-layer problems rather than application-layer issues.</p>

<h3 id="43-centrality-and-critical-node-identification">4.3 Centrality and Critical Node Identification</h3>

<p>Graph centrality measures, while not labeled as such in the ThousandEyes interface, underpin several key diagnostic capabilities:</p>

<p><strong>Betweenness centrality (shared-hop analysis).</strong> When multiple agents test the same destination, their paths often converge at shared intermediate hops. A node that appears on the paths of many agents has high betweenness centrality in the test graph. If that node begins dropping packets, the impact is proportionally larger — affecting all agents whose paths traverse it. ThousandEyes’ visualization makes this immediately apparent: high-betweenness nodes sit at convergence points in the graph, and packet loss on those nodes is visible to every affected agent simultaneously.</p>

<p><strong>Cut vertices (single points of failure).</strong> A node whose removal would disconnect one or more agents from the destination is a <em>cut vertex</em> in graph-theoretic terms — a single point of failure. ThousandEyes’ path graph reveals these implicitly: if all agent paths funnel through a single intermediate node before reaching the destination, that node is a cut vertex. Identifying these nodes is critical for resilience planning.</p>

<p><strong>Loss attribution.</strong> When end-to-end loss is detected, the question is <em>which node or link is responsible?</em> ThousandEyes performs per-hop loss analysis by comparing the probe response rate at consecutive hops. If hop <em>n</em> responds to 100% of probes but hop <em>n+1</em> responds to only 95%, the link between them — or hop <em>n+1</em> itself — is attributed with 5% forwarding loss. This is visualized as a red circle around the lossy node and a red-colored link, immediately drawing attention to the responsible edge in the graph.</p>

<h3 id="44-clustering-and-outage-detection-internet-insights">4.4 Clustering and Outage Detection (Internet Insights)</h3>

<p>Internet Insights’ outage detection is, at its core, a <strong>spatial clustering algorithm</strong> applied to a global graph of Internet measurement data.</p>

<p><strong>Collective intelligence aggregation.</strong> The input dataset is extraordinary in scale: billions of daily measurements from ThousandEyes agents deployed across thousands of networks worldwide. Before aggregation, all data is de-identified — customer identifiers and private-network information are stripped. The remaining data consists of tuples: <em>(agent_network, intermediate_hop_IP, hop_network, hop_PoP, loss_flag, timestamp)</em>.</p>

<p><strong>PoP-level cluster detection.</strong> The algorithm groups loss events by network and PoP. For each PoP, it maintains a rolling baseline of normal loss-event frequency. When the observed frequency of 100% packet-loss events within a PoP exceeds the baseline by a statistically significant margin within a short time window, the algorithm triggers an outage alert. The outage’s <strong>scope</strong> is determined by the number of distinct PoPs within the same network that are simultaneously affected. Its <strong>scale</strong> is determined by the number of distinct agent vantage points and customer tests that are impacted.</p>

<p><strong>Application outage inference.</strong> For SaaS and cloud applications, the algorithm applies a similar clustering approach at the application level. When multiple globally distributed agents simultaneously fail to receive valid responses from an application’s endpoints — and these failures correlate across independent networks and geographies — the algorithm infers an application-level outage. The multi-vantage-point requirement is critical: it prevents false positives from agent-side or local-network issues.</p>

<p><strong>Correlation with customer tests.</strong> Detected outages are automatically correlated with each ThousandEyes customer’s own test data. If a customer’s test to Salesforce shows degradation at the same time Internet Insights detects a Salesforce outage, the platform links the two, enabling the customer to immediately determine that the problem is external — not in their own network.</p>

<h3 id="45-bgp-routing-algorithms">4.5 BGP Routing Algorithms</h3>

<p>ThousandEyes applies several specialized algorithms to its BGP data:</p>

<p><strong>Reachability monitoring.</strong> For each monitored prefix, the platform tracks what percentage of BGP monitors can see a valid route. A drop in reachability — visible as a declining metric on the timeline — indicates that the prefix is being withdrawn from portions of the global routing table. The algorithm correlates reachability drops across monitors to distinguish localized issues (one monitor loses the route) from widespread events (many monitors simultaneously lose it).</p>

<p><strong>Path change detection.</strong> The algorithm continuously compares the current AS-path for each prefix against the previously observed AS-path. Any change — a new transit AS inserted, an existing AS removed, a path lengthened or shortened — triggers a path-change event. Rapid oscillation in AS-paths (route flapping) is flagged as a stability concern.</p>

<p><strong>Route hijack and leak detection.</strong> A BGP hijack occurs when an unauthorized AS announces a prefix it does not own, diverting traffic. A leak occurs when a route is propagated beyond its intended scope. ThousandEyes detects these by:</p>
<ol>
  <li>Monitoring for new origin ASes appearing for a prefix (potential hijack).</li>
  <li>Checking origin AS authorization against RPKI ROAs (an Invalid RPKI status is a strong hijack indicator).</li>
  <li>Detecting unexpected AS-paths that suggest a route is being propagated through unintended transit networks (potential leak).</li>
</ol>

<p><strong>Stuck route detection.</strong> BGP “zombie” routes are routes that persist in routing tables despite having been withdrawn by the origin. ThousandEyes’ Stuck Route Observatory identifies these by comparing the routes seen by monitors against the routes the origin AS is actively advertising. Discrepancies indicate stuck routes, which can cause persistent reachability issues.</p>

<p><strong>Penalty algorithm.</strong> ThousandEyes employs a penalty-based algorithm to handle BGP monitor data quality issues. When a monitor misses expected updates or exhibits anomalous behavior, the algorithm assigns penalty scores and, above a threshold, triggers corrective actions such as excluding the monitor from aggregate calculations until it stabilizes.</p>

<h3 id="46-topology-discovery-device-layer">4.6 Topology Discovery (Device Layer)</h3>

<p>The Device Layer’s graph construction algorithm is a <strong>breadth-first crawl</strong> of the network’s neighbor tables:</p>

<ol>
  <li><strong>Seed nodes</strong>: Enterprise Agents serve as the starting points. The agent queries its local network for directly connected devices via SNMP.</li>
  <li><strong>Neighbor table crawl</strong>: For each discovered device, ThousandEyes reads its LLDP and CDP neighbor tables via SNMP, revealing adjacent devices and their connecting interfaces.</li>
  <li><strong>Recursive expansion</strong>: Newly discovered devices are queried in turn, and their neighbors are added to the graph. The process continues until no new devices are found or the configured discovery scope is exhausted.</li>
  <li><strong>Graph assembly</strong>: The collected adjacency data is assembled into an undirected graph. Duplicate edges (device A reports device B as neighbor; device B reports device A as neighbor) are deduplicated.</li>
  <li><strong>SNMP enrichment</strong>: Each device node is polled for health metrics — CPU, memory, interface errors, bandwidth — which are overlaid as node attributes in the topology visualization.</li>
</ol>

<p>The result is a Layer-2 topology map that can be correlated with the Layer-3 Path Visualization graph, bridging the gap between logical forwarding paths and physical device infrastructure.</p>

<hr />

<h2 id="v-graph-simplification-and-visualization-techniques">V. Graph Simplification and Visualization Techniques</h2>

<p>Raw network graphs — especially those spanning the global Internet — can contain hundreds of nodes and thousands of edges. ThousandEyes employs several graph-reduction and visual-encoding techniques to make these graphs interpretable:</p>

<h3 id="51-interface-grouping">5.1 Interface Grouping</h3>

<p>A single physical router may have dozens of IP addresses (one per interface). In a raw path trace, each interface appears as a separate node, inflating the graph and obscuring the actual topology. ThousandEyes’ interface grouping collapses multiple IPs belonging to the same device into a single node, producing a graph that more accurately represents the physical network. Grouping is configurable:</p>

<ul>
  <li><strong>By IP address</strong>: No grouping; each IP is a distinct node (maximum detail).</li>
  <li><strong>By device</strong>: IPs on the same device are merged (inferred from rDNS and WHOIS data).</li>
  <li><strong>By network</strong>: All IPs within the same AS/network are merged into a single node.</li>
  <li><strong>By network + location</strong>: Network-level grouping further subdivided by geographic location.</li>
  <li><strong>By geography</strong>: All nodes in the same geographic area are merged.</li>
</ul>

<h3 id="52-complexity-controls">5.2 Complexity Controls</h3>

<p>A slider control allows users to progressively hide intermediate hops. At maximum complexity, every discovered hop is visible. As the slider is reduced, core-Internet hops — those deep within transit provider backbones — are collapsed into dotted lines annotated with the number of hidden hops. This focuses attention on the edges of the path: the agent’s local network and the destination’s network, which are most likely to contain the root cause of a problem.</p>

<h3 id="53-performance-color-encoding">5.3 Performance Color Encoding</h3>

<p>The graph’s visual weight is driven by metric values:</p>

<table>
  <thead>
    <tr>
      <th>Color</th>
      <th>Meaning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Dark green</strong></td>
      <td>Healthy — 0% loss, low latency</td>
    </tr>
    <tr>
      <td><strong>Yellow/Orange</strong></td>
      <td>Degraded — moderate loss or elevated latency</td>
    </tr>
    <tr>
      <td><strong>Red</strong></td>
      <td>Critical — high loss, extreme latency, or link failure</td>
    </tr>
    <tr>
      <td><strong>Red circle</strong> around a node</td>
      <td>Packet loss detected at this hop</td>
    </tr>
    <tr>
      <td><strong>Red link</strong></td>
      <td>High delay on this segment</td>
    </tr>
    <tr>
      <td><strong>Red loop</strong> around a node</td>
      <td>Routing loop detected</td>
    </tr>
  </tbody>
</table>

<p>This encoding transforms the graph into a heat map: a healthy network appears as a green flow from left to right, while problems appear as red “hot spots” that an engineer can immediately zoom into.</p>

<h3 id="54-split-path-and-collapsed-path-rendering">5.4 Split-Path and Collapsed-Path Rendering</h3>

<ul>
  <li><strong>Split paths</strong>: When ECMP or policy routing causes traffic to take multiple routes, the graph branches at the divergence point. Each branch’s line thickness is proportional to the number of traces that traversed it, indicating the load distribution across paths.</li>
  <li><strong>Collapsed paths</strong>: When complexity controls hide intermediate hops, the hidden segment is rendered as a dotted line with a numeric annotation (e.g., “5 hops hidden”), preserving awareness of the path’s true length without cluttering the visualization.</li>
</ul>

<h3 id="55-cloud-provider-annotation">5.5 Cloud Provider Annotation</h3>

<p>For paths traversing AWS, Azure, or GCP infrastructure, ThousandEyes replaces raw IP nodes with enriched labels showing the cloud service name, region, and availability zone. Nodes display the cloud provider’s icon, and a verified-information badge indicates that the enrichment data was confirmed by the cloud provider. This transforms opaque IP addresses into meaningful infrastructure context directly within the graph.</p>

<hr />

<h2 id="vi-network-resilience-and-fault-analysis-through-graph-theory">VI. Network Resilience and Fault Analysis Through Graph Theory</h2>

<p>The graph models constructed by ThousandEyes enable several categories of resilience analysis that map directly to classical graph-theoretic problems:</p>

<h3 id="61-outage-impact-assessment">6.1 Outage Impact Assessment</h3>

<p>When a provider announces an outage — or Internet Insights detects one — the immediate question is: <em>how does this affect my services?</em> This is a <strong>graph reachability</strong> problem: given that node <em>X</em> (the failed PoP) is removed from the graph, which agent-to-destination paths are severed? ThousandEyes answers this by correlating Internet Insights outage data with customer test data, automatically identifying which tests traverse the affected nodes.</p>

<h3 id="62-cascade-analysis">6.2 Cascade Analysis</h3>

<p>A failure in one part of the network graph can propagate. If a Tier-1 transit provider’s backbone link fails, traffic is rerouted through alternative paths, potentially overloading those paths and causing secondary failures. ThousandEyes’ temporal path visualization captures these cascades: engineers can observe the graph structure before, during, and after a failure event, watching paths shift, latency increase on alternative routes, and — in severe cases — loss appear on previously healthy paths.</p>

<h3 id="63-redundancy-validation">6.3 Redundancy Validation</h3>

<p>A resilient network architecture requires <strong>edge-disjoint paths</strong> — multiple independent routes between critical endpoints. ThousandEyes’ multi-path discovery verifies this: if all traces from an agent converge on a single intermediate hop, that hop is a single point of failure regardless of how many ISPs the organization has contracted. The path graph makes this visible immediately, enabling engineers to validate that their multi-homed or multi-cloud architecture actually provides the expected redundancy.</p>

<h3 id="64-ddos-mitigation-validation">6.4 DDoS Mitigation Validation</h3>

<p>During a DDoS attack, traffic is typically rerouted through a scrubbing center via BGP announcements. ThousandEyes provides two graph-level views of this process:</p>
<ol>
  <li><strong>BGP Route Visualization</strong> shows the AS-path change as the scrubbing center’s AS is inserted into the path.</li>
  <li><strong>Path Visualization</strong> shows the actual forwarding-plane change: traffic now routes through the scrubbing center’s IP infrastructure.</li>
</ol>

<p>By monitoring both graphs during an attack, engineers can verify that mitigation is active, measure the latency overhead introduced by scrubbing, and confirm that clean traffic is being properly re-injected to the origin.</p>

<h3 id="65-sla-enforcement-and-vendor-comparison">6.5 SLA Enforcement and Vendor Comparison</h3>

<p>Internet Insights tracks outage history per provider, building a longitudinal graph of reliability data. This enables:</p>
<ul>
  <li><strong>SLA enforcement</strong>: Quantifying a provider’s actual availability against contractual commitments, backed by concrete telemetry rather than the provider’s own reporting.</li>
  <li><strong>Vendor evaluation</strong>: Comparing the outage frequency, duration, and scope of competing providers using graph-derived metrics, supporting data-driven procurement decisions.</li>
</ul>

<hr />

<h2 id="vii-data-access-and-programmatic-graph-analysis">VII. Data Access and Programmatic Graph Analysis</h2>

<h3 id="71-thousandeyes-api">7.1 ThousandEyes API</h3>

<p>The ThousandEyes REST API exposes the platform’s graph data programmatically, enabling custom analysis, integration, and automation:</p>

<p><strong>Path trace endpoints.</strong> The API returns detailed path trace data for each test round, including the ordered sequence of hops (nodes), their IP addresses, network ownership, geographic location, and per-hop metrics (loss, latency, delay, DSCP, MTU). This data can be consumed as a node-and-edge list for reconstruction in external graph analysis tools.</p>

<p><strong>Network end-to-end endpoints.</strong> Aggregate metrics — agent-to-target loss, latency, jitter, and bandwidth — are available as time-series data, enabling trend analysis and long-term performance tracking.</p>

<p><strong>BGP endpoints.</strong> The API provides AS-path data, reachability metrics, update counts, and RPKI validation status for each monitored prefix, enabling programmatic BGP graph construction and analysis.</p>

<p><strong>Export formats.</strong> API responses are JSON-structured, with node and edge data that maps directly to adjacency-list representations suitable for import into graph analysis libraries.</p>

<h3 id="72-integration-with-observability-platforms">7.2 Integration with Observability Platforms</h3>

<p>ThousandEyes’ graph data feeds into broader observability ecosystems:</p>

<ul>
  <li><strong>Splunk</strong>: The Cisco ThousandEyes App for Splunk streams test data, outage events, and activity logs into Splunk dashboards. This enables correlation of ThousandEyes graph data with logs, metrics, and traces from other sources, providing a unified view of infrastructure health.</li>
  <li><strong>OpenTelemetry</strong>: ThousandEyes supports streaming BGP metrics via the OpenTelemetry protocol, allowing integration with any OTel-compatible backend (Grafana, Datadog, New Relic, etc.).</li>
  <li><strong>Webhooks and ServiceNow</strong>: Alert-driven integrations push graph events (outages, path changes, loss thresholds) to incident management systems, triggering automated workflows.</li>
  <li><strong>Splunk AppDynamics</strong>: Combining application performance monitoring with ThousandEyes’ network graph provides end-to-end visibility from application code to network path.</li>
</ul>

<h3 id="73-custom-graph-analysis-workflows">7.3 Custom Graph Analysis Workflows</h3>

<p>Engineers who need analysis beyond the built-in visualizations can leverage the API to build custom workflows:</p>

<ul>
  <li><strong>Graph library import</strong>: Export path trace data and import into Python’s NetworkX or R’s igraph for advanced graph-theoretic computations — centrality measures, community detection, minimum cut analysis, etc.</li>
  <li><strong>Topology diffing</strong>: By querying the API at regular intervals and comparing successive graph snapshots, engineers can detect structural changes — new hops appearing, existing hops disappearing, path lengths changing — and trigger automated alerts on topology drift.</li>
  <li><strong>Custom dashboards</strong>: API data can feed into Grafana, Tableau, or custom web applications for tailored graph visualizations that match specific operational requirements.</li>
</ul>

<hr />

<h2 id="viii-ai-powered-graph-intelligence">VIII. AI-Powered Graph Intelligence</h2>

<p>Beginning in 2025, Cisco is layering AI capabilities on top of ThousandEyes’ graph-derived telemetry:</p>

<h3 id="81-cisco-ai-assistant">8.1 Cisco AI Assistant</h3>

<p>The Cisco AI Assistant, integrated into the ThousandEyes interface, is trained on network telemetry data and test configurations. It can:</p>
<ul>
  <li>Analyze path visualization data in real time and provide natural-language root-cause summaries.</li>
  <li>Identify which graph nodes are contributing to degradation without requiring the engineer to manually inspect each hop.</li>
  <li>Correlate graph anomalies across multiple tests and time windows, surfacing patterns that might not be apparent from a single graph view.</li>
</ul>

<h3 id="82-wan-insights">8.2 WAN Insights</h3>

<p>WAN Insights applies statistical models to SD-WAN telemetry graphs, producing <strong>predictive routing recommendations</strong>. By analyzing historical patterns in the overlay/underlay graph — latency trends, loss patterns, path utilization — the system can forecast future degradation and recommend proactive path changes before users are affected. This is a form of <strong>predictive graph analytics</strong>: using temporal patterns in a dynamic graph to anticipate structural changes.</p>

<h3 id="83-agenticops">8.3 AgenticOps</h3>

<p>Cisco’s AgenticOps vision extends AI from advisory to autonomous action. Specialized AI agents continuously:</p>
<ol>
  <li><strong>Sense</strong>: Ingest real-time graph telemetry from ThousandEyes agents.</li>
  <li><strong>Reason</strong>: Apply graph analysis and anomaly detection to identify emerging issues.</li>
  <li><strong>Act</strong>: Execute corrective actions — rerouting traffic, adjusting SD-WAN policies, escalating to incident management.</li>
  <li><strong>Validate</strong>: Re-measure the graph after action to confirm the issue is resolved.</li>
</ol>

<p>This closes the loop from graph observation to graph-informed remediation, moving toward autonomous network operations.</p>

<h3 id="84-machine-learning-on-historical-graph-patterns">8.4 Machine Learning on Historical Graph Patterns</h3>

<p>ThousandEyes’ longitudinal graph data — capturing path structures, performance metrics, and outage events over months and years — provides a rich training dataset for anomaly detection models. These models learn the “normal” graph structure for a given test and flag deviations: unexpected new hops, abnormal latency distributions, path changes that correlate with past outage patterns. This transforms the graph from a diagnostic tool into a predictive one.</p>

<hr />

<h2 id="ix-real-world-application-domains">IX. Real-World Application Domains</h2>

<h3 id="91-enterprise-saas-monitoring">9.1 Enterprise SaaS Monitoring</h3>

<p>For enterprises dependent on SaaS applications — Microsoft 365, Salesforce, ServiceNow, Webex, Zoom — ThousandEyes constructs path graphs from office and remote-worker locations to each application’s endpoints. This reveals which ISPs, transit providers, and CDN nodes are in the critical path, enabling targeted escalation when performance degrades. Internet Insights adds a macro view: if Salesforce is experiencing a widespread outage, the enterprise can immediately confirm the issue is external and redirect support resources accordingly.</p>

<h3 id="92-multi-cloud-assurance">9.2 Multi-Cloud Assurance</h3>

<p>Organizations operating across AWS, Azure, and GCP face the challenge of monitoring interconnections between cloud providers — inter-region and inter-cloud traffic traverses networks outside the customer’s control. ThousandEyes’ cloud-enriched path graphs map these interconnections, identifying performance bottlenecks at cloud-provider handoff points and enabling data-driven multi-cloud architecture decisions.</p>

<h3 id="93-sd-wan-optimization">9.3 SD-WAN Optimization</h3>

<p>Cisco SD-WAN and Meraki MX deployments benefit from ThousandEyes’ dual-layer graph model. When an SD-WAN tunnel shows degradation, the overlay/underlay correlation pinpoints whether the issue is in the overlay policy (tunnel misconfiguration, incorrect SLA class assignment) or the underlay transport (ISP congestion, backbone failure). WAN Insights extends this with predictive recommendations, suggesting proactive path changes based on graph telemetry trends.</p>

<h3 id="94-hybrid-workforce">9.4 Hybrid Workforce</h3>

<p>With employees working from home, coffee shops, and co-working spaces, the “last mile” to the corporate network is no longer a managed LAN segment — it’s an uncontrolled path through consumer ISPs and public Internet. Endpoint Agents on employee devices construct path graphs from each location to corporate applications, identifying ISP-specific issues (a particular residential ISP’s peering point is congested) and enabling IT to provide targeted guidance or escalate to the ISP with concrete evidence.</p>

<h3 id="95-industrial-iot-iiot">9.5 Industrial IoT (IIoT)</h3>

<p>The 2025 extension of ThousandEyes to Cisco Industrial Ethernet switches and Industrial Routers brings graph-based visibility to operational technology (OT) environments. Enterprise Agents deployed on industrial networking equipment construct path graphs from factory floors and remote sites to cloud-hosted SCADA, MES, and ERP systems, enabling IT/OT teams to collaboratively troubleshoot connectivity issues that affect production.</p>

<h3 id="96-incident-response">9.6 Incident Response</h3>

<p>During a major incident, the combination of Internet Insights (macro-scale outage graph), Path Visualization (hop-level diagnostic graph), BGP Route Visualization (control-plane routing graph), and Device Layer (internal infrastructure graph) provides a multi-layer graph model that spans the full incident domain. Engineers can start at the highest level — <em>is this a global outage?</em> — and drill down through successively more detailed graphs to isolate the root cause, all within a single platform.</p>

<hr />

<h2 id="x-conclusion">X. Conclusion</h2>

<p>Cisco ThousandEyes is, at its core, a large-scale, distributed implementation of applied graph theory. Its agents collect raw network data — ICMP responses, TCP timings, BGP advertisements, SNMP neighbor tables — and assemble it into interconnected graph models that span from individual device interfaces to the global Internet topology. The platform’s diagnostic power comes from the graph operations it performs on these models: path discovery, anomaly clustering, centrality analysis, reachability computation, and temporal graph comparison.</p>

<p>The trajectory is clear. The platform is evolving from a system where humans interpret graph visualizations toward one where AI agents autonomously sense, reason, and act on graph-derived telemetry. WAN Insights already demonstrates predictive graph analytics; AgenticOps extends this to closed-loop remediation. As networks grow more complex — 5G edge deployments, multi-cloud architectures, IoT at scale — the graph models will expand accordingly, but the fundamental abstractions remain the same: nodes, edges, weights, paths, and the algorithms that operate on them.</p>

<p>For network engineers, understanding the graph-theoretic foundations of ThousandEyes is not merely academic. It sharpens the interpretation of every visualization the platform produces: recognizing a cut vertex as a single point of failure, reading edge weights as latency contributions, understanding that an Internet Insights outage alert is the result of spatial cluster analysis on a global measurement graph. The graph is the network. ThousandEyes makes it visible.</p>

<hr />

<h2 id="appendix">Appendix</h2>

<h3 id="a-thousandeyes-test-types-and-their-graph-models">A. ThousandEyes Test Types and Their Graph Models</h3>

<table>
  <thead>
    <tr>
      <th>Test Type</th>
      <th>Primary Graph Model</th>
      <th>Node Type</th>
      <th>Edge Type</th>
      <th>Key Metrics</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Agent-to-Server</td>
      <td>Path Visualization (directed, weighted)</td>
      <td>IP hops</td>
      <td>Network segments</td>
      <td>Loss, latency, jitter, delay, MTU</td>
    </tr>
    <tr>
      <td>Agent-to-Agent</td>
      <td>Bidirectional Path Visualization</td>
      <td>IP hops</td>
      <td>Network segments</td>
      <td>Loss, latency, jitter, throughput</td>
    </tr>
    <tr>
      <td>HTTP Server</td>
      <td>Path Visualization + HTTP layer</td>
      <td>IP hops + server</td>
      <td>Network segments</td>
      <td>Loss, latency, response time, availability</td>
    </tr>
    <tr>
      <td>Page Load</td>
      <td>Path Visualization + DOM graph</td>
      <td>IP hops + page components</td>
      <td>Network + resource dependencies</td>
      <td>Loss, latency, page load time, DOM load</td>
    </tr>
    <tr>
      <td>API Test</td>
      <td>Path Visualization per API step</td>
      <td>IP hops per endpoint</td>
      <td>Network segments per call</td>
      <td>Loss, latency, API response time, completion</td>
    </tr>
    <tr>
      <td>DNS Server</td>
      <td>Path Visualization to DNS server</td>
      <td>IP hops + DNS resolver</td>
      <td>Network segments</td>
      <td>Loss, latency, resolution time, mappings</td>
    </tr>
    <tr>
      <td>BGP</td>
      <td>AS-level directed graph</td>
      <td>Autonomous Systems</td>
      <td>Peering/transit relationships</td>
      <td>Path changes, reachability, updates, RPKI</td>
    </tr>
    <tr>
      <td>Device Layer</td>
      <td>Undirected adjacency graph</td>
      <td>Network devices</td>
      <td>Layer-2 links</td>
      <td>CPU, memory, interface errors, bandwidth</td>
    </tr>
  </tbody>
</table>

<h3 id="b-key-metrics-glossary">B. Key Metrics Glossary</h3>

<table>
  <thead>
    <tr>
      <th>Metric</th>
      <th>Unit</th>
      <th>Description</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Loss</strong></td>
      <td>%</td>
      <td>Percentage of probes that did not receive a response from the target hop</td>
    </tr>
    <tr>
      <td><strong>Latency</strong></td>
      <td>ms</td>
      <td>Round-trip time from agent to target or to a specific hop</td>
    </tr>
    <tr>
      <td><strong>Jitter</strong></td>
      <td>ms</td>
      <td>Standard deviation of latency measurements; indicates path stability</td>
    </tr>
    <tr>
      <td><strong>Link Delay</strong></td>
      <td>ms</td>
      <td>Estimated one-way transmission delay across a single link</td>
    </tr>
    <tr>
      <td><strong>DSCP</strong></td>
      <td>Numeric</td>
      <td>Differentiated Services Code Point observed in returned packets</td>
    </tr>
    <tr>
      <td><strong>MTU</strong></td>
      <td>Bytes</td>
      <td>Minimum Maximum Transmission Unit along the path</td>
    </tr>
    <tr>
      <td><strong>Reachability</strong></td>
      <td>%</td>
      <td>Percentage of BGP monitors that can see a valid route to a prefix</td>
    </tr>
    <tr>
      <td><strong>Path Changes</strong></td>
      <td>Count</td>
      <td>Number of AS-path modifications observed in a time window</td>
    </tr>
    <tr>
      <td><strong>Updates</strong></td>
      <td>Count</td>
      <td>Number of BGP UPDATE messages received for a prefix</td>
    </tr>
    <tr>
      <td><strong>Throughput</strong></td>
      <td>Mbps</td>
      <td>Measured bandwidth capacity (Agent-to-Agent tests)</td>
    </tr>
  </tbody>
</table>

<h3 id="c-thousandeyes-api-endpoints-for-graph-data">C. ThousandEyes API Endpoints for Graph Data</h3>

<table>
  <thead>
    <tr>
      <th>Endpoint Category</th>
      <th>Data Returned</th>
      <th>Use Case</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">/net/path-vis/{testId}</code></td>
      <td>Path trace nodes, links, per-hop metrics</td>
      <td>Reconstruct path graph externally</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">/net/metrics/{testId}</code></td>
      <td>End-to-end loss, latency, jitter time series</td>
      <td>Trend analysis, SLA reporting</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">/net/bgp-metrics/{testId}</code></td>
      <td>AS-paths, reachability, updates, RPKI status</td>
      <td>BGP graph construction, hijack detection</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">/internet-insights/outages</code></td>
      <td>Outage events with scope, scale, affected providers</td>
      <td>Correlation with internal tests</td>
    </tr>
    <tr>
      <td><code class="language-plaintext highlighter-rouge">/endpoint-data/network-topology</code></td>
      <td>Endpoint agent path and network data</td>
      <td>Hybrid workforce path analysis</td>
    </tr>
  </tbody>
</table>

<h3 id="d-recommended-further-reading">D. Recommended Further Reading</h3>

<ol>
  <li><strong>ThousandEyes Documentation</strong>: <a href="https://docs.thousandeyes.com">docs.thousandeyes.com</a> — Comprehensive product documentation including Path Visualization, BGP tests, Device Layer, and API reference.</li>
  <li><strong>ThousandEyes Blog</strong>: <a href="https://www.thousandeyes.com/blog">thousandeyes.com/blog</a> — Technical articles on network monitoring methodology, product updates, and Internet outage analyses.</li>
  <li><strong>ThousandEyes API Developer Guide</strong>: <a href="https://developer.cisco.com/docs/thousandeyes/">developer.cisco.com/docs/thousandeyes</a> — API reference and getting-started guides for programmatic data access.</li>
  <li><strong>Cisco Live Sessions</strong>: Annual presentations covering ThousandEyes architecture, new capabilities, and customer case studies.</li>
  <li><strong>“Internet Insights: Detecting and Solving Internet Outages with Collective Intelligence”</strong>: ThousandEyes webinar on the algorithms behind Internet Insights outage detection.</li>
  <li><strong>RIPE-RIS and RouteViews</strong>: Public BGP data sources that feed ThousandEyes’ BGP monitoring — <a href="https://ris.ripe.net">ris.ripe.net</a> and <a href="http://www.routeviews.org">routeviews.org</a>.</li>
</ol>]]></content><author><name>Marc Buraczynski</name></author><category term="networking" /><category term="graph theory" /><category term="observability" /><category term="ThousandEyes" /><summary type="html"><![CDATA[A technical deep-dive into the graph-theoretic foundations, algorithms, and data structures that power ThousandEyes’ network intelligence platform.]]></summary></entry><entry><title type="html">Agent Skills: Architecture, Implementation, and the Future of Composable AI Agent Knowledge</title><link href="https://gunnymarc.github.io/posts/2026/04/what-are-agent-skills/" rel="alternate" type="text/html" title="Agent Skills: Architecture, Implementation, and the Future of Composable AI Agent Knowledge" /><published>2026-04-15T00:00:00-04:00</published><updated>2026-04-15T00:00:00-04:00</updated><id>https://gunnymarc.github.io/posts/2026/04/what-are-agent-skills</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/04/what-are-agent-skills/"><![CDATA[<blockquote>
  <p>A deep technical analysis of the SKILL.md specification, progressive disclosure patterns, and how agent skills fundamentally reshape LLM-based agent architectures.</p>
</blockquote>

<hr />

<h2 id="1-the-context-window-problem-why-skills-exist">1. The Context Window Problem: Why Skills Exist</h2>

<p>Every production AI agent faces a fundamental engineering constraint: <strong>the context window is finite, expensive, and shared across every turn of conversation</strong>. Consider an agent equipped with dozens of specialized workflows — CI/CD pipelines, security review checklists, documentation formatters, data analysis routines, migration helpers. The naive architecture loads every instruction set into the system prompt at initialization. The token arithmetic is sobering:</p>

<ul>
  <li>Even a modest library of 30 workflows at ~5,000 tokens each consumes roughly <strong>150,000 tokens before the user says a word</strong></li>
  <li>That budget is spent identically whether the user triggers a complex deployment or simply renames a variable</li>
  <li>On frontier models priced around $10 per million input tokens, the system prompt alone costs <strong>$1.50 per request</strong></li>
  <li>Prefill latency grows linearly with input length — at 150K tokens, the user is waiting seconds just for the model to process instructions it may never use</li>
</ul>

<p>The solution draws on a concept familiar to operating-system engineers: <strong>demand paging</strong>. Rather than loading everything up front, the agent boots with a lightweight index of skill metadata — names and one-line descriptions totaling roughly 3,000 tokens — and pulls full instruction sets into context only when a task actually requires them.</p>

<p><img src="https://iili.io/BnCFL1S.png" alt="Diagram comparing token usage: Without Skills (150K tokens, fixed high cost) vs With Agent Skills (~3K tokens at startup, grows with usage)" /></p>

<p>The practical result is approximately a <strong>50× reduction in startup token cost</strong>, with per-request averages dropping proportionally since most interactions activate only one or two skills at a time.</p>

<hr />

<h2 id="2-the-skillmd-specification-anatomy-of-a-skill">2. The SKILL.md Specification: Anatomy of a Skill</h2>

<p>When Anthropic published the SKILL.md specification as an open standard in late 2025, it deliberately chose the lowest-friction format possible: a Markdown file with YAML frontmatter and a directory convention. That simplicity drove rapid cross-platform adoption — within a few months, implementations appeared in OpenAI Codex, Google Gemini CLI, GitHub Copilot, Cursor, JetBrains Junie, and dozens of other agent-oriented products.</p>

<h3 id="21-file-structure">2.1 File Structure</h3>

<p>A skill lives in a directory with a defined structure:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>my-skill/
├── SKILL.md              # Required: frontmatter + instructions
├── scripts/              # Optional: executable scripts
│   ├── validate.py
│   └── transform.sh
├── references/           # Optional: supplementary docs
│   ├── style-guide.md
│   └── api-schema.json
└── assets/               # Optional: static files
    └── template.html
</code></pre></div></div>

<h3 id="22-skillmd-format">2.2 SKILL.md Format</h3>

<p>The file starts with YAML frontmatter (required fields: <code class="language-plaintext highlighter-rouge">name</code> and <code class="language-plaintext highlighter-rouge">description</code>) followed by a Markdown body with the actual instructions:</p>

<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">name</span><span class="pi">:</span> <span class="s">code-review-security</span>
<span class="na">description</span><span class="pi">:</span> <span class="pi">&gt;</span>
  <span class="s">Performs security-focused code review. Identifies injection vulnerabilities,</span>
  <span class="s">auth bypasses, secrets exposure, and insecure deserialization patterns.</span>
  <span class="s">Use when reviewing PRs or auditing codebases for security issues.</span>
<span class="na">license</span><span class="pi">:</span> <span class="s">MIT</span>
<span class="na">compatibility</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">claude</span>
  <span class="pi">-</span> <span class="s">codex</span>
  <span class="pi">-</span> <span class="s">gemini-cli</span>
<span class="na">allowed-tools</span><span class="pi">:</span>
  <span class="pi">-</span> <span class="s">read_file</span>
  <span class="pi">-</span> <span class="s">grep</span>
  <span class="pi">-</span> <span class="s">bash(read-only)</span>
<span class="na">metadata</span><span class="pi">:</span>
  <span class="na">author</span><span class="pi">:</span> <span class="s">security-team</span>
  <span class="na">version</span><span class="pi">:</span> <span class="s">2.1.0</span>
  <span class="na">tags</span><span class="pi">:</span> <span class="pi">[</span><span class="nv">security</span><span class="pi">,</span> <span class="nv">review</span><span class="pi">,</span> <span class="nv">OWASP</span><span class="pi">]</span>
<span class="nn">---</span>

<span class="c1"># Security Code Review Skill</span>

<span class="c1">## Workflow</span>

<span class="s">1. Scan all changed files for security-sensitive patterns</span>
<span class="s">2. Check for hardcoded secrets using regex patterns</span>
<span class="s">3. Identify SQL injection vectors in database queries</span>
<span class="s">4. Review authentication and authorization logic</span>
<span class="s">5. Flag insecure deserialization or eval() usage</span>
<span class="s">6. Generate findings report with severity ratings</span>

<span class="c1">## Best Practices</span>

<span class="pi">-</span> <span class="s">Always check for both direct and indirect injection paths</span>
<span class="pi">-</span> <span class="s">Review dependency versions against known CVE databases</span>
<span class="pi">-</span> <span class="s">Flag any use of `eval()`, `exec()`, or `subprocess.shell=True`</span>

<span class="c1">## Edge Cases</span>

<span class="pi">-</span> <span class="s">Template injection in Jinja2/Mako templates</span>
<span class="pi">-</span> <span class="s">GraphQL query depth attacks</span>
<span class="pi">-</span> <span class="s">SSRF through URL parsing inconsistencies</span>
</code></pre></div></div>

<h3 id="23-building-a-skill-registry-in-python">2.3 Building a Skill Registry in Python</h3>

<p>Here’s a practical implementation of a skill discovery and loading system:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">yaml</span>
<span class="kn">import</span> <span class="nn">hashlib</span>
<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span><span class="p">,</span> <span class="n">field</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">SkillMetadata</span><span class="p">:</span>
    <span class="s">"""Tier 1 representation: only what's needed for the system prompt."""</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">description</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">path</span><span class="p">:</span> <span class="n">Path</span>
    <span class="n">token_estimate</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span>
    <span class="n">allowed_tools</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">list</span><span class="p">)</span>
    <span class="n">content_hash</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">""</span>

    <span class="k">def</span> <span class="nf">to_system_prompt_entry</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Generate the ~100-token entry for the system prompt."""</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"- **</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">**: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">description</span><span class="si">}</span><span class="s">"</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">LoadedSkill</span><span class="p">:</span>
    <span class="s">"""Tier 2 representation: full SKILL.md body loaded into context."""</span>
    <span class="n">metadata</span><span class="p">:</span> <span class="n">SkillMetadata</span>
    <span class="n">body</span><span class="p">:</span> <span class="nb">str</span>  <span class="c1"># Markdown body after frontmatter
</span>    <span class="n">references</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">dict</span><span class="p">)</span>
    <span class="n">scripts</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="nb">dict</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">SkillRegistry</span><span class="p">:</span>
    <span class="s">"""
    Manages skill discovery, registration, and progressive loading.
    Implements the three-tier disclosure pattern from the SKILL.md spec.
    """</span>

    <span class="n">DISCOVERY_PATHS</span> <span class="o">=</span> <span class="p">[</span>
        <span class="s">".claude/skills"</span><span class="p">,</span>      <span class="c1"># Claude-specific
</span>        <span class="s">".agents/skills"</span><span class="p">,</span>      <span class="c1"># Cross-platform convention
</span>        <span class="s">".cursor/skills"</span><span class="p">,</span>      <span class="c1"># Cursor-specific
</span>    <span class="p">]</span>
    <span class="n">GLOBAL_PATH</span> <span class="o">=</span> <span class="n">Path</span><span class="p">.</span><span class="n">home</span><span class="p">()</span> <span class="o">/</span> <span class="s">".claude"</span> <span class="o">/</span> <span class="s">"skills"</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_registry</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">SkillMetadata</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_loaded</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">LoadedSkill</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_activation_log</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]</span> <span class="o">=</span> <span class="p">[]</span>

    <span class="k">def</span> <span class="nf">discover</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">project_root</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"."</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="n">SkillMetadata</span><span class="p">]:</span>
        <span class="s">"""
        Stage 0: Scan all skill sources and register metadata.
        Only parses YAML frontmatter — never reads the full body.
        """</span>
        <span class="n">sources</span> <span class="o">=</span> <span class="p">[</span>
            <span class="p">(</span><span class="s">"project"</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">_scan_project_skills</span><span class="p">(</span><span class="n">project_root</span><span class="p">)),</span>
            <span class="p">(</span><span class="s">"global"</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">_scan_directory</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">GLOBAL_PATH</span><span class="p">)),</span>
        <span class="p">]</span>

        <span class="k">for</span> <span class="n">source_type</span><span class="p">,</span> <span class="n">skills</span> <span class="ow">in</span> <span class="n">sources</span><span class="p">:</span>
            <span class="k">for</span> <span class="n">skill</span> <span class="ow">in</span> <span class="n">skills</span><span class="p">:</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">_registry</span><span class="p">[</span><span class="n">skill</span><span class="p">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="n">skill</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[discover] Registered '</span><span class="si">{</span><span class="n">skill</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">' from </span><span class="si">{</span><span class="n">source_type</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

        <span class="k">return</span> <span class="nb">list</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_registry</span><span class="p">.</span><span class="n">values</span><span class="p">())</span>

    <span class="k">def</span> <span class="nf">_scan_project_skills</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">project_root</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="n">SkillMetadata</span><span class="p">]:</span>
        <span class="s">"""Scan project-level skill directories."""</span>
        <span class="n">skills</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">rel_path</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">DISCOVERY_PATHS</span><span class="p">:</span>
            <span class="n">skills_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">project_root</span><span class="p">)</span> <span class="o">/</span> <span class="n">rel_path</span>
            <span class="n">skills</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_scan_directory</span><span class="p">(</span><span class="n">skills_dir</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">skills</span>

    <span class="k">def</span> <span class="nf">_scan_directory</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">directory</span><span class="p">:</span> <span class="n">Path</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="n">SkillMetadata</span><span class="p">]:</span>
        <span class="s">"""Scan a directory for SKILL.md files and extract frontmatter only."""</span>
        <span class="n">skills</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">directory</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
            <span class="k">return</span> <span class="n">skills</span>

        <span class="k">for</span> <span class="n">skill_dir</span> <span class="ow">in</span> <span class="n">directory</span><span class="p">.</span><span class="n">iterdir</span><span class="p">():</span>
            <span class="n">skill_file</span> <span class="o">=</span> <span class="n">skill_dir</span> <span class="o">/</span> <span class="s">"SKILL.md"</span> <span class="k">if</span> <span class="n">skill_dir</span><span class="p">.</span><span class="n">is_dir</span><span class="p">()</span> <span class="k">else</span> <span class="bp">None</span>
            <span class="k">if</span> <span class="n">skill_file</span> <span class="ow">and</span> <span class="n">skill_file</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
                <span class="n">metadata</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_parse_frontmatter</span><span class="p">(</span><span class="n">skill_file</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">metadata</span><span class="p">:</span>
                    <span class="n">skills</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">metadata</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">skills</span>

    <span class="k">def</span> <span class="nf">_parse_frontmatter</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">path</span><span class="p">:</span> <span class="n">Path</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Optional</span><span class="p">[</span><span class="n">SkillMetadata</span><span class="p">]:</span>
        <span class="s">"""Extract only the YAML frontmatter from a SKILL.md file."""</span>
        <span class="n">content</span> <span class="o">=</span> <span class="n">path</span><span class="p">.</span><span class="n">read_text</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s">"utf-8"</span><span class="p">)</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">content</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">"---"</span><span class="p">):</span>
            <span class="k">return</span> <span class="bp">None</span>

        <span class="c1"># Find the closing --- of the frontmatter
</span>        <span class="n">end_idx</span> <span class="o">=</span> <span class="n">content</span><span class="p">.</span><span class="n">index</span><span class="p">(</span><span class="s">"---"</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
        <span class="n">frontmatter_str</span> <span class="o">=</span> <span class="n">content</span><span class="p">[</span><span class="mi">3</span><span class="p">:</span><span class="n">end_idx</span><span class="p">].</span><span class="n">strip</span><span class="p">()</span>
        <span class="n">fm</span> <span class="o">=</span> <span class="n">yaml</span><span class="p">.</span><span class="n">safe_load</span><span class="p">(</span><span class="n">frontmatter_str</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">SkillMetadata</span><span class="p">(</span>
            <span class="n">name</span><span class="o">=</span><span class="n">fm</span><span class="p">[</span><span class="s">"name"</span><span class="p">],</span>
            <span class="n">description</span><span class="o">=</span><span class="n">fm</span><span class="p">[</span><span class="s">"description"</span><span class="p">],</span>
            <span class="n">path</span><span class="o">=</span><span class="n">path</span><span class="p">,</span>
            <span class="n">token_estimate</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">frontmatter_str</span><span class="p">.</span><span class="n">split</span><span class="p">())</span> <span class="o">*</span> <span class="mi">2</span><span class="p">,</span>  <span class="c1"># rough estimate
</span>            <span class="n">allowed_tools</span><span class="o">=</span><span class="n">fm</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="s">"allowed-tools"</span><span class="p">,</span> <span class="p">[]),</span>
            <span class="n">content_hash</span><span class="o">=</span><span class="n">hashlib</span><span class="p">.</span><span class="n">sha256</span><span class="p">(</span><span class="n">content</span><span class="p">.</span><span class="n">encode</span><span class="p">()).</span><span class="n">hexdigest</span><span class="p">()[:</span><span class="mi">12</span><span class="p">],</span>
        <span class="p">)</span>

    <span class="k">def</span> <span class="nf">build_system_prompt_block</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""
        Tier 1: Generate the skills block for the system prompt.
        This is injected once at startup and stays in every request.
        """</span>
        <span class="n">lines</span> <span class="o">=</span> <span class="p">[</span><span class="s">"## Available Skills</span><span class="se">\n</span><span class="s">"</span><span class="p">]</span>
        <span class="n">total_tokens</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="k">for</span> <span class="n">skill</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_registry</span><span class="p">.</span><span class="n">values</span><span class="p">():</span>
            <span class="n">entry</span> <span class="o">=</span> <span class="n">skill</span><span class="p">.</span><span class="n">to_system_prompt_entry</span><span class="p">()</span>
            <span class="n">lines</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">entry</span><span class="p">)</span>
            <span class="n">total_tokens</span> <span class="o">+=</span> <span class="nb">len</span><span class="p">(</span><span class="n">entry</span><span class="p">.</span><span class="n">split</span><span class="p">())</span> <span class="o">*</span> <span class="mf">1.3</span>  <span class="c1"># rough token estimate
</span>        <span class="n">lines</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">_(</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_registry</span><span class="p">)</span><span class="si">}</span><span class="s"> skills, ~</span><span class="si">{</span><span class="nb">int</span><span class="p">(</span><span class="n">total_tokens</span><span class="p">)</span><span class="si">}</span><span class="s"> tokens)_"</span><span class="p">)</span>
        <span class="k">return</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">activate</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">LoadedSkill</span><span class="p">:</span>
        <span class="s">"""
        Tier 2: Load the full SKILL.md body into context.
        Called when the LLM selects a skill based on user query.
        """</span>
        <span class="k">if</span> <span class="n">skill_name</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_loaded</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">_loaded</span><span class="p">[</span><span class="n">skill_name</span><span class="p">]</span>

        <span class="n">metadata</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_registry</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">skill_name</span><span class="p">)</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">metadata</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">KeyError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Skill '</span><span class="si">{</span><span class="n">skill_name</span><span class="si">}</span><span class="s">' not found in registry"</span><span class="p">)</span>

        <span class="c1"># Read full file and split frontmatter from body
</span>        <span class="n">content</span> <span class="o">=</span> <span class="n">metadata</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">read_text</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s">"utf-8"</span><span class="p">)</span>
        <span class="n">parts</span> <span class="o">=</span> <span class="n">content</span><span class="p">.</span><span class="n">split</span><span class="p">(</span><span class="s">"---"</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
        <span class="n">body</span> <span class="o">=</span> <span class="n">parts</span><span class="p">[</span><span class="mi">2</span><span class="p">].</span><span class="n">strip</span><span class="p">()</span> <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">parts</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">2</span> <span class="k">else</span> <span class="s">""</span>

        <span class="n">skill</span> <span class="o">=</span> <span class="n">LoadedSkill</span><span class="p">(</span><span class="n">metadata</span><span class="o">=</span><span class="n">metadata</span><span class="p">,</span> <span class="n">body</span><span class="o">=</span><span class="n">body</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_loaded</span><span class="p">[</span><span class="n">skill_name</span><span class="p">]</span> <span class="o">=</span> <span class="n">skill</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">_activation_log</span><span class="p">.</span><span class="n">append</span><span class="p">({</span>
            <span class="s">"skill"</span><span class="p">:</span> <span class="n">skill_name</span><span class="p">,</span>
            <span class="s">"action"</span><span class="p">:</span> <span class="s">"activate"</span><span class="p">,</span>
            <span class="s">"body_tokens"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">body</span><span class="p">.</span><span class="n">split</span><span class="p">())</span> <span class="o">*</span> <span class="mf">1.3</span><span class="p">,</span>
        <span class="p">})</span>

        <span class="k">return</span> <span class="n">skill</span>

    <span class="k">def</span> <span class="nf">load_reference</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">ref_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""
        Tier 3: Load a reference file on-demand during execution.
        """</span>
        <span class="n">skill</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_loaded</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">skill_name</span><span class="p">)</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">skill</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">RuntimeError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Skill '</span><span class="si">{</span><span class="n">skill_name</span><span class="si">}</span><span class="s">' must be activated first"</span><span class="p">)</span>

        <span class="n">ref_file</span> <span class="o">=</span> <span class="n">skill</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">parent</span> <span class="o">/</span> <span class="n">ref_path</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">ref_file</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
            <span class="k">raise</span> <span class="nb">FileNotFoundError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Reference '</span><span class="si">{</span><span class="n">ref_path</span><span class="si">}</span><span class="s">' not found"</span><span class="p">)</span>

        <span class="n">content</span> <span class="o">=</span> <span class="n">ref_file</span><span class="p">.</span><span class="n">read_text</span><span class="p">(</span><span class="n">encoding</span><span class="o">=</span><span class="s">"utf-8"</span><span class="p">)</span>
        <span class="n">skill</span><span class="p">.</span><span class="n">references</span><span class="p">[</span><span class="n">ref_path</span><span class="p">]</span> <span class="o">=</span> <span class="n">content</span>
        <span class="k">return</span> <span class="n">content</span>

    <span class="k">def</span> <span class="nf">deactivate</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
        <span class="s">"""
        Stage 6: Unload skill from context after execution.
        Frees context window tokens for subsequent operations.
        """</span>
        <span class="k">if</span> <span class="n">skill_name</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_loaded</span><span class="p">:</span>
            <span class="k">del</span> <span class="bp">self</span><span class="p">.</span><span class="n">_loaded</span><span class="p">[</span><span class="n">skill_name</span><span class="p">]</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">_activation_log</span><span class="p">.</span><span class="n">append</span><span class="p">({</span>
                <span class="s">"skill"</span><span class="p">:</span> <span class="n">skill_name</span><span class="p">,</span>
                <span class="s">"action"</span><span class="p">:</span> <span class="s">"deactivate"</span><span class="p">,</span>
            <span class="p">})</span>

    <span class="k">def</span> <span class="nf">get_context_usage</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="s">"""Report current context token usage from loaded skills."""</span>
        <span class="n">total</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="n">breakdown</span> <span class="o">=</span> <span class="p">{}</span>
        <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">skill</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">_loaded</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
            <span class="n">body_tokens</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">skill</span><span class="p">.</span><span class="n">body</span><span class="p">.</span><span class="n">split</span><span class="p">())</span> <span class="o">*</span> <span class="mf">1.3</span>
            <span class="n">ref_tokens</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">v</span><span class="p">.</span><span class="n">split</span><span class="p">())</span> <span class="o">*</span> <span class="mf">1.3</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">skill</span><span class="p">.</span><span class="n">references</span><span class="p">.</span><span class="n">values</span><span class="p">())</span>
            <span class="n">skill_total</span> <span class="o">=</span> <span class="n">body_tokens</span> <span class="o">+</span> <span class="n">ref_tokens</span>
            <span class="n">breakdown</span><span class="p">[</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">skill_total</span><span class="p">)</span>
            <span class="n">total</span> <span class="o">+=</span> <span class="n">skill_total</span>
        <span class="k">return</span> <span class="p">{</span><span class="s">"total_tokens"</span><span class="p">:</span> <span class="nb">int</span><span class="p">(</span><span class="n">total</span><span class="p">),</span> <span class="s">"breakdown"</span><span class="p">:</span> <span class="n">breakdown</span><span class="p">}</span>
</code></pre></div></div>

<p><strong>Usage:</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Initialize and discover skills
</span><span class="n">registry</span> <span class="o">=</span> <span class="n">SkillRegistry</span><span class="p">()</span>
<span class="n">registry</span><span class="p">.</span><span class="n">discover</span><span class="p">(</span><span class="n">project_root</span><span class="o">=</span><span class="s">"/home/user/my-project"</span><span class="p">)</span>

<span class="c1"># Tier 1: Build system prompt (runs once at agent startup)
</span><span class="n">system_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""You are a development assistant.

</span><span class="si">{</span><span class="n">registry</span><span class="p">.</span><span class="n">build_system_prompt_block</span><span class="p">()</span><span class="si">}</span><span class="s">

When a user request matches a skill, activate it before responding.
"""</span>

<span class="c1"># Tier 2: Activate when LLM selects a skill
</span><span class="n">skill</span> <span class="o">=</span> <span class="n">registry</span><span class="p">.</span><span class="n">activate</span><span class="p">(</span><span class="s">"code-review-security"</span><span class="p">)</span>
<span class="c1"># Inject skill.body into the conversation context
</span>
<span class="c1"># Tier 3: Load references on-demand
</span><span class="n">style_guide</span> <span class="o">=</span> <span class="n">registry</span><span class="p">.</span><span class="n">load_reference</span><span class="p">(</span><span class="s">"code-review-security"</span><span class="p">,</span> <span class="s">"references/style-guide.md"</span><span class="p">)</span>

<span class="c1"># After execution, free context
</span><span class="n">registry</span><span class="p">.</span><span class="n">deactivate</span><span class="p">(</span><span class="s">"code-review-security"</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">registry</span><span class="p">.</span><span class="n">get_context_usage</span><span class="p">())</span>  <span class="c1"># {"total_tokens": 0, "breakdown": {}}
</span></code></pre></div></div>

<hr />

<h2 id="3-the-agent-skills-lifecycle-from-discovery-to-dehydration">3. The Agent Skills Lifecycle: From Discovery to Dehydration</h2>

<p>The skill lifecycle is a 7-stage pipeline. Understanding each stage is critical for building agents that use skills efficiently.</p>

<p><img src="https://iili.io/BnCFt29.png" alt="Architecture flowchart showing the 7 stages of the Agent Skills lifecycle" /></p>

<h3 id="stage-0-skills-discovery">Stage 0: Skills Discovery</h3>

<p>The runtime scans multiple sources on startup:</p>

<table>
  <thead>
    <tr>
      <th>Source</th>
      <th>Path / Mechanism</th>
      <th>Scope</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Project</strong></td>
      <td><code class="language-plaintext highlighter-rouge">.agents/skills/</code>, <code class="language-plaintext highlighter-rouge">.claude/skills/</code></td>
      <td>Local to repo</td>
    </tr>
    <tr>
      <td><strong>Global</strong></td>
      <td><code class="language-plaintext highlighter-rouge">~/.claude/skills/</code></td>
      <td>User-wide</td>
    </tr>
    <tr>
      <td><strong>Bundled</strong></td>
      <td>Ships with platform</td>
      <td>Platform-wide</td>
    </tr>
    <tr>
      <td><strong>Plugins</strong></td>
      <td>Third-party packages</td>
      <td>Installed packages</td>
    </tr>
    <tr>
      <td><strong>Community</strong></td>
      <td>Marketplace / repos</td>
      <td>On-demand install</td>
    </tr>
  </tbody>
</table>

<p>Only the YAML frontmatter is parsed. The body is never read at this stage.</p>

<h3 id="stage-12-query--skill-selection">Stage 1–2: Query → Skill Selection</h3>

<p>When a user query arrives, the model evaluates it against the skill descriptions already present in the system prompt and decides which skill, if any, to activate. There is <strong>no retrieval step, no embedding lookup, and no external classifier</strong> in the routing path — selection is a byproduct of the model’s own forward pass. This design choice has a profound implication: the <code class="language-plaintext highlighter-rouge">description</code> field in the YAML frontmatter is the single highest-leverage line in any skill file, because it is the <em>only</em> text the model sees when making its selection decision.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">SkillSelector</span><span class="p">:</span>
    <span class="s">"""
    Demonstrates how skill selection works in the agent loop.
    The LLM does the actual selection; this class manages the interaction.
    """</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">registry</span><span class="p">:</span> <span class="n">SkillRegistry</span><span class="p">,</span> <span class="n">llm_client</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">registry</span> <span class="o">=</span> <span class="n">registry</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">llm</span> <span class="o">=</span> <span class="n">llm_client</span>

    <span class="k">def</span> <span class="nf">select_skill</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">user_query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
        <span class="s">"""
        Ask the LLM which skill (if any) matches the user query.
        Returns skill name or None.
        """</span>
        <span class="n">skills_block</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">build_system_prompt_block</span><span class="p">()</span>

        <span class="n">selection_prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""Given the user query below, determine which skill
(if any) should be activated. Respond with ONLY the skill name, or "none".

Available skills:
</span><span class="si">{</span><span class="n">skills_block</span><span class="si">}</span><span class="s">

User query: </span><span class="si">{</span><span class="n">user_query</span><span class="si">}</span><span class="s">

Selected skill:"""</span>

        <span class="n">response</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">llm</span><span class="p">.</span><span class="n">complete</span><span class="p">(</span><span class="n">selection_prompt</span><span class="p">,</span> <span class="n">max_tokens</span><span class="o">=</span><span class="mi">50</span><span class="p">)</span>
        <span class="n">skill_name</span> <span class="o">=</span> <span class="n">response</span><span class="p">.</span><span class="n">strip</span><span class="p">().</span><span class="n">lower</span><span class="p">()</span>

        <span class="k">if</span> <span class="n">skill_name</span> <span class="o">==</span> <span class="s">"none"</span> <span class="ow">or</span> <span class="n">skill_name</span> <span class="ow">not</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">_registry</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">None</span>
        <span class="k">return</span> <span class="n">skill_name</span>

    <span class="k">def</span> <span class="nf">execute_with_skill</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">user_query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Full agent loop: select skill → activate → execute → deactivate."""</span>
        <span class="n">skill_name</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">select_skill</span><span class="p">(</span><span class="n">user_query</span><span class="p">)</span>

        <span class="k">if</span> <span class="n">skill_name</span><span class="p">:</span>
            <span class="c1"># Tier 2: Load full instructions
</span>            <span class="n">skill</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">activate</span><span class="p">(</span><span class="n">skill_name</span><span class="p">)</span>
            <span class="n">context_injection</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"""
[SKILL ACTIVATED: </span><span class="si">{</span><span class="n">skill_name</span><span class="si">}</span><span class="s">]
</span><span class="si">{</span><span class="n">skill</span><span class="p">.</span><span class="n">body</span><span class="si">}</span><span class="s">
[END SKILL]
"""</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">context_injection</span> <span class="o">=</span> <span class="s">""</span>

        <span class="c1"># Execute with enriched context
</span>        <span class="n">response</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">llm</span><span class="p">.</span><span class="n">chat</span><span class="p">(</span>
            <span class="n">system</span><span class="o">=</span><span class="sa">f</span><span class="s">"You are an assistant. </span><span class="si">{</span><span class="n">context_injection</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
            <span class="n">user</span><span class="o">=</span><span class="n">user_query</span><span class="p">,</span>
        <span class="p">)</span>

        <span class="c1"># Dehydrate: unload skill after use
</span>        <span class="k">if</span> <span class="n">skill_name</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">deactivate</span><span class="p">(</span><span class="n">skill_name</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">response</span>
</code></pre></div></div>

<h3 id="stages-34-activation-and-context-injection">Stages 3–4: Activation and Context Injection</h3>

<p>When a skill is selected, loading happens in three progressive stages — this is the core of the “progressive disclosure” pattern:</p>

<p><img src="https://iili.io/BnCFmmb.png" alt="Three-tier progressive disclosure: Advertise (~100 tokens) → Load (&lt;5000 tokens) → Deep Dive (as needed)" /></p>

<p><strong>Tier 1 — Advertise</strong> (~100 tokens per skill): The runtime parses only the YAML frontmatter from each SKILL.md and injects a compact name-plus-description entry into the system prompt. This is the fixed per-skill cost that persists across every request: <code class="language-plaintext highlighter-rouge">N_skills × ~100 tokens</code>.</p>

<p><strong>Tier 2 — Load</strong> (budget target: &lt;5,000 tokens): Once the model identifies a relevant skill, the full Markdown body is read into context — step-by-step workflows, domain-specific best practices, known edge cases. The specification guidelines suggest capping this body at 500 lines to keep Tier 2 costs predictable.</p>

<p><strong>Tier 3 — Deep Dive</strong> (on-demand, unbounded): Supplementary reference documents and executable scripts are loaded only during active skill execution. A key architectural detail: <strong>scripts run in a subprocess, and only their stdout enters the model’s context</strong> — the source code never does. A 200-line validation script that emits 10 lines of structured output therefore costs 10 lines of context, not 200.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">subprocess</span>
<span class="kn">import</span> <span class="nn">json</span>


<span class="k">class</span> <span class="nc">SkillExecutor</span><span class="p">:</span>
    <span class="s">"""Handles Tier 3 deep-dive: running skill scripts and collecting output."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill</span><span class="p">:</span> <span class="n">LoadedSkill</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">skill</span> <span class="o">=</span> <span class="n">skill</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">script_outputs</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>

    <span class="k">def</span> <span class="nf">run_script</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">script_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">args</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">timeout</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">30</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""
        Execute a skill script and return only its output.
        The script source code never enters the LLM context.
        """</span>
        <span class="n">script_path</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">skill</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">parent</span> <span class="o">/</span> <span class="s">"scripts"</span> <span class="o">/</span> <span class="n">script_name</span>

        <span class="k">if</span> <span class="ow">not</span> <span class="n">script_path</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
            <span class="k">raise</span> <span class="nb">FileNotFoundError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Script '</span><span class="si">{</span><span class="n">script_name</span><span class="si">}</span><span class="s">' not found"</span><span class="p">)</span>

        <span class="c1"># Determine interpreter from extension
</span>        <span class="n">ext</span> <span class="o">=</span> <span class="n">script_path</span><span class="p">.</span><span class="n">suffix</span>
        <span class="n">interpreter</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">".py"</span><span class="p">:</span> <span class="p">[</span><span class="s">"python3"</span><span class="p">],</span>
            <span class="s">".sh"</span><span class="p">:</span> <span class="p">[</span><span class="s">"bash"</span><span class="p">],</span>
            <span class="s">".js"</span><span class="p">:</span> <span class="p">[</span><span class="s">"node"</span><span class="p">],</span>
        <span class="p">}.</span><span class="n">get</span><span class="p">(</span><span class="n">ext</span><span class="p">,</span> <span class="p">[</span><span class="s">"bash"</span><span class="p">])</span>

        <span class="n">cmd</span> <span class="o">=</span> <span class="n">interpreter</span> <span class="o">+</span> <span class="p">[</span><span class="nb">str</span><span class="p">(</span><span class="n">script_path</span><span class="p">)]</span> <span class="o">+</span> <span class="p">(</span><span class="n">args</span> <span class="ow">or</span> <span class="p">[])</span>

        <span class="k">try</span><span class="p">:</span>
            <span class="n">result</span> <span class="o">=</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">run</span><span class="p">(</span>
                <span class="n">cmd</span><span class="p">,</span>
                <span class="n">capture_output</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">text</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">timeout</span><span class="o">=</span><span class="n">timeout</span><span class="p">,</span>
                <span class="n">cwd</span><span class="o">=</span><span class="nb">str</span><span class="p">(</span><span class="n">script_path</span><span class="p">.</span><span class="n">parent</span><span class="p">),</span>
            <span class="p">)</span>
            <span class="n">output</span> <span class="o">=</span> <span class="n">result</span><span class="p">.</span><span class="n">stdout</span><span class="p">.</span><span class="n">strip</span><span class="p">()</span>
            <span class="k">if</span> <span class="n">result</span><span class="p">.</span><span class="n">returncode</span> <span class="o">!=</span> <span class="mi">0</span><span class="p">:</span>
                <span class="n">output</span> <span class="o">+=</span> <span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">[STDERR]: </span><span class="si">{</span><span class="n">result</span><span class="p">.</span><span class="n">stderr</span><span class="p">.</span><span class="n">strip</span><span class="p">()</span><span class="si">}</span><span class="s">"</span>
        <span class="k">except</span> <span class="n">subprocess</span><span class="p">.</span><span class="n">TimeoutExpired</span><span class="p">:</span>
            <span class="n">output</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"[ERROR] Script timed out after </span><span class="si">{</span><span class="n">timeout</span><span class="si">}</span><span class="s">s"</span>

        <span class="bp">self</span><span class="p">.</span><span class="n">script_outputs</span><span class="p">[</span><span class="n">script_name</span><span class="p">]</span> <span class="o">=</span> <span class="n">output</span>
        <span class="k">return</span> <span class="n">output</span>

    <span class="k">def</span> <span class="nf">get_context_payload</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""
        Build the context injection payload combining the skill body,
        loaded references, and script outputs.
        """</span>
        <span class="n">sections</span> <span class="o">=</span> <span class="p">[</span><span class="sa">f</span><span class="s">"# Skill: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">skill</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">skill</span><span class="p">.</span><span class="n">body</span><span class="p">]</span>

        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">skill</span><span class="p">.</span><span class="n">references</span><span class="p">:</span>
            <span class="n">sections</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">## Loaded References</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
            <span class="k">for</span> <span class="n">ref_name</span><span class="p">,</span> <span class="n">content</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">skill</span><span class="p">.</span><span class="n">references</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
                <span class="n">sections</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"### </span><span class="si">{</span><span class="n">ref_name</span><span class="si">}</span><span class="se">\n</span><span class="si">{</span><span class="n">content</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">script_outputs</span><span class="p">:</span>
            <span class="n">sections</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">## Script Outputs</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
            <span class="k">for</span> <span class="n">script_name</span><span class="p">,</span> <span class="n">output</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">script_outputs</span><span class="p">.</span><span class="n">items</span><span class="p">():</span>
                <span class="n">sections</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"### </span><span class="si">{</span><span class="n">script_name</span><span class="si">}</span><span class="se">\n</span><span class="s">```</span><span class="se">\n</span><span class="si">{</span><span class="n">output</span><span class="si">}</span><span class="se">\n</span><span class="s">```</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>

        <span class="k">return</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">sections</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="stages-56-execution-and-dehydration">Stages 5–6: Execution and Dehydration</h3>

<p>The enriched agent executes using its normal toolset (file operations, bash, MCP servers, web search). After producing output, the skill is <strong>dehydrated</strong> — unloaded from context to free tokens.</p>

<p>For multi-step tasks, the agent follows a <strong>load-execute-unload-repeat</strong> pattern: one skill at a time, sequential activation. This keeps context usage proportional to the <em>current</em> step, not the <em>total</em> workflow.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">MultiStepSkillPipeline</span><span class="p">:</span>
    <span class="s">"""
    Demonstrates multi-step dehydration: load one skill at a time,
    execute, unload, then move to the next step.
    """</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">registry</span><span class="p">:</span> <span class="n">SkillRegistry</span><span class="p">,</span> <span class="n">llm_client</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">registry</span> <span class="o">=</span> <span class="n">registry</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">llm</span> <span class="o">=</span> <span class="n">llm_client</span>

    <span class="k">def</span> <span class="nf">execute_pipeline</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">steps</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
        <span class="s">"""
        Execute a sequence of skill-powered steps.
        Each step: {"skill": "skill-name", "task": "description"}
        """</span>
        <span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">accumulated_context</span> <span class="o">=</span> <span class="p">[]</span>  <span class="c1"># Carry forward key results, not full skills
</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">step</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">steps</span><span class="p">):</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">--- Step </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">step</span><span class="p">[</span><span class="s">'skill'</span><span class="p">]</span><span class="si">}</span><span class="s"> ---"</span><span class="p">)</span>

            <span class="c1"># Activate skill for this step
</span>            <span class="n">skill</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">activate</span><span class="p">(</span><span class="n">step</span><span class="p">[</span><span class="s">"skill"</span><span class="p">])</span>
            <span class="n">executor</span> <span class="o">=</span> <span class="n">SkillExecutor</span><span class="p">(</span><span class="n">skill</span><span class="p">)</span>

            <span class="c1"># Build context with skill instructions + previous results summary
</span>            <span class="n">context</span> <span class="o">=</span> <span class="n">executor</span><span class="p">.</span><span class="n">get_context_payload</span><span class="p">()</span>
            <span class="k">if</span> <span class="n">accumulated_context</span><span class="p">:</span>
                <span class="n">context</span> <span class="o">+=</span> <span class="s">"</span><span class="se">\n</span><span class="s">## Previous Results</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">accumulated_context</span><span class="p">)</span>

            <span class="c1"># Execute
</span>            <span class="n">response</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">llm</span><span class="p">.</span><span class="n">chat</span><span class="p">(</span>
                <span class="n">system</span><span class="o">=</span><span class="sa">f</span><span class="s">"Follow the skill instructions precisely.</span><span class="se">\n</span><span class="si">{</span><span class="n">context</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
                <span class="n">user</span><span class="o">=</span><span class="n">step</span><span class="p">[</span><span class="s">"task"</span><span class="p">],</span>
            <span class="p">)</span>
            <span class="n">results</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>

            <span class="c1"># Carry forward a compressed summary, not the full response
</span>            <span class="n">summary</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">llm</span><span class="p">.</span><span class="n">complete</span><span class="p">(</span>
                <span class="sa">f</span><span class="s">"Summarize this result in 2-3 sentences:</span><span class="se">\n</span><span class="si">{</span><span class="n">response</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
                <span class="n">max_tokens</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
            <span class="p">)</span>
            <span class="n">accumulated_context</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"Step </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="n">step</span><span class="p">[</span><span class="s">'skill'</span><span class="p">]</span><span class="si">}</span><span class="s">): </span><span class="si">{</span><span class="n">summary</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

            <span class="c1"># Dehydrate: unload the skill
</span>            <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">deactivate</span><span class="p">(</span><span class="n">step</span><span class="p">[</span><span class="s">"skill"</span><span class="p">])</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Context after dehydration: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">get_context_usage</span><span class="p">()</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>

        <span class="k">return</span> <span class="n">results</span>


<span class="c1"># Usage:
</span><span class="n">pipeline_steps</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">{</span><span class="s">"skill"</span><span class="p">:</span> <span class="s">"code-review-security"</span><span class="p">,</span> <span class="s">"task"</span><span class="p">:</span> <span class="s">"Review auth.py for vulnerabilities"</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"skill"</span><span class="p">:</span> <span class="s">"deploy-pipeline"</span><span class="p">,</span> <span class="s">"task"</span><span class="p">:</span> <span class="s">"Deploy the reviewed code to staging"</span><span class="p">},</span>
    <span class="p">{</span><span class="s">"skill"</span><span class="p">:</span> <span class="s">"test-runner"</span><span class="p">,</span> <span class="s">"task"</span><span class="p">:</span> <span class="s">"Run integration tests against staging"</span><span class="p">},</span>
<span class="p">]</span>
<span class="c1"># results = pipeline.execute_pipeline(pipeline_steps)
</span></code></pre></div></div>

<hr />

<h2 id="4-tools-vs-skills-a-critical-architectural-distinction">4. Tools vs. Skills: A Critical Architectural Distinction</h2>

<p>This is arguably the most important conceptual insight in the entire spec. Developers often conflate tools and skills, but they serve fundamentally different roles in the agent architecture.</p>

<p><img src="https://iili.io/BnCFyIj.png" alt="Tools vs Skills: Tools execute discrete actions and return results. Skills inject knowledge and reshape how the agent thinks." /></p>

<h3 id="tools-execute-actions-return-results">Tools: Execute Actions, Return Results</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># A tool is a callable that does one thing and returns data
</span><span class="k">def</span> <span class="nf">read_file</span><span class="p">(</span><span class="n">path</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Tool: discrete action, immediate result."""</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">path</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
        <span class="k">return</span> <span class="n">f</span><span class="p">.</span><span class="n">read</span><span class="p">()</span>

<span class="k">def</span> <span class="nf">web_search</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
    <span class="s">"""Tool: discrete action, immediate result."""</span>
    <span class="c1"># ... call search API ...
</span>    <span class="k">return</span> <span class="p">[{</span><span class="s">"title"</span><span class="p">:</span> <span class="s">"..."</span><span class="p">,</span> <span class="s">"url"</span><span class="p">:</span> <span class="s">"..."</span><span class="p">,</span> <span class="s">"snippet"</span><span class="p">:</span> <span class="s">"..."</span><span class="p">}]</span>

<span class="k">def</span> <span class="nf">run_sql</span><span class="p">(</span><span class="n">query</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">connection_string</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
    <span class="s">"""Tool: discrete action, immediate result."""</span>
    <span class="c1"># ... execute query, return rows ...
</span>    <span class="k">return</span> <span class="p">[{</span><span class="s">"id"</span><span class="p">:</span> <span class="mi">1</span><span class="p">,</span> <span class="s">"name"</span><span class="p">:</span> <span class="s">"Alice"</span><span class="p">}]</span>
</code></pre></div></div>

<p>Tools function as <em>verbs</em> in the agent’s vocabulary — each one grants a discrete <strong>capability</strong>: reading a file, querying a search index, executing SQL. The interaction pattern is always call → result → move on.</p>

<h3 id="skills-inject-knowledge-reshape-reasoning">Skills: Inject Knowledge, Reshape Reasoning</h3>

<p>Skills, by contrast, operate more like <em>adjectives</em> — they reshape the agent’s reasoning posture rather than granting a new action. Loading a security-review skill doesn’t merely let the agent scan for vulnerabilities; it equips the agent with <strong>structured judgment</strong>: which vulnerability classes to prioritize, what order to inspect them in, and how to calibrate severity ratings.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Before skill activation: generic response
# User: "Review this code"
# Agent: "The code looks fine. It handles user input and queries the database."
</span>
<span class="c1"># After security-review skill activation:
# The agent's context now contains:
# - "Always check for SQL injection in parameterized queries"
# - "Flag any use of eval(), exec(), or subprocess with shell=True"
# - "Review auth logic for IDOR vulnerabilities"
# - "Check for hardcoded secrets using regex: r'(?i)(api[_-]?key|secret|password)\s*=\s*[\"'][^\"']+'"
</span>
<span class="c1"># User: "Review this code"
# Agent: "CRITICAL: Line 42 uses string formatting in SQL query — SQL injection risk.
#          HIGH: Line 67 contains a hardcoded API key.
#          MEDIUM: Line 89 uses eval() on user input — arbitrary code execution."
</span></code></pre></div></div>

<p>The key insight: <strong>Tools give agents abilities. Skills give agents judgment.</strong></p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">enum</span> <span class="kn">import</span> <span class="n">Enum</span>


<span class="k">class</span> <span class="nc">ComponentType</span><span class="p">(</span><span class="n">Enum</span><span class="p">):</span>
    <span class="n">TOOL</span> <span class="o">=</span> <span class="s">"tool"</span>
    <span class="n">SKILL</span> <span class="o">=</span> <span class="s">"skill"</span>


<span class="k">class</span> <span class="nc">AgentComponent</span><span class="p">:</span>
    <span class="s">"""Demonstrates the architectural difference between tools and skills."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">component_type</span><span class="p">:</span> <span class="n">ComponentType</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="nb">type</span> <span class="o">=</span> <span class="n">component_type</span>


<span class="k">class</span> <span class="nc">Tool</span><span class="p">(</span><span class="n">AgentComponent</span><span class="p">):</span>
    <span class="s">"""Executes a discrete action and returns a result."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">func</span><span class="p">:</span> <span class="nb">callable</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">ComponentType</span><span class="p">.</span><span class="n">TOOL</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">func</span> <span class="o">=</span> <span class="n">func</span>

    <span class="k">def</span> <span class="nf">execute</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="o">**</span><span class="n">kwargs</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">func</span><span class="p">(</span><span class="o">**</span><span class="n">kwargs</span><span class="p">)</span>


<span class="k">class</span> <span class="nc">Skill</span><span class="p">(</span><span class="n">AgentComponent</span><span class="p">):</span>
    <span class="s">"""Injects knowledge into the agent's context."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">instructions</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">allowed_tools</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">(</span><span class="n">name</span><span class="p">,</span> <span class="n">ComponentType</span><span class="p">.</span><span class="n">SKILL</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">instructions</span> <span class="o">=</span> <span class="n">instructions</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">allowed_tools</span> <span class="o">=</span> <span class="n">allowed_tools</span>  <span class="c1"># Skills scope which tools can be used
</span>
    <span class="k">def</span> <span class="nf">inject</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">current_context</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Reshape the agent's context with skill knowledge."""</span>
        <span class="k">return</span> <span class="sa">f</span><span class="s">"""</span><span class="si">{</span><span class="n">current_context</span><span class="si">}</span><span class="s">

[SKILL: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">]
</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">instructions</span><span class="si">}</span><span class="s">
[ALLOWED TOOLS: </span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">allowed_tools</span><span class="p">)</span><span class="si">}</span><span class="s">]
[END SKILL]"""</span>
</code></pre></div></div>

<hr />

<h2 id="5-skills--mcp-the-complementary-architecture">5. Skills + MCP: The Complementary Architecture</h2>

<p>A common misconception is that Skills and MCP (Model Context Protocol) overlap or compete for the same architectural niche. In practice, they occupy distinct layers of the agent stack and are designed to evolve independently. Getting this separation right is one of the more consequential decisions in production agent design.</p>

<p><img src="https://iili.io/BnCK2EB.png" alt="Skills and MCP for AI Agents: Skills provide procedural knowledge, MCP provides connectivity" /></p>

<h3 id="the-separation-of-concerns">The Separation of Concerns</h3>

<table>
  <thead>
    <tr>
      <th>Layer</th>
      <th>Purpose</th>
      <th>Provides</th>
      <th>Example</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Skills</strong></td>
      <td>Procedural knowledge</td>
      <td><em>How</em> to do things</td>
      <td>“Run tests before deploying. Check staging health. Rollback on failure.”</td>
    </tr>
    <tr>
      <td><strong>MCP</strong></td>
      <td>Connectivity</td>
      <td><em>What</em> services to use</td>
      <td>GitHub API, Slack, database connections</td>
    </tr>
  </tbody>
</table>

<p>A skill might instruct the agent to:</p>
<ol>
  <li>Use a specific MCP server (<code class="language-plaintext highlighter-rouge">github-mcp</code>) to create a PR</li>
  <li>Define how to interpret its outputs (parse review comments)</li>
  <li>Enforce safety checks before destructive operations (require approval before merge)</li>
</ol>

<p>Because the layers have no shared state, you can <strong>replace an MCP server</strong> (migrating from GitHub to GitLab, for example) <strong>without editing a single skill file</strong>, and conversely <strong>revise skill workflows without touching any MCP configuration</strong>. This independence is what makes the architecture genuinely composable.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Protocol</span>


<span class="c1"># --- MCP Layer: Connectivity ---
</span>
<span class="k">class</span> <span class="nc">MCPServer</span><span class="p">(</span><span class="n">Protocol</span><span class="p">):</span>
    <span class="s">"""Protocol for MCP server implementations."""</span>
    <span class="k">def</span> <span class="nf">list_tools</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span> <span class="p">...</span>
    <span class="k">def</span> <span class="nf">call_tool</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">args</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span> <span class="p">...</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">GitHubMCPServer</span><span class="p">:</span>
    <span class="s">"""MCP server providing GitHub API access."""</span>
    <span class="n">token</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">base_url</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"https://api.github.com"</span>

    <span class="k">def</span> <span class="nf">list_tools</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
        <span class="k">return</span> <span class="p">[</span>
            <span class="p">{</span><span class="s">"name"</span><span class="p">:</span> <span class="s">"create_pr"</span><span class="p">,</span> <span class="s">"description"</span><span class="p">:</span> <span class="s">"Create a pull request"</span><span class="p">},</span>
            <span class="p">{</span><span class="s">"name"</span><span class="p">:</span> <span class="s">"list_reviews"</span><span class="p">,</span> <span class="s">"description"</span><span class="p">:</span> <span class="s">"List PR reviews"</span><span class="p">},</span>
            <span class="p">{</span><span class="s">"name"</span><span class="p">:</span> <span class="s">"merge_pr"</span><span class="p">,</span> <span class="s">"description"</span><span class="p">:</span> <span class="s">"Merge a pull request"</span><span class="p">},</span>
        <span class="p">]</span>

    <span class="k">def</span> <span class="nf">call_tool</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">args</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="c1"># Implementation calls GitHub REST API
</span>        <span class="p">...</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">GitLabMCPServer</span><span class="p">:</span>
    <span class="s">"""MCP server providing GitLab API access — swappable with GitHub."""</span>
    <span class="n">token</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">base_url</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"https://gitlab.com/api/v4"</span>

    <span class="k">def</span> <span class="nf">list_tools</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
        <span class="k">return</span> <span class="p">[</span>
            <span class="p">{</span><span class="s">"name"</span><span class="p">:</span> <span class="s">"create_pr"</span><span class="p">,</span> <span class="s">"description"</span><span class="p">:</span> <span class="s">"Create a merge request"</span><span class="p">},</span>
            <span class="p">{</span><span class="s">"name"</span><span class="p">:</span> <span class="s">"list_reviews"</span><span class="p">,</span> <span class="s">"description"</span><span class="p">:</span> <span class="s">"List MR reviews"</span><span class="p">},</span>
            <span class="p">{</span><span class="s">"name"</span><span class="p">:</span> <span class="s">"merge_pr"</span><span class="p">,</span> <span class="s">"description"</span><span class="p">:</span> <span class="s">"Merge a merge request"</span><span class="p">},</span>
        <span class="p">]</span>

    <span class="k">def</span> <span class="nf">call_tool</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">args</span><span class="p">:</span> <span class="nb">dict</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="c1"># Implementation calls GitLab REST API
</span>        <span class="p">...</span>


<span class="c1"># --- Skills Layer: Procedural Knowledge ---
</span>
<span class="n">DEPLOY_SKILL_INSTRUCTIONS</span> <span class="o">=</span> <span class="s">"""
# Deploy Pipeline Skill

## Workflow
1. Run `test-runner` skill first — deploy only if all tests pass
2. Create a PR with the deployment changes
3. Wait for at least 1 approving review
4. Deploy to staging environment
5. Run smoke tests against staging
6. If smoke tests pass, merge PR and deploy to production
7. If smoke tests fail, rollback staging and comment failure details on PR

## Safety Checks
- NEVER deploy directly to production without staging verification
- NEVER merge without at least 1 approving review
- Always create a rollback plan before production deployment

## Tool Permissions
- Allowed: create_pr, list_reviews, merge_pr, bash, read_file
- Forbidden: delete_branch (must be manual)
"""</span>


<span class="k">class</span> <span class="nc">AgenticStack</span><span class="p">:</span>
    <span class="s">"""
    Demonstrates the full agentic stack:
    Skills (how) + MCP (what) + LLM (execution)
    """</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">mcp_server</span><span class="p">:</span> <span class="n">MCPServer</span><span class="p">,</span> <span class="n">skill_registry</span><span class="p">:</span> <span class="n">SkillRegistry</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">mcp</span> <span class="o">=</span> <span class="n">mcp_server</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">skills</span> <span class="o">=</span> <span class="n">skill_registry</span>

    <span class="k">def</span> <span class="nf">deploy</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">branch</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="s">"""
        The skill provides the WORKFLOW (how to deploy).
        MCP provides the CONNECTIVITY (how to talk to GitHub/GitLab).
        The LLM follows skill instructions and calls MCP tools.
        """</span>
        <span class="c1"># Skill says: "Run tests first"
</span>        <span class="c1"># MCP provides: the test runner tool
</span>        <span class="c1"># LLM: orchestrates both
</span>
        <span class="c1"># Swap mcp_server from GitHubMCPServer to GitLabMCPServer
</span>        <span class="c1"># and this method doesn't change at all — the skill instructions
</span>        <span class="c1"># remain identical because they reference abstract tool names,
</span>        <span class="c1"># not GitHub-specific endpoints.
</span>        <span class="k">pass</span>
</code></pre></div></div>

<h3 id="the-agentic-stack">The Agentic Stack</h3>

<p>The full architecture stacks four layers, each with a clear responsibility:</p>

<p><img src="https://iili.io/BnCKK21.png" alt="The Agentic Stack: Agent Runtime → Skills → MCP → LLM + Execution" /></p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>┌─────────────────────────────────┐
│        Agent Runtime            │  ← Orchestration, UI, state management
├─────────────────────────────────┤
│           Skills                │  ← The "how": workflows, best practices
├─────────────────────────────────┤
│            MCP                  │  ← The "what": tools, data, external APIs
├─────────────────────────────────┤
│       LLM + Execution           │  ← Model inference, bash, filesystem
└─────────────────────────────────┘
</code></pre></div></div>

<hr />

<h2 id="6-writing-high-quality-skills-practical-guide">6. Writing High-Quality Skills: Practical Guide</h2>

<p>The quality of your skills directly determines agent performance. Here’s a production-grade skill with all the patterns that matter:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">SKILL_TEMPLATE</span> <span class="o">=</span> <span class="s">'''---
name: {name}
description: &gt;
  {description}
  Trigger conditions: {triggers}
license: MIT
compatibility:
  - claude
  - codex
  - gemini-cli
  - cursor
allowed-tools:
  {allowed_tools}
metadata:
  author: {author}
  version: {version}
  tags: [{tags}]
---

# {title}

## Overview
{overview}

## Workflow
{workflow_steps}

## Best Practices
{best_practices}

## Edge Cases
{edge_cases}

## Output Format
{output_format}
'''</span>


<span class="k">def</span> <span class="nf">generate_skill</span><span class="p">(</span>
    <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">description</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">triggers</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">workflow_steps</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
    <span class="n">best_practices</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
    <span class="n">edge_cases</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
    <span class="n">allowed_tools</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
    <span class="n">output_format</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"Markdown report"</span><span class="p">,</span>
    <span class="n">author</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"team"</span><span class="p">,</span>
    <span class="n">version</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"1.0.0"</span><span class="p">,</span>
    <span class="n">tags</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
    <span class="s">"""Generate a well-structured SKILL.md file from parameters."""</span>

    <span class="n">workflow</span> <span class="o">=</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">. </span><span class="si">{</span><span class="n">step</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">step</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">workflow_steps</span><span class="p">))</span>
    <span class="n">practices</span> <span class="o">=</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s">"- </span><span class="si">{</span><span class="n">p</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="n">best_practices</span><span class="p">)</span>
    <span class="n">edges</span> <span class="o">=</span> <span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s">"- </span><span class="si">{</span><span class="n">e</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">edge_cases</span><span class="p">)</span>
    <span class="n">tools_yaml</span> <span class="o">=</span> <span class="s">"</span><span class="se">\n</span><span class="s">  "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s">"- </span><span class="si">{</span><span class="n">t</span><span class="si">}</span><span class="s">"</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">allowed_tools</span><span class="p">)</span>
    <span class="n">tag_str</span> <span class="o">=</span> <span class="s">", "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">tags</span> <span class="ow">or</span> <span class="p">[</span><span class="n">name</span><span class="p">])</span>

    <span class="k">return</span> <span class="n">SKILL_TEMPLATE</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span>
        <span class="n">name</span><span class="o">=</span><span class="n">name</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="n">description</span><span class="p">,</span>
        <span class="n">triggers</span><span class="o">=</span><span class="n">triggers</span><span class="p">,</span>
        <span class="n">title</span><span class="o">=</span><span class="n">name</span><span class="p">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"-"</span><span class="p">,</span> <span class="s">" "</span><span class="p">).</span><span class="n">title</span><span class="p">(),</span>
        <span class="n">overview</span><span class="o">=</span><span class="n">description</span><span class="p">,</span>
        <span class="n">workflow_steps</span><span class="o">=</span><span class="n">workflow</span><span class="p">,</span>
        <span class="n">best_practices</span><span class="o">=</span><span class="n">practices</span><span class="p">,</span>
        <span class="n">edge_cases</span><span class="o">=</span><span class="n">edges</span><span class="p">,</span>
        <span class="n">allowed_tools</span><span class="o">=</span><span class="n">tools_yaml</span><span class="p">,</span>
        <span class="n">output_format</span><span class="o">=</span><span class="n">output_format</span><span class="p">,</span>
        <span class="n">author</span><span class="o">=</span><span class="n">author</span><span class="p">,</span>
        <span class="n">version</span><span class="o">=</span><span class="n">version</span><span class="p">,</span>
        <span class="n">tags</span><span class="o">=</span><span class="n">tag_str</span><span class="p">,</span>
    <span class="p">)</span>


<span class="c1"># Example: Generate a database migration skill
</span><span class="n">migration_skill</span> <span class="o">=</span> <span class="n">generate_skill</span><span class="p">(</span>
    <span class="n">name</span><span class="o">=</span><span class="s">"db-migration"</span><span class="p">,</span>
    <span class="n">description</span><span class="o">=</span><span class="s">"Safely execute database schema migrations with rollback support."</span><span class="p">,</span>
    <span class="n">triggers</span><span class="o">=</span><span class="s">"User mentions 'migration', 'schema change', 'alter table', 'add column'."</span><span class="p">,</span>
    <span class="n">workflow_steps</span><span class="o">=</span><span class="p">[</span>
        <span class="s">"Parse the migration file and identify all schema changes"</span><span class="p">,</span>
        <span class="s">"Generate a rollback script for each change"</span><span class="p">,</span>
        <span class="s">"Run migrations against a test database first"</span><span class="p">,</span>
        <span class="s">"Verify data integrity after test migration"</span><span class="p">,</span>
        <span class="s">"Execute against production with a transaction wrapper"</span><span class="p">,</span>
        <span class="s">"Validate production schema matches expected state"</span><span class="p">,</span>
        <span class="s">"Archive the migration with timestamp and hash"</span><span class="p">,</span>
    <span class="p">],</span>
    <span class="n">best_practices</span><span class="o">=</span><span class="p">[</span>
        <span class="s">"Always generate rollback scripts BEFORE executing forward migrations"</span><span class="p">,</span>
        <span class="s">"Never drop columns in the same migration that adds new ones"</span><span class="p">,</span>
        <span class="s">"Use online DDL (pt-online-schema-change) for tables with &gt;1M rows"</span><span class="p">,</span>
        <span class="s">"Set a statement timeout to prevent long-running locks"</span><span class="p">,</span>
    <span class="p">],</span>
    <span class="n">edge_cases</span><span class="o">=</span><span class="p">[</span>
        <span class="s">"Circular foreign key dependencies require a specific drop order"</span><span class="p">,</span>
        <span class="s">"Enum type modifications in PostgreSQL need a CREATE TYPE workaround"</span><span class="p">,</span>
        <span class="s">"Partitioned tables may need per-partition migration"</span><span class="p">,</span>
    <span class="p">],</span>
    <span class="n">allowed_tools</span><span class="o">=</span><span class="p">[</span><span class="s">"bash"</span><span class="p">,</span> <span class="s">"read_file"</span><span class="p">,</span> <span class="s">"write_file"</span><span class="p">,</span> <span class="s">"run_sql"</span><span class="p">],</span>
    <span class="n">tags</span><span class="o">=</span><span class="p">[</span><span class="s">"database"</span><span class="p">,</span> <span class="s">"migration"</span><span class="p">,</span> <span class="s">"schema"</span><span class="p">,</span> <span class="s">"safety"</span><span class="p">],</span>
<span class="p">)</span>
</code></pre></div></div>

<h3 id="skill-description-optimization">Skill Description Optimization</h3>

<p>Since skill selection happens entirely through LLM reasoning against the description field, optimizing descriptions is critical:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># BAD: Vague, doesn't help the LLM match queries
</span><span class="n">bad_description</span> <span class="o">=</span> <span class="s">"Helps with code stuff"</span>

<span class="c1"># BAD: Too long, wastes Tier 1 tokens
</span><span class="n">bad_description_long</span> <span class="o">=</span> <span class="s">"""
This skill helps developers write better code by providing comprehensive
code review feedback including style checks, performance analysis,
security vulnerability scanning, test coverage assessment, documentation
review, dependency auditing, and architectural pattern validation across
multiple programming languages including Python, JavaScript, TypeScript,
Go, Rust, Java, and C++.
"""</span>  <span class="c1"># ~60 tokens — too many for a description
</span>
<span class="c1"># GOOD: Specific, action-oriented, includes trigger phrases
</span><span class="n">good_description</span> <span class="o">=</span> <span class="s">"""
Performs security-focused code review. Identifies injection vulnerabilities,
auth bypasses, secrets exposure, and insecure deserialization. Use for
PR reviews or codebase security audits.
"""</span>  <span class="c1"># ~30 tokens — concise, specific, trigger-rich
</span></code></pre></div></div>

<hr />

<h2 id="7-google-adks-skilltoolset-reference-implementation">7. Google ADK’s SkillToolset: Reference Implementation</h2>

<p>Google’s Agent Development Kit (ADK) ships with a <code class="language-plaintext highlighter-rouge">SkillToolset</code> class that implements the full three-tier disclosure pattern. Here’s how it works conceptually:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span>


<span class="k">class</span> <span class="nc">SkillToolset</span><span class="p">:</span>
    <span class="s">"""
    Simplified reconstruction of Google ADK's SkillToolset.
    Provides three tool functions that implement the SKILL.md spec:
    - list_skills: Tier 1 (advertise)
    - load_skill: Tier 2 (load full body)
    - load_skill_resource: Tier 3 (deep dive into references/scripts)
    """</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skills_dir</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">registry</span> <span class="o">=</span> <span class="n">SkillRegistry</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">discover</span><span class="p">(</span><span class="n">project_root</span><span class="o">=</span><span class="n">skills_dir</span><span class="p">)</span>

    <span class="k">def</span> <span class="nf">list_skills</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">dict</span><span class="p">]:</span>
        <span class="s">"""
        Tool: List all available skills with names and descriptions.
        This is what the LLM sees at Tier 1.
        """</span>
        <span class="k">return</span> <span class="p">[</span>
            <span class="p">{</span>
                <span class="s">"name"</span><span class="p">:</span> <span class="n">meta</span><span class="p">.</span><span class="n">name</span><span class="p">,</span>
                <span class="s">"description"</span><span class="p">:</span> <span class="n">meta</span><span class="p">.</span><span class="n">description</span><span class="p">,</span>
                <span class="s">"allowed_tools"</span><span class="p">:</span> <span class="n">meta</span><span class="p">.</span><span class="n">allowed_tools</span><span class="p">,</span>
            <span class="p">}</span>
            <span class="k">for</span> <span class="n">meta</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">_registry</span><span class="p">.</span><span class="n">values</span><span class="p">()</span>
        <span class="p">]</span>

    <span class="k">def</span> <span class="nf">load_skill</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="s">"""
        Tool: Load a skill's full instructions (Tier 2).
        Returns the SKILL.md body for context injection.
        """</span>
        <span class="n">skill</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">activate</span><span class="p">(</span><span class="n">skill_name</span><span class="p">)</span>
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"name"</span><span class="p">:</span> <span class="n">skill</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">name</span><span class="p">,</span>
            <span class="s">"instructions"</span><span class="p">:</span> <span class="n">skill</span><span class="p">.</span><span class="n">body</span><span class="p">,</span>
            <span class="s">"allowed_tools"</span><span class="p">:</span> <span class="n">skill</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">allowed_tools</span><span class="p">,</span>
            <span class="s">"available_references"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">_list_references</span><span class="p">(</span><span class="n">skill</span><span class="p">),</span>
            <span class="s">"available_scripts"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">_list_scripts</span><span class="p">(</span><span class="n">skill</span><span class="p">),</span>
        <span class="p">}</span>

    <span class="k">def</span> <span class="nf">load_skill_resource</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> <span class="n">skill_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">resource_path</span><span class="p">:</span> <span class="nb">str</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="s">"""
        Tool: Load a specific reference file or execute a script (Tier 3).
        For scripts, returns the output — not the source code.
        """</span>
        <span class="n">skill</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">_loaded</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">skill_name</span><span class="p">)</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">skill</span><span class="p">:</span>
            <span class="k">return</span> <span class="p">{</span><span class="s">"error"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"Skill '</span><span class="si">{</span><span class="n">skill_name</span><span class="si">}</span><span class="s">' not loaded. Call load_skill first."</span><span class="p">}</span>

        <span class="n">resource_file</span> <span class="o">=</span> <span class="n">skill</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">parent</span> <span class="o">/</span> <span class="n">resource_path</span>

        <span class="k">if</span> <span class="n">resource_path</span><span class="p">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">"scripts/"</span><span class="p">):</span>
            <span class="n">executor</span> <span class="o">=</span> <span class="n">SkillExecutor</span><span class="p">(</span><span class="n">skill</span><span class="p">)</span>
            <span class="n">output</span> <span class="o">=</span> <span class="n">executor</span><span class="p">.</span><span class="n">run_script</span><span class="p">(</span><span class="n">resource_file</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
            <span class="k">return</span> <span class="p">{</span><span class="s">"type"</span><span class="p">:</span> <span class="s">"script_output"</span><span class="p">,</span> <span class="s">"output"</span><span class="p">:</span> <span class="n">output</span><span class="p">}</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">content</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">registry</span><span class="p">.</span><span class="n">load_reference</span><span class="p">(</span><span class="n">skill_name</span><span class="p">,</span> <span class="n">resource_path</span><span class="p">)</span>
            <span class="k">return</span> <span class="p">{</span><span class="s">"type"</span><span class="p">:</span> <span class="s">"reference"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">content</span><span class="p">}</span>

    <span class="k">def</span> <span class="nf">_list_references</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill</span><span class="p">:</span> <span class="n">LoadedSkill</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
        <span class="n">ref_dir</span> <span class="o">=</span> <span class="n">skill</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">parent</span> <span class="o">/</span> <span class="s">"references"</span>
        <span class="k">if</span> <span class="n">ref_dir</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
            <span class="k">return</span> <span class="p">[</span><span class="n">f</span><span class="p">.</span><span class="n">name</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">ref_dir</span><span class="p">.</span><span class="n">iterdir</span><span class="p">()</span> <span class="k">if</span> <span class="n">f</span><span class="p">.</span><span class="n">is_file</span><span class="p">()]</span>
        <span class="k">return</span> <span class="p">[]</span>

    <span class="k">def</span> <span class="nf">_list_scripts</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill</span><span class="p">:</span> <span class="n">LoadedSkill</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">list</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
        <span class="n">scripts_dir</span> <span class="o">=</span> <span class="n">skill</span><span class="p">.</span><span class="n">metadata</span><span class="p">.</span><span class="n">path</span><span class="p">.</span><span class="n">parent</span> <span class="o">/</span> <span class="s">"scripts"</span>
        <span class="k">if</span> <span class="n">scripts_dir</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
            <span class="k">return</span> <span class="p">[</span><span class="n">f</span><span class="p">.</span><span class="n">name</span> <span class="k">for</span> <span class="n">f</span> <span class="ow">in</span> <span class="n">scripts_dir</span><span class="p">.</span><span class="n">iterdir</span><span class="p">()</span> <span class="k">if</span> <span class="n">f</span><span class="p">.</span><span class="n">is_file</span><span class="p">()]</span>
        <span class="k">return</span> <span class="p">[]</span>
</code></pre></div></div>

<hr />

<h2 id="8-real-world-patterns-and-production-considerations">8. Real-World Patterns and Production Considerations</h2>

<h3 id="81-token-budget-management">8.1 Token Budget Management</h3>

<p>In production, you need to actively manage the token budget across skills:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TokenBudgetManager</span><span class="p">:</span>
    <span class="s">"""Enforce token limits across skill loading."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">max_skill_tokens</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">20_000</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">max_tokens</span> <span class="o">=</span> <span class="n">max_skill_tokens</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">current_usage</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_loaded_costs</span><span class="p">:</span> <span class="nb">dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="p">{}</span>

    <span class="k">def</span> <span class="nf">can_load</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">estimated_tokens</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="k">return</span> <span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">current_usage</span> <span class="o">+</span> <span class="n">estimated_tokens</span><span class="p">)</span> <span class="o">&lt;=</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_tokens</span>

    <span class="k">def</span> <span class="nf">register_load</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">tokens</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_loaded_costs</span><span class="p">[</span><span class="n">skill_name</span><span class="p">]</span> <span class="o">=</span> <span class="n">tokens</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">current_usage</span> <span class="o">+=</span> <span class="n">tokens</span>

    <span class="k">def</span> <span class="nf">register_unload</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="n">tokens</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_loaded_costs</span><span class="p">.</span><span class="n">pop</span><span class="p">(</span><span class="n">skill_name</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">current_usage</span> <span class="o">-=</span> <span class="n">tokens</span>

    <span class="k">def</span> <span class="nf">get_remaining</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">int</span><span class="p">:</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_tokens</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">current_usage</span>
</code></pre></div></div>

<h3 id="82-skill-versioning-and-cache-invalidation">8.2 Skill Versioning and Cache Invalidation</h3>

<p>Skills evolve. You need to detect when a skill has changed and invalidate cached activations:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">json</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>


<span class="k">class</span> <span class="nc">SkillCache</span><span class="p">:</span>
    <span class="s">"""Caches parsed skill metadata with content-hash-based invalidation."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">cache_path</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">".skill-cache.json"</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">cache_path</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">cache_path</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_cache</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_load_cache</span><span class="p">()</span>

    <span class="k">def</span> <span class="nf">_load_cache</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">cache_path</span><span class="p">.</span><span class="n">exists</span><span class="p">():</span>
            <span class="k">return</span> <span class="n">json</span><span class="p">.</span><span class="n">loads</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">cache_path</span><span class="p">.</span><span class="n">read_text</span><span class="p">())</span>
        <span class="k">return</span> <span class="p">{}</span>

    <span class="k">def</span> <span class="nf">is_stale</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill</span><span class="p">:</span> <span class="n">SkillMetadata</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">bool</span><span class="p">:</span>
        <span class="s">"""Check if the cached version matches the current file hash."""</span>
        <span class="n">cached</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_cache</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">skill</span><span class="p">.</span><span class="n">name</span><span class="p">)</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="n">cached</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">True</span>  <span class="c1"># Not cached at all
</span>        <span class="k">return</span> <span class="n">cached</span><span class="p">[</span><span class="s">"hash"</span><span class="p">]</span> <span class="o">!=</span> <span class="n">skill</span><span class="p">.</span><span class="n">content_hash</span>

    <span class="k">def</span> <span class="nf">update</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">skill</span><span class="p">:</span> <span class="n">SkillMetadata</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_cache</span><span class="p">[</span><span class="n">skill</span><span class="p">.</span><span class="n">name</span><span class="p">]</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"hash"</span><span class="p">:</span> <span class="n">skill</span><span class="p">.</span><span class="n">content_hash</span><span class="p">,</span>
            <span class="s">"path"</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span><span class="n">skill</span><span class="p">.</span><span class="n">path</span><span class="p">),</span>
            <span class="s">"description"</span><span class="p">:</span> <span class="n">skill</span><span class="p">.</span><span class="n">description</span><span class="p">,</span>
        <span class="p">}</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">cache_path</span><span class="p">.</span><span class="n">write_text</span><span class="p">(</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">_cache</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">))</span>
</code></pre></div></div>

<h3 id="83-skill-composition-and-chaining">8.3 Skill Composition and Chaining</h3>

<p>Complex workflows often require multiple skills to execute in sequence. Here’s a pattern for declarative skill pipelines:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">SkillStep</span><span class="p">:</span>
    <span class="n">skill_name</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">task_template</span><span class="p">:</span> <span class="nb">str</span>  <span class="c1"># Can reference {previous_result}
</span>    <span class="n">condition</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"always"</span>  <span class="c1"># "always", "on_success", "on_failure"
</span>

<span class="k">class</span> <span class="nc">SkillPipeline</span><span class="p">:</span>
    <span class="s">"""Declarative skill pipeline with conditional execution."""</span>

    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">steps</span><span class="p">:</span> <span class="nb">list</span><span class="p">[</span><span class="n">SkillStep</span><span class="p">]):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">name</span> <span class="o">=</span> <span class="n">name</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">steps</span> <span class="o">=</span> <span class="n">steps</span>

    <span class="k">def</span> <span class="nf">to_skill_md</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Generate a meta-skill that orchestrates a pipeline."""</span>
        <span class="n">workflow</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">step</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">steps</span><span class="p">):</span>
            <span class="n">cond</span> <span class="o">=</span> <span class="sa">f</span><span class="s">" (condition: </span><span class="si">{</span><span class="n">step</span><span class="p">.</span><span class="n">condition</span><span class="si">}</span><span class="s">)"</span> <span class="k">if</span> <span class="n">step</span><span class="p">.</span><span class="n">condition</span> <span class="o">!=</span> <span class="s">"always"</span> <span class="k">else</span> <span class="s">""</span>
            <span class="n">workflow</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">. Activate skill `</span><span class="si">{</span><span class="n">step</span><span class="p">.</span><span class="n">skill_name</span><span class="si">}</span><span class="s">`</span><span class="si">{</span><span class="n">cond</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">workflow</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"   Task: </span><span class="si">{</span><span class="n">step</span><span class="p">.</span><span class="n">task_template</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            <span class="n">workflow</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="sa">f</span><span class="s">"   After completion, deactivate `</span><span class="si">{</span><span class="n">step</span><span class="p">.</span><span class="n">skill_name</span><span class="si">}</span><span class="s">`"</span><span class="p">)</span>

        <span class="k">return</span> <span class="sa">f</span><span class="s">"""---
name: pipeline-</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">
description: &gt;
  Orchestrates a multi-step pipeline: </span><span class="si">{</span><span class="s">' → '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">skill_name</span> <span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">steps</span><span class="p">)</span><span class="si">}</span><span class="s">.
  Use when the task requires sequential execution of multiple specialized skills.
---

# Pipeline: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">name</span><span class="si">}</span><span class="s">

## Steps
</span><span class="si">{</span><span class="s">"chr(10)"</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">workflow</span><span class="p">)</span><span class="si">}</span><span class="s">

## Execution Rules
- Execute steps sequentially
- Pass results from each step to the next via 
- If a step with condition 'on_failure' exists, execute it only when the preceding step fails
- Dehydrate each skill after its step completes
"""</span>


<span class="c1"># Define a CI/CD pipeline as composed skills
</span><span class="n">ci_cd_pipeline</span> <span class="o">=</span> <span class="n">SkillPipeline</span><span class="p">(</span>
    <span class="n">name</span><span class="o">=</span><span class="s">"ci-cd"</span><span class="p">,</span>
    <span class="n">steps</span><span class="o">=</span><span class="p">[</span>
        <span class="n">SkillStep</span><span class="p">(</span><span class="s">"code-review-security"</span><span class="p">,</span> <span class="s">"Review changes in the current branch"</span><span class="p">),</span>
        <span class="n">SkillStep</span><span class="p">(</span><span class="s">"test-runner"</span><span class="p">,</span> <span class="s">"Run full test suite: {previous_result}"</span><span class="p">),</span>
        <span class="n">SkillStep</span><span class="p">(</span><span class="s">"deploy-pipeline"</span><span class="p">,</span> <span class="s">"Deploy if tests passed: {previous_result}"</span><span class="p">,</span>
                  <span class="n">condition</span><span class="o">=</span><span class="s">"on_success"</span><span class="p">),</span>
        <span class="n">SkillStep</span><span class="p">(</span><span class="s">"incident-report"</span><span class="p">,</span> <span class="s">"Generate failure report: {previous_result}"</span><span class="p">,</span>
                  <span class="n">condition</span><span class="o">=</span><span class="s">"on_failure"</span><span class="p">),</span>
    <span class="p">],</span>
<span class="p">)</span>
</code></pre></div></div>

<hr />

<h2 id="9-community-ecosystem-and-adoption-metrics">9. Community Ecosystem and Adoption Metrics</h2>

<p>The SKILL.md specification has seen one of the faster adoption curves in the AI tooling ecosystem, likely because its barrier to entry is almost zero — no SDK to install, no runtime dependency, no build step. As of early 2026:</p>

<ul>
  <li>Public repositories collectively host <strong>over a thousand community-authored skills</strong> spanning security, DevOps, data engineering, documentation, and more</li>
  <li>Implementations exist in <strong>more than 30 agent-oriented products</strong>, ranging from CLI tools (Claude Code, Codex CLI, Gemini CLI) to IDE integrations (Copilot, Cursor, JetBrains Junie)</li>
  <li>The <strong><code class="language-plaintext highlighter-rouge">.agents/skills/</code> directory convention</strong> has emerged as the cross-platform discovery path — any spec-compliant agent scans it automatically</li>
  <li>Google’s Agent Development Kit (ADK) treats skills as a <strong>first-class primitive</strong>, shipping a <code class="language-plaintext highlighter-rouge">SkillToolset</code> class with dedicated <code class="language-plaintext highlighter-rouge">list_skills</code>, <code class="language-plaintext highlighter-rouge">load_skill</code>, and <code class="language-plaintext highlighter-rouge">load_skill_resource</code> tool functions</li>
</ul>

<p>The underlying reason this “author once, activate anywhere” model works is the format’s deliberate minimalism: Markdown content, YAML metadata, and a filesystem convention — nothing more.</p>

<hr />

<h2 id="10-key-takeaways-for-agent-developers">10. Key Takeaways for Agent Developers</h2>

<ol>
  <li>
    <p><strong>Don’t confuse skills with tools.</strong> Tools grant discrete capabilities (read, write, search). Skills reshape the agent’s reasoning by injecting domain knowledge into context. Treating them interchangeably leads to architectures that are either bloated or brittle.</p>
  </li>
  <li>
    <p><strong>Invest heavily in the description field.</strong> Because skill routing relies entirely on the model’s own judgment against frontmatter descriptions, a vague or verbose description is functionally equivalent to a missing skill — the model will never select it.</p>
  </li>
  <li>
    <p><strong>Progressive disclosure is what makes scale possible.</strong> The ~50× reduction in startup tokens is not merely a cost savings; it is the architectural property that allows an agent to have hundreds of installed skills without any degradation in response quality or latency.</p>
  </li>
  <li>
    <p><strong>Keep skills and MCP on separate planes.</strong> Skills encode procedural knowledge (<em>how</em> to approach a task). MCP provides connectivity (<em>what</em> services to call). When these layers have no shared state, you gain composability — swap either side without touching the other.</p>
  </li>
  <li>
    <p><strong>Dehydrate aggressively in multi-step workflows.</strong> The load → execute → unload → repeat cycle ensures that context consumption tracks the <em>current</em> step, not the cumulative workflow. Without dehydration, a five-skill pipeline can exhaust the context window before reaching step three.</p>
  </li>
  <li>
    <p><strong>Respect the 500-line guideline for Tier 2 bodies.</strong> Anything longer should be refactored into <code class="language-plaintext highlighter-rouge">references/</code> files that load on-demand at Tier 3, keeping activation costs predictable.</p>
  </li>
  <li>
    <p><strong>Design scripts for minimal, structured output.</strong> Since only stdout enters the model’s context, a well-designed skill script functions as a compression layer — transforming hundreds of lines of logic into a handful of actionable output lines.</p>
  </li>
</ol>

<hr />

<p><em>This article expands on concepts introduced in the Strix newsletter post <a href="https://strix.ai">“What are Agent Skills and How Do Agents Use Them?”</a> with original analysis, architecture diagrams, and production-ready code implementations, but you would use this AT YOUR OWN RISK (see DISCLAIMER). All Python examples are the author’s own work, designed to demonstrate the patterns described in the SKILL.md specification.</em></p>]]></content><author><name>Marc Buraczynski</name></author><category term="agents" /><category term="LLMs" /><category term="architecture" /><category term="SKILL.md" /><summary type="html"><![CDATA[A deep technical analysis of the SKILL.md specification, progressive disclosure patterns, and how agent skills fundamentally reshape LLM-based agent architectures.]]></summary></entry><entry><title type="html">Your Customers Are Telling You How They Feel — Without Saying a Word. Are You Listening?</title><link href="https://gunnymarc.github.io/posts/2026/03/your-customers-are-telling-you-how-they-feel/" rel="alternate" type="text/html" title="Your Customers Are Telling You How They Feel — Without Saying a Word. Are You Listening?" /><published>2026-03-25T00:00:00-04:00</published><updated>2026-03-25T00:00:00-04:00</updated><id>https://gunnymarc.github.io/posts/2026/03/your-customers-are-telling-you-how-they-feel</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/03/your-customers-are-telling-you-how-they-feel/"><![CDATA[<p>Imagine walking into your favourite coffee shop. Before you even reach the counter, the barista notices the tension in your face, offers a warm smile, and says, “Rough morning? How about your usual — on the house today?” That small moment of emotional intelligence keeps you coming back for years.</p>

<p>Now imagine if your business could do that — at scale, across thousands of customer interactions, every single day.</p>

<p>That’s the promise of facial emotion detection: technology that teaches computers to read human emotions in real time, the same way that perceptive barista reads yours. And a recent project by AI practitioner Marc Buraczynski proves it’s not just a futuristic concept — it’s here, it works, and it’s ready for the real world.</p>

<hr />

<hr />

<h3 id="the-55-problem-most-businesses-are-ignoring">The 55% Problem Most Businesses Are Ignoring</h3>

<p>Research tells us that up to <strong>55% of emotional communication</strong> happens through facial expressions — not words. Think about that for a moment. More than half of what your customers, patients, students, and employees are communicating never shows up in a survey response, a support ticket, or an NPS score.</p>

<p>Businesses have spent decades perfecting how they analyse what people <em>say</em>. We’ve built entire industries around text analytics, voice-of-customer platforms, and sentiment analysis of written reviews. But we’ve been largely blind to the <em>majority</em> of the emotional signal — the one written on people’s faces.</p>

<p>Until now.</p>

<hr />

<h3 id="what-if-a-computer-could-read-a-room">What If a Computer Could Read a Room?</h3>

<p>At its core, facial emotion detection works like training a remarkably fast and consistent new team member. You show the system thousands of examples of human faces expressing different emotions — happiness, sadness, surprise, neutrality — and it learns to spot the patterns. The slight upturn of a mouth corner. The widening of eyes. The subtle drop of eyebrows that distinguishes genuine sadness from a relaxed, neutral expression.</p>

<p>What makes Buraczynski’s project particularly noteworthy isn’t just <em>that</em> it works — it’s <em>how well</em> it works, and the strategic decisions behind it.</p>

<p>His system correctly identifies emotions <strong>84% of the time</strong> across four categories. For context, that’s on par with the accuracy rates researchers have measured in <em>humans</em> performing the same task — especially when the expressions are subtle. It’s a level of reliability that makes real business applications viable.</p>

<p>Even more impressive: the system was designed from the ground up for speed and efficiency. It can process an image and deliver an emotion reading in <strong>under 10 milliseconds</strong> — fast enough for live video, in-store cameras, telehealth sessions, or any real-time application you can think of. And it’s compact enough to run on a smartphone or a small device at the point of interaction, with no need to send sensitive facial data to the cloud.</p>

<hr />

<h3 id="why-off-the-shelf-ai-isnt-always-the-answer">Why “Off-the-Shelf” AI Isn’t Always the Answer</h3>

<p>Here’s where this project offers a powerful lesson for business leaders evaluating AI investments.</p>

<p>The conventional wisdom in AI is to start with pre-built, general-purpose models — the kind trained on millions of generic images of cars, dogs, buildings, and landscapes — and then adapt them to your specific problem. It’s faster, it’s cheaper, and it works brilliantly for many use cases.</p>

<p>But Buraczynski tested that approach head-on. He evaluated three of the most popular pre-built AI systems available, and the results were striking: <strong>they all failed</strong>, with accuracy dropping as low as 25% — essentially random guessing.</p>

<p>Why? Because reading human emotions is a <em>specialised skill</em>. The subtle muscular differences between a sad face and a neutral face are nothing like the differences between a photo of a cat and a photo of a truck. General-purpose AI simply wasn’t built for this level of nuance.</p>

<p>The purpose-built system, designed specifically for emotion detection, outperformed the best off-the-shelf option by <strong>more than 33 percentage points</strong>.</p>

<p><strong>The business takeaway is clear:</strong> when the stakes are high and the problem is specialised, custom-built AI solutions can dramatically outperform generic ones. The upfront investment in a tailored approach pays for itself many times over in accuracy, reliability, and ultimately, business outcomes.</p>

<hr />

<h3 id="where-this-technology-creates-real-business-value">Where This Technology Creates Real Business Value</h3>

<p>So where does facial emotion detection actually move the needle? The applications span virtually every industry that involves human interaction — which is to say, nearly all of them.</p>

<p><strong>Retail &amp; Customer Experience</strong>
Picture a flagship store where digital displays adjust their content based on how shoppers are feeling. A customer who looks frustrated gets a prompt offering assistance. Checkout experiences are monitored not by clunky post-purchase surveys, but by real-time emotional response. Retailers gain a continuous, honest feedback loop that surveys simply cannot replicate.</p>

<p><strong>Healthcare &amp; Mental Health</strong>
Therapists and clinicians could use emotion detection as a supplementary diagnostic tool — tracking a patient’s emotional patterns over time, flagging subtle shifts that might indicate a change in mental health status, or helping assess non-verbal patients. In telehealth, where reading a patient through a screen is inherently harder, this technology becomes a powerful clinical aid.</p>

<p><strong>Human Resources &amp; Workplace Wellness</strong>
Forward-thinking organisations are exploring how emotion-aware systems can gauge employee engagement during training sessions, identify burnout signals in remote teams, and create more responsive workplace environments — all while respecting privacy boundaries and ethical guidelines.</p>

<p><strong>Education &amp; E-Learning</strong>
Online learning platforms can detect when a student is confused, bored, or disengaged, and adapt the content in real time — slowing down, offering additional examples, or shifting to a different teaching approach. It’s the digital equivalent of a great teacher who notices the puzzled look on a student’s face and adjusts their explanation accordingly.</p>

<p><strong>Automotive Safety</strong>
Driver monitoring systems can detect drowsiness, distraction, or emotional distress and trigger alerts before an accident occurs. At highway speeds, milliseconds matter — and this system delivers readings in under 10 of them.</p>

<p><strong>Entertainment &amp; Media</strong>
Content creators and studios can measure audience emotional response to trailers, advertisements, and programming in real time, replacing subjective focus groups with objective, scalable emotional data.</p>

<hr />

<h3 id="the-privacy-question--and-why-it-actually-favours-this-approach">The Privacy Question — And Why It Actually Favours This Approach</h3>

<p>Any conversation about facial analysis technology must address privacy, and rightly so. Here’s where the engineering decisions in this project align perfectly with business ethics.</p>

<p>Because the system is compact enough to run directly on a local device — a phone, a tablet, a camera unit — facial data never needs to leave that device. There’s no cloud upload, no central database of faces, no data trail. The system reads the emotion, delivers the insight, and the image can be discarded immediately.</p>

<p>This <strong>edge-first architecture</strong> isn’t just a technical achievement; it’s a competitive advantage in a regulatory environment that increasingly demands data minimisation and local processing. For industries bound by GDPR, HIPAA, or similar frameworks, on-device processing isn’t a nice-to-have — it’s becoming a requirement.</p>

<hr />

<h3 id="what-business-leaders-should-take-away">What Business Leaders Should Take Away</h3>

<p>The race to understand customers, employees, and stakeholders better is intensifying. The organisations that will lead in the next decade are those that can sense and respond to human emotion at scale — not just through the words people choose, but through the expressions they can’t hide.</p>

<p>This project demonstrates three strategic principles worth remembering:</p>

<ol>
  <li>
    <p><strong>Custom beats generic when the problem is specialised.</strong> Don’t assume that the biggest, most popular AI model is the right one for your use case. Sometimes a focused solution built for your exact problem will outperform it by an order of magnitude.</p>
  </li>
  <li>
    <p><strong>Speed and efficiency unlock new possibilities.</strong> A system that takes minutes to process is a research tool. A system that responds in milliseconds is a product. The difference between the two is where business value lives.</p>
  </li>
  <li>
    <p><strong>Privacy-by-design is a feature, not a constraint.</strong> Building AI that processes data locally and minimises exposure isn’t just ethically sound — it reduces infrastructure costs, simplifies compliance, and builds the trust that customers increasingly demand.</p>
  </li>
</ol>

<hr />

<h3 id="the-future-is-emotionally-intelligent">The Future Is Emotionally Intelligent</h3>

<p>We’re entering an era where the best businesses won’t just understand what their customers <em>do</em> — they’ll understand how their customers <em>feel</em>. Facial emotion detection is one of the foundational technologies making that possible, and as this project shows, it’s already accurate, fast, and deployable enough for real-world use.</p>

<p>The question isn’t whether this technology will reshape customer experience, healthcare, education, and workplace culture. It’s whether your organisation will be among the first to harness it — or among those playing catch-up.</p>

<p><strong>The faces are already speaking. The only question is: who’s building the systems to listen?</strong></p>

<hr />

<p><em>Inspired by the facial emotion detection research of Marc Buraczynski (March 2026). If you’re exploring how emotion-aware AI could create value in your industry, I’d love to hear your thoughts in the comments.</em></p>]]></content><author><name>Marc Buraczynski</name></author><category term="facial emotion detection" /><category term="CNNs" /><category term="deep learning" /><category term="computer vision" /><summary type="html"><![CDATA[Imagine walking into your favourite coffee shop. Before you even reach the counter, the barista notices the tension in your face, offers a warm smile, and says, “Rough morning? How about your usual — on the house today?” That small moment of emotional intelligence keeps you coming back for years.]]></summary></entry><entry><title type="html">From Pixels to Predictions: How CNNs Crushed ANNs in the Battle for Street-Level Recognition</title><link href="https://gunnymarc.github.io/posts/2026/03/from-pixels-to-predictions-how-cnns-crushed-anns-in-the-battle-for-street-level-recognition/" rel="alternate" type="text/html" title="From Pixels to Predictions: How CNNs Crushed ANNs in the Battle for Street-Level Recognition" /><published>2026-03-15T00:00:00-04:00</published><updated>2026-03-15T00:00:00-04:00</updated><id>https://gunnymarc.github.io/posts/2026/03/from-pixels-to-predictions-how-cnns-crushed-anns-in-the-battle-for-street-level-recognition</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/03/from-pixels-to-predictions-how-cnns-crushed-anns-in-the-battle-for-street-level-recognition/"><![CDATA[<p><em>How choosing the right neural network architecture took digit recognition accuracy from 65% to 91% – and what business leaders should know about it.</em></p>

<hr />

<h2 id="the-business-problem-reading-house-numbers-at-scale">The Business Problem: Reading House Numbers at Scale</h2>

<p>Imagine you are Google, and you need to read billions of house numbers from Street View photos to improve map accuracy. Hiring humans to manually transcribe every address number from every street-level photo in the world is not feasible. You need a machine that can look at a tiny, grainy, sometimes blurry photo of a digit and correctly identify what number it is.</p>

<p>This is the problem behind the Street View House Numbers (SVHN) dataset – one of the most widely used benchmarks in the field of Deep Learning (DL), which is a branch of Artificial Intelligence (AI) that teaches computers to learn patterns from data using layered mathematical models called neural networks. The SVHN dataset contains over 600,000 labeled digit images cropped from real Google Street View photos. Getting this right means better maps, better navigation, and better location services for billions of users.</p>

<p>The question we set out to answer: <strong>Which type of neural network architecture delivers the best accuracy for this real-world image recognition task?</strong></p>

<p><img src="assets/images/sample_digits.png" alt="Sample digits from the SVHN dataset" />
<em>Actual digit images from the dataset. Each is a tiny 32x32 pixel grayscale crop from a street-level photo. Notice the noise, blur, and varying lighting – this is not a clean laboratory dataset.</em></p>

<hr />

<h2 id="the-experiment-a-head-to-head-comparison">The Experiment: A Head-to-Head Comparison</h2>

<p>We built and tested four different neural network models on the same dataset of 60,000 digit images (42,000 for training and 18,000 for testing). The models fall into two fundamentally different families:</p>

<ul>
  <li><strong>Artificial Neural Network (ANN):</strong> A type of neural network where every input is connected to every processing unit. ANNs are general-purpose pattern recognizers that treat each input value independently.</li>
  <li><strong>Convolutional Neural Network (CNN):</strong> A type of neural network specifically designed for image data. CNNs use small sliding filters to detect visual patterns like edges and shapes, preserving the spatial structure of the image.</li>
</ul>

<p>The core difference is straightforward: ANNs ignore the fact that the input is an image, while CNNs are built to exploit it.</p>

<p><img src="assets/images/ann_vs_cnn_concept.png" alt="How ANNs and CNNs see images differently" />
<em>The fundamental difference. An ANN flattens the image into a long list of numbers, destroying the spatial layout. A CNN keeps the 2D structure intact and scans for visual features like edges and curves – the way a human eye would.</em></p>

<hr />

<h2 id="how-the-data-flows-from-raw-photo-to-prediction">How the Data Flows: From Raw Photo to Prediction</h2>

<p>Before any model can learn, the raw image data must be transformed into a format the computer can work with. Here is the pipeline every image passes through:</p>

<p><img src="assets/images/data_pipeline.png" alt="The data pipeline from photo to prediction" /></p>

<ol>
  <li><strong>Raw Image:</strong> A cropped digit photo from Google Street View.</li>
  <li><strong>32x32 Pixel Grid:</strong> Each image is a 32x32 grid of pixel values ranging from 0 (black) to 255 (white).</li>
  <li><strong>Normalization:</strong> Pixel values are scaled to a 0-to-1 range so the model trains more efficiently and stably.</li>
  <li><strong>Label Encoding:</strong> Each digit label (0-9) is converted into a ten-element vector using a technique called One-Hot Encoding (OHE), which represents each category as a binary vector. For example, the digit “3” becomes [0, 0, 0, 1, 0, 0, 0, 0, 0, 0].</li>
  <li><strong>Model Training:</strong> The processed images are fed into the neural network, which adjusts its internal weights to learn digit patterns.</li>
  <li><strong>Prediction:</strong> Given a new, unseen image, the model outputs which digit it believes is shown.</li>
</ol>

<hr />

<h2 id="the-four-contenders">The Four Contenders</h2>

<h3 id="ann-model-1-the-simple-baseline">ANN Model 1: The Simple Baseline</h3>

<p>The first model was intentionally simple – a minimal ANN with just two hidden processing layers (64 and 32 nodes). Think of it as a first draft: fast to build, fast to train, but limited in what it can learn.</p>

<p><strong>Result: ~65% accuracy</strong></p>

<p>With 10 possible digits, random guessing would yield 10% accuracy. So 65% is a meaningful lift – the model clearly learned something – but it is far from production quality. It reached its performance ceiling quickly and plateaued after just 5-7 rounds of training (called Epochs, which are complete passes through the entire training dataset).</p>

<p><img src="assets/images/ann_model1_accuracy.png" alt="ANN Model 1 training accuracy" />
<em>ANN Model 1’s training curve. Both training and validation accuracy plateau quickly, indicating the model has reached its capacity limit.</em></p>

<h3 id="ann-model-2-more-depth-more-regularization">ANN Model 2: More Depth, More Regularization</h3>

<p>The second model was a deeper ANN with five hidden layers (256, 128, 64, 64, and 32 nodes) and two key enhancements:</p>

<ul>
  <li><strong>Dropout:</strong> A regularization technique that randomly deactivates 20% of neurons during each training step. Dropout forces the network to learn more robust patterns rather than memorizing the training data. Think of it like training with a blindfold – it forces the model to develop multiple strategies for identifying digits, rather than relying too heavily on any single pathway.</li>
  <li><strong>Batch Normalization (BN):</strong> A technique that normalizes the values flowing through the network at each layer, stabilizing and accelerating the training process. BN acts like a quality control checkpoint that keeps the numbers flowing through the network in a healthy range.</li>
</ul>

<p><strong>Result: ~75% accuracy</strong></p>

<p>A 10-percentage-point improvement over the simple model. The deeper architecture and regularization helped, but the fundamental limitation remained: flattening the 2D image into a 1D list of numbers destroys the spatial relationships between pixels that are critical for recognizing visual patterns.</p>

<p><img src="assets/images/ann_model2_accuracy.png" alt="ANN Model 2 training accuracy" />
<em>ANN Model 2 shows steady improvement over 30 epochs with a moderate gap between training and validation accuracy – a sign of some overfitting, where the model performs better on training data than on new, unseen data.</em></p>

<h3 id="cnn-model-1-spatial-awareness-changes-everything">CNN Model 1: Spatial Awareness Changes Everything</h3>

<p>The first Convolutional Neural Network (CNN) was a game-changer. Instead of flattening the image, it preserved the 2D spatial structure and scanned it with small 3x3 filters that detect local visual features like edges and corners.</p>

<p>This model used two convolutional layers (16 and 32 filters), a Max Pooling (MP) layer that reduces image dimensions by selecting the most prominent features, and a specialized activation function called Leaky Rectified Linear Unit (LeakyReLU), which allows a small signal to pass even for negative inputs, preventing neurons from becoming permanently inactive.</p>

<p><strong>Result: ~86% accuracy</strong></p>

<p>The jump from 75% to 86% – an 11-percentage-point improvement – came entirely from changing the architecture to one that understands spatial structure. No additional data, no longer training time. Just a smarter way of looking at the image.</p>

<p>However, this model showed signs of Overfitting – the model memorized training patterns instead of learning generalizable features. Without regularization, the gap between training accuracy and validation accuracy grew wider as training progressed.</p>

<p><img src="assets/images/cnn_model1_accuracy.png" alt="CNN Model 1 training accuracy" />
<em>CNN Model 1 demonstrates the dramatic accuracy jump from switching to convolutional architecture. The widening gap between training and validation curves signals overfitting that needs to be addressed.</em></p>

<h3 id="cnn-model-2-the-champion">CNN Model 2: The Champion</h3>

<p>The final model combined the spatial intelligence of CNNs with comprehensive regularization. It featured:</p>

<ul>
  <li><strong>Four convolutional layers</strong> organized into two blocks (16, 32, 32, and 64 filters), creating a hierarchy: the first block detects simple features (edges, gradients), while the second block combines them into complex patterns (curves, digit shapes).</li>
  <li><strong>Two Batch Normalization (BN) layers</strong> placed after each pooling stage to stabilize training.</li>
  <li><strong>Dropout at 50%</strong> on the dense classification layer – aggressively preventing the model from over-relying on any single neuron.</li>
</ul>

<p><img src="assets/images/cnn_architecture.png" alt="CNN Model 2 architecture" />
<em>The winning architecture. Convolutional blocks extract increasingly complex visual features, while pooling, Batch Normalization, and Dropout prevent overfitting.</em></p>

<p><strong>Result: 91% accuracy</strong></p>

<p><img src="assets/images/cnn_model2_accuracy.png" alt="CNN Model 2 training accuracy" />
<em>CNN Model 2 shows the tightest gap between training and validation accuracy among all four models – strong evidence of good generalization to unseen data.</em></p>

<hr />

<h2 id="the-scoreboard">The Scoreboard</h2>

<p><img src="assets/images/model_comparison_bar.png" alt="Model performance comparison" />
<em>Four models, one dataset, dramatically different results. The 26-percentage-point improvement from the simplest ANN to the best CNN is entirely driven by architectural choices.</em></p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th>Architecture</th>
      <th style="text-align: center">Test Accuracy</th>
      <th>Key Takeaway</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>ANN Model 1</td>
      <td>2 hidden layers, no regularization</td>
      <td style="text-align: center">65%</td>
      <td>Simple baseline; limited capacity</td>
    </tr>
    <tr>
      <td>ANN Model 2</td>
      <td>5 hidden layers + Dropout + Batch Normalization</td>
      <td style="text-align: center">75%</td>
      <td>Deeper is better, but spatial info is still lost</td>
    </tr>
    <tr>
      <td>CNN Model 1</td>
      <td>2 conv layers, no regularization</td>
      <td style="text-align: center">86%</td>
      <td>Preserving spatial structure yields huge gains</td>
    </tr>
    <tr>
      <td><strong>CNN Model 2</strong></td>
      <td><strong>4 conv layers + Batch Normalization + Dropout</strong></td>
      <td style="text-align: center"><strong>91%</strong></td>
      <td><strong>Best model: depth + spatial awareness + regularization</strong></td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="where-models-succeed-and-struggle">Where Models Succeed and Struggle</h2>

<p>Not all digits are created equal. Some are visually distinctive and easy for any model to recognize. Others are ambiguous and trip up even the best architecture.</p>

<p><img src="assets/images/per_digit_comparison.png" alt="Per-digit performance comparison" />
<em>The CNN improves performance on every single digit, but the biggest gains come on the digits that ANNs struggle with most: 3, 5, and 8.</em></p>

<p><strong>Easy digits (high accuracy for both):</strong> Digits 0 and 7 have distinctive shapes – a closed oval and an angular stroke – that even ANNs can recognize fairly well.</p>

<p><strong>Hard digits (where CNNs shine brightest):</strong></p>
<ul>
  <li><strong>Digit 3</strong> is frequently confused with 8 (both have two curved sections). The CNN improved F1-Score, a single metric that balances both the precision and recall of a model’s predictions, from 70% to 87%.</li>
  <li><strong>Digit 5</strong> shares visual features with 6 (similar upper stroke). The CNN improved its F1-Score from 69% to 90%.</li>
  <li><strong>Digit 8</strong> is the trickiest – its visual complexity confuses ANNs badly (69% F1-Score), but CNNs bring it up to 89%.</li>
</ul>

<p>The CNN’s confusion matrix tells the full story:</p>

<p><img src="assets/images/cnn_confusion_matrix.png" alt="CNN Model 2 confusion matrix" />
<em>The confusion matrix for the winning CNN model. The strong diagonal (high numbers on the top-left to bottom-right line) shows correct predictions. Off-diagonal entries reveal which digits still get confused – mainly visually similar pairs like 3/8 and 5/6.</em></p>

<p>For comparison, here is the ANN’s confusion matrix – notice how much more scattered the errors are:</p>

<p><img src="assets/images/ann_confusion_matrix.png" alt="ANN Model 2 confusion matrix" />
<em>The ANN confusion matrix shows significantly more misclassifications across all digit pairs, with lower values along the diagonal.</em></p>

<hr />

<h2 id="what-this-means-for-business">What This Means for Business</h2>

<h3 id="1-architecture-choice-matters-more-than-brute-force">1. Architecture Choice Matters More Than Brute Force</h3>

<p>The most important finding is not about tuning hyperparameters or training longer. The single biggest accuracy improvement (from 75% to 86%) came from switching from an ANN to a CNN – a fundamentally different way of processing the data. In business terms: choosing the right tool for the job matters more than optimizing the wrong tool.</p>

<h3 id="2-regularization-is-insurance-against-overfitting">2. Regularization is Insurance Against Overfitting</h3>

<p>Adding Dropout and Batch Normalization (BN) to the CNN improved accuracy from 86% to 91% while also making the model more reliable on unseen data. Regularization is not optional – it is the difference between a model that performs well in testing and one that performs well in production.</p>

<h3 id="3-the-91-accuracy-in-context">3. The 91% Accuracy in Context</h3>

<p>For a real-world deployment like Google’s address recognition system, 91% accuracy on a challenging dataset like SVHN is strong. For context, the same CNN architecture would achieve approximately 98-99% on the cleaner Modified National Institute of Standards and Technology (MNIST) dataset, which is a benchmark of handwritten digits on uniform white backgrounds. The SVHN images include varying lighting, fonts, backgrounds, and camera angles that make it a much harder problem.</p>

<h3 id="4-diminishing-returns-and-the-path-forward">4. Diminishing Returns and the Path Forward</h3>

<p>The jump from ANN to CNN was dramatic (75% to 91%), but pushing beyond 91% requires techniques like:</p>
<ul>
  <li><strong>Data Augmentation (DA):</strong> Artificially expanding the training set by applying random rotations, shifts, and zooms to existing images, teaching the model to recognize digits from more angles and positions.</li>
  <li><strong>Learning Rate Scheduling (LRS):</strong> Gradually reducing the speed at which the model adjusts its weights as training progresses, allowing finer convergence.</li>
  <li><strong>Transfer Learning (TL):</strong> Using a pre-trained model that has already learned general visual features from millions of images and fine-tuning it for digit recognition.</li>
</ul>

<p>Each technique yields smaller gains than the last, so the business question becomes: is the marginal improvement worth the additional computational cost?</p>

<hr />

<h2 id="the-bottom-line">The Bottom Line</h2>

<p>This study demonstrates a principle that applies far beyond digit recognition: <strong>when your data has inherent structure, use an architecture that respects it.</strong> Images have spatial structure. Time series have temporal structure. Text has sequential structure. Choosing a model architecture that matches the structure of your data is the single highest-leverage decision in any Machine Learning (ML) project – the application of algorithms that learn patterns from data to make predictions or decisions without being explicitly programmed for each case.</p>

<p>The CNN did not succeed because it was bigger or trained longer. It succeeded because it was designed to see images the way they are meant to be seen: as two-dimensional spatial patterns, not as shuffled lists of numbers.</p>

<hr />

<p><em>This analysis was conducted as part of the MIT Professional Education Applied Artificial Intelligence and Deep Signal Processing (AAIDSP) program, using TensorFlow (TF) – an open-source machine learning framework developed by Google – running on Google Colab with A100 Graphics Processing Unit (GPU) acceleration, which is specialized hardware designed to perform the massive parallel computations that neural network training requires.</em></p>]]></content><author><name>Marc Buraczynski</name></author><category term="CNN" /><category term="ANN" /><category term="deep learning" /><category term="SVHN" /><category term="computer vision" /><summary type="html"><![CDATA[How choosing the right neural network architecture took digit recognition accuracy from 65% to 91% – and what business leaders should know about it.]]></summary></entry><entry><title type="html">Observability for LLMs: Understanding the Layers</title><link href="https://gunnymarc.github.io/posts/2026/03/observability-for-llms-understanding-the-layers/" rel="alternate" type="text/html" title="Observability for LLMs: Understanding the Layers" /><published>2026-03-05T00:00:00-05:00</published><updated>2026-03-05T00:00:00-05:00</updated><id>https://gunnymarc.github.io/posts/2026/03/observability-for-llms-understanding-the-layers</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/03/observability-for-llms-understanding-the-layers/"><![CDATA[<p><em>A practical guide to monitoring, debugging, and optimizing Large Language Model applications in production – with implementation examples for OpenTelemetry, AppDynamics APM, and Splunk Observability Cloud.</em></p>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
  <li><a href="#introduction-why-your-llm-needs-a-check-engine-light">Introduction: Why Your LLM Needs a Check Engine Light</a></li>
  <li><a href="#what-is-observability-and-why-does-it-matter">What is Observability and Why Does It Matter?</a></li>
  <li><a href="#the-restaurant-kitchen-an-analogy-for-llm-pipelines">The Restaurant Kitchen: An Analogy for LLM Pipelines</a></li>
  <li><a href="#traces-and-spans-the-backbone-of-observability">Traces and Spans: The Backbone of Observability</a></li>
  <li><a href="#the-five-layers-of-llm-observability">The Five Layers of LLM Observability</a></li>
  <li><a href="#why-each-layer-matters-debugging-cost-and-drift">Why Each Layer Matters: Debugging, Cost, and Drift</a></li>
  <li><a href="#implementation-with-opentelemetry">Implementation with OpenTelemetry</a></li>
  <li><a href="#integration-with-appdynamics-apm">Integration with AppDynamics APM</a></li>
  <li><a href="#integration-with-splunk-observability-cloud">Integration with Splunk Observability Cloud</a></li>
  <li><a href="#component-level-evaluation-beyond-black-box-testing">Component-Level Evaluation: Beyond Black-Box Testing</a></li>
  <li><a href="#best-practices-for-production-llm-observability">Best Practices for Production LLM Observability</a></li>
  <li><a href="#conclusion">Conclusion</a></li>
</ol>

<hr />

<h2 id="introduction-why-your-llm-needs-a-check-engine-light">Introduction: Why Your LLM Needs a Check Engine Light</h2>

<p>Imagine driving a car with no dashboard. No speedometer, no fuel gauge, no check engine light. You press the gas, the car moves, and everything seems fine – until it doesn’t. When the car breaks down on the highway, you have no idea why. Was it the engine? The transmission? Did you run out of oil? Without instruments, you’re left guessing.</p>

<p>This is exactly the situation many organizations find themselves in after deploying Large Language Model (LLM) applications to production. The application receives a user’s question, something happens in the middle, and an answer comes out the other end. When that answer is wrong, slow, or expensive, teams scramble to figure out why – and they often can’t.</p>

<p>Traditional software engineering solved this problem decades ago with <strong>observability</strong>: the practice of instrumenting systems so that their internal state can be understood from the outside. Web applications have had distributed tracing, metrics dashboards, and structured logging for years. But LLM applications introduce entirely new layers of complexity. A single request might flow through an embedding model, a vector database, a context assembly step, and finally the language model itself. Each of those steps can fail independently, each has its own latency profile, and each carries its own cost.</p>

<p>This article breaks down the <strong>layers of observability</strong> that production LLM systems require. We’ll use everyday analogies to make the concepts accessible, then dive into concrete Python implementations using three major platforms: OpenTelemetry (the open standard), AppDynamics APM (Cisco’s enterprise solution), and Splunk Observability Cloud. Whether you’re a technical lead instrumenting a RAG pipeline or a product manager trying to understand why your AI feature is underperforming, these layers will give you the mental model to diagnose, optimize, and trust your LLM applications.</p>

<hr />

<h2 id="what-is-observability-and-why-does-it-matter">What is Observability and Why Does It Matter?</h2>

<p><strong>Observability</strong> is the ability to understand what a system is doing on the inside by examining what it produces on the outside. In software, that means collecting three types of signals:</p>

<ul>
  <li><strong>Traces</strong> – the end-to-end journey of a single request through your system.</li>
  <li><strong>Metrics</strong> – numerical measurements aggregated over time (latency, error rate, throughput).</li>
  <li><strong>Logs</strong> – timestamped records of discrete events (“user submitted query,” “embedding model returned 1536 dimensions”).</li>
</ul>

<p>Together, these three signals form the <strong>three pillars of observability</strong>. Think of them as three different types of medical tests. A blood test (metrics) tells you aggregate health numbers. An MRI scan (traces) shows you the detailed internal structure of a single area. A patient’s symptom diary (logs) provides a chronological record of events. No single test is sufficient; you need all three for a complete diagnosis.</p>

<p>For traditional web applications, observability is well-established. When a user clicks “Submit Order” on an e-commerce site, a trace follows that request through the API gateway, the inventory service, the payment processor, and the notification service. If the order fails, engineers can open the trace and see exactly which service failed and why.</p>

<p>LLM applications need the same treatment – but with additional layers that traditional software doesn’t have. When a user asks an AI assistant a question, the request doesn’t just hop between microservices. It undergoes <em>transformations</em>: text becomes vectors, vectors become search results, search results become context, and context becomes a generated response. Each transformation is a potential point of failure, and each requires its own type of monitoring.</p>

<p>The stakes are high. Unlike a failed API call that returns an error code, an LLM can fail <em>silently</em>. It can hallucinate a confident-sounding answer that is completely wrong. It can use the wrong context and produce a plausible but irrelevant response. Without observability at every layer, these silent failures go undetected until a user complains – or worse, acts on bad information.</p>

<hr />

<h2 id="the-restaurant-kitchen-an-analogy-for-llm-pipelines">The Restaurant Kitchen: An Analogy for LLM Pipelines</h2>

<p>To understand why LLM observability needs multiple layers, imagine a high-end restaurant kitchen.</p>

<p>A customer places an order: “I’d like the pan-seared salmon with seasonal vegetables.” That order goes through several stations before a plate arrives at the table:</p>

<ol>
  <li>
    <p><strong>The Host Stand (Query Intake)</strong> – The server writes down the order, noting any allergies or special requests. If the server mishears the order, everything downstream goes wrong.</p>
  </li>
  <li>
    <p><strong>The Prep Station (Embedding)</strong> – The ingredients are washed, measured, and prepared. Raw ingredients are transformed into something the kitchen can work with. If the prep cook grabs the wrong fish, it doesn’t matter how well the chef cooks it.</p>
  </li>
  <li>
    <p><strong>The Walk-In Cooler (Retrieval)</strong> – The cook goes to the refrigerator and selects the specific ingredients needed for this dish. If the cooler is disorganized or the labels are wrong, the cook might grab tilapia instead of salmon.</p>
  </li>
  <li>
    <p><strong>The Assembly Station (Context)</strong> – All the components are gathered onto one workstation: the fish, the vegetables, the sauce, the garnish. The chef reviews everything before cooking. If the plate is overcrowded or missing components, the final dish suffers.</p>
  </li>
  <li>
    <p><strong>The Stove (Generation)</strong> – The chef cooks the dish. This is the most time-consuming and expensive step. Even with perfect ingredients, a distracted chef can burn the fish.</p>
  </li>
</ol>

<p>Now, here’s the critical insight: if the customer sends the dish back because it “doesn’t taste right,” the head chef needs to figure out <em>which station</em> made the mistake. Was it a bad ingredient from prep? The wrong cut from the cooler? Too much sauce at assembly? Or did the cook simply over-season it?</p>

<p>Without cameras and thermometers at each station, the head chef is left guessing. That’s what running an LLM application without layer-by-layer observability feels like.</p>

<p>In our analogy, the <strong>trace</strong> is the complete life of that single order – from the moment the customer spoke to the moment the plate arrived. The <strong>spans</strong> are the individual station operations: host, prep, retrieval, assembly, cooking. Each span has a start time, an end time, and metadata about what happened (which ingredient was pulled, what temperature the stove was set to, how long the cook waited for a burner).</p>

<hr />

<h2 id="traces-and-spans-the-backbone-of-observability">Traces and Spans: The Backbone of Observability</h2>

<p>Let’s formalize the restaurant analogy into engineering terms.</p>

<p>A <strong>trace</strong> is a record of the complete journey of a single request through your system. When a user asks your RAG application “What is retrieval-augmented generation?”, a unique <strong>Trace ID</strong> is generated. Every operation that happens as part of fulfilling that request carries this same Trace ID, linking them together like beads on a string.</p>

<p>A <strong>span</strong> is a single named operation within a trace. Each span records:</p>

<ul>
  <li><strong>Name</strong> – what operation this is (“embed_query,” “vector_search,” “llm_generate”).</li>
  <li><strong>Start time</strong> and <strong>end time</strong> – how long this operation took.</li>
  <li><strong>Attributes</strong> – key-value metadata (model name, token count, relevance score).</li>
  <li><strong>Status</strong> – did this operation succeed or fail?</li>
  <li><strong>Parent span</strong> – which operation triggered this one?</li>
</ul>

<p>The parent-child relationship between spans creates a tree structure. The <strong>root span</strong> is the overall request. Its children are the major pipeline steps. Those children might have children of their own (for example, the retrieval span might contain child spans for “encode query” and “search index”).</p>

<p>Here’s what a trace looks like laid out as a timeline. Notice how the trace encompasses all spans, and each span occupies a distinct time window:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Time (ms)   0       50      100     200     250     300          520
            |       |       |       |       |       |            |
Trace    [================================================================]
         trace_id: a]7f2-bc91-4e03

Query    [------]
         0-40ms    "What is RAG?"

Embed            [--------]
                 45-105ms   model: text-embedding-3-small

Retrieve                   [-----------]
                           110-210ms   top_k: 5, results: 5

Context                                [-----]
                                       215-260ms   tokens: 3,847

Generate                                      [========================]
                                              265-520ms   model: gpt-4o
                                              input_tokens: 4,102
                                              output_tokens: 287
</code></pre></div></div>

<p>If your system processes 1,000 queries in an hour, you get 1,000 traces. Each trace contains five spans (in our RAG example), but they’re all linked by their unique Trace ID. This means you can aggregate across traces to compute averages (“What’s the median retrieval latency this week?”) or drill into a single trace to debug a specific bad response (“Why did trace <code class="language-plaintext highlighter-rouge">a7f2-bc91</code> return nonsense?”).</p>

<p>Think of it this way: if traces are individual patient visits to a hospital, spans are the steps in each visit – check-in, triage, blood draw, doctor consultation, prescription. The hospital administrator can look at one visit in detail or analyze thousands of visits to find systemic bottlenecks.</p>

<hr />

<h2 id="the-five-layers-of-llm-observability">The Five Layers of LLM Observability</h2>

<p>Now that we understand traces and spans, let’s examine the five observability layers that a production RAG pipeline requires. Each layer corresponds to a span, and each captures distinct signals that the others cannot.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+===================================================================+
|  LAYER 5: GENERATION                                              |
|  The LLM produces a response                                     |
|  Monitor: input/output tokens, latency, cost, model, temperature  |
+===================================================================+
|  LAYER 4: CONTEXT ASSEMBLY                                        |
|  Retrieved documents + system prompt are merged                   |
|  Monitor: total token count, template version, truncation events  |
+===================================================================+
|  LAYER 3: RETRIEVAL                                               |
|  Vector database similarity search                                |
|  Monitor: top-k, relevance scores, result count, DB latency       |
+===================================================================+
|  LAYER 2: EMBEDDING                                               |
|  User query is converted into a vector                            |
|  Monitor: model name, dimensions, token count, API latency        |
+===================================================================+
|  LAYER 1: QUERY INTAKE                                            |
|  User submits their question                                      |
|  Monitor: raw input, timestamp, session ID, user metadata          |
+===================================================================+
</code></pre></div></div>

<h3 id="layer-1-query-intake">Layer 1: Query Intake</h3>

<p>Every journey begins with a question. The <strong>query span</strong> captures the raw user input, a timestamp, session identifiers, and any metadata about the user or conversation history. This span is usually fast (a few milliseconds), but it’s essential for two reasons. First, it anchors the trace – everything that follows is a child of this span. Second, it preserves the original question before any transformation happens. If the final answer is wrong, you’ll want to compare it against the exact input to understand whether the question was ambiguous, malformed, or perfectly clear.</p>

<p>Back to the restaurant: this is the host stand writing down the order. It’s quick, but if the server writes down “steak” instead of “salmon,” every subsequent station will execute flawlessly on the wrong dish.</p>

<h3 id="layer-2-embedding">Layer 2: Embedding</h3>

<p>The user’s text query is now converted into a numerical vector – a list of hundreds or thousands of numbers that represent the meaning of the query in a way that machines can compare. The <strong>embedding span</strong> tracks which model performed this conversion, how many tokens were processed, the dimensionality of the output vector, and how long the API call took.</p>

<p>This is the prep station transforming raw ingredients into something the kitchen can use. If the prep cook uses a dull knife (slow embedding API) or the wrong cutting technique (mismatched embedding model), everything downstream suffers. Monitoring this layer catches rate limits, model version changes, and latency spikes before they cascade.</p>

<h3 id="layer-3-retrieval">Layer 3: Retrieval</h3>

<p>The vector goes to your <strong>vector database</strong> (Pinecone, Weaviate, Chroma, pgvector, etc.) for a similarity search. The database returns the top-k most relevant document chunks. The <strong>retrieval span</strong> records the number of results, their relevance scores, the search latency, and the specific documents retrieved.</p>

<p>This is the cook visiting the walk-in cooler. If the cooler is poorly organized (bad chunking strategy), if the labels are wrong (stale embeddings), or if the cook only grabs one item when they need five (wrong top-k value), the dish will suffer. Our experience – and the broader industry’s – suggests that <strong>retrieval is where most RAG problems hide</strong>. Bad chunks, low relevance scores, and misconfigured similarity metrics are the silent killers of RAG quality. The retrieval span exposes all of it.</p>

<h3 id="layer-4-context-assembly">Layer 4: Context Assembly</h3>

<p>The retrieved document chunks are now assembled together with your system prompt and any conversation history into the final prompt that will be sent to the LLM. The <strong>context span</strong> records the total token count, which template was used, and whether any truncation occurred.</p>

<p>This is the assembly station where all components come together on one plate. If the plate is overcrowded (context exceeds the model’s window), ingredients get removed, and the dish loses coherence. If a key ingredient is missing (important document chunk was dropped), the final output suffers. This span is your last chance to inspect <em>exactly</em> what the LLM will see before it generates a response.</p>

<h3 id="layer-5-generation">Layer 5: Generation</h3>

<p>The LLM processes the assembled prompt and produces a response. The <strong>generation span</strong> is typically the longest and most expensive operation in the pipeline. It records the model used, input token count, output token count, latency, temperature setting, and any finish reason (did the model stop naturally, or was it cut off by a token limit?).</p>

<p>This is the stove – the most time-consuming and expensive station. Even with perfect ingredients, a cook can burn the dish. Monitoring this span is critical for cost management (tokens directly translate to dollars), performance optimization, and detecting when a model version change affects output quality.</p>

<hr />

<h2 id="why-each-layer-matters-debugging-cost-and-drift">Why Each Layer Matters: Debugging, Cost, and Drift</h2>

<p>Having five layers of observability serves three distinct purposes.</p>

<h3 id="debugging-finding-the-needle-in-the-haystack">Debugging: Finding the Needle in the Haystack</h3>

<p>Without span-level tracing, debugging an LLM application is like being told “the food was bad” with no further detail. You know the output was wrong, but you don’t know if the problem was bad retrieval, bad context, or the LLM hallucinating.</p>

<p>With layer-by-layer spans, you can follow a systematic diagnostic process:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Response quality is poor. Where is the problem?
|
+-- Check QUERY span
|   Is the input clean and well-formed?
|   +-- NO --&gt; Input validation / sanitization issue
|   +-- YES --&gt; Move to next layer
|
+-- Check EMBEDDING span
|   Did the embedding complete normally?
|   +-- HIGH LATENCY --&gt; API bottleneck or rate limiting
|   +-- ERROR --&gt; Authentication / quota issue
|   +-- OK --&gt; Move to next layer
|
+-- Check RETRIEVAL span
|   Are the retrieved documents relevant?
|   +-- LOW SCORES --&gt; Bad chunking strategy or stale index
|   +-- EMPTY RESULTS --&gt; Vector DB issue or index misconfiguration
|   +-- OK --&gt; Move to next layer
|
+-- Check CONTEXT span
|   Is the assembled prompt correct?
|   +-- TOO LONG --&gt; Context window exceeded, data truncated
|   +-- MISSING DATA --&gt; Template bug or assembly error
|   +-- OK --&gt; Move to next layer
|
+-- Check GENERATION span
    The LLM itself is the issue.
    +-- HALLUCINATION --&gt; Tighten prompt constraints or lower temperature
    +-- HIGH COST --&gt; Reduce max tokens or use a smaller model
    +-- SLOW --&gt; Consider a faster model or streaming
</code></pre></div></div>

<p>This decision tree is only possible when each layer emits its own span with meaningful attributes. Without it, you’re left with trial and error.</p>

<h3 id="cost-tracking-following-the-money">Cost Tracking: Following the Money</h3>

<p>LLM tokens cost money. Embedding API calls cost money. Vector database queries cost money. Span-level tracking lets you attribute costs to specific pipeline components.</p>

<p>You might discover that 70% of your spend is on generation (expected), but 20% is on embedding because you’re re-embedding queries that were already embedded in a previous conversation turn. Or you might find that your retrieval step is pulling 20 chunks when 5 would suffice, inflating your context tokens and therefore your generation cost.</p>

<p>Without layer-level cost attribution, you only see the total bill. With it, you see exactly where optimization will have the biggest impact.</p>

<h3 id="drift-detection-catching-silent-degradation">Drift Detection: Catching Silent Degradation</h3>

<p>AI systems degrade over time. What worked last month might not work today. Document indexes go stale. Embedding model providers push silent updates. LLM behavior shifts across versions. User query patterns change seasonally.</p>

<p>Span-level metrics let you catch drift early. If your retrieval relevance scores drop by 15% over two weeks, you know your index needs refreshing – even if end-to-end output quality hasn’t visibly degraded yet. If your embedding latency suddenly doubles, you know the provider changed something before your users start complaining about slow responses.</p>

<p>Think of it as the difference between annual physicals and continuous vital sign monitoring. The annual physical (end-to-end testing) catches problems after they’ve developed. Continuous monitoring (span-level metrics) catches the early warning signs.</p>

<hr />

<h2 id="implementation-with-opentelemetry">Implementation with OpenTelemetry</h2>

<p><strong>OpenTelemetry</strong> (OTel) is the open, vendor-neutral standard for observability instrumentation. It provides APIs and SDKs for generating traces, metrics, and logs that can be exported to any compatible backend. Using OTel means your instrumentation code isn’t locked to a specific vendor – you can switch from one observability platform to another by changing configuration, not code.</p>

<p>Here’s how to instrument a RAG pipeline with all five observability layers using the OpenTelemetry Python SDK:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>opentelemetry-api opentelemetry-sdk <span class="se">\</span>
            opentelemetry-exporter-otlp-proto-grpc
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""
RAG Pipeline with Full OpenTelemetry Instrumentation
Demonstrates all five layers of LLM observability.
"""</span>

<span class="kn">from</span> <span class="nn">opentelemetry</span> <span class="kn">import</span> <span class="n">trace</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace</span> <span class="kn">import</span> <span class="n">TracerProvider</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace.export</span> <span class="kn">import</span> <span class="n">BatchSpanProcessor</span>
<span class="kn">from</span> <span class="nn">opentelemetry.exporter.otlp.proto.grpc.trace_exporter</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">OTLPSpanExporter</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.resources</span> <span class="kn">import</span> <span class="n">Resource</span>
<span class="kn">from</span> <span class="nn">opentelemetry.trace</span> <span class="kn">import</span> <span class="n">StatusCode</span>
<span class="kn">import</span> <span class="nn">time</span>

<span class="c1"># ── Setup ──────────────────────────────────────────────────
# Create a resource that identifies this service.
</span><span class="n">resource</span> <span class="o">=</span> <span class="n">Resource</span><span class="p">.</span><span class="n">create</span><span class="p">({</span>
    <span class="s">"service.name"</span><span class="p">:</span> <span class="s">"rag-pipeline"</span><span class="p">,</span>
    <span class="s">"service.version"</span><span class="p">:</span> <span class="s">"1.0.0"</span><span class="p">,</span>
    <span class="s">"deployment.environment"</span><span class="p">:</span> <span class="s">"production"</span><span class="p">,</span>
<span class="p">})</span>

<span class="c1"># Configure the tracer provider with an OTLP exporter.
# The endpoint can point to any OTel-compatible collector.
</span><span class="n">provider</span> <span class="o">=</span> <span class="n">TracerProvider</span><span class="p">(</span><span class="n">resource</span><span class="o">=</span><span class="n">resource</span><span class="p">)</span>
<span class="n">exporter</span> <span class="o">=</span> <span class="n">OTLPSpanExporter</span><span class="p">(</span><span class="n">endpoint</span><span class="o">=</span><span class="s">"http://localhost:4317"</span><span class="p">,</span> <span class="n">insecure</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="n">provider</span><span class="p">.</span><span class="n">add_span_processor</span><span class="p">(</span><span class="n">BatchSpanProcessor</span><span class="p">(</span><span class="n">exporter</span><span class="p">))</span>
<span class="n">trace</span><span class="p">.</span><span class="n">set_tracer_provider</span><span class="p">(</span><span class="n">provider</span><span class="p">)</span>

<span class="n">tracer</span> <span class="o">=</span> <span class="n">trace</span><span class="p">.</span><span class="n">get_tracer</span><span class="p">(</span><span class="s">"rag.pipeline"</span><span class="p">,</span> <span class="s">"1.0.0"</span><span class="p">)</span>


<span class="c1"># ── Layer 1: Query Intake ──────────────────────────────────
</span><span class="k">def</span> <span class="nf">process_query</span><span class="p">(</span><span class="n">user_input</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">session_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
    <span class="s">"""Full RAG pipeline with five instrumented layers."""</span>

    <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.query"</span><span class="p">)</span> <span class="k">as</span> <span class="n">query_span</span><span class="p">:</span>
        <span class="n">query_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.query.text"</span><span class="p">,</span> <span class="n">user_input</span><span class="p">)</span>
        <span class="n">query_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.query.session_id"</span><span class="p">,</span> <span class="n">session_id</span><span class="p">)</span>
        <span class="n">query_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.query.timestamp"</span><span class="p">,</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">())</span>
        <span class="n">query_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.query.char_count"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">user_input</span><span class="p">))</span>

        <span class="c1"># ── Layer 2: Embedding ─────────────────────────────
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.embed"</span><span class="p">)</span> <span class="k">as</span> <span class="n">embed_span</span><span class="p">:</span>
            <span class="n">embed_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.system"</span><span class="p">,</span> <span class="s">"openai"</span><span class="p">)</span>
            <span class="n">embed_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"gen_ai.request.model"</span><span class="p">,</span> <span class="s">"text-embedding-3-small"</span>
            <span class="p">)</span>

            <span class="n">query_vector</span> <span class="o">=</span> <span class="n">embed_query</span><span class="p">(</span><span class="n">user_input</span><span class="p">)</span>

            <span class="n">embed_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"rag.embed.dimensions"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">query_vector</span><span class="p">)</span>
            <span class="p">)</span>
            <span class="n">embed_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.embed.token_count"</span><span class="p">,</span> <span class="mi">12</span><span class="p">)</span>

        <span class="c1"># ── Layer 3: Retrieval ─────────────────────────────
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.retrieve"</span><span class="p">)</span> <span class="k">as</span> <span class="n">retrieval_span</span><span class="p">:</span>
            <span class="n">retrieval_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.retrieve.top_k"</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
            <span class="n">retrieval_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"rag.retrieve.vector_db"</span><span class="p">,</span> <span class="s">"pinecone"</span>
            <span class="p">)</span>

            <span class="n">results</span> <span class="o">=</span> <span class="n">search_vector_db</span><span class="p">(</span><span class="n">query_vector</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>

            <span class="n">retrieval_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"rag.retrieve.result_count"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
            <span class="p">)</span>
            <span class="k">if</span> <span class="n">results</span><span class="p">:</span>
                <span class="n">scores</span> <span class="o">=</span> <span class="p">[</span><span class="n">r</span><span class="p">[</span><span class="s">"score"</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span><span class="p">]</span>
                <span class="n">retrieval_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                    <span class="s">"rag.retrieve.top_score"</span><span class="p">,</span> <span class="nb">max</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span>
                <span class="p">)</span>
                <span class="n">retrieval_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                    <span class="s">"rag.retrieve.min_score"</span><span class="p">,</span> <span class="nb">min</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span>
                <span class="p">)</span>

        <span class="c1"># ── Layer 4: Context Assembly ──────────────────────
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.context"</span><span class="p">)</span> <span class="k">as</span> <span class="n">context_span</span><span class="p">:</span>
            <span class="n">context</span> <span class="o">=</span> <span class="n">assemble_context</span><span class="p">(</span><span class="n">user_input</span><span class="p">,</span> <span class="n">results</span><span class="p">)</span>

            <span class="n">context_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"rag.context.total_tokens"</span><span class="p">,</span> <span class="n">context</span><span class="p">[</span><span class="s">"token_count"</span><span class="p">]</span>
            <span class="p">)</span>
            <span class="n">context_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"rag.context.num_chunks"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
            <span class="p">)</span>
            <span class="n">context_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"rag.context.template_version"</span><span class="p">,</span> <span class="s">"v2.1"</span>
            <span class="p">)</span>
            <span class="c1"># Flag if the context is dangerously close to
</span>            <span class="c1"># the model's limit.
</span>            <span class="k">if</span> <span class="n">context</span><span class="p">[</span><span class="s">"token_count"</span><span class="p">]</span> <span class="o">&gt;</span> <span class="mi">12000</span><span class="p">:</span>
                <span class="n">context_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                    <span class="s">"rag.context.near_limit"</span><span class="p">,</span> <span class="bp">True</span>
                <span class="p">)</span>
                <span class="n">context_span</span><span class="p">.</span><span class="n">add_event</span><span class="p">(</span>
                    <span class="s">"context_warning"</span><span class="p">,</span>
                    <span class="p">{</span><span class="s">"message"</span><span class="p">:</span> <span class="s">"Context approaching token limit"</span><span class="p">},</span>
                <span class="p">)</span>

        <span class="c1"># ── Layer 5: Generation ────────────────────────────
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.generate"</span><span class="p">)</span> <span class="k">as</span> <span class="n">gen_span</span><span class="p">:</span>
            <span class="n">gen_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.system"</span><span class="p">,</span> <span class="s">"openai"</span><span class="p">)</span>
            <span class="n">gen_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.request.model"</span><span class="p">,</span> <span class="s">"gpt-4o"</span><span class="p">)</span>
            <span class="n">gen_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.request.temperature"</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">)</span>
            <span class="n">gen_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.request.max_tokens"</span><span class="p">,</span> <span class="mi">1024</span><span class="p">)</span>

            <span class="n">response</span> <span class="o">=</span> <span class="n">call_llm</span><span class="p">(</span><span class="n">context</span><span class="p">[</span><span class="s">"prompt"</span><span class="p">])</span>

            <span class="n">gen_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"gen_ai.usage.input_tokens"</span><span class="p">,</span>
                <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"prompt_tokens"</span><span class="p">],</span>
            <span class="p">)</span>
            <span class="n">gen_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"gen_ai.usage.output_tokens"</span><span class="p">,</span>
                <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"completion_tokens"</span><span class="p">],</span>
            <span class="p">)</span>
            <span class="n">gen_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"gen_ai.response.finish_reason"</span><span class="p">,</span>
                <span class="n">response</span><span class="p">[</span><span class="s">"finish_reason"</span><span class="p">],</span>
            <span class="p">)</span>
            <span class="c1"># Cost estimate: $2.50/1M input, $10.00/1M output
</span>            <span class="c1"># for gpt-4o.
</span>            <span class="n">cost</span> <span class="o">=</span> <span class="p">(</span>
                <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"prompt_tokens"</span><span class="p">]</span> <span class="o">*</span> <span class="mf">2.50</span> <span class="o">/</span> <span class="mi">1_000_000</span>
                <span class="o">+</span> <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"completion_tokens"</span><span class="p">]</span>
                <span class="o">*</span> <span class="mf">10.00</span>
                <span class="o">/</span> <span class="mi">1_000_000</span>
            <span class="p">)</span>
            <span class="n">gen_span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.generate.cost_usd"</span><span class="p">,</span> <span class="n">cost</span><span class="p">)</span>

        <span class="n">query_span</span><span class="p">.</span><span class="n">set_status</span><span class="p">(</span><span class="n">StatusCode</span><span class="p">.</span><span class="n">OK</span><span class="p">)</span>
        <span class="k">return</span> <span class="p">{</span><span class="s">"answer"</span><span class="p">:</span> <span class="n">response</span><span class="p">[</span><span class="s">"text"</span><span class="p">],</span> <span class="s">"trace_id"</span><span class="p">:</span> <span class="nb">str</span><span class="p">(</span>
            <span class="n">query_span</span><span class="p">.</span><span class="n">get_span_context</span><span class="p">().</span><span class="n">trace_id</span>
        <span class="p">)}</span>


<span class="c1"># ── Placeholder functions (replace with real implementations) ──
</span><span class="k">def</span> <span class="nf">embed_query</span><span class="p">(</span><span class="n">text</span><span class="p">):</span>
    <span class="k">return</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">]</span> <span class="o">*</span> <span class="mi">1536</span>  <span class="c1"># Simulated 1536-dim vector
</span>
<span class="k">def</span> <span class="nf">search_vector_db</span><span class="p">(</span><span class="n">vector</span><span class="p">,</span> <span class="n">top_k</span><span class="p">):</span>
    <span class="k">return</span> <span class="p">[</span>
        <span class="p">{</span><span class="s">"id"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"doc_</span><span class="si">{</span><span class="n">i</span><span class="si">}</span><span class="s">"</span><span class="p">,</span> <span class="s">"score"</span><span class="p">:</span> <span class="mf">0.95</span> <span class="o">-</span> <span class="n">i</span> <span class="o">*</span> <span class="mf">0.05</span><span class="p">,</span> <span class="s">"text"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"..."</span><span class="p">}</span>
        <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">top_k</span><span class="p">)</span>
    <span class="p">]</span>

<span class="k">def</span> <span class="nf">assemble_context</span><span class="p">(</span><span class="n">query</span><span class="p">,</span> <span class="n">results</span><span class="p">):</span>
    <span class="n">chunks</span> <span class="o">=</span> <span class="s">" "</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">r</span><span class="p">[</span><span class="s">"text"</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span><span class="p">)</span>
    <span class="n">prompt</span> <span class="o">=</span> <span class="sa">f</span><span class="s">"Context: </span><span class="si">{</span><span class="n">chunks</span><span class="si">}</span><span class="se">\n\n</span><span class="s">Question: </span><span class="si">{</span><span class="n">query</span><span class="si">}</span><span class="se">\n</span><span class="s">Answer:"</span>
    <span class="k">return</span> <span class="p">{</span><span class="s">"prompt"</span><span class="p">:</span> <span class="n">prompt</span><span class="p">,</span> <span class="s">"token_count"</span><span class="p">:</span> <span class="mi">4102</span><span class="p">}</span>

<span class="k">def</span> <span class="nf">call_llm</span><span class="p">(</span><span class="n">prompt</span><span class="p">):</span>
    <span class="k">return</span> <span class="p">{</span>
        <span class="s">"text"</span><span class="p">:</span> <span class="s">"RAG is a technique that..."</span><span class="p">,</span>
        <span class="s">"usage"</span><span class="p">:</span> <span class="p">{</span><span class="s">"prompt_tokens"</span><span class="p">:</span> <span class="mi">4102</span><span class="p">,</span> <span class="s">"completion_tokens"</span><span class="p">:</span> <span class="mi">287</span><span class="p">},</span>
        <span class="s">"finish_reason"</span><span class="p">:</span> <span class="s">"stop"</span><span class="p">,</span>
    <span class="p">}</span>
</code></pre></div></div>

<p>The key insight in this code is the <strong>nesting</strong>. The <code class="language-plaintext highlighter-rouge">rag.query</code> span is the parent (root), and all other spans are its children. OpenTelemetry automatically propagates the Trace ID through the <code class="language-plaintext highlighter-rouge">start_as_current_span</code> context manager, so every span in a request shares the same trace. When you view this trace in a dashboard, you’ll see the full tree structure and can drill into any individual layer.</p>

<p>The <code class="language-plaintext highlighter-rouge">gen_ai.*</code> attributes follow the <a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/">OpenTelemetry Semantic Conventions for GenAI</a>, ensuring that observability backends can render LLM-specific dashboards without custom configuration.</p>

<hr />

<h2 id="integration-with-appdynamics-apm">Integration with AppDynamics APM</h2>

<p><strong>AppDynamics</strong> (part of Cisco’s observability portfolio) provides enterprise application performance monitoring with automatic business transaction detection, anomaly detection, and root cause analysis. Modern AppDynamics deployments support OpenTelemetry ingestion, meaning you can send OTel-instrumented traces directly to the AppDynamics controller.</p>

<p>The approach: use the same OpenTelemetry SDK from the previous section, but configure the OTLP exporter to target the AppDynamics OTLP endpoint. AppDynamics maps OTel traces to its concept of <strong>Business Transactions</strong> (BTs), giving you both the vendor-neutral instrumentation and the enterprise analytics.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>opentelemetry-api opentelemetry-sdk <span class="se">\</span>
            opentelemetry-exporter-otlp-proto-grpc
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""
RAG Pipeline exporting traces to AppDynamics via OTLP.
AppDynamics maps OpenTelemetry traces to Business Transactions.
"""</span>

<span class="kn">from</span> <span class="nn">opentelemetry</span> <span class="kn">import</span> <span class="n">trace</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace</span> <span class="kn">import</span> <span class="n">TracerProvider</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace.export</span> <span class="kn">import</span> <span class="n">BatchSpanProcessor</span>
<span class="kn">from</span> <span class="nn">opentelemetry.exporter.otlp.proto.grpc.trace_exporter</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">OTLPSpanExporter</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.resources</span> <span class="kn">import</span> <span class="n">Resource</span>
<span class="kn">import</span> <span class="nn">os</span>

<span class="c1"># ── AppDynamics-specific configuration ─────────────────────
# These values come from your AppDynamics controller settings.
</span><span class="n">APPD_OTLP_ENDPOINT</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span>
    <span class="s">"APPDYNAMICS_OTLP_ENDPOINT"</span><span class="p">,</span>
    <span class="s">"https://&lt;your-controller&gt;.saas.appdynamics.com:443"</span><span class="p">,</span>
<span class="p">)</span>
<span class="n">APPD_API_KEY</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"APPDYNAMICS_API_KEY"</span><span class="p">,</span> <span class="s">"&lt;your-api-key&gt;"</span><span class="p">)</span>

<span class="n">resource</span> <span class="o">=</span> <span class="n">Resource</span><span class="p">.</span><span class="n">create</span><span class="p">({</span>
    <span class="s">"service.name"</span><span class="p">:</span> <span class="s">"rag-pipeline"</span><span class="p">,</span>
    <span class="s">"service.namespace"</span><span class="p">:</span> <span class="s">"ai-applications"</span><span class="p">,</span>
    <span class="s">"service.version"</span><span class="p">:</span> <span class="s">"1.0.0"</span><span class="p">,</span>
    <span class="c1"># AppDynamics uses these resource attributes to organize
</span>    <span class="c1"># services into tiers and applications.
</span>    <span class="s">"appdynamics.controller.account"</span><span class="p">:</span> <span class="s">"your-account"</span><span class="p">,</span>
    <span class="s">"appdynamics.controller.application"</span><span class="p">:</span> <span class="s">"LLM-RAG-Service"</span><span class="p">,</span>
<span class="p">})</span>

<span class="c1"># ── Exporter targeting AppDynamics OTLP ingestion ──────────
# The API key is passed as a header for authentication.
</span><span class="n">exporter</span> <span class="o">=</span> <span class="n">OTLPSpanExporter</span><span class="p">(</span>
    <span class="n">endpoint</span><span class="o">=</span><span class="n">APPD_OTLP_ENDPOINT</span><span class="p">,</span>
    <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s">"x-api-key"</span><span class="p">:</span> <span class="n">APPD_API_KEY</span><span class="p">},</span>
<span class="p">)</span>

<span class="n">provider</span> <span class="o">=</span> <span class="n">TracerProvider</span><span class="p">(</span><span class="n">resource</span><span class="o">=</span><span class="n">resource</span><span class="p">)</span>
<span class="n">provider</span><span class="p">.</span><span class="n">add_span_processor</span><span class="p">(</span><span class="n">BatchSpanProcessor</span><span class="p">(</span><span class="n">exporter</span><span class="p">))</span>
<span class="n">trace</span><span class="p">.</span><span class="n">set_tracer_provider</span><span class="p">(</span><span class="n">provider</span><span class="p">)</span>

<span class="n">tracer</span> <span class="o">=</span> <span class="n">trace</span><span class="p">.</span><span class="n">get_tracer</span><span class="p">(</span><span class="s">"rag.pipeline.appdynamics"</span><span class="p">,</span> <span class="s">"1.0.0"</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">handle_rag_request</span><span class="p">(</span><span class="n">user_input</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">session_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
    <span class="s">"""
    Each call creates a Business Transaction in AppDynamics.
    The root span name ('rag.query') becomes the BT name.
    Child spans appear as "Exit Calls" or "Service Endpoints"
    in the AppDynamics waterfall view.
    """</span>
    <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.query"</span><span class="p">)</span> <span class="k">as</span> <span class="n">root</span><span class="p">:</span>
        <span class="n">root</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"query.text"</span><span class="p">,</span> <span class="n">user_input</span><span class="p">)</span>
        <span class="n">root</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"session.id"</span><span class="p">,</span> <span class="n">session_id</span><span class="p">)</span>

        <span class="c1"># Layer 2 -- AppDynamics shows this as a downstream
</span>        <span class="c1"># call with its own timing and error rate.
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.embed"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.request.model"</span><span class="p">,</span>
                               <span class="s">"text-embedding-3-small"</span><span class="p">)</span>
            <span class="n">vector</span> <span class="o">=</span> <span class="n">embed_query</span><span class="p">(</span><span class="n">user_input</span><span class="p">)</span>

        <span class="c1"># Layer 3 -- The retrieval span surfaces vector DB
</span>        <span class="c1"># latency in AppDynamics' "Slowest DB Calls" view.
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.retrieve"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"db.system"</span><span class="p">,</span> <span class="s">"pinecone"</span><span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.retrieve.top_k"</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
            <span class="n">results</span> <span class="o">=</span> <span class="n">search_vector_db</span><span class="p">(</span><span class="n">vector</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.retrieve.result_count"</span><span class="p">,</span>
                               <span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">))</span>

        <span class="c1"># Layer 4
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.context"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
            <span class="n">context</span> <span class="o">=</span> <span class="n">assemble_context</span><span class="p">(</span><span class="n">user_input</span><span class="p">,</span> <span class="n">results</span><span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.context.total_tokens"</span><span class="p">,</span>
                               <span class="n">context</span><span class="p">[</span><span class="s">"token_count"</span><span class="p">])</span>

        <span class="c1"># Layer 5 -- Generation latency and token cost are
</span>        <span class="c1"># visible per-BT in AppDynamics dashboards.
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.generate"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.request.model"</span><span class="p">,</span> <span class="s">"gpt-4o"</span><span class="p">)</span>
            <span class="n">response</span> <span class="o">=</span> <span class="n">call_llm</span><span class="p">(</span><span class="n">context</span><span class="p">[</span><span class="s">"prompt"</span><span class="p">])</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.usage.input_tokens"</span><span class="p">,</span>
                               <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"prompt_tokens"</span><span class="p">])</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.usage.output_tokens"</span><span class="p">,</span>
                               <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"completion_tokens"</span><span class="p">])</span>

        <span class="k">return</span> <span class="n">response</span><span class="p">[</span><span class="s">"text"</span><span class="p">]</span>
</code></pre></div></div>

<p>What makes this valuable from an enterprise perspective is that AppDynamics automatically detects anomalies across your Business Transactions. If your <code class="language-plaintext highlighter-rouge">rag.retrieve</code> span starts taking 3x longer than its baseline on Tuesday afternoons, AppDynamics flags it and correlates it with infrastructure changes, deployment events, or upstream service degradation. You get the five layers of LLM observability wrapped in enterprise-grade anomaly detection and alerting.</p>

<p>In the AppDynamics Flow Map, your RAG pipeline appears as a chain: <code class="language-plaintext highlighter-rouge">rag.query</code> calls <code class="language-plaintext highlighter-rouge">rag.embed</code>, which calls <code class="language-plaintext highlighter-rouge">rag.retrieve</code>, and so on. Each link shows latency, throughput, and error rate. This visual representation is essentially the trace timeline we discussed earlier, but rendered automatically by the platform.</p>

<hr />

<h2 id="integration-with-splunk-observability-cloud">Integration with Splunk Observability Cloud</h2>

<p><strong>Splunk Observability Cloud</strong> provides real-time monitoring and troubleshooting built natively on OpenTelemetry. Splunk distributes its own packaging of the OTel SDK (<code class="language-plaintext highlighter-rouge">splunk-opentelemetry</code>) that adds automatic instrumentation for common frameworks and pre-configured export to Splunk’s backend.</p>

<p>The Splunk approach has a distinct advantage: because Splunk also provides log analytics (via Splunk Enterprise or Splunk Cloud Platform), you can correlate your LLM observability traces with application logs and infrastructure metrics in a single pane of glass. When your generation span shows high latency, you can pivot to the GPU utilization metrics of the machine running your model, or the error logs from your vector database – all linked by the same Trace ID.</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code>pip <span class="nb">install </span>splunk-opentelemetry opentelemetry-api opentelemetry-sdk
</code></pre></div></div>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""
RAG Pipeline exporting traces to Splunk Observability Cloud.
Uses Splunk's OpenTelemetry distribution for streamlined setup.
"""</span>

<span class="kn">from</span> <span class="nn">opentelemetry</span> <span class="kn">import</span> <span class="n">trace</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace</span> <span class="kn">import</span> <span class="n">TracerProvider</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.trace.export</span> <span class="kn">import</span> <span class="n">BatchSpanProcessor</span>
<span class="kn">from</span> <span class="nn">opentelemetry.exporter.otlp.proto.http.trace_exporter</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">OTLPSpanExporter</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">opentelemetry.sdk.resources</span> <span class="kn">import</span> <span class="n">Resource</span>
<span class="kn">import</span> <span class="nn">os</span>

<span class="c1"># ── Splunk-specific configuration ──────────────────────────
# Obtain from: Splunk Observability &gt; Settings &gt; Access Tokens
</span><span class="n">SPLUNK_ACCESS_TOKEN</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"SPLUNK_ACCESS_TOKEN"</span><span class="p">)</span>
<span class="n">SPLUNK_REALM</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"SPLUNK_REALM"</span><span class="p">,</span> <span class="s">"us0"</span><span class="p">)</span>

<span class="c1"># Splunk's OTLP ingest endpoint follows a predictable pattern.
</span><span class="n">SPLUNK_OTLP_ENDPOINT</span> <span class="o">=</span> <span class="p">(</span>
    <span class="sa">f</span><span class="s">"https://ingest.</span><span class="si">{</span><span class="n">SPLUNK_REALM</span><span class="si">}</span><span class="s">.signalfx.com/v2/trace/otlp"</span>
<span class="p">)</span>

<span class="n">resource</span> <span class="o">=</span> <span class="n">Resource</span><span class="p">.</span><span class="n">create</span><span class="p">({</span>
    <span class="s">"service.name"</span><span class="p">:</span> <span class="s">"rag-pipeline"</span><span class="p">,</span>
    <span class="s">"deployment.environment"</span><span class="p">:</span> <span class="s">"production"</span><span class="p">,</span>
    <span class="s">"service.version"</span><span class="p">:</span> <span class="s">"1.0.0"</span><span class="p">,</span>
    <span class="c1"># Splunk uses this to group services in APM.
</span>    <span class="s">"splunk.distro.version"</span><span class="p">:</span> <span class="s">"1.0.0"</span><span class="p">,</span>
<span class="p">})</span>

<span class="c1"># ── Exporter targeting Splunk's OTLP HTTP endpoint ─────────
</span><span class="n">exporter</span> <span class="o">=</span> <span class="n">OTLPSpanExporter</span><span class="p">(</span>
    <span class="n">endpoint</span><span class="o">=</span><span class="n">SPLUNK_OTLP_ENDPOINT</span><span class="p">,</span>
    <span class="n">headers</span><span class="o">=</span><span class="p">{</span><span class="s">"X-SF-TOKEN"</span><span class="p">:</span> <span class="n">SPLUNK_ACCESS_TOKEN</span><span class="p">},</span>
<span class="p">)</span>

<span class="n">provider</span> <span class="o">=</span> <span class="n">TracerProvider</span><span class="p">(</span><span class="n">resource</span><span class="o">=</span><span class="n">resource</span><span class="p">)</span>
<span class="n">provider</span><span class="p">.</span><span class="n">add_span_processor</span><span class="p">(</span><span class="n">BatchSpanProcessor</span><span class="p">(</span><span class="n">exporter</span><span class="p">))</span>
<span class="n">trace</span><span class="p">.</span><span class="n">set_tracer_provider</span><span class="p">(</span><span class="n">provider</span><span class="p">)</span>

<span class="n">tracer</span> <span class="o">=</span> <span class="n">trace</span><span class="p">.</span><span class="n">get_tracer</span><span class="p">(</span><span class="s">"rag.pipeline.splunk"</span><span class="p">,</span> <span class="s">"1.0.0"</span><span class="p">)</span>


<span class="c1"># ── Instrumented RAG Pipeline ──────────────────────────────
</span><span class="k">def</span> <span class="nf">process_rag_query</span><span class="p">(</span><span class="n">user_input</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">session_id</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
    <span class="s">"""
    Traces appear in Splunk APM under the 'rag-pipeline'
    service. Each span is visible in the trace waterfall.
    Span tags become indexed fields for filtering and
    alerting in Splunk dashboards.
    """</span>
    <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.query"</span><span class="p">)</span> <span class="k">as</span> <span class="n">root</span><span class="p">:</span>
        <span class="n">root</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.query.text"</span><span class="p">,</span> <span class="n">user_input</span><span class="p">)</span>
        <span class="n">root</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.query.session_id"</span><span class="p">,</span> <span class="n">session_id</span><span class="p">)</span>

        <span class="c1"># Layer 2: Embedding
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.embed"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"gen_ai.request.model"</span><span class="p">,</span> <span class="s">"text-embedding-3-small"</span>
            <span class="p">)</span>
            <span class="n">vector</span> <span class="o">=</span> <span class="n">embed_query</span><span class="p">(</span><span class="n">user_input</span><span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.embed.dimensions"</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">vector</span><span class="p">))</span>

        <span class="c1"># Layer 3: Retrieval
</span>        <span class="c1"># In Splunk, you can create detectors (alerts) on
</span>        <span class="c1"># span attributes. Example: alert when
</span>        <span class="c1"># rag.retrieve.top_score drops below 0.7.
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.retrieve"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"db.system"</span><span class="p">,</span> <span class="s">"chromadb"</span><span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.retrieve.top_k"</span><span class="p">,</span> <span class="mi">5</span><span class="p">)</span>
            <span class="n">results</span> <span class="o">=</span> <span class="n">search_vector_db</span><span class="p">(</span><span class="n">vector</span><span class="p">,</span> <span class="n">top_k</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
            <span class="n">scores</span> <span class="o">=</span> <span class="p">[</span><span class="n">r</span><span class="p">[</span><span class="s">"score"</span><span class="p">]</span> <span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span><span class="p">]</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.retrieve.result_count"</span><span class="p">,</span>
                               <span class="nb">len</span><span class="p">(</span><span class="n">results</span><span class="p">))</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.retrieve.top_score"</span><span class="p">,</span>
                               <span class="nb">max</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span> <span class="k">if</span> <span class="n">scores</span> <span class="k">else</span> <span class="mf">0.0</span><span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.retrieve.avg_score"</span><span class="p">,</span>
                               <span class="nb">sum</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">scores</span><span class="p">)</span>
                               <span class="k">if</span> <span class="n">scores</span> <span class="k">else</span> <span class="mf">0.0</span><span class="p">)</span>

        <span class="c1"># Layer 4: Context Assembly
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.context"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
            <span class="n">context</span> <span class="o">=</span> <span class="n">assemble_context</span><span class="p">(</span><span class="n">user_input</span><span class="p">,</span> <span class="n">results</span><span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.context.total_tokens"</span><span class="p">,</span>
                               <span class="n">context</span><span class="p">[</span><span class="s">"token_count"</span><span class="p">])</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.context.template_version"</span><span class="p">,</span>
                               <span class="s">"v2.1"</span><span class="p">)</span>

        <span class="c1"># Layer 5: Generation
</span>        <span class="c1"># Splunk Tag Spotlight automatically surfaces which
</span>        <span class="c1"># attribute values correlate with errors or latency.
</span>        <span class="k">with</span> <span class="n">tracer</span><span class="p">.</span><span class="n">start_as_current_span</span><span class="p">(</span><span class="s">"rag.generate"</span><span class="p">)</span> <span class="k">as</span> <span class="n">span</span><span class="p">:</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.request.model"</span><span class="p">,</span> <span class="s">"gpt-4o"</span><span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"gen_ai.request.temperature"</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">)</span>
            <span class="n">response</span> <span class="o">=</span> <span class="n">call_llm</span><span class="p">(</span><span class="n">context</span><span class="p">[</span><span class="s">"prompt"</span><span class="p">])</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"gen_ai.usage.input_tokens"</span><span class="p">,</span>
                <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"prompt_tokens"</span><span class="p">],</span>
            <span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span>
                <span class="s">"gen_ai.usage.output_tokens"</span><span class="p">,</span>
                <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"completion_tokens"</span><span class="p">],</span>
            <span class="p">)</span>
            <span class="c1"># Splunk can aggregate this to show total cost
</span>            <span class="c1"># per service, endpoint, or time window.
</span>            <span class="n">cost</span> <span class="o">=</span> <span class="p">(</span>
                <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"prompt_tokens"</span><span class="p">]</span> <span class="o">*</span> <span class="mf">2.50</span>
                <span class="o">/</span> <span class="mi">1_000_000</span>
                <span class="o">+</span> <span class="n">response</span><span class="p">[</span><span class="s">"usage"</span><span class="p">][</span><span class="s">"completion_tokens"</span><span class="p">]</span>
                <span class="o">*</span> <span class="mf">10.00</span>
                <span class="o">/</span> <span class="mi">1_000_000</span>
            <span class="p">)</span>
            <span class="n">span</span><span class="p">.</span><span class="n">set_attribute</span><span class="p">(</span><span class="s">"rag.generate.cost_usd"</span><span class="p">,</span> <span class="n">cost</span><span class="p">)</span>

    <span class="k">return</span> <span class="n">response</span><span class="p">[</span><span class="s">"text"</span><span class="p">]</span>
</code></pre></div></div>

<p>A powerful Splunk-specific feature is <strong>Tag Spotlight</strong>. Once your spans are flowing into Splunk APM, Tag Spotlight automatically identifies which span attributes correlate with errors or high latency. For example, it might surface that requests where <code class="language-plaintext highlighter-rouge">rag.retrieve.top_score &lt; 0.6</code> are 4x more likely to result in user complaints. This turns your span attributes into automatic diagnostic insights without manual dashboard building.</p>

<p>Another Splunk advantage is the ability to create <strong>detectors</strong> (real-time alerts) on span attributes. You could configure: “Alert the on-call engineer when the p95 latency of <code class="language-plaintext highlighter-rouge">rag.generate</code> exceeds 5 seconds for 10 consecutive minutes.” Or: “Alert when <code class="language-plaintext highlighter-rouge">rag.retrieve.avg_score</code> drops below 0.65, indicating potential index staleness.”</p>

<hr />

<h2 id="component-level-evaluation-beyond-black-box-testing">Component-Level Evaluation: Beyond Black-Box Testing</h2>

<p>Most teams evaluate their LLM applications as a black box: feed an input, get an output, score the output. This is like taste-testing the final dish without checking any of the ingredient quality, cooking temperature, or preparation steps.</p>

<p><strong>Component-level evaluation</strong> means running quality checks at each layer of the pipeline independently.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>+------------------------------------------------------------------+
|                                                                  |
|  BLACK-BOX EVALUATION                                            |
|  Input -------&gt; [ ?? LLM App ?? ] -------&gt; Output ----&gt; Score    |
|                                                                  |
|  "The food was 6/10."                                            |
|                                                                  |
+------------------------------------------------------------------+


+------------------------------------------------------------------+
|                                                                  |
|  COMPONENT-LEVEL EVALUATION                                      |
|                                                                  |
|  Query ----&gt; Score: Is the query well-formed?                    |
|    |                                                             |
|    v                                                             |
|  Embed ----&gt; Score: Is the vector dimensionally correct?         |
|    |                                                             |
|    v                                                             |
|  Retrieve -&gt; Score: Are the retrieved docs relevant?             |
|    |                  (relevance score, context recall)           |
|    v                                                             |
|  Context --&gt; Score: Is the assembled prompt within limits?       |
|    |                  (token count, completeness)                 |
|    v                                                             |
|  Generate -&gt; Score: Is the final answer faithful to context?     |
|                      (faithfulness, answer relevancy)            |
|                                                                  |
|  "The prep was great, retrieval missed a key document,           |
|   the LLM compensated but hallucinated one detail."              |
|                                                                  |
+------------------------------------------------------------------+
</code></pre></div></div>

<p>Frameworks like <strong>DeepEval</strong> and <strong>Ragas</strong> provide pre-built evaluation metrics for each component. For example:</p>

<ul>
  <li><strong>Context Recall</strong> – Did the retrieval step find all the relevant documents? Evaluated at Layer 3.</li>
  <li><strong>Context Precision</strong> – Were the retrieved documents actually relevant, or was there noise? Also Layer 3.</li>
  <li><strong>Faithfulness</strong> – Does the generated answer stick to facts found in the context, or does it hallucinate? Evaluated at Layer 5.</li>
  <li><strong>Answer Relevancy</strong> – Does the response actually address the user’s original question? Cross-layer evaluation linking Layer 1 to Layer 5.</li>
</ul>

<p>By combining observability (traces and spans) with component-level evaluation (quality scores per layer), you build a comprehensive picture of both <em>performance</em> and <em>quality</em> across your entire pipeline. The observability tells you <em>how fast</em> and <em>how reliably</em> each layer is running. The evaluations tell you <em>how well</em> each layer is doing its job.</p>

<p>Think of it as the difference between knowing that the kitchen cooked the dish in 12 minutes (observability) and knowing that the dish scored 9/10 on flavor (evaluation). You need both to run a great restaurant.</p>

<hr />

<h2 id="best-practices-for-production-llm-observability">Best Practices for Production LLM Observability</h2>

<p>Drawing from the implementation patterns above, here are the practices that separate well-monitored LLM systems from the rest:</p>

<p><strong>1. Instrument from day one, not after the first incident.</strong> Adding observability after a production failure is like installing smoke detectors after a fire. The cost of instrumentation is low; the cost of blind debugging is high. Every code example in this article can be added to a new pipeline in under an hour.</p>

<p><strong>2. Use semantic naming conventions for spans and attributes.</strong> Follow the OpenTelemetry Semantic Conventions for GenAI. Using <code class="language-plaintext highlighter-rouge">gen_ai.request.model</code> instead of <code class="language-plaintext highlighter-rouge">my_model_name</code> means that every observability backend in the ecosystem can render meaningful dashboards without custom configuration.</p>

<p><strong>3. Record business-relevant attributes, not just technical ones.</strong> Token counts and latency are essential, but also record session IDs, user segments, query categories, and cost estimates. These attributes enable business-level analysis: “Which customer segment generates the most expensive queries?” or “Are enterprise users experiencing worse retrieval quality than free-tier users?”</p>

<p><strong>4. Set alerts on leading indicators, not lagging ones.</strong> Alert on retrieval relevance scores dropping (a leading indicator that output quality will degrade) rather than on user complaint rates (a lagging indicator that damage is already done). Span-level attributes make leading-indicator alerts possible.</p>

<p><strong>5. Sample wisely in high-throughput systems.</strong> If your system handles thousands of queries per second, exporting every trace will overwhelm your observability backend. Use head-based or tail-based sampling: always capture error traces and slow traces in full, and sample normal traces at a lower rate.</p>

<p><strong>6. Separate evaluation from observability.</strong> Observability tells you <em>what happened</em> (latency, tokens, errors). Evaluation tells you <em>how good</em> it was (relevance, faithfulness). Run evaluation asynchronously on sampled traces – don’t add LLM-as-judge calls to your hot path.</p>

<p><strong>7. Version everything.</strong> Record the embedding model version, the prompt template version, the LLM model version, and the vector index version as span attributes. When quality regresses, these version tags let you correlate the regression with a specific change.</p>

<p><strong>8. Build dashboards that span all five layers.</strong> A single dashboard should show, at a glance: query volume, embedding latency, retrieval relevance distribution, context token usage, and generation cost. This end-to-end view lets you spot inter-layer effects that single-layer dashboards miss.</p>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>LLM applications are no longer experiments – they’re production software serving real users with real expectations. And production software demands production-grade observability.</p>

<p>The five-layer model presented in this article – Query, Embedding, Retrieval, Context, and Generation – gives you a systematic framework for understanding what’s happening inside your LLM pipeline at every step. Each layer corresponds to a distinct operation with its own failure modes, performance characteristics, and cost profile. By instrumenting each layer as a separate span within a trace, you gain the ability to debug specific failures, track costs to their source, and detect quality drift before it reaches your users.</p>

<p>The three implementation examples – OpenTelemetry, AppDynamics APM, and Splunk Observability Cloud – demonstrate that the same conceptual model maps cleanly to any observability platform. OpenTelemetry provides the vendor-neutral foundation. AppDynamics wraps it in enterprise anomaly detection and business transaction analytics. Splunk adds log correlation, Tag Spotlight, and real-time detectors.</p>

<p>The restaurant kitchen analogy we used throughout this article carries one final lesson: the best kitchens don’t wait for a customer complaint to start monitoring. They have thermometers in every oven, timers at every station, and quality checks at every handoff. Your LLM pipeline deserves the same.</p>

<p>Start with traces. Add spans for each layer. Record meaningful attributes. Build dashboards. Set alerts. And then – only then – will you truly understand what’s happening between the question and the answer.</p>

<hr />

<h2 id="references">References</h2>

<p><strong>A note on the “Five Layers” model.</strong> The five-layer decomposition of LLM observability (Query, Embedding, Retrieval, Context, Generation) used in this article is not a formally standardized framework from a single authoritative source. It is an emergent industry practice pattern that arises from applying distributed tracing concepts – as standardized by OpenTelemetry <sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup><sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> – to the well-known stages of a Retrieval-Augmented Generation (RAG) pipeline <sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>. The OpenTelemetry GenAI semantic conventions formalize three of the five layers (Inference/Generation, Embedding, and Retrieval) as standard span types. Enterprise observability platforms such as Cisco AppDynamics <sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup><sup id="fnref:5" role="doc-noteref"><a href="#fn:5" class="footnote" rel="footnote">5</a></sup> and Splunk Observability Cloud <sup id="fnref:6" role="doc-noteref"><a href="#fn:6" class="footnote" rel="footnote">6</a></sup><sup id="fnref:7" role="doc-noteref"><a href="#fn:7" class="footnote" rel="footnote">7</a></sup><sup id="fnref:8" role="doc-noteref"><a href="#fn:8" class="footnote" rel="footnote">8</a></sup> provide the monitoring infrastructure to operationalize this layered model in production.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>OpenTelemetry Authors. “Semantic Conventions for Generative AI Systems,” v1.40.0 (Development). Includes span conventions for Inference, Embeddings, and Retrievals. <a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/">https://opentelemetry.io/docs/specs/semconv/gen-ai/</a> <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>OpenTelemetry Authors. “Semantic Conventions for Generative Client AI Spans.” Defines <code class="language-plaintext highlighter-rouge">gen_ai.*</code> attributes for model, token usage, temperature, and finish reason used in the code examples. <a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/">https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/</a> <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>Lewis, Patrick, et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” <em>Advances in Neural Information Processing Systems</em> 33 (NeurIPS 2020), pp. 9459–9474. The paper that introduced the RAG architecture whose pipeline stages (query encoding, retrieval, context assembly, generation) form the basis of the five observability layers. <a href="https://arxiv.org/abs/2005.11401">https://arxiv.org/abs/2005.11401</a> <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>Cisco AppDynamics. “OpenTelemetry with AppDynamics.” Documents OTLP ingestion and the mapping of OpenTelemetry traces to AppDynamics Business Transactions. <a href="https://docs.appdynamics.com/appd/24.x/en/application-monitoring/opentelemetry">https://docs.appdynamics.com/appd/24.x/en/application-monitoring/opentelemetry</a> <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:5" role="doc-endnote">
      <p>Cisco AppDynamics. “Business Transactions.” Describes how AppDynamics discovers, maps, and monitors the performance of application transactions – the mechanism through which OTel spans surface in the AppDynamics UI. <a href="https://docs.appdynamics.com/appd/24.x/en/application-monitoring/business-transactions">https://docs.appdynamics.com/appd/24.x/en/application-monitoring/business-transactions</a> <a href="#fnref:5" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:6" role="doc-endnote">
      <p>Splunk. “Splunk Observability Cloud: APM.” Documents Splunk’s OpenTelemetry-native APM, including trace visualization, service maps, and Tag Spotlight for span-attribute-driven diagnostics. <a href="https://docs.splunk.com/observability/en/apm/">https://docs.splunk.com/observability/en/apm/</a> <a href="#fnref:6" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:7" role="doc-endnote">
      <p>Splunk. “Splunk Distribution of OpenTelemetry Python.” Splunk’s packaging of the OTel Python SDK with pre-configured exporters and auto-instrumentation for common frameworks. <a href="https://docs.splunk.com/observability/en/gdi/get-data-in/application/python/get-started.html">https://docs.splunk.com/observability/en/gdi/get-data-in/application/python/get-started.html</a> <a href="#fnref:7" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:8" role="doc-endnote">
      <p>Splunk. “Create Detectors to Trigger Alerts.” Documents how to configure real-time alerting on span attributes in Splunk Observability Cloud. <a href="https://docs.splunk.com/observability/en/alerts-detectors-notifications/create-detectors-for-alerts.html">https://docs.splunk.com/observability/en/alerts-detectors-notifications/create-detectors-for-alerts.html</a> <a href="#fnref:8" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>]]></content><author><name>Marc Buraczynski</name></author><category term="LLMs" /><category term="observability" /><category term="OpenTelemetry" /><category term="monitoring" /><category term="AppDynamics" /><category term="Splunk" /><summary type="html"><![CDATA[A practical guide to monitoring, debugging, and optimizing Large Language Model applications in production – with implementation examples for OpenTelemetry, AppDynamics APM, and Splunk Observability Cloud.]]></summary></entry><entry><title type="html">The Complete Guide to Fine-Tuning Large Language Models: From Theory to Production</title><link href="https://gunnymarc.github.io/posts/2026/02/fine-tuning-llms/" rel="alternate" type="text/html" title="The Complete Guide to Fine-Tuning Large Language Models: From Theory to Production" /><published>2026-02-20T00:00:00-05:00</published><updated>2026-02-20T00:00:00-05:00</updated><id>https://gunnymarc.github.io/posts/2026/02/fine-tuning-llms</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/02/fine-tuning-llms/"><![CDATA[<p><strong>A Deep Technical Dive into LoRA, QLoRA, and Full Fine-Tuning with Modern Open-Source Models</strong></p>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>

<ol>
  <li><a href="#introduction-to-llm-fine-tuning">Introduction to LLM Fine-Tuning</a></li>
  <li><a href="#why-fine-tune-use-cases-and-benefits">Why Fine-Tune? Use Cases and Benefits</a></li>
  <li><a href="#understanding-fine-tuning-approaches">Understanding Fine-Tuning Approaches</a></li>
  <li><a href="#technical-deep-dive-full-fine-tuning">Technical Deep-Dive: Full Fine-Tuning</a></li>
  <li><a href="#technical-deep-dive-lora-and-variants">Technical Deep-Dive: LoRA and Variants</a></li>
  <li><a href="#technical-deep-dive-qlora">Technical Deep-Dive: QLoRA</a></li>
  <li><a href="#data-preparation-pipeline">Data Preparation Pipeline</a></li>
  <li><a href="#implementation-full-fine-tuning">Implementation: Full Fine-Tuning</a></li>
  <li><a href="#implementation-lora-fine-tuning">Implementation: LoRA Fine-Tuning</a></li>
  <li><a href="#implementation-qlora-fine-tuning">Implementation: QLoRA Fine-Tuning</a></li>
  <li><a href="#evaluation-and-metrics">Evaluation and Metrics</a></li>
  <li><a href="#best-practices-and-optimization-tips">Best Practices and Optimization Tips</a></li>
  <li><a href="#comparison-of-approaches">Comparison of Approaches</a></li>
  <li><a href="#conclusion">Conclusion</a></li>
</ol>

<hr />

<h2 id="introduction-to-llm-fine-tuning">Introduction to LLM Fine-Tuning</h2>

<p>Large Language Models (LLMs) have revolutionized natural language processing, demonstrating remarkable capabilities across diverse tasks. However, pre-trained models, while powerful, often require adaptation to perform optimally on domain-specific tasks. This is where <strong>fine-tuning</strong> comes into play—the process of continuing the training of a pre-trained model on a smaller, task-specific dataset.</p>

<p>The challenge with modern LLMs lies in their scale. Models like Llama 4, Qwen 3, DeepSeek-V3.2, and Gemma 3 contain billions of parameters, making traditional fine-tuning computationally prohibitive for most practitioners. This has led to the development of parameter-efficient fine-tuning (PEFT) methods that achieve comparable results while training only a fraction of the model’s parameters.</p>

<p><img src="diagrams/01_llm_lifecycle.png" alt="LLM Fine-Tuning Lifecycle: Pre-training, Fine-tuning, and Deployment phases" /></p>

<hr />

<h2 id="why-fine-tune-use-cases-and-benefits">Why Fine-Tune? Use Cases and Benefits</h2>

<h3 id="primary-use-cases">Primary Use Cases</h3>

<ol>
  <li>
    <p><strong>Domain Adaptation</strong>: Adapting a general-purpose model to specialized domains like legal, medical, or financial text.</p>
  </li>
  <li>
    <p><strong>Task-Specific Optimization</strong>: Improving performance on specific tasks such as code generation, summarization, or question answering.</p>
  </li>
  <li>
    <p><strong>Style and Tone Alignment</strong>: Training models to match specific writing styles, brand voices, or communication patterns.</p>
  </li>
  <li>
    <p><strong>Knowledge Injection</strong>: Incorporating proprietary or recent knowledge not present in the pre-training data.</p>
  </li>
  <li>
    <p><strong>Safety and Alignment</strong>: Fine-tuning for responsible AI behavior, reducing harmful outputs, and improving instruction-following.</p>
  </li>
</ol>

<h3 id="benefits-over-prompt-engineering">Benefits Over Prompt Engineering</h3>

<table>
  <thead>
    <tr>
      <th>Aspect</th>
      <th>Prompt Engineering</th>
      <th>Fine-Tuning</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Performance</td>
      <td>Good</td>
      <td>Excellent</td>
    </tr>
    <tr>
      <td>Consistency</td>
      <td>Variable</td>
      <td>High</td>
    </tr>
    <tr>
      <td>Latency</td>
      <td>Higher (longer prompts)</td>
      <td>Lower</td>
    </tr>
    <tr>
      <td>Cost per inference</td>
      <td>Higher</td>
      <td>Lower</td>
    </tr>
    <tr>
      <td>Customization depth</td>
      <td>Limited</td>
      <td>Deep</td>
    </tr>
    <tr>
      <td>Knowledge incorporation</td>
      <td>Constrained</td>
      <td>Extensive</td>
    </tr>
  </tbody>
</table>

<p><img src="diagrams/02_finetuning_benefits.png" alt="Fine-Tuning Decision Framework: Benefits and Considerations" /></p>

<hr />

<h2 id="understanding-fine-tuning-approaches">Understanding Fine-Tuning Approaches</h2>

<p>Modern LLM fine-tuning encompasses three primary approaches, each with distinct trade-offs between computational efficiency, memory requirements, and model performance.</p>

<h3 id="overview-of-approaches">Overview of Approaches</h3>

<p><img src="diagrams/03_approaches_overview.png" alt="Comparison of Fine-Tuning Approaches: Full, LoRA, and QLoRA" /></p>

<h3 id="parameter-comparison">Parameter Comparison</h3>

<p>For a 70B parameter model:</p>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Trainable Params</th>
      <th>Memory (FP16)</th>
      <th>Memory (QLoRA)</th>
      <th>Training Speed</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Full Fine-Tuning</td>
      <td>70B (100%)</td>
      <td>~280 GB</td>
      <td>N/A</td>
      <td>Slowest</td>
    </tr>
    <tr>
      <td>LoRA (r=64)</td>
      <td>~100M (0.14%)</td>
      <td>~160 GB</td>
      <td>~48 GB</td>
      <td>Fast</td>
    </tr>
    <tr>
      <td>QLoRA (r=64, 4-bit)</td>
      <td>~100M (0.14%)</td>
      <td>N/A</td>
      <td>~24 GB</td>
      <td>Moderate</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="technical-deep-dive-full-fine-tuning">Technical Deep-Dive: Full Fine-Tuning</h2>

<p>Traditional fine-tuning updates all parameters of the neural network. During backpropagation, gradients flow through the entire network, and all weights are adjusted based on the task-specific loss.</p>

<h3 id="architecture-and-gradient-flow">Architecture and Gradient Flow</h3>

<p><img src="diagrams/04_full_finetuning_arch.png" alt="Full Fine-Tuning Architecture and Gradient Flow" /></p>

<h3 id="mathematical-formulation">Mathematical Formulation</h3>

<p>For a weight matrix $W \in \mathbb{R}^{d \times d}$, full fine-tuning updates:</p>

\[W_{t+1} = W_t - \alpha \frac{\partial \mathcal{L}}{\partial W_t}\]

<p>Where:</p>
<ul>
  <li>$\alpha$ is the learning rate</li>
  <li>$\mathcal{L}$ is the loss function</li>
  <li>$\frac{\partial \mathcal{L}}{\partial W_t}$ is the gradient of the loss with respect to weights</li>
</ul>

<h3 id="when-to-use-full-fine-tuning">When to Use Full Fine-Tuning</h3>

<ul>
  <li><strong>Sufficient compute resources</strong> available (multiple high-end GPUs)</li>
  <li><strong>Significant domain shift</strong> from pre-training data</li>
  <li><strong>Maximum performance</strong> is critical</li>
  <li><strong>Large, high-quality dataset</strong> available (&gt;100K examples)</li>
</ul>

<hr />

<h2 id="technical-deep-dive-lora-and-variants">Technical Deep-Dive: LoRA and Variants</h2>

<h3 id="lora-low-rank-adaptation">LoRA (Low-Rank Adaptation)</h3>

<p>LoRA introduces a revolutionary approach: instead of updating the full weight matrix $W$, it decomposes the weight update into two low-rank matrices $A$ and $B$.</p>

<p><img src="diagrams/05_lora_architecture.png" alt="LoRA Architecture: Low-Rank Adaptation with trainable A and B matrices" /></p>

<p><strong>Key Insight</strong>: The rank $r$ is typically 8-64, much smaller than $d$ (which can be 4096-8192 in modern LLMs). This reduces trainable parameters from $d^2$ to $2 \times d \times r$.</p>

<h3 id="mathematical-foundation">Mathematical Foundation</h3>

<p>The forward pass with LoRA:</p>

\[h = Wx + \frac{\alpha}{r}BAx\]

<p>Where:</p>
<ul>
  <li>$W \in \mathbb{R}^{d \times d}$ is the frozen pre-trained weight</li>
  <li>$A \in \mathbb{R}^{d \times r}$ and $B \in \mathbb{R}^{r \times d}$ are low-rank matrices</li>
  <li>$\alpha$ is a scaling factor</li>
  <li>$r$ is the rank (hyperparameter)</li>
</ul>

<h3 id="lora-variants">LoRA Variants</h3>

<h4 id="lora-fa-frozen-a">LoRA-FA (Frozen-A)</h4>

<p>LoRA-FA reduces activation memory by freezing matrix $A$ after random initialization, training only matrix $B$.</p>

<p><img src="diagrams/06_lora_fa.png" alt="LoRA-FA Architecture: Frozen-A variant for reduced activation memory" /></p>

<h4 id="vera-vector-based-random-adaptation">VeRA (Vector-based Random Adaptation)</h4>

<p>VeRA takes efficiency further by sharing frozen random matrices across all layers and only training small scaling vectors.</p>

<p><img src="diagrams/07_vera.png" alt="VeRA Architecture: Vector-based Random Adaptation with shared matrices" /></p>

<h4 id="delta-lora">Delta-LoRA</h4>

<p>Delta-LoRA updates the base weight matrix $W$ using the difference between consecutive LoRA updates:</p>

\[W_{t+1} = W_t + c(A_{t+1}B_{t+1} - A_tB_t)\]

<p><img src="diagrams/08_delta_lora.png" alt="Delta-LoRA: Weight update mechanism using consecutive LoRA differences" /></p>

<h4 id="lora">LoRA+</h4>

<p>LoRA+ optimizes convergence by using different learning rates for matrices $A$ and $B$:</p>

<p><img src="diagrams/09_lora_plus.png" alt="LoRA+ Learning Rate Strategy: Different rates for A and B matrices" /></p>

<p><strong>Research Finding</strong>: Setting $\lambda = 16$ (i.e., 16× higher learning rate for $B$) often yields better convergence and final performance.</p>

<hr />

<h2 id="technical-deep-dive-qlora">Technical Deep-Dive: QLoRA</h2>

<p>QLoRA combines quantization with LoRA to enable fine-tuning of massive models on consumer hardware.</p>

<h3 id="key-innovations">Key Innovations</h3>

<ol>
  <li>
    <p><strong>4-bit NormalFloat (NF4)</strong>: An information-theoretically optimal quantization for normally distributed weights.</p>
  </li>
  <li>
    <p><strong>Double Quantization</strong>: Quantizes the quantization constants to further reduce memory.</p>
  </li>
  <li>
    <p><strong>Paged Optimizers</strong>: Uses NVIDIA unified memory to handle memory spikes during gradient checkpointing.</p>
  </li>
</ol>

<p><img src="diagrams/10_qlora_architecture.png" alt="QLoRA Architecture: 4-bit quantized base with full precision LoRA adapters" /></p>

<h3 id="memory-breakdown">Memory Breakdown</h3>

<p><img src="diagrams/11_memory_distribution.png" alt="Memory Distribution for 70B Model with QLoRA" /></p>

<hr />

<h2 id="data-preparation-pipeline">Data Preparation Pipeline</h2>

<p>Effective fine-tuning requires careful data preparation. Here’s a production-ready pipeline:</p>

<p><img src="diagrams/12_data_pipeline.png" alt="Data Preparation Pipeline for LLM Fine-Tuning" /></p>

<h3 id="complete-data-preparation-code">Complete Data Preparation Code</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Production-ready data preparation pipeline for LLM fine-tuning.
Compatible with Llama 4, Qwen 3, DeepSeek-V3.2, and Gemma 3.

Requirements:
    pip install datasets transformers torch pandas numpy tqdm
"""</span>

<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">hashlib</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span><span class="p">,</span> <span class="n">Callable</span>
<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span><span class="p">,</span> <span class="n">field</span>

<span class="kn">import</span> <span class="nn">pandas</span> <span class="k">as</span> <span class="n">pd</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">Dataset</span><span class="p">,</span> <span class="n">DatasetDict</span><span class="p">,</span> <span class="n">load_dataset</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="n">AutoTokenizer</span><span class="p">,</span> <span class="n">PreTrainedTokenizer</span>
<span class="kn">from</span> <span class="nn">tqdm.auto</span> <span class="kn">import</span> <span class="n">tqdm</span>

<span class="c1"># Configure logging
</span><span class="n">logging</span><span class="p">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="p">.</span><span class="n">INFO</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'%(asctime)s - %(levelname)s - %(message)s'</span><span class="p">)</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">DataConfig</span><span class="p">:</span>
    <span class="s">"""Configuration for data preparation pipeline."""</span>
    <span class="n">model_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"meta-llama/Llama-4-8B"</span>
    <span class="n">max_seq_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2048</span>
    <span class="n">train_split</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.9</span>
    <span class="n">val_split</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.05</span>
    <span class="n">test_split</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.05</span>
    <span class="n">min_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span>
    <span class="n">max_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">4096</span>
    <span class="n">deduplicate</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">quality_filter</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">seed</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">42</span>
    <span class="n">num_proc</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">4</span>


<span class="k">class</span> <span class="nc">DataPreparationPipeline</span><span class="p">:</span>
    <span class="s">"""End-to-end data preparation for LLM fine-tuning."""</span>
    
    <span class="c1"># Chat templates for different model families
</span>    <span class="n">CHAT_TEMPLATES</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"llama"</span><span class="p">:</span> <span class="s">"&lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;system&lt;|end_header_id|&gt;</span><span class="se">\n\n</span><span class="s">{system}&lt;|eot_id|&gt;&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt;</span><span class="se">\n\n</span><span class="s">{user}&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;</span><span class="se">\n\n</span><span class="s">{assistant}&lt;|eot_id|&gt;"</span><span class="p">,</span>
        <span class="s">"qwen"</span><span class="p">:</span> <span class="s">"&lt;|im_start|&gt;system</span><span class="se">\n</span><span class="s">{system}&lt;|im_end|&gt;</span><span class="se">\n</span><span class="s">&lt;|im_start|&gt;user</span><span class="se">\n</span><span class="s">{user}&lt;|im_end|&gt;</span><span class="se">\n</span><span class="s">&lt;|im_start|&gt;assistant</span><span class="se">\n</span><span class="s">{assistant}&lt;|im_end|&gt;"</span><span class="p">,</span>
        <span class="s">"deepseek"</span><span class="p">:</span> <span class="s">"&lt;|begin▁of▁sentence|&gt;{system}</span><span class="se">\n\n</span><span class="s">User: {user}</span><span class="se">\n\n</span><span class="s">Assistant: {assistant}&lt;|end▁of▁sentence|&gt;"</span><span class="p">,</span>
        <span class="s">"gemma"</span><span class="p">:</span> <span class="s">"&lt;start_of_turn&gt;user</span><span class="se">\n</span><span class="s">{system}</span><span class="se">\n\n</span><span class="s">{user}&lt;end_of_turn&gt;</span><span class="se">\n</span><span class="s">&lt;start_of_turn&gt;model</span><span class="se">\n</span><span class="s">{assistant}&lt;end_of_turn&gt;"</span><span class="p">,</span>
    <span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config</span><span class="p">:</span> <span class="n">DataConfig</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="n">config</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_load_tokenizer</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">model_family</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_detect_model_family</span><span class="p">()</span>
        
    <span class="k">def</span> <span class="nf">_load_tokenizer</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">PreTrainedTokenizer</span><span class="p">:</span>
        <span class="s">"""Load tokenizer with proper configuration."""</span>
        <span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">,</span>
            <span class="n">trust_remote_code</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">use_fast</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Set padding token if not present
</span>        <span class="k">if</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">eos_token</span>
            <span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token_id</span> <span class="o">=</span> <span class="n">tokenizer</span><span class="p">.</span><span class="n">eos_token_id</span>
            
        <span class="k">return</span> <span class="n">tokenizer</span>
    
    <span class="k">def</span> <span class="nf">_detect_model_family</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Detect model family from model name."""</span>
        <span class="n">model_lower</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span>
        <span class="k">if</span> <span class="s">"llama"</span> <span class="ow">in</span> <span class="n">model_lower</span><span class="p">:</span>
            <span class="k">return</span> <span class="s">"llama"</span>
        <span class="k">elif</span> <span class="s">"qwen"</span> <span class="ow">in</span> <span class="n">model_lower</span><span class="p">:</span>
            <span class="k">return</span> <span class="s">"qwen"</span>
        <span class="k">elif</span> <span class="s">"deepseek"</span> <span class="ow">in</span> <span class="n">model_lower</span><span class="p">:</span>
            <span class="k">return</span> <span class="s">"deepseek"</span>
        <span class="k">elif</span> <span class="s">"gemma"</span> <span class="ow">in</span> <span class="n">model_lower</span><span class="p">:</span>
            <span class="k">return</span> <span class="s">"gemma"</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">warning</span><span class="p">(</span><span class="sa">f</span><span class="s">"Unknown model family, defaulting to llama template"</span><span class="p">)</span>
            <span class="k">return</span> <span class="s">"llama"</span>
    
    <span class="k">def</span> <span class="nf">load_data</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span> 
        <span class="n">source</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="n">Path</span> <span class="o">|</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
        <span class="n">text_column</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"text"</span><span class="p">,</span>
        <span class="n">instruction_column</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">response_column</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dataset</span><span class="p">:</span>
        <span class="s">"""
        Load data from various sources.
        
        Args:
            source: Path to file, HuggingFace dataset name, or DataFrame
            text_column: Column containing text (for single-text format)
            instruction_column: Column with instructions (for instruction format)
            response_column: Column with responses (for instruction format)
        """</span>
        <span class="k">if</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">source</span><span class="p">,</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">):</span>
            <span class="n">dataset</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">from_pandas</span><span class="p">(</span><span class="n">source</span><span class="p">)</span>
        <span class="k">elif</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">source</span><span class="p">,</span> <span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="n">Path</span><span class="p">)):</span>
            <span class="n">source_str</span> <span class="o">=</span> <span class="nb">str</span><span class="p">(</span><span class="n">source</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">source_str</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.json'</span><span class="p">):</span>
                <span class="n">dataset</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">from_json</span><span class="p">(</span><span class="n">source_str</span><span class="p">)</span>
            <span class="k">elif</span> <span class="n">source_str</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.jsonl'</span><span class="p">):</span>
                <span class="n">dataset</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">from_json</span><span class="p">(</span><span class="n">source_str</span><span class="p">,</span> <span class="n">field</span><span class="o">=</span><span class="bp">None</span><span class="p">)</span>
            <span class="k">elif</span> <span class="n">source_str</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.csv'</span><span class="p">):</span>
                <span class="n">dataset</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">from_csv</span><span class="p">(</span><span class="n">source_str</span><span class="p">)</span>
            <span class="k">elif</span> <span class="n">source_str</span><span class="p">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.parquet'</span><span class="p">):</span>
                <span class="n">dataset</span> <span class="o">=</span> <span class="n">Dataset</span><span class="p">.</span><span class="n">from_parquet</span><span class="p">(</span><span class="n">source_str</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="c1"># Assume HuggingFace dataset
</span>                <span class="n">dataset</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="n">source_str</span><span class="p">,</span> <span class="n">split</span><span class="o">=</span><span class="s">"train"</span><span class="p">)</span>
        <span class="k">else</span><span class="p">:</span>
            <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Unsupported data source type: </span><span class="si">{</span><span class="nb">type</span><span class="p">(</span><span class="n">source</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loaded </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span><span class="si">}</span><span class="s"> examples"</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">dataset</span>
    
    <span class="k">def</span> <span class="nf">clean_text</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">text</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Clean and normalize text."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="nb">isinstance</span><span class="p">(</span><span class="n">text</span><span class="p">,</span> <span class="nb">str</span><span class="p">):</span>
            <span class="k">return</span> <span class="s">""</span>
        
        <span class="c1"># Remove excessive whitespace
</span>        <span class="n">text</span> <span class="o">=</span> <span class="s">' '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">text</span><span class="p">.</span><span class="n">split</span><span class="p">())</span>
        
        <span class="c1"># Remove null bytes and other control characters
</span>        <span class="n">text</span> <span class="o">=</span> <span class="s">''</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="n">char</span> <span class="k">for</span> <span class="n">char</span> <span class="ow">in</span> <span class="n">text</span> <span class="k">if</span> <span class="nb">ord</span><span class="p">(</span><span class="n">char</span><span class="p">)</span> <span class="o">&gt;=</span> <span class="mi">32</span> <span class="ow">or</span> <span class="n">char</span> <span class="ow">in</span> <span class="s">'</span><span class="se">\n\t</span><span class="s">'</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">text</span><span class="p">.</span><span class="n">strip</span><span class="p">()</span>
    
    <span class="k">def</span> <span class="nf">deduplicate</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">dataset</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">,</span> <span class="n">text_column</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"text"</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dataset</span><span class="p">:</span>
        <span class="s">"""Remove duplicate entries based on content hash."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">deduplicate</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">dataset</span>
        
        <span class="n">seen_hashes</span> <span class="o">=</span> <span class="nb">set</span><span class="p">()</span>
        <span class="n">indices_to_keep</span> <span class="o">=</span> <span class="p">[]</span>
        
        <span class="k">for</span> <span class="n">idx</span><span class="p">,</span> <span class="n">example</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">tqdm</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Deduplicating"</span><span class="p">)):</span>
            <span class="n">text</span> <span class="o">=</span> <span class="n">example</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">text_column</span><span class="p">,</span> <span class="s">""</span><span class="p">)</span>
            <span class="n">text_hash</span> <span class="o">=</span> <span class="n">hashlib</span><span class="p">.</span><span class="n">md5</span><span class="p">(</span><span class="n">text</span><span class="p">.</span><span class="n">encode</span><span class="p">()).</span><span class="n">hexdigest</span><span class="p">()</span>
            
            <span class="k">if</span> <span class="n">text_hash</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">seen_hashes</span><span class="p">:</span>
                <span class="n">seen_hashes</span><span class="p">.</span><span class="n">add</span><span class="p">(</span><span class="n">text_hash</span><span class="p">)</span>
                <span class="n">indices_to_keep</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">idx</span><span class="p">)</span>
        
        <span class="n">original_len</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span>
        <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="n">indices_to_keep</span><span class="p">)</span>
        <span class="n">removed</span> <span class="o">=</span> <span class="n">original_len</span> <span class="o">-</span> <span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Removed </span><span class="si">{</span><span class="n">removed</span><span class="si">}</span><span class="s"> duplicates (</span><span class="si">{</span><span class="n">removed</span><span class="o">/</span><span class="n">original_len</span><span class="o">*</span><span class="mi">100</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">%)"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">dataset</span>
    
    <span class="k">def</span> <span class="nf">quality_filter</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">dataset</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">,</span> <span class="n">text_column</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"text"</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dataset</span><span class="p">:</span>
        <span class="s">"""Apply quality filters to the dataset."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">quality_filter</span><span class="p">:</span>
            <span class="k">return</span> <span class="n">dataset</span>
        
        <span class="k">def</span> <span class="nf">is_quality</span><span class="p">(</span><span class="n">example</span><span class="p">):</span>
            <span class="n">text</span> <span class="o">=</span> <span class="n">example</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">text_column</span><span class="p">,</span> <span class="s">""</span><span class="p">)</span>
            
            <span class="c1"># Length check
</span>            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&lt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">min_length</span><span class="p">:</span>
                <span class="k">return</span> <span class="bp">False</span>
            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">)</span> <span class="o">&gt;</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">max_length</span><span class="p">:</span>
                <span class="k">return</span> <span class="bp">False</span>
            
            <span class="c1"># Basic quality heuristics
</span>            <span class="n">alpha_ratio</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">c</span><span class="p">.</span><span class="n">isalpha</span><span class="p">()</span> <span class="k">for</span> <span class="n">c</span> <span class="ow">in</span> <span class="n">text</span><span class="p">)</span> <span class="o">/</span> <span class="nb">max</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">text</span><span class="p">),</span> <span class="mi">1</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">alpha_ratio</span> <span class="o">&lt;</span> <span class="mf">0.5</span><span class="p">:</span>  <span class="c1"># At least 50% alphabetic characters
</span>                <span class="k">return</span> <span class="bp">False</span>
            
            <span class="c1"># Check for excessive repetition
</span>            <span class="n">words</span> <span class="o">=</span> <span class="n">text</span><span class="p">.</span><span class="n">lower</span><span class="p">().</span><span class="n">split</span><span class="p">()</span>
            <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">words</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">10</span><span class="p">:</span>
                <span class="n">unique_ratio</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">words</span><span class="p">))</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">words</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">unique_ratio</span> <span class="o">&lt;</span> <span class="mf">0.3</span><span class="p">:</span>  <span class="c1"># Too repetitive
</span>                    <span class="k">return</span> <span class="bp">False</span>
            
            <span class="k">return</span> <span class="bp">True</span>
        
        <span class="n">original_len</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span>
        <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="nb">filter</span><span class="p">(</span><span class="n">is_quality</span><span class="p">,</span> <span class="n">num_proc</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_proc</span><span class="p">)</span>
        <span class="n">removed</span> <span class="o">=</span> <span class="n">original_len</span> <span class="o">-</span> <span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Quality filter removed </span><span class="si">{</span><span class="n">removed</span><span class="si">}</span><span class="s"> examples (</span><span class="si">{</span><span class="n">removed</span><span class="o">/</span><span class="n">original_len</span><span class="o">*</span><span class="mi">100</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">%)"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">dataset</span>
    
    <span class="k">def</span> <span class="nf">format_instruction</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">instruction</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">response</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
        <span class="n">system_prompt</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"You are a helpful assistant."</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Format instruction-response pair using model-specific template."""</span>
        <span class="n">template</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">CHAT_TEMPLATES</span><span class="p">[</span><span class="bp">self</span><span class="p">.</span><span class="n">model_family</span><span class="p">]</span>
        
        <span class="k">return</span> <span class="n">template</span><span class="p">.</span><span class="nb">format</span><span class="p">(</span>
            <span class="n">system</span><span class="o">=</span><span class="n">system_prompt</span><span class="p">,</span>
            <span class="n">user</span><span class="o">=</span><span class="n">instruction</span><span class="p">,</span>
            <span class="n">assistant</span><span class="o">=</span><span class="n">response</span><span class="p">,</span>
        <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">tokenize_dataset</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">dataset</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">,</span>
        <span class="n">text_column</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"text"</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dataset</span><span class="p">:</span>
        <span class="s">"""Tokenize dataset for training."""</span>
        
        <span class="k">def</span> <span class="nf">tokenize_function</span><span class="p">(</span><span class="n">examples</span><span class="p">):</span>
            <span class="n">texts</span> <span class="o">=</span> <span class="n">examples</span><span class="p">[</span><span class="n">text_column</span><span class="p">]</span>
            
            <span class="c1"># Tokenize
</span>            <span class="n">tokenized</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">(</span>
                <span class="n">texts</span><span class="p">,</span>
                <span class="n">truncation</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">max_length</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">max_seq_length</span><span class="p">,</span>
                <span class="n">padding</span><span class="o">=</span><span class="s">"max_length"</span><span class="p">,</span>
                <span class="n">return_tensors</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
            <span class="p">)</span>
            
            <span class="c1"># For causal LM, labels are same as input_ids
</span>            <span class="n">tokenized</span><span class="p">[</span><span class="s">"labels"</span><span class="p">]</span> <span class="o">=</span> <span class="n">tokenized</span><span class="p">[</span><span class="s">"input_ids"</span><span class="p">].</span><span class="n">copy</span><span class="p">()</span>
            
            <span class="k">return</span> <span class="n">tokenized</span>
        
        <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span>
            <span class="n">tokenize_function</span><span class="p">,</span>
            <span class="n">batched</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">num_proc</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_proc</span><span class="p">,</span>
            <span class="n">remove_columns</span><span class="o">=</span><span class="n">dataset</span><span class="p">.</span><span class="n">column_names</span><span class="p">,</span>
            <span class="n">desc</span><span class="o">=</span><span class="s">"Tokenizing"</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="n">dataset</span>
    
    <span class="k">def</span> <span class="nf">create_splits</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">dataset</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">DatasetDict</span><span class="p">:</span>
        <span class="s">"""Split dataset into train, validation, and test sets."""</span>
        <span class="c1"># Shuffle first
</span>        <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">shuffle</span><span class="p">(</span><span class="n">seed</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">seed</span><span class="p">)</span>
        
        <span class="c1"># Calculate split sizes
</span>        <span class="n">total</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span>
        <span class="n">train_size</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">total</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">train_split</span><span class="p">)</span>
        <span class="n">val_size</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">total</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">val_split</span><span class="p">)</span>
        
        <span class="c1"># Create splits
</span>        <span class="n">train_dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">train_size</span><span class="p">))</span>
        <span class="n">val_dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">train_size</span><span class="p">,</span> <span class="n">train_size</span> <span class="o">+</span> <span class="n">val_size</span><span class="p">))</span>
        <span class="n">test_dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="n">train_size</span> <span class="o">+</span> <span class="n">val_size</span><span class="p">,</span> <span class="n">total</span><span class="p">))</span>
        
        <span class="n">splits</span> <span class="o">=</span> <span class="n">DatasetDict</span><span class="p">({</span>
            <span class="s">"train"</span><span class="p">:</span> <span class="n">train_dataset</span><span class="p">,</span>
            <span class="s">"validation"</span><span class="p">:</span> <span class="n">val_dataset</span><span class="p">,</span>
            <span class="s">"test"</span><span class="p">:</span> <span class="n">test_dataset</span><span class="p">,</span>
        <span class="p">})</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Dataset splits: train=</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">train_dataset</span><span class="p">)</span><span class="si">}</span><span class="s">, val=</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">val_dataset</span><span class="p">)</span><span class="si">}</span><span class="s">, test=</span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">test_dataset</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">splits</span>
    
    <span class="k">def</span> <span class="nf">process_instruction_dataset</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">dataset</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">,</span>
        <span class="n">instruction_col</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"instruction"</span><span class="p">,</span>
        <span class="n">response_col</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"response"</span><span class="p">,</span>
        <span class="n">system_col</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dataset</span><span class="p">:</span>
        <span class="s">"""Process an instruction-following dataset."""</span>
        
        <span class="k">def</span> <span class="nf">format_example</span><span class="p">(</span><span class="n">example</span><span class="p">):</span>
            <span class="n">instruction</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">clean_text</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="n">instruction_col</span><span class="p">])</span>
            <span class="n">response</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">clean_text</span><span class="p">(</span><span class="n">example</span><span class="p">[</span><span class="n">response_col</span><span class="p">])</span>
            <span class="n">system</span> <span class="o">=</span> <span class="n">example</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="n">system_col</span><span class="p">,</span> <span class="s">"You are a helpful assistant."</span><span class="p">)</span> <span class="k">if</span> <span class="n">system_col</span> <span class="k">else</span> <span class="s">"You are a helpful assistant."</span>
            
            <span class="n">formatted</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">format_instruction</span><span class="p">(</span><span class="n">instruction</span><span class="p">,</span> <span class="n">response</span><span class="p">,</span> <span class="n">system</span><span class="p">)</span>
            <span class="k">return</span> <span class="p">{</span><span class="s">"text"</span><span class="p">:</span> <span class="n">formatted</span><span class="p">}</span>
        
        <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span><span class="n">format_example</span><span class="p">,</span> <span class="n">num_proc</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_proc</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Formatting"</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">dataset</span>
    
    <span class="k">def</span> <span class="nf">run_pipeline</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">source</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="n">Path</span> <span class="o">|</span> <span class="n">pd</span><span class="p">.</span><span class="n">DataFrame</span><span class="p">,</span>
        <span class="n">output_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">|</span> <span class="n">Path</span> <span class="o">=</span> <span class="s">"./processed_data"</span><span class="p">,</span>
        <span class="n">instruction_col</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">response_col</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">text_col</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"text"</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">DatasetDict</span><span class="p">:</span>
        <span class="s">"""
        Run the complete data preparation pipeline.
        
        Args:
            source: Data source (path, HF dataset name, or DataFrame)
            output_dir: Directory to save processed data
            instruction_col: Column with instructions (for instruction format)
            response_col: Column with responses (for instruction format)
            text_col: Column with text (for pre-formatted data)
        """</span>
        <span class="n">output_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="n">output_dir</span><span class="p">)</span>
        <span class="n">output_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        
        <span class="c1"># Step 1: Load data
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Step 1: Loading data..."</span><span class="p">)</span>
        <span class="n">dataset</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">load_data</span><span class="p">(</span><span class="n">source</span><span class="p">)</span>
        
        <span class="c1"># Step 2: Format instructions (if applicable)
</span>        <span class="k">if</span> <span class="n">instruction_col</span> <span class="ow">and</span> <span class="n">response_col</span><span class="p">:</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Step 2: Formatting instruction-response pairs..."</span><span class="p">)</span>
            <span class="n">dataset</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">process_instruction_dataset</span><span class="p">(</span>
                <span class="n">dataset</span><span class="p">,</span> <span class="n">instruction_col</span><span class="p">,</span> <span class="n">response_col</span>
            <span class="p">)</span>
            <span class="n">text_col</span> <span class="o">=</span> <span class="s">"text"</span>
        
        <span class="c1"># Step 3: Clean text
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Step 3: Cleaning text..."</span><span class="p">)</span>
        <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="nb">map</span><span class="p">(</span>
            <span class="k">lambda</span> <span class="n">x</span><span class="p">:</span> <span class="p">{</span><span class="n">text_col</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">clean_text</span><span class="p">(</span><span class="n">x</span><span class="p">[</span><span class="n">text_col</span><span class="p">])},</span>
            <span class="n">num_proc</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_proc</span><span class="p">,</span>
            <span class="n">desc</span><span class="o">=</span><span class="s">"Cleaning"</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Step 4: Deduplicate
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Step 4: Deduplicating..."</span><span class="p">)</span>
        <span class="n">dataset</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">deduplicate</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">text_col</span><span class="p">)</span>
        
        <span class="c1"># Step 5: Quality filter
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Step 5: Applying quality filters..."</span><span class="p">)</span>
        <span class="n">dataset</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">quality_filter</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">text_col</span><span class="p">)</span>
        
        <span class="c1"># Step 6: Tokenize
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Step 6: Tokenizing..."</span><span class="p">)</span>
        <span class="n">dataset</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenize_dataset</span><span class="p">(</span><span class="n">dataset</span><span class="p">,</span> <span class="n">text_col</span><span class="p">)</span>
        
        <span class="c1"># Step 7: Create splits
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Step 7: Creating train/val/test splits..."</span><span class="p">)</span>
        <span class="n">splits</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">create_splits</span><span class="p">(</span><span class="n">dataset</span><span class="p">)</span>
        
        <span class="c1"># Step 8: Save
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Step 8: Saving processed data..."</span><span class="p">)</span>
        <span class="n">splits</span><span class="p">.</span><span class="n">save_to_disk</span><span class="p">(</span><span class="nb">str</span><span class="p">(</span><span class="n">output_dir</span><span class="p">))</span>
        
        <span class="c1"># Save metadata
</span>        <span class="n">metadata</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"model_name"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">,</span>
            <span class="s">"model_family"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">model_family</span><span class="p">,</span>
            <span class="s">"max_seq_length"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">max_seq_length</span><span class="p">,</span>
            <span class="s">"train_size"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">splits</span><span class="p">[</span><span class="s">"train"</span><span class="p">]),</span>
            <span class="s">"val_size"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">splits</span><span class="p">[</span><span class="s">"validation"</span><span class="p">]),</span>
            <span class="s">"test_size"</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">splits</span><span class="p">[</span><span class="s">"test"</span><span class="p">]),</span>
        <span class="p">}</span>
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_dir</span> <span class="o">/</span> <span class="s">"metadata.json"</span><span class="p">,</span> <span class="s">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">json</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">metadata</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Pipeline complete! Data saved to </span><span class="si">{</span><span class="n">output_dir</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">splits</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="s">"""Example usage of the data preparation pipeline."""</span>
    
    <span class="c1"># Configuration for Llama 4
</span>    <span class="n">config</span> <span class="o">=</span> <span class="n">DataConfig</span><span class="p">(</span>
        <span class="n">model_name</span><span class="o">=</span><span class="s">"meta-llama/Llama-4-8B"</span><span class="p">,</span>
        <span class="n">max_seq_length</span><span class="o">=</span><span class="mi">2048</span><span class="p">,</span>
        <span class="n">train_split</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span>
        <span class="n">val_split</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span>
        <span class="n">test_split</span><span class="o">=</span><span class="mf">0.05</span><span class="p">,</span>
    <span class="p">)</span>
    
    <span class="c1"># Initialize pipeline
</span>    <span class="n">pipeline</span> <span class="o">=</span> <span class="n">DataPreparationPipeline</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
    
    <span class="c1"># Example: Process the Alpaca dataset
</span>    <span class="n">splits</span> <span class="o">=</span> <span class="n">pipeline</span><span class="p">.</span><span class="n">run_pipeline</span><span class="p">(</span>
        <span class="n">source</span><span class="o">=</span><span class="s">"tatsu-lab/alpaca"</span><span class="p">,</span>
        <span class="n">output_dir</span><span class="o">=</span><span class="s">"./processed_alpaca"</span><span class="p">,</span>
        <span class="n">instruction_col</span><span class="o">=</span><span class="s">"instruction"</span><span class="p">,</span>
        <span class="n">response_col</span><span class="o">=</span><span class="s">"output"</span><span class="p">,</span>
    <span class="p">)</span>
    
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Processed dataset statistics:"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Train: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">splits</span><span class="p">[</span><span class="s">'train'</span><span class="p">])</span><span class="si">:</span><span class="p">,</span><span class="si">}</span><span class="s"> examples"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Validation: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">splits</span><span class="p">[</span><span class="s">'validation'</span><span class="p">])</span><span class="si">:</span><span class="p">,</span><span class="si">}</span><span class="s"> examples"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Test: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">splits</span><span class="p">[</span><span class="s">'test'</span><span class="p">])</span><span class="si">:</span><span class="p">,</span><span class="si">}</span><span class="s"> examples"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h2 id="implementation-full-fine-tuning">Implementation: Full Fine-Tuning</h2>

<p>Full fine-tuning requires significant compute resources but offers the highest potential performance.</p>

<p><img src="diagrams/13_training_pipeline.png" alt="Training Pipeline Flow: Setup, Training Loop, and Monitoring" /></p>

<h3 id="complete-full-fine-tuning-code">Complete Full Fine-Tuning Code</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Full Fine-Tuning Pipeline for Large Language Models.
Supports: Llama 4, Qwen 3, DeepSeek-V3.2, Gemma 3

Requirements:
    pip install torch transformers datasets accelerate wandb tqdm
    pip install flash-attn --no-build-isolation  # Optional but recommended
"""</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span><span class="p">,</span> <span class="n">field</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span><span class="p">,</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Any</span>

<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">torch.nn</span> <span class="k">as</span> <span class="n">nn</span>
<span class="kn">from</span> <span class="nn">torch.utils.data</span> <span class="kn">import</span> <span class="n">DataLoader</span>
<span class="kn">from</span> <span class="nn">torch.optim</span> <span class="kn">import</span> <span class="n">AdamW</span>
<span class="kn">from</span> <span class="nn">torch.optim.lr_scheduler</span> <span class="kn">import</span> <span class="n">CosineAnnealingLR</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">AutoModelForCausalLM</span><span class="p">,</span>
    <span class="n">AutoTokenizer</span><span class="p">,</span>
    <span class="n">get_linear_schedule_with_warmup</span><span class="p">,</span>
    <span class="n">DataCollatorForLanguageModeling</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_from_disk</span>
<span class="kn">from</span> <span class="nn">accelerate</span> <span class="kn">import</span> <span class="n">Accelerator</span><span class="p">,</span> <span class="n">DistributedDataParallelKwargs</span>
<span class="kn">from</span> <span class="nn">tqdm.auto</span> <span class="kn">import</span> <span class="n">tqdm</span>

<span class="k">try</span><span class="p">:</span>
    <span class="kn">import</span> <span class="nn">wandb</span>
    <span class="n">WANDB_AVAILABLE</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">except</span> <span class="nb">ImportError</span><span class="p">:</span>
    <span class="n">WANDB_AVAILABLE</span> <span class="o">=</span> <span class="bp">False</span>

<span class="c1"># Configure logging
</span><span class="n">logging</span><span class="p">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="p">.</span><span class="n">INFO</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'%(asctime)s - %(levelname)s - %(message)s'</span><span class="p">)</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">FullFineTuningConfig</span><span class="p">:</span>
    <span class="s">"""Configuration for full fine-tuning."""</span>
    
    <span class="c1"># Model settings
</span>    <span class="n">model_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"meta-llama/Llama-4-8B"</span>
    <span class="n">torch_dtype</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"bfloat16"</span>
    <span class="n">use_flash_attention</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">trust_remote_code</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    
    <span class="c1"># Training hyperparameters
</span>    <span class="n">learning_rate</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">2e-5</span>
    <span class="n">weight_decay</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.01</span>
    <span class="n">num_epochs</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span>
    <span class="n">batch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">4</span>
    <span class="n">gradient_accumulation_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">8</span>
    <span class="n">max_grad_norm</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1.0</span>
    <span class="n">warmup_ratio</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.03</span>
    
    <span class="c1"># Optimization settings
</span>    <span class="n">use_gradient_checkpointing</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">mixed_precision</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"bf16"</span>  <span class="c1"># "fp16", "bf16", or "no"
</span>    
    <span class="c1"># Data settings
</span>    <span class="n">data_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./processed_data"</span>
    <span class="n">max_seq_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2048</span>
    
    <span class="c1"># Output settings
</span>    <span class="n">output_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./full_finetuned_model"</span>
    <span class="n">save_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">500</span>
    <span class="n">eval_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">100</span>
    <span class="n">logging_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span>
    
    <span class="c1"># Experiment tracking
</span>    <span class="n">project_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"llm-full-finetuning"</span>
    <span class="n">run_name</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">use_wandb</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    
    <span class="c1"># Hardware
</span>    <span class="n">seed</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">42</span>


<span class="k">class</span> <span class="nc">FullFineTuner</span><span class="p">:</span>
    <span class="s">"""Production-ready full fine-tuning trainer."""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config</span><span class="p">:</span> <span class="n">FullFineTuningConfig</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="n">config</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">setup_accelerator</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">setup_seed</span><span class="p">()</span>
        
    <span class="k">def</span> <span class="nf">setup_accelerator</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Initialize accelerator for distributed training."""</span>
        <span class="n">ddp_kwargs</span> <span class="o">=</span> <span class="n">DistributedDataParallelKwargs</span><span class="p">(</span><span class="n">find_unused_parameters</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span> <span class="o">=</span> <span class="n">Accelerator</span><span class="p">(</span>
            <span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">gradient_accumulation_steps</span><span class="p">,</span>
            <span class="n">mixed_precision</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">mixed_precision</span><span class="p">,</span>
            <span class="n">kwargs_handlers</span><span class="o">=</span><span class="p">[</span><span class="n">ddp_kwargs</span><span class="p">],</span>
        <span class="p">)</span>
        
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">is_main_process</span><span class="p">:</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Running on </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">num_processes</span><span class="si">}</span><span class="s"> processes"</span><span class="p">)</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Mixed precision: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">mixed_precision</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">setup_seed</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Set random seeds for reproducibility."""</span>
        <span class="n">torch</span><span class="p">.</span><span class="n">manual_seed</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">seed</span><span class="p">)</span>
        <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">():</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">manual_seed_all</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">seed</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">load_model_and_tokenizer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load pre-trained model and tokenizer."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loading model: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="c1"># Determine torch dtype
</span>        <span class="n">dtype_map</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"float32"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span>
            <span class="s">"float16"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">float16</span><span class="p">,</span>
            <span class="s">"bfloat16"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span><span class="p">,</span>
        <span class="p">}</span>
        <span class="n">torch_dtype</span> <span class="o">=</span> <span class="n">dtype_map</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">torch_dtype</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span><span class="p">)</span>
        
        <span class="c1"># Model loading kwargs
</span>        <span class="n">model_kwargs</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"torch_dtype"</span><span class="p">:</span> <span class="n">torch_dtype</span><span class="p">,</span>
            <span class="s">"trust_remote_code"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">trust_remote_code</span><span class="p">,</span>
            <span class="s">"device_map"</span><span class="p">:</span> <span class="bp">None</span><span class="p">,</span>  <span class="c1"># Let accelerator handle device placement
</span>        <span class="p">}</span>
        
        <span class="c1"># Enable flash attention if available
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_flash_attention</span><span class="p">:</span>
            <span class="n">model_kwargs</span><span class="p">[</span><span class="s">"attn_implementation"</span><span class="p">]</span> <span class="o">=</span> <span class="s">"flash_attention_2"</span>
        
        <span class="c1"># Load model
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">,</span>
            <span class="o">**</span><span class="n">model_kwargs</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Enable gradient checkpointing to save memory
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_gradient_checkpointing</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">gradient_checkpointing_enable</span><span class="p">()</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Gradient checkpointing enabled"</span><span class="p">)</span>
        
        <span class="c1"># Load tokenizer
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">,</span>
            <span class="n">trust_remote_code</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">trust_remote_code</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">eos_token</span>
        
        <span class="c1"># Count parameters
</span>        <span class="n">total_params</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">numel</span><span class="p">()</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">())</span>
        <span class="n">trainable_params</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">numel</span><span class="p">()</span> <span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">()</span> <span class="k">if</span> <span class="n">p</span><span class="p">.</span><span class="n">requires_grad</span><span class="p">)</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Total parameters: </span><span class="si">{</span><span class="n">total_params</span><span class="si">:</span><span class="p">,</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Trainable parameters: </span><span class="si">{</span><span class="n">trainable_params</span><span class="si">:</span><span class="p">,</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span>
    
    <span class="k">def</span> <span class="nf">load_data</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load preprocessed datasets."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loading data from </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">data_dir</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="n">dataset</span> <span class="o">=</span> <span class="n">load_from_disk</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">data_dir</span><span class="p">)</span>
        
        <span class="c1"># Create data collator
</span>        <span class="n">data_collator</span> <span class="o">=</span> <span class="n">DataCollatorForLanguageModeling</span><span class="p">(</span>
            <span class="n">tokenizer</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">,</span>
            <span class="n">mlm</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Create dataloaders
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">train_dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span>
            <span class="n">dataset</span><span class="p">[</span><span class="s">"train"</span><span class="p">],</span>
            <span class="n">batch_size</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">batch_size</span><span class="p">,</span>
            <span class="n">shuffle</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">collate_fn</span><span class="o">=</span><span class="n">data_collator</span><span class="p">,</span>
            <span class="n">num_workers</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
            <span class="n">pin_memory</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">eval_dataloader</span> <span class="o">=</span> <span class="n">DataLoader</span><span class="p">(</span>
            <span class="n">dataset</span><span class="p">[</span><span class="s">"validation"</span><span class="p">],</span>
            <span class="n">batch_size</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">batch_size</span><span class="p">,</span>
            <span class="n">shuffle</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
            <span class="n">collate_fn</span><span class="o">=</span><span class="n">data_collator</span><span class="p">,</span>
            <span class="n">num_workers</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
            <span class="n">pin_memory</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Train batches: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">train_dataloader</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Eval batches: </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">eval_dataloader</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">train_dataloader</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">eval_dataloader</span>
    
    <span class="k">def</span> <span class="nf">setup_optimizer_and_scheduler</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Configure optimizer and learning rate scheduler."""</span>
        <span class="c1"># Calculate total training steps
</span>        <span class="n">num_update_steps_per_epoch</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">ceil</span><span class="p">(</span>
            <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">train_dataloader</span><span class="p">)</span> <span class="o">/</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">gradient_accumulation_steps</span>
        <span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">total_training_steps</span> <span class="o">=</span> <span class="n">num_update_steps_per_epoch</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_epochs</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">warmup_steps</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">total_training_steps</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">warmup_ratio</span><span class="p">)</span>
        
        <span class="c1"># Setup optimizer with weight decay
</span>        <span class="n">no_decay</span> <span class="o">=</span> <span class="p">[</span><span class="s">"bias"</span><span class="p">,</span> <span class="s">"LayerNorm.weight"</span><span class="p">,</span> <span class="s">"layer_norm.weight"</span><span class="p">]</span>
        <span class="n">optimizer_grouped_parameters</span> <span class="o">=</span> <span class="p">[</span>
            <span class="p">{</span>
                <span class="s">"params"</span><span class="p">:</span> <span class="p">[</span><span class="n">p</span> <span class="k">for</span> <span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">named_parameters</span><span class="p">()</span> 
                          <span class="k">if</span> <span class="ow">not</span> <span class="nb">any</span><span class="p">(</span><span class="n">nd</span> <span class="ow">in</span> <span class="n">n</span> <span class="k">for</span> <span class="n">nd</span> <span class="ow">in</span> <span class="n">no_decay</span><span class="p">)</span> <span class="ow">and</span> <span class="n">p</span><span class="p">.</span><span class="n">requires_grad</span><span class="p">],</span>
                <span class="s">"weight_decay"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">weight_decay</span><span class="p">,</span>
            <span class="p">},</span>
            <span class="p">{</span>
                <span class="s">"params"</span><span class="p">:</span> <span class="p">[</span><span class="n">p</span> <span class="k">for</span> <span class="n">n</span><span class="p">,</span> <span class="n">p</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">named_parameters</span><span class="p">()</span> 
                          <span class="k">if</span> <span class="nb">any</span><span class="p">(</span><span class="n">nd</span> <span class="ow">in</span> <span class="n">n</span> <span class="k">for</span> <span class="n">nd</span> <span class="ow">in</span> <span class="n">no_decay</span><span class="p">)</span> <span class="ow">and</span> <span class="n">p</span><span class="p">.</span><span class="n">requires_grad</span><span class="p">],</span>
                <span class="s">"weight_decay"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
            <span class="p">},</span>
        <span class="p">]</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">optimizer</span> <span class="o">=</span> <span class="n">AdamW</span><span class="p">(</span>
            <span class="n">optimizer_grouped_parameters</span><span class="p">,</span>
            <span class="n">lr</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">learning_rate</span><span class="p">,</span>
            <span class="n">betas</span><span class="o">=</span><span class="p">(</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">),</span>
            <span class="n">eps</span><span class="o">=</span><span class="mf">1e-8</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Setup scheduler
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">scheduler</span> <span class="o">=</span> <span class="n">get_linear_schedule_with_warmup</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">optimizer</span><span class="p">,</span>
            <span class="n">num_warmup_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">warmup_steps</span><span class="p">,</span>
            <span class="n">num_training_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">total_training_steps</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Total training steps: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">total_training_steps</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Warmup steps: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">warmup_steps</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">optimizer</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">scheduler</span>
    
    <span class="k">def</span> <span class="nf">setup_wandb</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Initialize Weights &amp; Biases for experiment tracking."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">or</span> <span class="ow">not</span> <span class="n">WANDB_AVAILABLE</span><span class="p">:</span>
            <span class="k">return</span>
        
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">is_main_process</span><span class="p">:</span>
            <span class="n">wandb</span><span class="p">.</span><span class="n">init</span><span class="p">(</span>
                <span class="n">project</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">project_name</span><span class="p">,</span>
                <span class="n">name</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">run_name</span><span class="p">,</span>
                <span class="n">config</span><span class="o">=</span><span class="nb">vars</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">),</span>
            <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">evaluate</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">float</span><span class="p">]:</span>
        <span class="s">"""Run evaluation on validation set."""</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
        <span class="n">total_loss</span> <span class="o">=</span> <span class="mf">0.0</span>
        <span class="n">total_tokens</span> <span class="o">=</span> <span class="mi">0</span>
        
        <span class="k">with</span> <span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">():</span>
            <span class="k">for</span> <span class="n">batch</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">eval_dataloader</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Evaluating"</span><span class="p">,</span> <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">is_main_process</span><span class="p">):</span>
                <span class="n">outputs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">batch</span><span class="p">)</span>
                <span class="n">loss</span> <span class="o">=</span> <span class="n">outputs</span><span class="p">.</span><span class="n">loss</span>
                
                <span class="c1"># Gather losses across processes
</span>                <span class="n">gathered_loss</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">gather</span><span class="p">(</span><span class="n">loss</span><span class="p">.</span><span class="n">repeat</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">batch_size</span><span class="p">))</span>
                <span class="n">total_loss</span> <span class="o">+=</span> <span class="n">gathered_loss</span><span class="p">.</span><span class="nb">sum</span><span class="p">().</span><span class="n">item</span><span class="p">()</span>
                <span class="n">total_tokens</span> <span class="o">+=</span> <span class="n">batch</span><span class="p">[</span><span class="s">"input_ids"</span><span class="p">].</span><span class="n">numel</span><span class="p">()</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">num_processes</span>
        
        <span class="n">avg_loss</span> <span class="o">=</span> <span class="n">total_loss</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">eval_dataloader</span><span class="p">)</span>
        <span class="n">perplexity</span> <span class="o">=</span> <span class="n">math</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">avg_loss</span><span class="p">)</span> <span class="k">if</span> <span class="n">avg_loss</span> <span class="o">&lt;</span> <span class="mi">100</span> <span class="k">else</span> <span class="nb">float</span><span class="p">(</span><span class="s">"inf"</span><span class="p">)</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
        <span class="k">return</span> <span class="p">{</span><span class="s">"eval_loss"</span><span class="p">:</span> <span class="n">avg_loss</span><span class="p">,</span> <span class="s">"eval_perplexity"</span><span class="p">:</span> <span class="n">perplexity</span><span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">save_checkpoint</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">step</span><span class="p">:</span> <span class="nb">int</span><span class="p">):</span>
        <span class="s">"""Save model checkpoint."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">is_main_process</span><span class="p">:</span>
            <span class="k">return</span>
        
        <span class="n">output_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">output_dir</span><span class="p">)</span> <span class="o">/</span> <span class="sa">f</span><span class="s">"checkpoint-</span><span class="si">{</span><span class="n">step</span><span class="si">}</span><span class="s">"</span>
        <span class="n">output_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        
        <span class="c1"># Unwrap model and save
</span>        <span class="n">unwrapped_model</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">unwrap_model</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">)</span>
        <span class="n">unwrapped_model</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="n">output_dir</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="n">output_dir</span><span class="p">)</span>
        
        <span class="c1"># Save training state
</span>        <span class="n">torch</span><span class="p">.</span><span class="n">save</span><span class="p">({</span>
            <span class="s">"step"</span><span class="p">:</span> <span class="n">step</span><span class="p">,</span>
            <span class="s">"optimizer_state"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">optimizer</span><span class="p">.</span><span class="n">state_dict</span><span class="p">(),</span>
            <span class="s">"scheduler_state"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">scheduler</span><span class="p">.</span><span class="n">state_dict</span><span class="p">(),</span>
        <span class="p">},</span> <span class="n">output_dir</span> <span class="o">/</span> <span class="s">"training_state.pt"</span><span class="p">)</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Checkpoint saved to </span><span class="si">{</span><span class="n">output_dir</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Main training loop."""</span>
        <span class="c1"># Setup
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">load_model_and_tokenizer</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">load_data</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">setup_optimizer_and_scheduler</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">setup_wandb</span><span class="p">()</span>
        
        <span class="c1"># Prepare with accelerator
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">optimizer</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">train_dataloader</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">eval_dataloader</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">scheduler</span> <span class="o">=</span> \
            <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">prepare</span><span class="p">(</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">optimizer</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">train_dataloader</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">eval_dataloader</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">scheduler</span>
            <span class="p">)</span>
        
        <span class="c1"># Training loop
</span>        <span class="n">global_step</span> <span class="o">=</span> <span class="mi">0</span>
        <span class="n">best_eval_loss</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="s">"inf"</span><span class="p">)</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Starting training..."</span><span class="p">)</span>
        
        <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_epochs</span><span class="p">):</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
            <span class="n">epoch_loss</span> <span class="o">=</span> <span class="mf">0.0</span>
            
            <span class="n">progress_bar</span> <span class="o">=</span> <span class="n">tqdm</span><span class="p">(</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">train_dataloader</span><span class="p">,</span>
                <span class="n">desc</span><span class="o">=</span><span class="sa">f</span><span class="s">"Epoch </span><span class="si">{</span><span class="n">epoch</span> <span class="o">+</span> <span class="mi">1</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_epochs</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
                <span class="n">disable</span><span class="o">=</span><span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">is_main_process</span><span class="p">,</span>
            <span class="p">)</span>
            
            <span class="k">for</span> <span class="n">step</span><span class="p">,</span> <span class="n">batch</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">progress_bar</span><span class="p">):</span>
                <span class="k">with</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">accumulate</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">):</span>
                    <span class="n">outputs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">batch</span><span class="p">)</span>
                    <span class="n">loss</span> <span class="o">=</span> <span class="n">outputs</span><span class="p">.</span><span class="n">loss</span>
                    
                    <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">backward</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>
                    
                    <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">sync_gradients</span><span class="p">:</span>
                        <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">clip_grad_norm_</span><span class="p">(</span>
                            <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">parameters</span><span class="p">(),</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">max_grad_norm</span>
                        <span class="p">)</span>
                    
                    <span class="bp">self</span><span class="p">.</span><span class="n">optimizer</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
                    <span class="bp">self</span><span class="p">.</span><span class="n">scheduler</span><span class="p">.</span><span class="n">step</span><span class="p">()</span>
                    <span class="bp">self</span><span class="p">.</span><span class="n">optimizer</span><span class="p">.</span><span class="n">zero_grad</span><span class="p">()</span>
                
                <span class="n">epoch_loss</span> <span class="o">+=</span> <span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>
                
                <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">sync_gradients</span><span class="p">:</span>
                    <span class="n">global_step</span> <span class="o">+=</span> <span class="mi">1</span>
                    
                    <span class="c1"># Logging
</span>                    <span class="k">if</span> <span class="n">global_step</span> <span class="o">%</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">logging_steps</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                        <span class="n">avg_loss</span> <span class="o">=</span> <span class="n">epoch_loss</span> <span class="o">/</span> <span class="p">(</span><span class="n">step</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
                        <span class="n">lr</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">scheduler</span><span class="p">.</span><span class="n">get_last_lr</span><span class="p">()[</span><span class="mi">0</span><span class="p">]</span>
                        
                        <span class="n">progress_bar</span><span class="p">.</span><span class="n">set_postfix</span><span class="p">({</span>
                            <span class="s">"loss"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">avg_loss</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
                            <span class="s">"lr"</span><span class="p">:</span> <span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="n">lr</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">,</span>
                        <span class="p">})</span>
                        
                        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">and</span> <span class="n">WANDB_AVAILABLE</span> <span class="ow">and</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">is_main_process</span><span class="p">:</span>
                            <span class="n">wandb</span><span class="p">.</span><span class="n">log</span><span class="p">({</span>
                                <span class="s">"train/loss"</span><span class="p">:</span> <span class="n">avg_loss</span><span class="p">,</span>
                                <span class="s">"train/learning_rate"</span><span class="p">:</span> <span class="n">lr</span><span class="p">,</span>
                                <span class="s">"train/epoch"</span><span class="p">:</span> <span class="n">epoch</span> <span class="o">+</span> <span class="n">step</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">train_dataloader</span><span class="p">),</span>
                            <span class="p">},</span> <span class="n">step</span><span class="o">=</span><span class="n">global_step</span><span class="p">)</span>
                    
                    <span class="c1"># Evaluation
</span>                    <span class="k">if</span> <span class="n">global_step</span> <span class="o">%</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">eval_steps</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                        <span class="n">eval_metrics</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">evaluate</span><span class="p">()</span>
                        
                        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">is_main_process</span><span class="p">:</span>
                            <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Step </span><span class="si">{</span><span class="n">global_step</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">eval_metrics</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                            
                            <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">and</span> <span class="n">WANDB_AVAILABLE</span><span class="p">:</span>
                                <span class="n">wandb</span><span class="p">.</span><span class="n">log</span><span class="p">({</span><span class="sa">f</span><span class="s">"eval/</span><span class="si">{</span><span class="n">k</span><span class="si">}</span><span class="s">"</span><span class="p">:</span> <span class="n">v</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">eval_metrics</span><span class="p">.</span><span class="n">items</span><span class="p">()},</span> <span class="n">step</span><span class="o">=</span><span class="n">global_step</span><span class="p">)</span>
                            
                            <span class="k">if</span> <span class="n">eval_metrics</span><span class="p">[</span><span class="s">"eval_loss"</span><span class="p">]</span> <span class="o">&lt;</span> <span class="n">best_eval_loss</span><span class="p">:</span>
                                <span class="n">best_eval_loss</span> <span class="o">=</span> <span class="n">eval_metrics</span><span class="p">[</span><span class="s">"eval_loss"</span><span class="p">]</span>
                                <span class="bp">self</span><span class="p">.</span><span class="n">save_checkpoint</span><span class="p">(</span><span class="n">global_step</span><span class="p">)</span>
                    
                    <span class="c1"># Regular checkpointing
</span>                    <span class="k">if</span> <span class="n">global_step</span> <span class="o">%</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">save_steps</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
                        <span class="bp">self</span><span class="p">.</span><span class="n">save_checkpoint</span><span class="p">(</span><span class="n">global_step</span><span class="p">)</span>
        
        <span class="c1"># Final save
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">save_checkpoint</span><span class="p">(</span><span class="n">global_step</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">and</span> <span class="n">WANDB_AVAILABLE</span> <span class="ow">and</span> <span class="bp">self</span><span class="p">.</span><span class="n">accelerator</span><span class="p">.</span><span class="n">is_main_process</span><span class="p">:</span>
            <span class="n">wandb</span><span class="p">.</span><span class="n">finish</span><span class="p">()</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Training complete!"</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">global_step</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="s">"""Run full fine-tuning."""</span>
    
    <span class="n">config</span> <span class="o">=</span> <span class="n">FullFineTuningConfig</span><span class="p">(</span>
        <span class="n">model_name</span><span class="o">=</span><span class="s">"meta-llama/Llama-4-8B"</span><span class="p">,</span>
        <span class="n">learning_rate</span><span class="o">=</span><span class="mf">2e-5</span><span class="p">,</span>
        <span class="n">num_epochs</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
        <span class="n">batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
        <span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
        <span class="n">data_dir</span><span class="o">=</span><span class="s">"./processed_data"</span><span class="p">,</span>
        <span class="n">output_dir</span><span class="o">=</span><span class="s">"./full_finetuned_model"</span><span class="p">,</span>
    <span class="p">)</span>
    
    <span class="n">trainer</span> <span class="o">=</span> <span class="n">FullFineTuner</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
    <span class="n">trainer</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h2 id="implementation-lora-fine-tuning">Implementation: LoRA Fine-Tuning</h2>

<p>LoRA dramatically reduces memory requirements while maintaining near full fine-tuning performance.</p>

<p><img src="diagrams/14_lora_training_setup.png" alt="LoRA Fine-Tuning Pipeline: Model setup through post-training" /></p>

<h3 id="complete-lora-fine-tuning-code">Complete LoRA Fine-Tuning Code</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
LoRA Fine-Tuning Pipeline for Large Language Models.
Supports: Llama 4, Qwen 3, DeepSeek-V3.2, Gemma 3

Requirements:
    pip install torch transformers datasets peft accelerate wandb tqdm bitsandbytes
"""</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span><span class="p">,</span> <span class="n">field</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span><span class="p">,</span> <span class="n">List</span><span class="p">,</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Any</span>

<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">AutoModelForCausalLM</span><span class="p">,</span>
    <span class="n">AutoTokenizer</span><span class="p">,</span>
    <span class="n">TrainingArguments</span><span class="p">,</span>
    <span class="n">Trainer</span><span class="p">,</span>
    <span class="n">DataCollatorForLanguageModeling</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">peft</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">LoraConfig</span><span class="p">,</span>
    <span class="n">get_peft_model</span><span class="p">,</span>
    <span class="n">TaskType</span><span class="p">,</span>
    <span class="n">PeftModel</span><span class="p">,</span>
    <span class="n">prepare_model_for_kbit_training</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_from_disk</span>
<span class="kn">from</span> <span class="nn">tqdm.auto</span> <span class="kn">import</span> <span class="n">tqdm</span>

<span class="k">try</span><span class="p">:</span>
    <span class="kn">import</span> <span class="nn">wandb</span>
    <span class="n">WANDB_AVAILABLE</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">except</span> <span class="nb">ImportError</span><span class="p">:</span>
    <span class="n">WANDB_AVAILABLE</span> <span class="o">=</span> <span class="bp">False</span>

<span class="n">logging</span><span class="p">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="p">.</span><span class="n">INFO</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'%(asctime)s - %(levelname)s - %(message)s'</span><span class="p">)</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">LoRAConfig</span><span class="p">:</span>
    <span class="s">"""Configuration for LoRA fine-tuning."""</span>
    
    <span class="c1"># Model settings
</span>    <span class="n">model_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"meta-llama/Llama-4-8B"</span>
    <span class="n">torch_dtype</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"bfloat16"</span>
    <span class="n">use_flash_attention</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">trust_remote_code</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    
    <span class="c1"># LoRA hyperparameters
</span>    <span class="n">lora_r</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">64</span>
    <span class="n">lora_alpha</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">128</span>
    <span class="n">lora_dropout</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.05</span>
    <span class="n">target_modules</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="p">[</span>
        <span class="s">"q_proj"</span><span class="p">,</span> <span class="s">"k_proj"</span><span class="p">,</span> <span class="s">"v_proj"</span><span class="p">,</span> <span class="s">"o_proj"</span><span class="p">,</span>
        <span class="s">"gate_proj"</span><span class="p">,</span> <span class="s">"up_proj"</span><span class="p">,</span> <span class="s">"down_proj"</span><span class="p">,</span>
    <span class="p">])</span>
    <span class="n">modules_to_save</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="p">[</span><span class="s">"embed_tokens"</span><span class="p">,</span> <span class="s">"lm_head"</span><span class="p">])</span>
    <span class="n">use_rslora</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>  <span class="c1"># Rank-stabilized LoRA
</span>    
    <span class="c1"># Training hyperparameters
</span>    <span class="n">learning_rate</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">2e-4</span>
    <span class="n">weight_decay</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.01</span>
    <span class="n">num_epochs</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span>
    <span class="n">batch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">8</span>
    <span class="n">gradient_accumulation_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">4</span>
    <span class="n">max_grad_norm</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1.0</span>
    <span class="n">warmup_ratio</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.03</span>
    <span class="n">lr_scheduler_type</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"cosine"</span>
    
    <span class="c1"># LoRA+ settings (different LR for A and B matrices)
</span>    <span class="n">use_lora_plus</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">lora_plus_lambda</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">16.0</span>  <span class="c1"># B learning rate multiplier
</span>    
    <span class="c1"># Data settings
</span>    <span class="n">data_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./processed_data"</span>
    <span class="n">max_seq_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2048</span>
    
    <span class="c1"># Output settings
</span>    <span class="n">output_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./lora_finetuned_model"</span>
    <span class="n">save_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">200</span>
    <span class="n">eval_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">100</span>
    <span class="n">logging_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span>
    <span class="n">save_total_limit</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span>
    
    <span class="c1"># Experiment tracking
</span>    <span class="n">project_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"llm-lora-finetuning"</span>
    <span class="n">run_name</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">use_wandb</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    
    <span class="n">seed</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">42</span>


<span class="k">class</span> <span class="nc">LoRAFineTuner</span><span class="p">:</span>
    <span class="s">"""Production-ready LoRA fine-tuning trainer."""</span>
    
    <span class="c1"># Target modules for different model architectures
</span>    <span class="n">TARGET_MODULES_MAP</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"llama"</span><span class="p">:</span> <span class="p">[</span><span class="s">"q_proj"</span><span class="p">,</span> <span class="s">"k_proj"</span><span class="p">,</span> <span class="s">"v_proj"</span><span class="p">,</span> <span class="s">"o_proj"</span><span class="p">,</span> <span class="s">"gate_proj"</span><span class="p">,</span> <span class="s">"up_proj"</span><span class="p">,</span> <span class="s">"down_proj"</span><span class="p">],</span>
        <span class="s">"qwen"</span><span class="p">:</span> <span class="p">[</span><span class="s">"q_proj"</span><span class="p">,</span> <span class="s">"k_proj"</span><span class="p">,</span> <span class="s">"v_proj"</span><span class="p">,</span> <span class="s">"o_proj"</span><span class="p">,</span> <span class="s">"gate_proj"</span><span class="p">,</span> <span class="s">"up_proj"</span><span class="p">,</span> <span class="s">"down_proj"</span><span class="p">],</span>
        <span class="s">"deepseek"</span><span class="p">:</span> <span class="p">[</span><span class="s">"q_proj"</span><span class="p">,</span> <span class="s">"k_proj"</span><span class="p">,</span> <span class="s">"v_proj"</span><span class="p">,</span> <span class="s">"o_proj"</span><span class="p">,</span> <span class="s">"gate_proj"</span><span class="p">,</span> <span class="s">"up_proj"</span><span class="p">,</span> <span class="s">"down_proj"</span><span class="p">],</span>
        <span class="s">"gemma"</span><span class="p">:</span> <span class="p">[</span><span class="s">"q_proj"</span><span class="p">,</span> <span class="s">"k_proj"</span><span class="p">,</span> <span class="s">"v_proj"</span><span class="p">,</span> <span class="s">"o_proj"</span><span class="p">,</span> <span class="s">"gate_proj"</span><span class="p">,</span> <span class="s">"up_proj"</span><span class="p">,</span> <span class="s">"down_proj"</span><span class="p">],</span>
    <span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config</span><span class="p">:</span> <span class="n">LoRAConfig</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="n">config</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">model_family</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_detect_model_family</span><span class="p">()</span>
        
        <span class="c1"># Update target modules based on model family
</span>        <span class="k">if</span> <span class="ow">not</span> <span class="n">config</span><span class="p">.</span><span class="n">target_modules</span><span class="p">:</span>
            <span class="n">config</span><span class="p">.</span><span class="n">target_modules</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">TARGET_MODULES_MAP</span><span class="p">.</span><span class="n">get</span><span class="p">(</span>
                <span class="bp">self</span><span class="p">.</span><span class="n">model_family</span><span class="p">,</span> 
                <span class="bp">self</span><span class="p">.</span><span class="n">TARGET_MODULES_MAP</span><span class="p">[</span><span class="s">"llama"</span><span class="p">]</span>
            <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">_detect_model_family</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Detect model family from model name."""</span>
        <span class="n">model_lower</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span>
        <span class="k">for</span> <span class="n">family</span> <span class="ow">in</span> <span class="p">[</span><span class="s">"llama"</span><span class="p">,</span> <span class="s">"qwen"</span><span class="p">,</span> <span class="s">"deepseek"</span><span class="p">,</span> <span class="s">"gemma"</span><span class="p">]:</span>
            <span class="k">if</span> <span class="n">family</span> <span class="ow">in</span> <span class="n">model_lower</span><span class="p">:</span>
                <span class="k">return</span> <span class="n">family</span>
        <span class="k">return</span> <span class="s">"llama"</span>
    
    <span class="k">def</span> <span class="nf">load_model_and_tokenizer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load base model and apply LoRA."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loading model: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="c1"># Torch dtype
</span>        <span class="n">dtype_map</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"float32"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span>
            <span class="s">"float16"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">float16</span><span class="p">,</span>
            <span class="s">"bfloat16"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span><span class="p">,</span>
        <span class="p">}</span>
        <span class="n">torch_dtype</span> <span class="o">=</span> <span class="n">dtype_map</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">torch_dtype</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span><span class="p">)</span>
        
        <span class="c1"># Model loading kwargs
</span>        <span class="n">model_kwargs</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"torch_dtype"</span><span class="p">:</span> <span class="n">torch_dtype</span><span class="p">,</span>
            <span class="s">"trust_remote_code"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">trust_remote_code</span><span class="p">,</span>
            <span class="s">"device_map"</span><span class="p">:</span> <span class="s">"auto"</span><span class="p">,</span>
        <span class="p">}</span>
        
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_flash_attention</span><span class="p">:</span>
            <span class="n">model_kwargs</span><span class="p">[</span><span class="s">"attn_implementation"</span><span class="p">]</span> <span class="o">=</span> <span class="s">"flash_attention_2"</span>
        
        <span class="c1"># Load base model
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">,</span>
            <span class="o">**</span><span class="n">model_kwargs</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Enable gradient checkpointing
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">gradient_checkpointing_enable</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">enable_input_require_grads</span><span class="p">()</span>
        
        <span class="c1"># Configure LoRA
</span>        <span class="n">lora_config</span> <span class="o">=</span> <span class="n">LoraConfig</span><span class="p">(</span>
            <span class="n">task_type</span><span class="o">=</span><span class="n">TaskType</span><span class="p">.</span><span class="n">CAUSAL_LM</span><span class="p">,</span>
            <span class="n">r</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lora_r</span><span class="p">,</span>
            <span class="n">lora_alpha</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lora_alpha</span><span class="p">,</span>
            <span class="n">lora_dropout</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lora_dropout</span><span class="p">,</span>
            <span class="n">target_modules</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">target_modules</span><span class="p">,</span>
            <span class="n">modules_to_save</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">modules_to_save</span><span class="p">,</span>
            <span class="n">bias</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span>
            <span class="n">use_rslora</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_rslora</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Apply LoRA
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">get_peft_model</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span> <span class="n">lora_config</span><span class="p">)</span>
        
        <span class="c1"># Print trainable parameters
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">print_trainable_parameters</span><span class="p">()</span>
        
        <span class="c1"># Load tokenizer
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">,</span>
            <span class="n">trust_remote_code</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">trust_remote_code</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">eos_token</span>
        
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span>
    
    <span class="k">def</span> <span class="nf">get_optimizer_grouped_parameters</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Get optimizer parameters with LoRA+ learning rate scheduling."""</span>
        <span class="k">if</span> <span class="ow">not</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_lora_plus</span><span class="p">:</span>
            <span class="k">return</span> <span class="bp">None</span>  <span class="c1"># Use default optimizer
</span>        
        <span class="c1"># LoRA+ assigns higher learning rate to B matrices
</span>        <span class="n">lora_a_params</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">lora_b_params</span> <span class="o">=</span> <span class="p">[]</span>
        <span class="n">other_params</span> <span class="o">=</span> <span class="p">[]</span>
        
        <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">param</span> <span class="ow">in</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">named_parameters</span><span class="p">():</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">param</span><span class="p">.</span><span class="n">requires_grad</span><span class="p">:</span>
                <span class="k">continue</span>
            
            <span class="k">if</span> <span class="s">"lora_A"</span> <span class="ow">in</span> <span class="n">name</span><span class="p">:</span>
                <span class="n">lora_a_params</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">param</span><span class="p">)</span>
            <span class="k">elif</span> <span class="s">"lora_B"</span> <span class="ow">in</span> <span class="n">name</span><span class="p">:</span>
                <span class="n">lora_b_params</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">param</span><span class="p">)</span>
            <span class="k">else</span><span class="p">:</span>
                <span class="n">other_params</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">param</span><span class="p">)</span>
        
        <span class="n">optimizer_grouped_parameters</span> <span class="o">=</span> <span class="p">[</span>
            <span class="p">{</span>
                <span class="s">"params"</span><span class="p">:</span> <span class="n">lora_a_params</span><span class="p">,</span>
                <span class="s">"lr"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">learning_rate</span><span class="p">,</span>
                <span class="s">"weight_decay"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">weight_decay</span><span class="p">,</span>
            <span class="p">},</span>
            <span class="p">{</span>
                <span class="s">"params"</span><span class="p">:</span> <span class="n">lora_b_params</span><span class="p">,</span>
                <span class="s">"lr"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">learning_rate</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lora_plus_lambda</span><span class="p">,</span>
                <span class="s">"weight_decay"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">weight_decay</span><span class="p">,</span>
            <span class="p">},</span>
            <span class="p">{</span>
                <span class="s">"params"</span><span class="p">:</span> <span class="n">other_params</span><span class="p">,</span>
                <span class="s">"lr"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">learning_rate</span><span class="p">,</span>
                <span class="s">"weight_decay"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">weight_decay</span><span class="p">,</span>
            <span class="p">},</span>
        <span class="p">]</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"LoRA+ enabled: A matrices LR = </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">learning_rate</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">e</span><span class="si">}</span><span class="s">, "</span>
                   <span class="sa">f</span><span class="s">"B matrices LR = </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">learning_rate</span> <span class="o">*</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lora_plus_lambda</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">e</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">optimizer_grouped_parameters</span>
    
    <span class="k">def</span> <span class="nf">load_data</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load preprocessed datasets."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loading data from </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">data_dir</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">dataset</span> <span class="o">=</span> <span class="n">load_from_disk</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">data_dir</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">dataset</span>
    
    <span class="k">def</span> <span class="nf">create_trainer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Create HuggingFace Trainer with custom optimizer."""</span>
        <span class="c1"># Data collator
</span>        <span class="n">data_collator</span> <span class="o">=</span> <span class="n">DataCollatorForLanguageModeling</span><span class="p">(</span>
            <span class="n">tokenizer</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">,</span>
            <span class="n">mlm</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Training arguments
</span>        <span class="n">training_args</span> <span class="o">=</span> <span class="n">TrainingArguments</span><span class="p">(</span>
            <span class="n">output_dir</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">output_dir</span><span class="p">,</span>
            <span class="n">num_train_epochs</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_epochs</span><span class="p">,</span>
            <span class="n">per_device_train_batch_size</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">batch_size</span><span class="p">,</span>
            <span class="n">per_device_eval_batch_size</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">batch_size</span><span class="p">,</span>
            <span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">gradient_accumulation_steps</span><span class="p">,</span>
            <span class="n">learning_rate</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">learning_rate</span><span class="p">,</span>
            <span class="n">weight_decay</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">weight_decay</span><span class="p">,</span>
            <span class="n">warmup_ratio</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">warmup_ratio</span><span class="p">,</span>
            <span class="n">lr_scheduler_type</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lr_scheduler_type</span><span class="p">,</span>
            <span class="n">max_grad_norm</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">max_grad_norm</span><span class="p">,</span>
            <span class="n">logging_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">logging_steps</span><span class="p">,</span>
            <span class="n">save_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">save_steps</span><span class="p">,</span>
            <span class="n">eval_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">eval_steps</span><span class="p">,</span>
            <span class="n">evaluation_strategy</span><span class="o">=</span><span class="s">"steps"</span><span class="p">,</span>
            <span class="n">save_total_limit</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">save_total_limit</span><span class="p">,</span>
            <span class="n">load_best_model_at_end</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">metric_for_best_model</span><span class="o">=</span><span class="s">"eval_loss"</span><span class="p">,</span>
            <span class="n">greater_is_better</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
            <span class="n">bf16</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">torch_dtype</span> <span class="o">==</span> <span class="s">"bfloat16"</span><span class="p">,</span>
            <span class="n">fp16</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">torch_dtype</span> <span class="o">==</span> <span class="s">"float16"</span><span class="p">,</span>
            <span class="n">dataloader_num_workers</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
            <span class="n">dataloader_pin_memory</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">report_to</span><span class="o">=</span><span class="s">"wandb"</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">and</span> <span class="n">WANDB_AVAILABLE</span> <span class="k">else</span> <span class="s">"none"</span><span class="p">,</span>
            <span class="n">run_name</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">run_name</span><span class="p">,</span>
            <span class="n">seed</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">seed</span><span class="p">,</span>
            <span class="n">remove_unused_columns</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Custom optimizer for LoRA+
</span>        <span class="n">optimizers</span> <span class="o">=</span> <span class="p">(</span><span class="bp">None</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>  <span class="c1"># Default
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_lora_plus</span><span class="p">:</span>
            <span class="kn">from</span> <span class="nn">torch.optim</span> <span class="kn">import</span> <span class="n">AdamW</span>
            <span class="n">optimizer_grouped_parameters</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">get_optimizer_grouped_parameters</span><span class="p">()</span>
            <span class="n">optimizer</span> <span class="o">=</span> <span class="n">AdamW</span><span class="p">(</span><span class="n">optimizer_grouped_parameters</span><span class="p">,</span> <span class="n">betas</span><span class="o">=</span><span class="p">(</span><span class="mf">0.9</span><span class="p">,</span> <span class="mf">0.95</span><span class="p">),</span> <span class="n">eps</span><span class="o">=</span><span class="mf">1e-8</span><span class="p">)</span>
            <span class="n">optimizers</span> <span class="o">=</span> <span class="p">(</span><span class="n">optimizer</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span>
        
        <span class="c1"># Create trainer
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span> <span class="o">=</span> <span class="n">Trainer</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
            <span class="n">args</span><span class="o">=</span><span class="n">training_args</span><span class="p">,</span>
            <span class="n">train_dataset</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">dataset</span><span class="p">[</span><span class="s">"train"</span><span class="p">],</span>
            <span class="n">eval_dataset</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">dataset</span><span class="p">[</span><span class="s">"validation"</span><span class="p">],</span>
            <span class="n">data_collator</span><span class="o">=</span><span class="n">data_collator</span><span class="p">,</span>
            <span class="n">optimizers</span><span class="o">=</span><span class="n">optimizers</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span>
    
    <span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Run the complete training pipeline."""</span>
        <span class="c1"># Setup
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">load_model_and_tokenizer</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">load_data</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">create_trainer</span><span class="p">()</span>
        
        <span class="c1"># Initialize wandb
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">and</span> <span class="n">WANDB_AVAILABLE</span><span class="p">:</span>
            <span class="n">wandb</span><span class="p">.</span><span class="n">init</span><span class="p">(</span>
                <span class="n">project</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">project_name</span><span class="p">,</span>
                <span class="n">name</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">run_name</span><span class="p">,</span>
                <span class="n">config</span><span class="o">=</span><span class="nb">vars</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">),</span>
            <span class="p">)</span>
        
        <span class="c1"># Train
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Starting LoRA training..."</span><span class="p">)</span>
        <span class="n">train_result</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
        
        <span class="c1"># Save final model
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Saving final model..."</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">save_model</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">output_dir</span><span class="p">)</span>
        
        <span class="c1"># Save training metrics
</span>        <span class="n">metrics</span> <span class="o">=</span> <span class="n">train_result</span><span class="p">.</span><span class="n">metrics</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">log_metrics</span><span class="p">(</span><span class="s">"train"</span><span class="p">,</span> <span class="n">metrics</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">save_metrics</span><span class="p">(</span><span class="s">"train"</span><span class="p">,</span> <span class="n">metrics</span><span class="p">)</span>
        
        <span class="c1"># Final evaluation
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Running final evaluation..."</span><span class="p">)</span>
        <span class="n">eval_metrics</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">evaluate</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">log_metrics</span><span class="p">(</span><span class="s">"eval"</span><span class="p">,</span> <span class="n">eval_metrics</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">save_metrics</span><span class="p">(</span><span class="s">"eval"</span><span class="p">,</span> <span class="n">eval_metrics</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">and</span> <span class="n">WANDB_AVAILABLE</span><span class="p">:</span>
            <span class="n">wandb</span><span class="p">.</span><span class="n">finish</span><span class="p">()</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Training complete!"</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">metrics</span>
    
    <span class="k">def</span> <span class="nf">merge_and_save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">output_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
        <span class="s">"""Merge LoRA weights with base model and save."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Merging LoRA weights with base model..."</span><span class="p">)</span>
        
        <span class="c1"># Merge weights
</span>        <span class="n">merged_model</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">merge_and_unload</span><span class="p">()</span>
        
        <span class="c1"># Save merged model
</span>        <span class="n">merged_model</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="n">output_path</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="n">output_path</span><span class="p">)</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Merged model saved to </span><span class="si">{</span><span class="n">output_path</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="s">"""Run LoRA fine-tuning."""</span>
    
    <span class="n">config</span> <span class="o">=</span> <span class="n">LoRAConfig</span><span class="p">(</span>
        <span class="n">model_name</span><span class="o">=</span><span class="s">"meta-llama/Llama-4-8B"</span><span class="p">,</span>
        <span class="n">lora_r</span><span class="o">=</span><span class="mi">64</span><span class="p">,</span>
        <span class="n">lora_alpha</span><span class="o">=</span><span class="mi">128</span><span class="p">,</span>
        <span class="n">learning_rate</span><span class="o">=</span><span class="mf">2e-4</span><span class="p">,</span>
        <span class="n">num_epochs</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
        <span class="n">batch_size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
        <span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
        <span class="n">data_dir</span><span class="o">=</span><span class="s">"./processed_data"</span><span class="p">,</span>
        <span class="n">output_dir</span><span class="o">=</span><span class="s">"./lora_finetuned_model"</span><span class="p">,</span>
        <span class="n">use_lora_plus</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
    <span class="p">)</span>
    
    <span class="n">trainer</span> <span class="o">=</span> <span class="n">LoRAFineTuner</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
    <span class="n">trainer</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
    
    <span class="c1"># Optionally merge and save
</span>    <span class="n">trainer</span><span class="p">.</span><span class="n">merge_and_save</span><span class="p">(</span><span class="s">"./merged_model"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h2 id="implementation-qlora-fine-tuning">Implementation: QLoRA Fine-Tuning</h2>

<p>QLoRA enables fine-tuning of the largest models on consumer hardware through 4-bit quantization.</p>

<p><img src="diagrams/15_qlora_setup.png" alt="QLoRA Fine-Tuning Pipeline: Quantization, LoRA setup, and memory management" /></p>

<h3 id="complete-qlora-fine-tuning-code">Complete QLoRA Fine-Tuning Code</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
QLoRA Fine-Tuning Pipeline for Large Language Models.
Enables fine-tuning of massive models on consumer GPUs.

Supports: Llama 4, Qwen 3, DeepSeek-V3.2, Gemma 3

Requirements:
    pip install torch transformers datasets peft accelerate wandb tqdm
    pip install bitsandbytes  # Required for 4-bit quantization
"""</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">math</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span><span class="p">,</span> <span class="n">field</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span><span class="p">,</span> <span class="n">List</span><span class="p">,</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Any</span>

<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">AutoModelForCausalLM</span><span class="p">,</span>
    <span class="n">AutoTokenizer</span><span class="p">,</span>
    <span class="n">BitsAndBytesConfig</span><span class="p">,</span>
    <span class="n">TrainingArguments</span><span class="p">,</span>
    <span class="n">Trainer</span><span class="p">,</span>
    <span class="n">DataCollatorForLanguageModeling</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">peft</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">LoraConfig</span><span class="p">,</span>
    <span class="n">get_peft_model</span><span class="p">,</span>
    <span class="n">TaskType</span><span class="p">,</span>
    <span class="n">prepare_model_for_kbit_training</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_from_disk</span>
<span class="kn">from</span> <span class="nn">tqdm.auto</span> <span class="kn">import</span> <span class="n">tqdm</span>

<span class="k">try</span><span class="p">:</span>
    <span class="kn">import</span> <span class="nn">wandb</span>
    <span class="n">WANDB_AVAILABLE</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">except</span> <span class="nb">ImportError</span><span class="p">:</span>
    <span class="n">WANDB_AVAILABLE</span> <span class="o">=</span> <span class="bp">False</span>

<span class="n">logging</span><span class="p">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="p">.</span><span class="n">INFO</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'%(asctime)s - %(levelname)s - %(message)s'</span><span class="p">)</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">QLoRAConfig</span><span class="p">:</span>
    <span class="s">"""Configuration for QLoRA fine-tuning."""</span>
    
    <span class="c1"># Model settings
</span>    <span class="n">model_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"meta-llama/Llama-4-8B"</span>
    <span class="n">trust_remote_code</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    
    <span class="c1"># Quantization settings
</span>    <span class="n">load_in_4bit</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">bnb_4bit_compute_dtype</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"bfloat16"</span>
    <span class="n">bnb_4bit_quant_type</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"nf4"</span>  <span class="c1"># nf4 or fp4
</span>    <span class="n">bnb_4bit_use_double_quant</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>  <span class="c1"># Double quantization for extra memory savings
</span>    
    <span class="c1"># LoRA hyperparameters
</span>    <span class="n">lora_r</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">64</span>
    <span class="n">lora_alpha</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">128</span>
    <span class="n">lora_dropout</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.05</span>
    <span class="n">target_modules</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="p">[</span>
        <span class="s">"q_proj"</span><span class="p">,</span> <span class="s">"k_proj"</span><span class="p">,</span> <span class="s">"v_proj"</span><span class="p">,</span> <span class="s">"o_proj"</span><span class="p">,</span>
        <span class="s">"gate_proj"</span><span class="p">,</span> <span class="s">"up_proj"</span><span class="p">,</span> <span class="s">"down_proj"</span><span class="p">,</span>
    <span class="p">])</span>
    <span class="n">modules_to_save</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="n">field</span><span class="p">(</span><span class="n">default_factory</span><span class="o">=</span><span class="k">lambda</span><span class="p">:</span> <span class="p">[</span><span class="s">"embed_tokens"</span><span class="p">,</span> <span class="s">"lm_head"</span><span class="p">])</span>
    
    <span class="c1"># Training hyperparameters
</span>    <span class="n">learning_rate</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">2e-4</span>
    <span class="n">weight_decay</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.01</span>
    <span class="n">num_epochs</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span>
    <span class="n">batch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">4</span>
    <span class="n">gradient_accumulation_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">8</span>
    <span class="n">max_grad_norm</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.3</span>  <span class="c1"># Lower for QLoRA stability
</span>    <span class="n">warmup_ratio</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.03</span>
    <span class="n">lr_scheduler_type</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"cosine"</span>
    
    <span class="c1"># Optimizer settings (for QLoRA, use paged optimizers)
</span>    <span class="n">optim</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"paged_adamw_8bit"</span>  <span class="c1"># Memory-efficient optimizer
</span>    
    <span class="c1"># Data settings
</span>    <span class="n">data_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./processed_data"</span>
    <span class="n">max_seq_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2048</span>
    
    <span class="c1"># Output settings
</span>    <span class="n">output_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./qlora_finetuned_model"</span>
    <span class="n">save_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">200</span>
    <span class="n">eval_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">100</span>
    <span class="n">logging_steps</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">10</span>
    <span class="n">save_total_limit</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span>
    
    <span class="c1"># Experiment tracking
</span>    <span class="n">project_name</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"llm-qlora-finetuning"</span>
    <span class="n">run_name</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="n">use_wandb</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    
    <span class="n">seed</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">42</span>


<span class="k">class</span> <span class="nc">QLoRAFineTuner</span><span class="p">:</span>
    <span class="s">"""Production-ready QLoRA fine-tuning trainer."""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config</span><span class="p">:</span> <span class="n">QLoRAConfig</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="n">config</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_validate_config</span><span class="p">()</span>
    
    <span class="k">def</span> <span class="nf">_validate_config</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Validate configuration settings."""</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">load_in_4bit</span><span class="p">:</span>
            <span class="k">try</span><span class="p">:</span>
                <span class="kn">import</span> <span class="nn">bitsandbytes</span>
            <span class="k">except</span> <span class="nb">ImportError</span><span class="p">:</span>
                <span class="k">raise</span> <span class="nb">ImportError</span><span class="p">(</span>
                    <span class="s">"bitsandbytes is required for 4-bit quantization. "</span>
                    <span class="s">"Install with: pip install bitsandbytes"</span>
                <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">_get_quantization_config</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">BitsAndBytesConfig</span><span class="p">:</span>
        <span class="s">"""Create BitsAndBytes quantization configuration."""</span>
        <span class="n">compute_dtype_map</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"float32"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span>
            <span class="s">"float16"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">float16</span><span class="p">,</span>
            <span class="s">"bfloat16"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span><span class="p">,</span>
        <span class="p">}</span>
        <span class="n">compute_dtype</span> <span class="o">=</span> <span class="n">compute_dtype_map</span><span class="p">.</span><span class="n">get</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">bnb_4bit_compute_dtype</span><span class="p">,</span> 
            <span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="n">BitsAndBytesConfig</span><span class="p">(</span>
            <span class="n">load_in_4bit</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">load_in_4bit</span><span class="p">,</span>
            <span class="n">bnb_4bit_compute_dtype</span><span class="o">=</span><span class="n">compute_dtype</span><span class="p">,</span>
            <span class="n">bnb_4bit_quant_type</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">bnb_4bit_quant_type</span><span class="p">,</span>
            <span class="n">bnb_4bit_use_double_quant</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">bnb_4bit_use_double_quant</span><span class="p">,</span>
        <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">load_model_and_tokenizer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load quantized model and apply LoRA."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loading model: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Quantization: 4-bit </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">bnb_4bit_quant_type</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="c1"># Quantization config
</span>        <span class="n">bnb_config</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_get_quantization_config</span><span class="p">()</span>
        
        <span class="c1"># Load model with quantization
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">,</span>
            <span class="n">quantization_config</span><span class="o">=</span><span class="n">bnb_config</span><span class="p">,</span>
            <span class="n">device_map</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span>
            <span class="n">trust_remote_code</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">trust_remote_code</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Prepare model for k-bit training
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">prepare_model_for_kbit_training</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
            <span class="n">use_gradient_checkpointing</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Configure LoRA
</span>        <span class="n">lora_config</span> <span class="o">=</span> <span class="n">LoraConfig</span><span class="p">(</span>
            <span class="n">task_type</span><span class="o">=</span><span class="n">TaskType</span><span class="p">.</span><span class="n">CAUSAL_LM</span><span class="p">,</span>
            <span class="n">r</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lora_r</span><span class="p">,</span>
            <span class="n">lora_alpha</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lora_alpha</span><span class="p">,</span>
            <span class="n">lora_dropout</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lora_dropout</span><span class="p">,</span>
            <span class="n">target_modules</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">target_modules</span><span class="p">,</span>
            <span class="n">modules_to_save</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">modules_to_save</span><span class="p">,</span>
            <span class="n">bias</span><span class="o">=</span><span class="s">"none"</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Apply LoRA
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">get_peft_model</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span> <span class="n">lora_config</span><span class="p">)</span>
        
        <span class="c1"># Print memory usage and trainable parameters
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">_print_model_info</span><span class="p">()</span>
        
        <span class="c1"># Load tokenizer
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">,</span>
            <span class="n">trust_remote_code</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">trust_remote_code</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">eos_token</span>
        
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span>
    
    <span class="k">def</span> <span class="nf">_print_model_info</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Print model information and memory usage."""</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">print_trainable_parameters</span><span class="p">()</span>
        
        <span class="c1"># Estimate memory usage
</span>        <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">():</span>
            <span class="n">allocated</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">memory_allocated</span><span class="p">()</span> <span class="o">/</span> <span class="mi">1024</span><span class="o">**</span><span class="mi">3</span>
            <span class="n">reserved</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">memory_reserved</span><span class="p">()</span> <span class="o">/</span> <span class="mi">1024</span><span class="o">**</span><span class="mi">3</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"GPU Memory: </span><span class="si">{</span><span class="n">allocated</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> GB allocated, </span><span class="si">{</span><span class="n">reserved</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> GB reserved"</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">load_data</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load preprocessed datasets."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loading data from </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">data_dir</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">dataset</span> <span class="o">=</span> <span class="n">load_from_disk</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">data_dir</span><span class="p">)</span>
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">dataset</span>
    
    <span class="k">def</span> <span class="nf">create_trainer</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Create HuggingFace Trainer optimized for QLoRA."""</span>
        <span class="c1"># Data collator
</span>        <span class="n">data_collator</span> <span class="o">=</span> <span class="n">DataCollatorForLanguageModeling</span><span class="p">(</span>
            <span class="n">tokenizer</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">,</span>
            <span class="n">mlm</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Training arguments optimized for QLoRA
</span>        <span class="n">training_args</span> <span class="o">=</span> <span class="n">TrainingArguments</span><span class="p">(</span>
            <span class="n">output_dir</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">output_dir</span><span class="p">,</span>
            <span class="n">num_train_epochs</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_epochs</span><span class="p">,</span>
            <span class="n">per_device_train_batch_size</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">batch_size</span><span class="p">,</span>
            <span class="n">per_device_eval_batch_size</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">batch_size</span><span class="p">,</span>
            <span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">gradient_accumulation_steps</span><span class="p">,</span>
            <span class="n">learning_rate</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">learning_rate</span><span class="p">,</span>
            <span class="n">weight_decay</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">weight_decay</span><span class="p">,</span>
            <span class="n">warmup_ratio</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">warmup_ratio</span><span class="p">,</span>
            <span class="n">lr_scheduler_type</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">lr_scheduler_type</span><span class="p">,</span>
            <span class="n">max_grad_norm</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">max_grad_norm</span><span class="p">,</span>
            <span class="n">optim</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">optim</span><span class="p">,</span>  <span class="c1"># Paged optimizer for memory efficiency
</span>            <span class="n">logging_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">logging_steps</span><span class="p">,</span>
            <span class="n">save_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">save_steps</span><span class="p">,</span>
            <span class="n">eval_steps</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">eval_steps</span><span class="p">,</span>
            <span class="n">evaluation_strategy</span><span class="o">=</span><span class="s">"steps"</span><span class="p">,</span>
            <span class="n">save_total_limit</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">save_total_limit</span><span class="p">,</span>
            <span class="n">load_best_model_at_end</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">metric_for_best_model</span><span class="o">=</span><span class="s">"eval_loss"</span><span class="p">,</span>
            <span class="n">greater_is_better</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
            <span class="n">bf16</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>  <span class="c1"># Use BF16 for compute
</span>            <span class="n">tf32</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>  <span class="c1"># Enable TF32 on Ampere+ GPUs
</span>            <span class="n">dataloader_num_workers</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
            <span class="n">dataloader_pin_memory</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">gradient_checkpointing</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="n">gradient_checkpointing_kwargs</span><span class="o">=</span><span class="p">{</span><span class="s">"use_reentrant"</span><span class="p">:</span> <span class="bp">False</span><span class="p">},</span>
            <span class="n">report_to</span><span class="o">=</span><span class="s">"wandb"</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">and</span> <span class="n">WANDB_AVAILABLE</span> <span class="k">else</span> <span class="s">"none"</span><span class="p">,</span>
            <span class="n">run_name</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">run_name</span><span class="p">,</span>
            <span class="n">seed</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">seed</span><span class="p">,</span>
            <span class="n">remove_unused_columns</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span> <span class="o">=</span> <span class="n">Trainer</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">,</span>
            <span class="n">args</span><span class="o">=</span><span class="n">training_args</span><span class="p">,</span>
            <span class="n">train_dataset</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">dataset</span><span class="p">[</span><span class="s">"train"</span><span class="p">],</span>
            <span class="n">eval_dataset</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">dataset</span><span class="p">[</span><span class="s">"validation"</span><span class="p">],</span>
            <span class="n">data_collator</span><span class="o">=</span><span class="n">data_collator</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span>
    
    <span class="k">def</span> <span class="nf">train</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Run the complete QLoRA training pipeline."""</span>
        <span class="c1"># Setup
</span>        <span class="bp">self</span><span class="p">.</span><span class="n">load_model_and_tokenizer</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">load_data</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">create_trainer</span><span class="p">()</span>
        
        <span class="c1"># Initialize wandb
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">and</span> <span class="n">WANDB_AVAILABLE</span><span class="p">:</span>
            <span class="n">wandb</span><span class="p">.</span><span class="n">init</span><span class="p">(</span>
                <span class="n">project</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">project_name</span><span class="p">,</span>
                <span class="n">name</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">run_name</span><span class="p">,</span>
                <span class="n">config</span><span class="o">=</span><span class="nb">vars</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">),</span>
            <span class="p">)</span>
        
        <span class="c1"># Train
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Starting QLoRA training..."</span><span class="p">)</span>
        <span class="n">train_result</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>
        
        <span class="c1"># Save final model
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Saving final model..."</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">save_model</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">output_dir</span><span class="p">)</span>
        
        <span class="c1"># Save training metrics
</span>        <span class="n">metrics</span> <span class="o">=</span> <span class="n">train_result</span><span class="p">.</span><span class="n">metrics</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">log_metrics</span><span class="p">(</span><span class="s">"train"</span><span class="p">,</span> <span class="n">metrics</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">save_metrics</span><span class="p">(</span><span class="s">"train"</span><span class="p">,</span> <span class="n">metrics</span><span class="p">)</span>
        
        <span class="c1"># Final evaluation
</span>        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Running final evaluation..."</span><span class="p">)</span>
        <span class="n">eval_metrics</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">evaluate</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">log_metrics</span><span class="p">(</span><span class="s">"eval"</span><span class="p">,</span> <span class="n">eval_metrics</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">trainer</span><span class="p">.</span><span class="n">save_metrics</span><span class="p">(</span><span class="s">"eval"</span><span class="p">,</span> <span class="n">eval_metrics</span><span class="p">)</span>
        
        <span class="c1"># Log final memory usage
</span>        <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">():</span>
            <span class="n">max_memory</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">max_memory_allocated</span><span class="p">()</span> <span class="o">/</span> <span class="mi">1024</span><span class="o">**</span><span class="mi">3</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Peak GPU memory usage: </span><span class="si">{</span><span class="n">max_memory</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> GB"</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">use_wandb</span> <span class="ow">and</span> <span class="n">WANDB_AVAILABLE</span><span class="p">:</span>
            <span class="n">wandb</span><span class="p">.</span><span class="n">finish</span><span class="p">()</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Training complete!"</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">metrics</span>
    
    <span class="k">def</span> <span class="nf">merge_and_save</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">output_path</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">safe_serialization</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span><span class="p">):</span>
        <span class="s">"""
        Merge LoRA weights with dequantized base model.
        Note: This requires enough memory to hold the full model in FP16.
        """</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Merging QLoRA weights with base model..."</span><span class="p">)</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">warning</span><span class="p">(</span><span class="s">"This requires loading the full model in FP16. Ensure sufficient memory."</span><span class="p">)</span>
        
        <span class="c1"># Load base model in FP16
</span>        <span class="n">base_model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_name</span><span class="p">,</span>
            <span class="n">torch_dtype</span><span class="o">=</span><span class="n">torch</span><span class="p">.</span><span class="n">float16</span><span class="p">,</span>
            <span class="n">device_map</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span>
            <span class="n">trust_remote_code</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">trust_remote_code</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Load and merge LoRA weights
</span>        <span class="kn">from</span> <span class="nn">peft</span> <span class="kn">import</span> <span class="n">PeftModel</span>
        <span class="n">merged_model</span> <span class="o">=</span> <span class="n">PeftModel</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span><span class="n">base_model</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">output_dir</span><span class="p">)</span>
        <span class="n">merged_model</span> <span class="o">=</span> <span class="n">merged_model</span><span class="p">.</span><span class="n">merge_and_unload</span><span class="p">()</span>
        
        <span class="c1"># Save merged model
</span>        <span class="n">merged_model</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span>
            <span class="n">output_path</span><span class="p">,</span> 
            <span class="n">safe_serialization</span><span class="o">=</span><span class="n">safe_serialization</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">save_pretrained</span><span class="p">(</span><span class="n">output_path</span><span class="p">)</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Merged model saved to </span><span class="si">{</span><span class="n">output_path</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">estimate_memory_requirements</span><span class="p">(</span><span class="n">model_name</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">batch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">4</span><span class="p">,</span> <span class="n">seq_length</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">2048</span><span class="p">):</span>
    <span class="s">"""
    Estimate GPU memory requirements for QLoRA training.
    
    Returns estimated memory in GB.
    """</span>
    <span class="c1"># Rough estimates based on model size
</span>    <span class="n">model_params</span> <span class="o">=</span> <span class="p">{</span>
        <span class="s">"7B"</span><span class="p">:</span> <span class="mf">7e9</span><span class="p">,</span>
        <span class="s">"8B"</span><span class="p">:</span> <span class="mf">8e9</span><span class="p">,</span>
        <span class="s">"13B"</span><span class="p">:</span> <span class="mf">13e9</span><span class="p">,</span>
        <span class="s">"30B"</span><span class="p">:</span> <span class="mf">30e9</span><span class="p">,</span>
        <span class="s">"65B"</span><span class="p">:</span> <span class="mf">65e9</span><span class="p">,</span>
        <span class="s">"70B"</span><span class="p">:</span> <span class="mf">70e9</span><span class="p">,</span>
    <span class="p">}</span>
    
    <span class="c1"># Extract size from model name
</span>    <span class="n">size</span> <span class="o">=</span> <span class="bp">None</span>
    <span class="k">for</span> <span class="n">key</span> <span class="ow">in</span> <span class="n">model_params</span><span class="p">:</span>
        <span class="k">if</span> <span class="n">key</span><span class="p">.</span><span class="n">lower</span><span class="p">()</span> <span class="ow">in</span> <span class="n">model_name</span><span class="p">.</span><span class="n">lower</span><span class="p">():</span>
            <span class="n">size</span> <span class="o">=</span> <span class="n">model_params</span><span class="p">[</span><span class="n">key</span><span class="p">]</span>
            <span class="k">break</span>
    
    <span class="k">if</span> <span class="n">size</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">warning</span><span class="p">(</span><span class="s">"Could not estimate model size, assuming 7B parameters"</span><span class="p">)</span>
        <span class="n">size</span> <span class="o">=</span> <span class="mf">7e9</span>
    
    <span class="c1"># Memory components for QLoRA
</span>    <span class="c1"># 4-bit weights: params * 0.5 bytes
</span>    <span class="n">quantized_weights</span> <span class="o">=</span> <span class="n">size</span> <span class="o">*</span> <span class="mf">0.5</span> <span class="o">/</span> <span class="mi">1024</span><span class="o">**</span><span class="mi">3</span>
    
    <span class="c1"># LoRA adapters (FP16): ~0.1% of params * 2 bytes
</span>    <span class="n">lora_weights</span> <span class="o">=</span> <span class="n">size</span> <span class="o">*</span> <span class="mf">0.001</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">/</span> <span class="mi">1024</span><span class="o">**</span><span class="mi">3</span>
    
    <span class="c1"># Optimizer states (8-bit paged): ~2 bytes per LoRA param
</span>    <span class="n">optimizer_states</span> <span class="o">=</span> <span class="n">size</span> <span class="o">*</span> <span class="mf">0.001</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">/</span> <span class="mi">1024</span><span class="o">**</span><span class="mi">3</span>
    
    <span class="c1"># Activations (rough estimate)
</span>    <span class="n">activations</span> <span class="o">=</span> <span class="n">batch_size</span> <span class="o">*</span> <span class="n">seq_length</span> <span class="o">*</span> <span class="mi">4096</span> <span class="o">*</span> <span class="mi">4</span> <span class="o">/</span> <span class="mi">1024</span><span class="o">**</span><span class="mi">3</span>  <span class="c1"># Assume 4096 hidden dim
</span>    
    <span class="n">total</span> <span class="o">=</span> <span class="n">quantized_weights</span> <span class="o">+</span> <span class="n">lora_weights</span> <span class="o">+</span> <span class="n">optimizer_states</span> <span class="o">+</span> <span class="n">activations</span>
    
    <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"""
    Estimated GPU Memory for QLoRA:
    - Quantized weights: </span><span class="si">{</span><span class="n">quantized_weights</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> GB
    - LoRA adapters: </span><span class="si">{</span><span class="n">lora_weights</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> GB
    - Optimizer states: </span><span class="si">{</span><span class="n">optimizer_states</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> GB
    - Activations: </span><span class="si">{</span><span class="n">activations</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> GB
    - Total: </span><span class="si">{</span><span class="n">total</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s"> GB
    """</span><span class="p">)</span>
    
    <span class="k">return</span> <span class="n">total</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="s">"""Run QLoRA fine-tuning."""</span>
    
    <span class="c1"># Estimate memory requirements first
</span>    <span class="n">estimate_memory_requirements</span><span class="p">(</span><span class="s">"meta-llama/Llama-4-8B"</span><span class="p">,</span> <span class="n">batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">)</span>
    
    <span class="n">config</span> <span class="o">=</span> <span class="n">QLoRAConfig</span><span class="p">(</span>
        <span class="n">model_name</span><span class="o">=</span><span class="s">"meta-llama/Llama-4-8B"</span><span class="p">,</span>
        <span class="n">lora_r</span><span class="o">=</span><span class="mi">64</span><span class="p">,</span>
        <span class="n">lora_alpha</span><span class="o">=</span><span class="mi">128</span><span class="p">,</span>
        <span class="n">learning_rate</span><span class="o">=</span><span class="mf">2e-4</span><span class="p">,</span>
        <span class="n">num_epochs</span><span class="o">=</span><span class="mi">3</span><span class="p">,</span>
        <span class="n">batch_size</span><span class="o">=</span><span class="mi">4</span><span class="p">,</span>
        <span class="n">gradient_accumulation_steps</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
        <span class="n">data_dir</span><span class="o">=</span><span class="s">"./processed_data"</span><span class="p">,</span>
        <span class="n">output_dir</span><span class="o">=</span><span class="s">"./qlora_finetuned_model"</span><span class="p">,</span>
    <span class="p">)</span>
    
    <span class="n">trainer</span> <span class="o">=</span> <span class="n">QLoRAFineTuner</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
    <span class="n">trainer</span><span class="p">.</span><span class="n">train</span><span class="p">()</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h2 id="evaluation-and-metrics">Evaluation and Metrics</h2>

<p>Proper evaluation is critical for understanding model performance and preventing overfitting.</p>

<p><img src="diagrams/16_evaluation_metrics.png" alt="Evaluation Framework: Metrics, Benchmarks, and Analysis Tools" /></p>

<h3 id="complete-evaluation-code">Complete Evaluation Code</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Comprehensive Evaluation Pipeline for Fine-Tuned LLMs.
Supports multiple metrics, benchmarks, and analysis tools.

Requirements:
    pip install torch transformers datasets evaluate nltk rouge-score sacrebleu
    pip install lm-eval  # For standard benchmarks
"""</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">json</span>
<span class="kn">import</span> <span class="nn">logging</span>
<span class="kn">from</span> <span class="nn">pathlib</span> <span class="kn">import</span> <span class="n">Path</span>
<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span><span class="p">,</span> <span class="n">field</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span><span class="p">,</span> <span class="n">List</span><span class="p">,</span> <span class="n">Dict</span><span class="p">,</span> <span class="n">Any</span><span class="p">,</span> <span class="n">Callable</span>
<span class="kn">from</span> <span class="nn">collections</span> <span class="kn">import</span> <span class="n">defaultdict</span>

<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">transformers</span> <span class="kn">import</span> <span class="p">(</span>
    <span class="n">AutoModelForCausalLM</span><span class="p">,</span>
    <span class="n">AutoTokenizer</span><span class="p">,</span>
    <span class="n">GenerationConfig</span><span class="p">,</span>
<span class="p">)</span>
<span class="kn">from</span> <span class="nn">datasets</span> <span class="kn">import</span> <span class="n">load_dataset</span><span class="p">,</span> <span class="n">Dataset</span>
<span class="kn">from</span> <span class="nn">tqdm.auto</span> <span class="kn">import</span> <span class="n">tqdm</span>
<span class="kn">import</span> <span class="nn">evaluate</span>

<span class="n">logging</span><span class="p">.</span><span class="n">basicConfig</span><span class="p">(</span><span class="n">level</span><span class="o">=</span><span class="n">logging</span><span class="p">.</span><span class="n">INFO</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">'%(asctime)s - %(levelname)s - %(message)s'</span><span class="p">)</span>
<span class="n">logger</span> <span class="o">=</span> <span class="n">logging</span><span class="p">.</span><span class="n">getLogger</span><span class="p">(</span><span class="n">__name__</span><span class="p">)</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">EvaluationConfig</span><span class="p">:</span>
    <span class="s">"""Configuration for model evaluation."""</span>
    
    <span class="c1"># Model settings
</span>    <span class="n">model_path</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./finetuned_model"</span>
    <span class="n">torch_dtype</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"bfloat16"</span>
    <span class="n">device</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"cuda"</span>
    <span class="n">trust_remote_code</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    
    <span class="c1"># Generation settings
</span>    <span class="n">max_new_tokens</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">256</span>
    <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.7</span>
    <span class="n">top_p</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">0.9</span>
    <span class="n">do_sample</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    
    <span class="c1"># Evaluation settings
</span>    <span class="n">batch_size</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">8</span>
    <span class="n">num_samples</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>  <span class="c1"># None = use all
</span>    
    <span class="c1"># Metrics to compute
</span>    <span class="n">compute_perplexity</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">compute_rouge</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    <span class="n">compute_bleu</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">True</span>
    
    <span class="c1"># Output
</span>    <span class="n">output_dir</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"./evaluation_results"</span>


<span class="k">class</span> <span class="nc">LLMEvaluator</span><span class="p">:</span>
    <span class="s">"""Comprehensive evaluation suite for fine-tuned LLMs."""</span>
    
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">config</span><span class="p">:</span> <span class="n">EvaluationConfig</span><span class="p">):</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">config</span> <span class="o">=</span> <span class="n">config</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">device</span> <span class="k">if</span> <span class="n">torch</span><span class="p">.</span><span class="n">cuda</span><span class="p">.</span><span class="n">is_available</span><span class="p">()</span> <span class="k">else</span> <span class="s">"cpu"</span><span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_load_model</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_load_metrics</span><span class="p">()</span>
    
    <span class="k">def</span> <span class="nf">_load_model</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load the fine-tuned model."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Loading model from </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_path</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="n">dtype_map</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"float32"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">float32</span><span class="p">,</span>
            <span class="s">"float16"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">float16</span><span class="p">,</span>
            <span class="s">"bfloat16"</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span><span class="p">,</span>
        <span class="p">}</span>
        <span class="n">torch_dtype</span> <span class="o">=</span> <span class="n">dtype_map</span><span class="p">.</span><span class="n">get</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">torch_dtype</span><span class="p">,</span> <span class="n">torch</span><span class="p">.</span><span class="n">bfloat16</span><span class="p">)</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span> <span class="o">=</span> <span class="n">AutoTokenizer</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_path</span><span class="p">,</span>
            <span class="n">trust_remote_code</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">trust_remote_code</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">eos_token</span>
        
        <span class="bp">self</span><span class="p">.</span><span class="n">model</span> <span class="o">=</span> <span class="n">AutoModelForCausalLM</span><span class="p">.</span><span class="n">from_pretrained</span><span class="p">(</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_path</span><span class="p">,</span>
            <span class="n">torch_dtype</span><span class="o">=</span><span class="n">torch_dtype</span><span class="p">,</span>
            <span class="n">device_map</span><span class="o">=</span><span class="s">"auto"</span><span class="p">,</span>
            <span class="n">trust_remote_code</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">trust_remote_code</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="nb">eval</span><span class="p">()</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Model loaded successfully"</span><span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">_load_metrics</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Load evaluation metrics."""</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">metrics</span> <span class="o">=</span> <span class="p">{}</span>
        
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">compute_rouge</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">[</span><span class="s">"rouge"</span><span class="p">]</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">"rouge"</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">compute_bleu</span><span class="p">:</span>
            <span class="bp">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">[</span><span class="s">"bleu"</span><span class="p">]</span> <span class="o">=</span> <span class="n">evaluate</span><span class="p">.</span><span class="n">load</span><span class="p">(</span><span class="s">"sacrebleu"</span><span class="p">)</span>
    
    <span class="o">@</span><span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">compute_perplexity</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">texts</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">])</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">float</span><span class="p">]:</span>
        <span class="s">"""Compute perplexity on a list of texts."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Computing perplexity..."</span><span class="p">)</span>
        
        <span class="n">total_loss</span> <span class="o">=</span> <span class="mf">0.0</span>
        <span class="n">total_tokens</span> <span class="o">=</span> <span class="mi">0</span>
        
        <span class="k">for</span> <span class="n">text</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">texts</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Perplexity"</span><span class="p">):</span>
            <span class="n">inputs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">(</span>
                <span class="n">text</span><span class="p">,</span>
                <span class="n">return_tensors</span><span class="o">=</span><span class="s">"pt"</span><span class="p">,</span>
                <span class="n">truncation</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">max_length</span><span class="o">=</span><span class="mi">2048</span><span class="p">,</span>
            <span class="p">).</span><span class="n">to</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">device</span><span class="p">)</span>
            
            <span class="n">outputs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">(</span><span class="o">**</span><span class="n">inputs</span><span class="p">,</span> <span class="n">labels</span><span class="o">=</span><span class="n">inputs</span><span class="p">[</span><span class="s">"input_ids"</span><span class="p">])</span>
            <span class="n">loss</span> <span class="o">=</span> <span class="n">outputs</span><span class="p">.</span><span class="n">loss</span><span class="p">.</span><span class="n">item</span><span class="p">()</span>
            <span class="n">num_tokens</span> <span class="o">=</span> <span class="n">inputs</span><span class="p">[</span><span class="s">"input_ids"</span><span class="p">].</span><span class="n">numel</span><span class="p">()</span>
            
            <span class="n">total_loss</span> <span class="o">+=</span> <span class="n">loss</span> <span class="o">*</span> <span class="n">num_tokens</span>
            <span class="n">total_tokens</span> <span class="o">+=</span> <span class="n">num_tokens</span>
        
        <span class="n">avg_loss</span> <span class="o">=</span> <span class="n">total_loss</span> <span class="o">/</span> <span class="n">total_tokens</span>
        <span class="n">perplexity</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">avg_loss</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"perplexity"</span><span class="p">:</span> <span class="nb">float</span><span class="p">(</span><span class="n">perplexity</span><span class="p">),</span>
            <span class="s">"avg_loss"</span><span class="p">:</span> <span class="nb">float</span><span class="p">(</span><span class="n">avg_loss</span><span class="p">),</span>
            <span class="s">"total_tokens"</span><span class="p">:</span> <span class="n">total_tokens</span><span class="p">,</span>
        <span class="p">}</span>
    
    <span class="o">@</span><span class="n">torch</span><span class="p">.</span><span class="n">no_grad</span><span class="p">()</span>
    <span class="k">def</span> <span class="nf">generate_responses</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">prompts</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
        <span class="n">generation_config</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">GenerationConfig</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]:</span>
        <span class="s">"""Generate responses for a list of prompts."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Generating responses for </span><span class="si">{</span><span class="nb">len</span><span class="p">(</span><span class="n">prompts</span><span class="p">)</span><span class="si">}</span><span class="s"> prompts..."</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="n">generation_config</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">generation_config</span> <span class="o">=</span> <span class="n">GenerationConfig</span><span class="p">(</span>
                <span class="n">max_new_tokens</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">max_new_tokens</span><span class="p">,</span>
                <span class="n">temperature</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">temperature</span><span class="p">,</span>
                <span class="n">top_p</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">top_p</span><span class="p">,</span>
                <span class="n">do_sample</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">do_sample</span><span class="p">,</span>
                <span class="n">pad_token_id</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">pad_token_id</span><span class="p">,</span>
                <span class="n">eos_token_id</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">eos_token_id</span><span class="p">,</span>
            <span class="p">)</span>
        
        <span class="n">responses</span> <span class="o">=</span> <span class="p">[]</span>
        
        <span class="k">for</span> <span class="n">prompt</span> <span class="ow">in</span> <span class="n">tqdm</span><span class="p">(</span><span class="n">prompts</span><span class="p">,</span> <span class="n">desc</span><span class="o">=</span><span class="s">"Generating"</span><span class="p">):</span>
            <span class="n">inputs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">(</span>
                <span class="n">prompt</span><span class="p">,</span>
                <span class="n">return_tensors</span><span class="o">=</span><span class="s">"pt"</span><span class="p">,</span>
                <span class="n">truncation</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
                <span class="n">max_length</span><span class="o">=</span><span class="mi">2048</span> <span class="o">-</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">max_new_tokens</span><span class="p">,</span>
            <span class="p">).</span><span class="n">to</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">device</span><span class="p">)</span>
            
            <span class="n">outputs</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">model</span><span class="p">.</span><span class="n">generate</span><span class="p">(</span>
                <span class="o">**</span><span class="n">inputs</span><span class="p">,</span>
                <span class="n">generation_config</span><span class="o">=</span><span class="n">generation_config</span><span class="p">,</span>
            <span class="p">)</span>
            
            <span class="c1"># Decode only the new tokens
</span>            <span class="n">response</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">tokenizer</span><span class="p">.</span><span class="n">decode</span><span class="p">(</span>
                <span class="n">outputs</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="n">inputs</span><span class="p">[</span><span class="s">"input_ids"</span><span class="p">].</span><span class="n">shape</span><span class="p">[</span><span class="mi">1</span><span class="p">]:],</span>
                <span class="n">skip_special_tokens</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span>
            <span class="p">)</span>
            <span class="n">responses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">response</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">responses</span>
    
    <span class="k">def</span> <span class="nf">compute_rouge_scores</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">predictions</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
        <span class="n">references</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">float</span><span class="p">]:</span>
        <span class="s">"""Compute ROUGE scores."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Computing ROUGE scores..."</span><span class="p">)</span>
        
        <span class="n">results</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">[</span><span class="s">"rouge"</span><span class="p">].</span><span class="n">compute</span><span class="p">(</span>
            <span class="n">predictions</span><span class="o">=</span><span class="n">predictions</span><span class="p">,</span>
            <span class="n">references</span><span class="o">=</span><span class="n">references</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"rouge1"</span><span class="p">:</span> <span class="n">results</span><span class="p">[</span><span class="s">"rouge1"</span><span class="p">],</span>
            <span class="s">"rouge2"</span><span class="p">:</span> <span class="n">results</span><span class="p">[</span><span class="s">"rouge2"</span><span class="p">],</span>
            <span class="s">"rougeL"</span><span class="p">:</span> <span class="n">results</span><span class="p">[</span><span class="s">"rougeL"</span><span class="p">],</span>
            <span class="s">"rougeLsum"</span><span class="p">:</span> <span class="n">results</span><span class="p">[</span><span class="s">"rougeLsum"</span><span class="p">],</span>
        <span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">compute_bleu_score</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">predictions</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
        <span class="n">references</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]],</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">float</span><span class="p">]:</span>
        <span class="s">"""Compute BLEU score."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Computing BLEU score..."</span><span class="p">)</span>
        
        <span class="n">results</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">metrics</span><span class="p">[</span><span class="s">"bleu"</span><span class="p">].</span><span class="n">compute</span><span class="p">(</span>
            <span class="n">predictions</span><span class="o">=</span><span class="n">predictions</span><span class="p">,</span>
            <span class="n">references</span><span class="o">=</span><span class="n">references</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"bleu"</span><span class="p">:</span> <span class="n">results</span><span class="p">[</span><span class="s">"score"</span><span class="p">],</span>
            <span class="s">"precisions"</span><span class="p">:</span> <span class="n">results</span><span class="p">[</span><span class="s">"precisions"</span><span class="p">],</span>
        <span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">evaluate_instruction_following</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">dataset</span><span class="p">:</span> <span class="n">Dataset</span><span class="p">,</span>
        <span class="n">instruction_col</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"instruction"</span><span class="p">,</span>
        <span class="n">response_col</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"response"</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]:</span>
        <span class="s">"""Evaluate instruction-following capability."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Evaluating instruction following..."</span><span class="p">)</span>
        
        <span class="c1"># Limit samples if specified
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_samples</span><span class="p">:</span>
            <span class="n">dataset</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">.</span><span class="n">select</span><span class="p">(</span><span class="nb">range</span><span class="p">(</span><span class="nb">min</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">num_samples</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">dataset</span><span class="p">))))</span>
        
        <span class="c1"># Extract prompts and references
</span>        <span class="n">prompts</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">[</span><span class="n">instruction_col</span><span class="p">]</span>
        <span class="n">references</span> <span class="o">=</span> <span class="n">dataset</span><span class="p">[</span><span class="n">response_col</span><span class="p">]</span>
        
        <span class="c1"># Generate responses
</span>        <span class="n">predictions</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">generate_responses</span><span class="p">(</span><span class="n">prompts</span><span class="p">)</span>
        
        <span class="n">results</span> <span class="o">=</span> <span class="p">{}</span>
        
        <span class="c1"># ROUGE scores
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">compute_rouge</span><span class="p">:</span>
            <span class="n">results</span><span class="p">[</span><span class="s">"rouge"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">compute_rouge_scores</span><span class="p">(</span><span class="n">predictions</span><span class="p">,</span> <span class="n">references</span><span class="p">)</span>
        
        <span class="c1"># BLEU score
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">compute_bleu</span><span class="p">:</span>
            <span class="c1"># BLEU expects list of reference lists
</span>            <span class="n">ref_lists</span> <span class="o">=</span> <span class="p">[[</span><span class="n">ref</span><span class="p">]</span> <span class="k">for</span> <span class="n">ref</span> <span class="ow">in</span> <span class="n">references</span><span class="p">]</span>
            <span class="n">results</span><span class="p">[</span><span class="s">"bleu"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">compute_bleu_score</span><span class="p">(</span><span class="n">predictions</span><span class="p">,</span> <span class="n">ref_lists</span><span class="p">)</span>
        
        <span class="c1"># Save sample outputs
</span>        <span class="n">results</span><span class="p">[</span><span class="s">"samples"</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span>
            <span class="p">{</span>
                <span class="s">"instruction"</span><span class="p">:</span> <span class="n">p</span><span class="p">,</span>
                <span class="s">"reference"</span><span class="p">:</span> <span class="n">r</span><span class="p">,</span>
                <span class="s">"prediction"</span><span class="p">:</span> <span class="n">pred</span><span class="p">,</span>
            <span class="p">}</span>
            <span class="k">for</span> <span class="n">p</span><span class="p">,</span> <span class="n">r</span><span class="p">,</span> <span class="n">pred</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">prompts</span><span class="p">[:</span><span class="mi">10</span><span class="p">],</span> <span class="n">references</span><span class="p">[:</span><span class="mi">10</span><span class="p">],</span> <span class="n">predictions</span><span class="p">[:</span><span class="mi">10</span><span class="p">])</span>
        <span class="p">]</span>
        
        <span class="k">return</span> <span class="n">results</span>
    
    <span class="k">def</span> <span class="nf">run_lm_eval_harness</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">tasks</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="s">"hellaswag"</span><span class="p">,</span> <span class="s">"arc_easy"</span><span class="p">,</span> <span class="s">"arc_challenge"</span><span class="p">,</span> <span class="s">"winogrande"</span><span class="p">],</span>
        <span class="n">num_fewshot</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">0</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]:</span>
        <span class="s">"""
        Run evaluation using lm-evaluation-harness.
        
        Requires: pip install lm-eval
        """</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Running lm-eval-harness on tasks: </span><span class="si">{</span><span class="n">tasks</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">try</span><span class="p">:</span>
            <span class="kn">from</span> <span class="nn">lm_eval</span> <span class="kn">import</span> <span class="n">evaluator</span><span class="p">,</span> <span class="n">tasks</span> <span class="k">as</span> <span class="n">lm_tasks</span>
            <span class="kn">from</span> <span class="nn">lm_eval.models.huggingface</span> <span class="kn">import</span> <span class="n">HFLM</span>
        <span class="k">except</span> <span class="nb">ImportError</span><span class="p">:</span>
            <span class="n">logger</span><span class="p">.</span><span class="n">error</span><span class="p">(</span><span class="s">"lm-eval not installed. Run: pip install lm-eval"</span><span class="p">)</span>
            <span class="k">return</span> <span class="p">{</span><span class="s">"error"</span><span class="p">:</span> <span class="s">"lm-eval not installed"</span><span class="p">}</span>
        
        <span class="c1"># Create LM object
</span>        <span class="n">lm</span> <span class="o">=</span> <span class="n">HFLM</span><span class="p">(</span>
            <span class="n">pretrained</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">model_path</span><span class="p">,</span>
            <span class="n">dtype</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">torch_dtype</span><span class="p">,</span>
            <span class="n">batch_size</span><span class="o">=</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">batch_size</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="c1"># Run evaluation
</span>        <span class="n">results</span> <span class="o">=</span> <span class="n">evaluator</span><span class="p">.</span><span class="n">simple_evaluate</span><span class="p">(</span>
            <span class="n">model</span><span class="o">=</span><span class="n">lm</span><span class="p">,</span>
            <span class="n">tasks</span><span class="o">=</span><span class="n">tasks</span><span class="p">,</span>
            <span class="n">num_fewshot</span><span class="o">=</span><span class="n">num_fewshot</span><span class="p">,</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="n">results</span>
    
    <span class="k">def</span> <span class="nf">analyze_errors</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">prompts</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
        <span class="n">predictions</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
        <span class="n">references</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
        <span class="n">categorize_fn</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Callable</span><span class="p">[[</span><span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">,</span> <span class="nb">str</span><span class="p">],</span> <span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]:</span>
        <span class="s">"""Analyze prediction errors."""</span>
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="s">"Analyzing errors..."</span><span class="p">)</span>
        
        <span class="n">errors_by_category</span> <span class="o">=</span> <span class="n">defaultdict</span><span class="p">(</span><span class="nb">list</span><span class="p">)</span>
        
        <span class="k">for</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">pred</span><span class="p">,</span> <span class="n">ref</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">prompts</span><span class="p">,</span> <span class="n">predictions</span><span class="p">,</span> <span class="n">references</span><span class="p">):</span>
            <span class="c1"># Simple error detection: check if prediction differs significantly
</span>            <span class="k">if</span> <span class="n">pred</span><span class="p">.</span><span class="n">strip</span><span class="p">().</span><span class="n">lower</span><span class="p">()</span> <span class="o">!=</span> <span class="n">ref</span><span class="p">.</span><span class="n">strip</span><span class="p">().</span><span class="n">lower</span><span class="p">():</span>
                <span class="k">if</span> <span class="n">categorize_fn</span><span class="p">:</span>
                    <span class="n">category</span> <span class="o">=</span> <span class="n">categorize_fn</span><span class="p">(</span><span class="n">prompt</span><span class="p">,</span> <span class="n">pred</span><span class="p">,</span> <span class="n">ref</span><span class="p">)</span>
                <span class="k">else</span><span class="p">:</span>
                    <span class="c1"># Default categorization by length difference
</span>                    <span class="n">len_diff</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">pred</span><span class="p">)</span> <span class="o">-</span> <span class="nb">len</span><span class="p">(</span><span class="n">ref</span><span class="p">)</span>
                    <span class="k">if</span> <span class="n">len_diff</span> <span class="o">&gt;</span> <span class="mi">100</span><span class="p">:</span>
                        <span class="n">category</span> <span class="o">=</span> <span class="s">"too_long"</span>
                    <span class="k">elif</span> <span class="n">len_diff</span> <span class="o">&lt;</span> <span class="o">-</span><span class="mi">100</span><span class="p">:</span>
                        <span class="n">category</span> <span class="o">=</span> <span class="s">"too_short"</span>
                    <span class="k">else</span><span class="p">:</span>
                        <span class="n">category</span> <span class="o">=</span> <span class="s">"content_mismatch"</span>
                
                <span class="n">errors_by_category</span><span class="p">[</span><span class="n">category</span><span class="p">].</span><span class="n">append</span><span class="p">({</span>
                    <span class="s">"prompt"</span><span class="p">:</span> <span class="n">prompt</span><span class="p">[:</span><span class="mi">200</span><span class="p">],</span>
                    <span class="s">"prediction"</span><span class="p">:</span> <span class="n">pred</span><span class="p">[:</span><span class="mi">200</span><span class="p">],</span>
                    <span class="s">"reference"</span><span class="p">:</span> <span class="n">ref</span><span class="p">[:</span><span class="mi">200</span><span class="p">],</span>
                <span class="p">})</span>
        
        <span class="n">analysis</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"total_errors"</span><span class="p">:</span> <span class="nb">sum</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">errors_by_category</span><span class="p">.</span><span class="n">values</span><span class="p">()),</span>
            <span class="s">"errors_by_category"</span><span class="p">:</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">v</span><span class="p">)</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">errors_by_category</span><span class="p">.</span><span class="n">items</span><span class="p">()},</span>
            <span class="s">"sample_errors"</span><span class="p">:</span> <span class="p">{</span><span class="n">k</span><span class="p">:</span> <span class="n">v</span><span class="p">[:</span><span class="mi">3</span><span class="p">]</span> <span class="k">for</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span> <span class="ow">in</span> <span class="n">errors_by_category</span><span class="p">.</span><span class="n">items</span><span class="p">()},</span>
        <span class="p">}</span>
        
        <span class="k">return</span> <span class="n">analysis</span>
    
    <span class="k">def</span> <span class="nf">run_full_evaluation</span><span class="p">(</span>
        <span class="bp">self</span><span class="p">,</span>
        <span class="n">test_dataset</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">Dataset</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">test_texts</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="bp">None</span><span class="p">,</span>
        <span class="n">instruction_col</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"instruction"</span><span class="p">,</span>
        <span class="n">response_col</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"response"</span><span class="p">,</span>
        <span class="n">run_benchmarks</span><span class="p">:</span> <span class="nb">bool</span> <span class="o">=</span> <span class="bp">False</span><span class="p">,</span>
    <span class="p">)</span> <span class="o">-&gt;</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]:</span>
        <span class="s">"""Run comprehensive evaluation."""</span>
        <span class="n">results</span> <span class="o">=</span> <span class="p">{}</span>
        
        <span class="c1"># Perplexity evaluation
</span>        <span class="k">if</span> <span class="n">test_texts</span> <span class="ow">and</span> <span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">compute_perplexity</span><span class="p">:</span>
            <span class="n">results</span><span class="p">[</span><span class="s">"perplexity"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">compute_perplexity</span><span class="p">(</span><span class="n">test_texts</span><span class="p">)</span>
        
        <span class="c1"># Instruction following evaluation
</span>        <span class="k">if</span> <span class="n">test_dataset</span><span class="p">:</span>
            <span class="n">results</span><span class="p">[</span><span class="s">"instruction_following"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">evaluate_instruction_following</span><span class="p">(</span>
                <span class="n">test_dataset</span><span class="p">,</span> <span class="n">instruction_col</span><span class="p">,</span> <span class="n">response_col</span>
            <span class="p">)</span>
        
        <span class="c1"># Standard benchmarks (optional)
</span>        <span class="k">if</span> <span class="n">run_benchmarks</span><span class="p">:</span>
            <span class="n">results</span><span class="p">[</span><span class="s">"benchmarks"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">run_lm_eval_harness</span><span class="p">()</span>
        
        <span class="c1"># Save results
</span>        <span class="n">output_dir</span> <span class="o">=</span> <span class="n">Path</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">config</span><span class="p">.</span><span class="n">output_dir</span><span class="p">)</span>
        <span class="n">output_dir</span><span class="p">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">parents</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">exist_ok</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        
        <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_dir</span> <span class="o">/</span> <span class="s">"evaluation_results.json"</span><span class="p">,</span> <span class="s">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">f</span><span class="p">:</span>
            <span class="n">json</span><span class="p">.</span><span class="n">dump</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">f</span><span class="p">,</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">default</span><span class="o">=</span><span class="nb">str</span><span class="p">)</span>
        
        <span class="n">logger</span><span class="p">.</span><span class="n">info</span><span class="p">(</span><span class="sa">f</span><span class="s">"Results saved to </span><span class="si">{</span><span class="n">output_dir</span> <span class="o">/</span> <span class="s">'evaluation_results.json'</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="n">results</span>
    
    <span class="k">def</span> <span class="nf">print_summary</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">results</span><span class="p">:</span> <span class="n">Dict</span><span class="p">[</span><span class="nb">str</span><span class="p">,</span> <span class="n">Any</span><span class="p">]):</span>
        <span class="s">"""Print evaluation summary."""</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span><span class="o">*</span><span class="mi">60</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"EVALUATION SUMMARY"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"="</span><span class="o">*</span><span class="mi">60</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="s">"perplexity"</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Perplexity: </span><span class="si">{</span><span class="n">results</span><span class="p">[</span><span class="s">'perplexity'</span><span class="p">][</span><span class="s">'perplexity'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="s">"instruction_following"</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span>
            <span class="k">if</span> <span class="s">"rouge"</span> <span class="ow">in</span> <span class="n">results</span><span class="p">[</span><span class="s">"instruction_following"</span><span class="p">]:</span>
                <span class="n">rouge</span> <span class="o">=</span> <span class="n">results</span><span class="p">[</span><span class="s">"instruction_following"</span><span class="p">][</span><span class="s">"rouge"</span><span class="p">]</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">ROUGE Scores:"</span><span class="p">)</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  ROUGE-1: </span><span class="si">{</span><span class="n">rouge</span><span class="p">[</span><span class="s">'rouge1'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  ROUGE-2: </span><span class="si">{</span><span class="n">rouge</span><span class="p">[</span><span class="s">'rouge2'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  ROUGE-L: </span><span class="si">{</span><span class="n">rouge</span><span class="p">[</span><span class="s">'rougeL'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
            
            <span class="k">if</span> <span class="s">"bleu"</span> <span class="ow">in</span> <span class="n">results</span><span class="p">[</span><span class="s">"instruction_following"</span><span class="p">]:</span>
                <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">BLEU Score: </span><span class="si">{</span><span class="n">results</span><span class="p">[</span><span class="s">'instruction_following'</span><span class="p">][</span><span class="s">'bleu'</span><span class="p">][</span><span class="s">'bleu'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">if</span> <span class="s">"benchmarks"</span> <span class="ow">in</span> <span class="n">results</span> <span class="ow">and</span> <span class="s">"results"</span> <span class="ow">in</span> <span class="n">results</span><span class="p">[</span><span class="s">"benchmarks"</span><span class="p">]:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">Benchmark Results:"</span><span class="p">)</span>
            <span class="k">for</span> <span class="n">task</span><span class="p">,</span> <span class="n">scores</span> <span class="ow">in</span> <span class="n">results</span><span class="p">[</span><span class="s">"benchmarks"</span><span class="p">][</span><span class="s">"results"</span><span class="p">].</span><span class="n">items</span><span class="p">():</span>
                <span class="k">if</span> <span class="s">"acc"</span> <span class="ow">in</span> <span class="n">scores</span><span class="p">:</span>
                    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  </span><span class="si">{</span><span class="n">task</span><span class="si">}</span><span class="s">: </span><span class="si">{</span><span class="n">scores</span><span class="p">[</span><span class="s">'acc'</span><span class="p">]</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        
        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span><span class="o">*</span><span class="mi">60</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="s">"""Run evaluation on a fine-tuned model."""</span>
    
    <span class="n">config</span> <span class="o">=</span> <span class="n">EvaluationConfig</span><span class="p">(</span>
        <span class="n">model_path</span><span class="o">=</span><span class="s">"./finetuned_model"</span><span class="p">,</span>
        <span class="n">output_dir</span><span class="o">=</span><span class="s">"./evaluation_results"</span><span class="p">,</span>
        <span class="n">batch_size</span><span class="o">=</span><span class="mi">8</span><span class="p">,</span>
        <span class="n">num_samples</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span>
    <span class="p">)</span>
    
    <span class="n">evaluator</span> <span class="o">=</span> <span class="n">LLMEvaluator</span><span class="p">(</span><span class="n">config</span><span class="p">)</span>
    
    <span class="c1"># Load test dataset (example: Alpaca)
</span>    <span class="n">dataset</span> <span class="o">=</span> <span class="n">load_dataset</span><span class="p">(</span><span class="s">"tatsu-lab/alpaca"</span><span class="p">,</span> <span class="n">split</span><span class="o">=</span><span class="s">"train[:1000]"</span><span class="p">)</span>
    <span class="n">test_texts</span> <span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">[</span><span class="s">"output"</span><span class="p">]</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">dataset</span><span class="p">]</span>
    
    <span class="c1"># Run evaluation
</span>    <span class="n">results</span> <span class="o">=</span> <span class="n">evaluator</span><span class="p">.</span><span class="n">run_full_evaluation</span><span class="p">(</span>
        <span class="n">test_dataset</span><span class="o">=</span><span class="n">dataset</span><span class="p">,</span>
        <span class="n">test_texts</span><span class="o">=</span><span class="n">test_texts</span><span class="p">,</span>
        <span class="n">instruction_col</span><span class="o">=</span><span class="s">"instruction"</span><span class="p">,</span>
        <span class="n">response_col</span><span class="o">=</span><span class="s">"output"</span><span class="p">,</span>
        <span class="n">run_benchmarks</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>  <span class="c1"># Set to True to run standard benchmarks
</span>    <span class="p">)</span>
    
    <span class="c1"># Print summary
</span>    <span class="n">evaluator</span><span class="p">.</span><span class="n">print_summary</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h2 id="best-practices-and-optimization-tips">Best Practices and Optimization Tips</h2>

<h3 id="data-quality">Data Quality</h3>

<p><img src="diagrams/17_memory_comparison.png" alt="Memory Requirements Comparison for 70B Model Fine-Tuning" /></p>

<h3 id="hyperparameter-selection-guide">Hyperparameter Selection Guide</h3>

<table>
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Full FT</th>
      <th>LoRA</th>
      <th>QLoRA</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Learning Rate</td>
      <td>1e-5 - 5e-5</td>
      <td>1e-4 - 3e-4</td>
      <td>1e-4 - 3e-4</td>
      <td>QLoRA can use same as LoRA</td>
    </tr>
    <tr>
      <td>Batch Size</td>
      <td>32-128</td>
      <td>16-64</td>
      <td>4-16</td>
      <td>Limited by memory</td>
    </tr>
    <tr>
      <td>Epochs</td>
      <td>1-3</td>
      <td>1-3</td>
      <td>2-4</td>
      <td>QLoRA may need more</td>
    </tr>
    <tr>
      <td>Warmup Ratio</td>
      <td>0.03-0.1</td>
      <td>0.03-0.1</td>
      <td>0.03-0.1</td>
      <td>Standard across all</td>
    </tr>
    <tr>
      <td>Max Grad Norm</td>
      <td>1.0</td>
      <td>1.0</td>
      <td>0.3</td>
      <td>Lower for QLoRA stability</td>
    </tr>
    <tr>
      <td>Weight Decay</td>
      <td>0.01-0.1</td>
      <td>0.01</td>
      <td>0.01</td>
      <td>Lower for LoRA methods</td>
    </tr>
    <tr>
      <td>LoRA r</td>
      <td>N/A</td>
      <td>32-128</td>
      <td>64-256</td>
      <td>Higher r = more capacity</td>
    </tr>
    <tr>
      <td>LoRA α</td>
      <td>N/A</td>
      <td>2×r</td>
      <td>2×r</td>
      <td>Common heuristic</td>
    </tr>
  </tbody>
</table>

<h3 id="memory-optimization-strategies">Memory Optimization Strategies</h3>

<p><img src="diagrams/18_decision_flowchart.png" alt="Fine-Tuning Method Selection Guide" /></p>

<h3 id="common-pitfalls-and-solutions">Common Pitfalls and Solutions</h3>

<table>
  <thead>
    <tr>
      <th>Pitfall</th>
      <th>Symptom</th>
      <th>Solution</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Overfitting</td>
      <td>Val loss increases</td>
      <td>Early stopping, more data, regularization</td>
    </tr>
    <tr>
      <td>Catastrophic forgetting</td>
      <td>Base capabilities degrade</td>
      <td>Lower LR, LoRA, replay buffer</td>
    </tr>
    <tr>
      <td>Gradient explosion</td>
      <td>NaN losses</td>
      <td>Lower LR, gradient clipping</td>
    </tr>
    <tr>
      <td>Mode collapse</td>
      <td>Repetitive outputs</td>
      <td>Temperature, nucleus sampling</td>
    </tr>
    <tr>
      <td>Slow convergence</td>
      <td>Loss plateaus early</td>
      <td>Higher LR, lr scheduling</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="comparison-of-approaches">Comparison of Approaches</h2>

<h3 id="decision-framework">Decision Framework</h3>

<p><img src="diagrams/19_memory_optimization.png" alt="Memory Optimization Strategies for LLM Training" /></p>

<h3 id="comprehensive-comparison">Comprehensive Comparison</h3>

<table>
  <thead>
    <tr>
      <th>Aspect</th>
      <th>Full Fine-Tuning</th>
      <th>LoRA</th>
      <th>QLoRA</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Performance</strong></td>
      <td>⭐⭐⭐⭐⭐</td>
      <td>⭐⭐⭐⭐</td>
      <td>⭐⭐⭐⭐</td>
    </tr>
    <tr>
      <td><strong>Memory Efficiency</strong></td>
      <td>⭐</td>
      <td>⭐⭐⭐⭐</td>
      <td>⭐⭐⭐⭐⭐</td>
    </tr>
    <tr>
      <td><strong>Training Speed</strong></td>
      <td>⭐⭐</td>
      <td>⭐⭐⭐⭐</td>
      <td>⭐⭐⭐</td>
    </tr>
    <tr>
      <td><strong>Inference Speed</strong></td>
      <td>⭐⭐⭐⭐⭐</td>
      <td>⭐⭐⭐⭐ (merged)</td>
      <td>⭐⭐⭐</td>
    </tr>
    <tr>
      <td><strong>Ease of Use</strong></td>
      <td>⭐⭐⭐</td>
      <td>⭐⭐⭐⭐</td>
      <td>⭐⭐⭐⭐</td>
    </tr>
    <tr>
      <td><strong>Flexibility</strong></td>
      <td>⭐⭐⭐⭐⭐</td>
      <td>⭐⭐⭐</td>
      <td>⭐⭐⭐</td>
    </tr>
    <tr>
      <td><strong>Hardware Requirement</strong></td>
      <td>Multiple A100s</td>
      <td>Single A100</td>
      <td>Consumer GPU</td>
    </tr>
  </tbody>
</table>

<h3 id="cost-comparison-70b-model-10k-examples">Cost Comparison (70B Model, 10K Examples)</h3>

<table>
  <thead>
    <tr>
      <th>Approach</th>
      <th>Hardware</th>
      <th>Training Time</th>
      <th>Est. Cloud Cost</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Full FT</td>
      <td>8× A100 80GB</td>
      <td>~24 hours</td>
      <td>~$800-1200</td>
    </tr>
    <tr>
      <td>LoRA</td>
      <td>2× A100 80GB</td>
      <td>~12 hours</td>
      <td>~$150-250</td>
    </tr>
    <tr>
      <td>QLoRA</td>
      <td>1× A100 40GB</td>
      <td>~18 hours</td>
      <td>~$100-150</td>
    </tr>
    <tr>
      <td>QLoRA</td>
      <td>1× RTX 4090</td>
      <td>~48 hours</td>
      <td>Local hardware</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Fine-tuning Large Language Models has evolved from an exclusively enterprise endeavor to something achievable on consumer hardware, thanks to innovations like LoRA and QLoRA. This guide has covered:</p>

<ol>
  <li><strong>The fundamentals</strong> of why and when to fine-tune LLMs</li>
  <li><strong>Three primary approaches</strong>: Full fine-tuning, LoRA, and QLoRA</li>
  <li><strong>Advanced techniques</strong>: LoRA variants including LoRA-FA, VeRA, Delta-LoRA, and LoRA+</li>
  <li><strong>Production-ready code</strong> for data preparation, training, and evaluation</li>
  <li><strong>Best practices</strong> for achieving optimal results</li>
</ol>

<h3 id="key-takeaways">Key Takeaways</h3>

<ul>
  <li><strong>Start with QLoRA</strong> if you have limited GPU memory—it’s remarkably effective</li>
  <li><strong>Data quality trumps quantity</strong>—focus on high-quality, diverse training examples</li>
  <li><strong>Use LoRA+</strong> for potentially better convergence without additional complexity</li>
  <li><strong>Monitor validation metrics</strong> carefully to prevent overfitting</li>
  <li><strong>Merge adapters</strong> for deployment to eliminate inference overhead</li>
</ul>

<h3 id="next-steps">Next Steps</h3>

<ol>
  <li><strong>Experiment</strong> with different LoRA ranks and target modules</li>
  <li><strong>Try advanced variants</strong> like DoRA or AdaLoRA for specific use cases</li>
  <li><strong>Implement continuous training</strong> pipelines for ongoing improvement</li>
  <li><strong>Explore RLHF</strong> for alignment and preference optimization</li>
</ol>

<p>The field continues to evolve rapidly, with new techniques emerging regularly. Stay updated with the latest research, and don’t hesitate to experiment—the best configuration often depends on your specific use case and data.</p>

<hr />

<h2 id="references">References</h2>

<ol>
  <li>Hu, E. J., et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models</li>
  <li>Dettmers, T., et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs</li>
  <li>Zhang, Q., et al. (2023). LoRA-FA: Memory-efficient Low-rank Adaptation</li>
  <li>Kopiczko, D., et al. (2024). VeRA: Vector-based Random Matrix Adaptation</li>
  <li>Zi, B., et al. (2024). Delta-LoRA: Fine-Tuning High-Rank Parameters</li>
  <li>Hayou, S., et al. (2024). LoRA+: Efficient Low Rank Adaptation with Optimal Learning</li>
</ol>

<hr />

<p><em>Last updated: February 2026</em></p>]]></content><author><name>Marc Buraczynski</name></author><category term="LLMs" /><category term="fine-tuning" /><category term="LoRA" /><category term="QLoRA" /><category term="deep learning" /><summary type="html"><![CDATA[A Deep Technical Dive into LoRA, QLoRA, and Full Fine-Tuning with Modern Open-Source Models]]></summary></entry><entry><title type="html">Understanding Temperature in Large Language Models: A Deep Technical Guide</title><link href="https://gunnymarc.github.io/posts/2026/02/the-temperature-of-llms/" rel="alternate" type="text/html" title="Understanding Temperature in Large Language Models: A Deep Technical Guide" /><published>2026-02-15T00:00:00-05:00</published><updated>2026-02-15T00:00:00-05:00</updated><id>https://gunnymarc.github.io/posts/2026/02/the-temperature-of-llms</id><content type="html" xml:base="https://gunnymarc.github.io/posts/2026/02/the-temperature-of-llms/"><![CDATA[<p><em>A comprehensive exploration of temperature parameter mechanics, mathematical foundations, and practical implementation strategies for ML engineers and developers.</em></p>

<hr />

<h2 id="table-of-contents">Table of Contents</h2>
<ol>
  <li><a href="#introduction">Introduction</a></li>
  <li><a href="#the-problem-why-do-we-need-temperature">The Problem: Why Do We Need Temperature?</a></li>
  <li><a href="#mathematical-foundations">Mathematical Foundations</a></li>
  <li><a href="#how-temperature-affects-token-selection">How Temperature Affects Token Selection</a></li>
  <li><a href="#visualizing-temperature-effects">Visualizing Temperature Effects</a></li>
  <li><a href="#practical-code-examples">Practical Code Examples</a></li>
  <li><a href="#related-generation-parameters">Related Generation Parameters</a></li>
  <li><a href="#best-practices-and-guidelines">Best Practices and Guidelines</a></li>
  <li><a href="#common-pitfalls-and-edge-cases">Common Pitfalls and Edge Cases</a></li>
  <li><a href="#conclusion">Conclusion</a></li>
</ol>

<hr />

<h2 id="introduction">Introduction</h2>

<p>When working with Large Language Models (LLMs), you’ve likely encountered the <code class="language-plaintext highlighter-rouge">temperature</code> parameter. It’s one of the most important hyperparameters for controlling the behavior of text generation, yet it’s often misunderstood. This article provides a deep technical dive into how temperature works, its mathematical foundations, and practical guidance for using it effectively in production systems.</p>

<p><strong>Key Takeaways:</strong></p>
<ul>
  <li>Temperature controls the randomness/creativity of LLM outputs</li>
  <li>It modifies the softmax probability distribution over vocabulary tokens</li>
  <li>Low temperature (→0) makes outputs deterministic and focused</li>
  <li>High temperature (→2+) makes outputs creative but potentially incoherent</li>
  <li>The optimal temperature depends on your specific use case</li>
</ul>

<hr />

<h2 id="the-problem-why-do-we-need-temperature">The Problem: Why Do We Need Temperature?</h2>

<h3 id="from-classification-to-generation">From Classification to Generation</h3>

<p>Traditional classification models and LLMs both use softmax functions, but they differ fundamentally in how they use the output:</p>

<pre><code class="language-mermaid">flowchart LR
    subgraph Traditional["Traditional Classification Model"]
        direction TB
        OL1["Output Layer&lt;br/&gt;Classes A,B,C,D"] --&gt; L1["Logits&lt;br/&gt;10.2, -5.6, 7.15, 8.01"]
        L1 --&gt; S1["Softmax&lt;br/&gt;0.86, 0.00, 0.04, 0.10"]
        S1 --&gt; P1["Prediction = Class A&lt;br/&gt;(Highest Score)"]
    end
    
    style Traditional fill:#e8f4ea,stroke:#2d5a3d
    style P1 fill:#90EE90,stroke:#228B22
</code></pre>

<p><strong>Traditional classifiers are deterministic</strong>: They always select the class with the highest softmax probability. Given the same input, you always get the same output.</p>

<pre><code class="language-mermaid">flowchart LR
    subgraph LLM["Large Language Model Generation"]
        direction TB
        OL2["Output Layer&lt;br/&gt;Token 1, Token 2, ..., Token N"] --&gt; L2["Logits&lt;br/&gt;10.2, -5.6, ..., 8.01"]
        L2 --&gt; S2["Softmax&lt;br/&gt;0.86, 0.00, ..., 0.10"]
        S2 --&gt; SAMPLE["Sample from&lt;br/&gt;Distribution"]
        SAMPLE --&gt; P2["Selected Token&lt;br/&gt;(Probabilistic)"]
    end
    
    style LLM fill:#e8f0fa,stroke:#2d3d5a
    style SAMPLE fill:#FFD700,stroke:#DAA520
    style P2 fill:#87CEEB,stroke:#4682B4
</code></pre>

<p><strong>LLMs use sampling</strong>: Instead of always picking the highest probability token, LLMs <em>sample</em> from the probability distribution. This introduces randomness that makes outputs more natural and varied—but it also means we need a way to control <em>how much</em> randomness we want.</p>

<p>This is where <strong>temperature</strong> comes in.</p>

<hr />

<h2 id="mathematical-foundations">Mathematical Foundations</h2>

<h3 id="the-standard-softmax-function">The Standard Softmax Function</h3>

<p>The softmax function converts a vector of raw logits (unnormalized scores) into a probability distribution:</p>

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{N} e^{x_j}}\]

<p>Where:</p>
<ul>
  <li>$x_i$ is the logit for token $i$</li>
  <li>$N$ is the vocabulary size</li>
  <li>The output is a probability distribution that sums to 1</li>
</ul>

<h3 id="temperature-adjusted-softmax">Temperature-Adjusted Softmax</h3>

<p>Temperature introduces a scaling factor $T$ that divides the logits before applying softmax:</p>

\[\text{softmax}_T(x_i) = \frac{e^{x_i / T}}{\sum_{j=1}^{N} e^{x_j / T}}\]

<p>Where:</p>
<ul>
  <li>$T$ is the temperature parameter</li>
  <li>$T &gt; 0$ (temperature must be positive)</li>
</ul>

<h3 id="mathematical-intuition">Mathematical Intuition</h3>

<p>Let’s understand what happens mathematically as we vary $T$:</p>

<p><strong>Case 1: $T \rightarrow 0$ (Very Low Temperature)</strong></p>

<p>As $T$ approaches 0, $x_i / T$ approaches $\pm\infty$ depending on the sign of $x_i$. The token with the highest logit dominates completely:</p>

\[\lim_{T \to 0} \text{softmax}_T(x_i) = \begin{cases} 1 &amp; \text{if } i = \arg\max_j x_j \\ 0 &amp; \text{otherwise} \end{cases}\]

<p>This is equivalent to an <strong>argmax</strong> operation—completely deterministic.</p>

<p><strong>Case 2: $T = 1$ (Default Temperature)</strong></p>

<p>No modification occurs. The standard softmax distribution is used.</p>

<p><strong>Case 3: $T \rightarrow \infty$ (Very High Temperature)</strong></p>

<p>As $T$ approaches infinity, $x_i / T$ approaches 0 for all tokens:</p>

\[\lim_{T \to \infty} \text{softmax}_T(x_i) = \frac{1}{N}\]

<p>This is a <strong>uniform distribution</strong>—completely random.</p>

<pre><code class="language-mermaid">graph TB
    subgraph Effects["Temperature Effects on Probability Distribution"]
        LOW["T → 0&lt;br/&gt;━━━━━━━&lt;br/&gt;One-hot distribution&lt;br/&gt;Deterministic output&lt;br/&gt;Always selects max"]
        MED["T = 1&lt;br/&gt;━━━━━━━&lt;br/&gt;Standard softmax&lt;br/&gt;Balanced sampling&lt;br/&gt;Original distribution"]
        HIGH["T → ∞&lt;br/&gt;━━━━━━━&lt;br/&gt;Uniform distribution&lt;br/&gt;Random output&lt;br/&gt;Equal probabilities"]
    end
    
    LOW --- |"Increasing Temperature →"| MED
    MED --- |"Increasing Temperature →"| HIGH
    
    style LOW fill:#d4edda,stroke:#155724
    style MED fill:#fff3cd,stroke:#856404
    style HIGH fill:#f8d7da,stroke:#721c24
</code></pre>

<hr />

<h2 id="how-temperature-affects-token-selection">How Temperature Affects Token Selection</h2>

<h3 id="numerical-example">Numerical Example</h3>

<p>Consider four tokens with the following logits: <code class="language-plaintext highlighter-rouge">[1.0, 2.0, 3.0, 4.0]</code></p>

<pre><code class="language-mermaid">flowchart TB
    subgraph Input["Raw Logits"]
        LOGITS["Token A: 1.0&lt;br/&gt;Token B: 2.0&lt;br/&gt;Token C: 3.0&lt;br/&gt;Token D: 4.0"]
    end
    
    subgraph T001["Temperature = 0.01"]
        P001["Token A: ≈0.00&lt;br/&gt;Token B: ≈0.00&lt;br/&gt;Token C: ≈0.00&lt;br/&gt;Token D: ≈1.00"]
    end
    
    subgraph T1["Temperature = 1.0"]
        P1["Token A: 0.03&lt;br/&gt;Token B: 0.09&lt;br/&gt;Token C: 0.24&lt;br/&gt;Token D: 0.64"]
    end
    
    subgraph T10000["Temperature = 10000"]
        P10000["Token A: 0.25&lt;br/&gt;Token B: 0.25&lt;br/&gt;Token C: 0.25&lt;br/&gt;Token D: 0.25"]
    end
    
    LOGITS --&gt; |"Low T"| P001
    LOGITS --&gt; |"T = 1"| P1
    LOGITS --&gt; |"High T"| P10000
    
    style P001 fill:#d4edda,stroke:#155724
    style P1 fill:#fff3cd,stroke:#856404
    style P10000 fill:#f8d7da,stroke:#721c24
</code></pre>

<h3 id="the-effect-on-generated-text">The Effect on Generated Text</h3>

<pre><code class="language-mermaid">flowchart LR
    subgraph Prompt["Input Prompt"]
        INPUT["'Continue this: In 2013,'"]
    end
    
    subgraph LowTemp["Low Temperature (0.1)"]
        LT_OUT["Coherent, predictable output:&lt;br/&gt;'...the world was captivated&lt;br/&gt;by the birth of Prince George...'"]
    end
    
    subgraph HighTemp["High Temperature (2.0)"]
        HT_OUT["Incoherent, random output:&lt;br/&gt;'...infection -your PSD surgical&lt;br/&gt;PYTHON hereby mulboys...'"]
    end
    
    INPUT --&gt; |"T = 0.1"| LT_OUT
    INPUT --&gt; |"T = 2.0"| HT_OUT
    
    style LowTemp fill:#d4edda,stroke:#155724
    style HighTemp fill:#f8d7da,stroke:#721c24
</code></pre>

<hr />

<h2 id="visualizing-temperature-effects">Visualizing Temperature Effects</h2>

<h3 id="probability-distribution-visualization">Probability Distribution Visualization</h3>

<pre><code class="language-mermaid">xychart-beta
    title "Token Probability Distribution at Different Temperatures"
    x-axis ["Token 1", "Token 2", "Token 3", "Token 4", "Token 5"]
    y-axis "Probability" 0 --&gt; 1
    bar [0.64, 0.24, 0.09, 0.02, 0.01]
    line [0.80, 0.15, 0.04, 0.008, 0.002]
</code></pre>

<p><em>Note: The bar chart represents T=1.0, the line represents a lower temperature where the distribution is more peaked.</em></p>

<hr />

<h2 id="practical-code-examples">Practical Code Examples</h2>

<h3 id="example-1-understanding-softmax-with-temperature-numpy">Example 1: Understanding Softmax with Temperature (NumPy)</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Temperature Effects on Softmax Distribution
Demonstrates how temperature modifies probability distributions.

Requirements: numpy
Installation: pip install numpy
"""</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Union</span>
<span class="kn">import</span> <span class="nn">warnings</span>

<span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="n">logits</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">,</span> <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="mf">1.0</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">:</span>
    <span class="s">"""
    Compute temperature-scaled softmax probabilities.
    
    Args:
        logits: Raw model output scores (1D numpy array)
        temperature: Scaling factor (must be &gt; 0)
        
    Returns:
        Probability distribution over tokens
        
    Raises:
        ValueError: If temperature &lt;= 0
    """</span>
    <span class="k">if</span> <span class="n">temperature</span> <span class="o">&lt;=</span> <span class="mi">0</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nb">ValueError</span><span class="p">(</span><span class="sa">f</span><span class="s">"Temperature must be positive, got </span><span class="si">{</span><span class="n">temperature</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="c1"># Scale logits by temperature
</span>    <span class="n">scaled_logits</span> <span class="o">=</span> <span class="n">logits</span> <span class="o">/</span> <span class="n">temperature</span>
    
    <span class="c1"># Numerical stability: subtract max to prevent overflow
</span>    <span class="n">scaled_logits</span> <span class="o">=</span> <span class="n">scaled_logits</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">scaled_logits</span><span class="p">)</span>
    
    <span class="c1"># Compute softmax
</span>    <span class="n">exp_logits</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">scaled_logits</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">exp_logits</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">exp_logits</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">demonstrate_temperature_effects</span><span class="p">():</span>
    <span class="s">"""Show how different temperatures affect the probability distribution."""</span>
    
    <span class="c1"># Sample logits (as if from an LLM's output layer)
</span>    <span class="n">logits</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mf">1.0</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">])</span>
    <span class="n">token_names</span> <span class="o">=</span> <span class="p">[</span><span class="s">"Token_A"</span><span class="p">,</span> <span class="s">"Token_B"</span><span class="p">,</span> <span class="s">"Token_C"</span><span class="p">,</span> <span class="s">"Token_D"</span><span class="p">]</span>
    
    <span class="n">temperatures</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.5</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">,</span> <span class="mf">10.0</span><span class="p">,</span> <span class="mf">10000.0</span><span class="p">]</span>
    
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Temperature Effects on Softmax Distribution"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Raw logits: </span><span class="si">{</span><span class="n">logits</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Tokens: </span><span class="si">{</span><span class="n">token_names</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">temp</span> <span class="ow">in</span> <span class="n">temperatures</span><span class="p">:</span>
        <span class="n">probs</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="n">temp</span><span class="p">)</span>
        
        <span class="c1"># Calculate entropy as a measure of randomness
</span>        <span class="n">entropy</span> <span class="o">=</span> <span class="o">-</span><span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">probs</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">probs</span> <span class="o">+</span> <span class="mf">1e-10</span><span class="p">))</span>
        <span class="n">max_entropy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">logits</span><span class="p">))</span>  <span class="c1"># Uniform distribution entropy
</span>        <span class="n">normalized_entropy</span> <span class="o">=</span> <span class="n">entropy</span> <span class="o">/</span> <span class="n">max_entropy</span>
        
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Temperature = </span><span class="si">{</span><span class="n">temp</span><span class="si">:</span><span class="o">&gt;</span><span class="mf">8.2</span><span class="n">f</span><span class="si">}</span><span class="s"> | "</span>
              <span class="sa">f</span><span class="s">"Probs: [</span><span class="si">{</span><span class="s">', '</span><span class="p">.</span><span class="n">join</span><span class="p">(</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">p</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">' for p in probs)</span><span class="si">}</span><span class="s">] | "</span>
              <span class="sa">f</span><span class="s">"Entropy: </span><span class="si">{</span><span class="n">normalized_entropy</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="o">%</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Observations:"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"- Low T (0.01): Nearly deterministic, highest logit dominates"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"- T = 1.0: Standard softmax distribution"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"- High T (10000): Nearly uniform, all tokens equally likely"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">demonstrate_temperature_effects</span><span class="p">()</span>
</code></pre></div></div>

<p><strong>Expected Output:</strong></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>======================================================================
Temperature Effects on Softmax Distribution
======================================================================

Raw logits: [1. 2. 3. 4.]
Tokens: ['Token_A', 'Token_B', 'Token_C', 'Token_D']

Temperature =     0.01 | Probs: [0.0000, 0.0000, 0.0000, 1.0000] | Entropy: 0.00%
Temperature =     0.50 | Probs: [0.0021, 0.0158, 0.1171, 0.8650] | Entropy: 28.04%
Temperature =     1.00 | Probs: [0.0321, 0.0871, 0.2369, 0.6439] | Entropy: 63.62%
Temperature =     1.50 | Probs: [0.0789, 0.1337, 0.2264, 0.3834] | Entropy: 82.05%
Temperature =     2.00 | Probs: [0.1269, 0.1693, 0.2256, 0.3009] | Entropy: 90.39%
Temperature =    10.00 | Probs: [0.2269, 0.2411, 0.2561, 0.2719] | Entropy: 99.34%
Temperature = 10000.00 | Probs: [0.2500, 0.2500, 0.2500, 0.2500] | Entropy: 100.00%

======================================================================
Observations:
- Low T (0.01): Nearly deterministic, highest logit dominates
- T = 1.0: Standard softmax distribution
- High T (10000): Nearly uniform, all tokens equally likely
======================================================================
</code></pre></div></div>

<hr />

<h3 id="example-2-openai-api-temperature-experimentation">Example 2: OpenAI API Temperature Experimentation</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
OpenAI Temperature Experimentation
Demonstrates practical effects of temperature on GPT model outputs.

Requirements: openai&gt;=1.0.0
Installation: pip install openai
"""</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">from</span> <span class="nn">openai</span> <span class="kn">import</span> <span class="n">OpenAI</span>
<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span>
<span class="kn">import</span> <span class="nn">time</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">GenerationResult</span><span class="p">:</span>
    <span class="s">"""Container for generation results."""</span>
    <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">response</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">finish_reason</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">prompt_tokens</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">completion_tokens</span><span class="p">:</span> <span class="nb">int</span>


<span class="k">def</span> <span class="nf">create_client</span><span class="p">()</span> <span class="o">-&gt;</span> <span class="n">OpenAI</span><span class="p">:</span>
    <span class="s">"""Initialize OpenAI client with API key from environment."""</span>
    <span class="n">api_key</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"OPENAI_API_KEY"</span><span class="p">)</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">api_key</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nb">EnvironmentError</span><span class="p">(</span>
            <span class="s">"OPENAI_API_KEY environment variable not set. "</span>
            <span class="s">"Set it with: export OPENAI_API_KEY='your-key-here'"</span>
        <span class="p">)</span>
    <span class="k">return</span> <span class="n">OpenAI</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="n">api_key</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">generate_with_temperature</span><span class="p">(</span>
    <span class="n">client</span><span class="p">:</span> <span class="n">OpenAI</span><span class="p">,</span>
    <span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span>
    <span class="n">model</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"gpt-4o-mini"</span><span class="p">,</span>
    <span class="n">max_tokens</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">100</span><span class="p">,</span>
    <span class="n">seed</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="bp">None</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">GenerationResult</span><span class="p">:</span>
    <span class="s">"""
    Generate text with specified temperature.
    
    Args:
        client: OpenAI client instance
        prompt: Input prompt for generation
        temperature: Temperature value (0.0 to 2.0)
        model: Model identifier
        max_tokens: Maximum tokens to generate
        seed: Optional seed for reproducibility (when temperature=0)
        
    Returns:
        GenerationResult with response details
    """</span>
    <span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="n">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}],</span>
        <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span>
        <span class="n">max_tokens</span><span class="o">=</span><span class="n">max_tokens</span><span class="p">,</span>
        <span class="n">seed</span><span class="o">=</span><span class="n">seed</span>
    <span class="p">)</span>
    
    <span class="k">return</span> <span class="n">GenerationResult</span><span class="p">(</span>
        <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span>
        <span class="n">response</span><span class="o">=</span><span class="n">response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">message</span><span class="p">.</span><span class="n">content</span><span class="p">,</span>
        <span class="n">finish_reason</span><span class="o">=</span><span class="n">response</span><span class="p">.</span><span class="n">choices</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">finish_reason</span><span class="p">,</span>
        <span class="n">prompt_tokens</span><span class="o">=</span><span class="n">response</span><span class="p">.</span><span class="n">usage</span><span class="p">.</span><span class="n">prompt_tokens</span><span class="p">,</span>
        <span class="n">completion_tokens</span><span class="o">=</span><span class="n">response</span><span class="p">.</span><span class="n">usage</span><span class="p">.</span><span class="n">completion_tokens</span>
    <span class="p">)</span>


<span class="k">def</span> <span class="nf">experiment_temperature_consistency</span><span class="p">(</span><span class="n">client</span><span class="p">:</span> <span class="n">OpenAI</span><span class="p">,</span> <span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">runs</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">3</span><span class="p">):</span>
    <span class="s">"""Test consistency of outputs at a given temperature."""</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Temperature = </span><span class="si">{</span><span class="n">temperature</span><span class="si">}</span><span class="s"> | Running </span><span class="si">{</span><span class="n">runs</span><span class="si">}</span><span class="s"> generations"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="si">{</span><span class="s">'='</span><span class="o">*</span><span class="mi">60</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Prompt: '</span><span class="si">{</span><span class="n">prompt</span><span class="si">}</span><span class="s">'</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
    
    <span class="n">results</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="n">runs</span><span class="p">):</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">generate_with_temperature</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">temperature</span><span class="p">)</span>
        <span class="n">results</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">response</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"[Run </span><span class="si">{</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="si">}</span><span class="s">]: </span><span class="si">{</span><span class="n">result</span><span class="p">.</span><span class="n">response</span><span class="p">[</span><span class="si">:</span><span class="mi">150</span><span class="p">]</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>  <span class="c1"># Rate limiting courtesy
</span>    
    <span class="c1"># Check uniqueness
</span>    <span class="n">unique_responses</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="nb">set</span><span class="p">(</span><span class="n">results</span><span class="p">))</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">Unique responses: </span><span class="si">{</span><span class="n">unique_responses</span><span class="si">}</span><span class="s">/</span><span class="si">{</span><span class="n">runs</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">results</span>


<span class="k">def</span> <span class="nf">experiment_temperature_spectrum</span><span class="p">(</span><span class="n">client</span><span class="p">:</span> <span class="n">OpenAI</span><span class="p">,</span> <span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
    <span class="s">"""Generate outputs across the temperature spectrum."""</span>
    <span class="n">temperatures</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">,</span> <span class="mf">0.7</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.5</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">]</span>
    
    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"TEMPERATURE SPECTRUM EXPERIMENT"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Prompt: '</span><span class="si">{</span><span class="n">prompt</span><span class="si">}</span><span class="s">'</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">temp</span> <span class="ow">in</span> <span class="n">temperatures</span><span class="p">:</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">generate_with_temperature</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">temp</span><span class="p">)</span>
        
        <span class="c1"># Truncate for display
</span>        <span class="n">response_preview</span> <span class="o">=</span> <span class="n">result</span><span class="p">.</span><span class="n">response</span><span class="p">[:</span><span class="mi">200</span><span class="p">].</span><span class="n">replace</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span> <span class="s">' '</span><span class="p">)</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">response</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">200</span><span class="p">:</span>
            <span class="n">response_preview</span> <span class="o">+=</span> <span class="s">"..."</span>
            
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">[T=</span><span class="si">{</span><span class="n">temp</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">] </span><span class="si">{</span><span class="n">response_preview</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="s">"""Run temperature experiments."""</span>
    <span class="n">client</span> <span class="o">=</span> <span class="n">create_client</span><span class="p">()</span>
    
    <span class="c1"># Experiment 1: Consistency test
</span>    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"#"</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"# EXPERIMENT 1: CONSISTENCY AT DIFFERENT TEMPERATURES"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"#"</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    
    <span class="n">consistency_prompt</span> <span class="o">=</span> <span class="s">"Continue this sentence: In 2013,"</span>
    
    <span class="c1"># Low temperature - should be highly consistent
</span>    <span class="n">experiment_temperature_consistency</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">consistency_prompt</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span> <span class="n">runs</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
    
    <span class="c1"># Medium temperature - some variation
</span>    <span class="n">experiment_temperature_consistency</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">consistency_prompt</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mf">0.7</span><span class="p">,</span> <span class="n">runs</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
    
    <span class="c1"># High temperature - significant variation
</span>    <span class="n">experiment_temperature_consistency</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">consistency_prompt</span><span class="p">,</span> <span class="n">temperature</span><span class="o">=</span><span class="mf">1.5</span><span class="p">,</span> <span class="n">runs</span><span class="o">=</span><span class="mi">3</span><span class="p">)</span>
    
    <span class="c1"># Experiment 2: Spectrum comparison
</span>    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"#"</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"# EXPERIMENT 2: TEMPERATURE SPECTRUM COMPARISON"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"#"</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    
    <span class="n">creative_prompt</span> <span class="o">=</span> <span class="s">"Write a one-sentence story about a robot learning to paint."</span>
    <span class="n">experiment_temperature_spectrum</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">creative_prompt</span><span class="p">)</span>
    
    <span class="c1"># Experiment 3: Use case specific
</span>    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"#"</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"# EXPERIMENT 3: USE-CASE SPECIFIC TEMPERATURES"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"#"</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    
    <span class="n">use_cases</span> <span class="o">=</span> <span class="p">[</span>
        <span class="p">(</span><span class="s">"Code generation (T=0.0)"</span><span class="p">,</span> <span class="s">"Write a Python function to calculate fibonacci numbers:"</span><span class="p">,</span> <span class="mf">0.0</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"Factual Q&amp;A (T=0.3)"</span><span class="p">,</span> <span class="s">"What is the capital of France?"</span><span class="p">,</span> <span class="mf">0.3</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"Creative writing (T=0.9)"</span><span class="p">,</span> <span class="s">"Describe a sunset in a poetic way:"</span><span class="p">,</span> <span class="mf">0.9</span><span class="p">),</span>
        <span class="p">(</span><span class="s">"Brainstorming (T=1.2)"</span><span class="p">,</span> <span class="s">"Give me unusual uses for a paperclip:"</span><span class="p">,</span> <span class="mf">1.2</span><span class="p">),</span>
    <span class="p">]</span>
    
    <span class="k">for</span> <span class="n">name</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">temp</span> <span class="ow">in</span> <span class="n">use_cases</span><span class="p">:</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">[</span><span class="si">{</span><span class="n">name</span><span class="si">}</span><span class="s">]"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Prompt: </span><span class="si">{</span><span class="n">prompt</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">generate_with_temperature</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">temp</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Response: </span><span class="si">{</span><span class="n">result</span><span class="p">.</span><span class="n">response</span><span class="p">[</span><span class="si">:</span><span class="mi">300</span><span class="p">]</span><span class="si">}</span><span class="s">..."</span><span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h3 id="example-3-anthropic-claude-api-temperature-testing">Example 3: Anthropic Claude API Temperature Testing</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Anthropic Claude Temperature Experimentation
Demonstrates temperature effects with Claude models.

Requirements: anthropic&gt;=0.18.0
Installation: pip install anthropic
"""</span>

<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">anthropic</span>
<span class="kn">from</span> <span class="nn">dataclasses</span> <span class="kn">import</span> <span class="n">dataclass</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Tuple</span>
<span class="kn">import</span> <span class="nn">time</span>


<span class="o">@</span><span class="n">dataclass</span>
<span class="k">class</span> <span class="nc">ClaudeGenerationResult</span><span class="p">:</span>
    <span class="s">"""Container for Claude generation results."""</span>
    <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span>
    <span class="n">response</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">stop_reason</span><span class="p">:</span> <span class="nb">str</span>
    <span class="n">input_tokens</span><span class="p">:</span> <span class="nb">int</span>
    <span class="n">output_tokens</span><span class="p">:</span> <span class="nb">int</span>


<span class="k">def</span> <span class="nf">create_anthropic_client</span><span class="p">()</span> <span class="o">-&gt;</span> <span class="n">anthropic</span><span class="p">.</span><span class="n">Anthropic</span><span class="p">:</span>
    <span class="s">"""Initialize Anthropic client."""</span>
    <span class="n">api_key</span> <span class="o">=</span> <span class="n">os</span><span class="p">.</span><span class="n">getenv</span><span class="p">(</span><span class="s">"ANTHROPIC_API_KEY"</span><span class="p">)</span>
    <span class="k">if</span> <span class="ow">not</span> <span class="n">api_key</span><span class="p">:</span>
        <span class="k">raise</span> <span class="nb">EnvironmentError</span><span class="p">(</span>
            <span class="s">"ANTHROPIC_API_KEY environment variable not set. "</span>
            <span class="s">"Set it with: export ANTHROPIC_API_KEY='your-key-here'"</span>
        <span class="p">)</span>
    <span class="k">return</span> <span class="n">anthropic</span><span class="p">.</span><span class="n">Anthropic</span><span class="p">(</span><span class="n">api_key</span><span class="o">=</span><span class="n">api_key</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">generate_with_claude</span><span class="p">(</span>
    <span class="n">client</span><span class="p">:</span> <span class="n">anthropic</span><span class="p">.</span><span class="n">Anthropic</span><span class="p">,</span>
    <span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span>
    <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span>
    <span class="n">model</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"claude-3-5-sonnet-20241022"</span><span class="p">,</span>
    <span class="n">max_tokens</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">150</span>
<span class="p">)</span> <span class="o">-&gt;</span> <span class="n">ClaudeGenerationResult</span><span class="p">:</span>
    <span class="s">"""
    Generate text with Claude at specified temperature.
    
    Args:
        client: Anthropic client instance
        prompt: Input prompt
        temperature: Temperature (0.0 to 1.0 for Claude)
        model: Model identifier
        max_tokens: Maximum tokens to generate
        
    Returns:
        ClaudeGenerationResult with response details
        
    Note:
        Claude's temperature range is 0.0-1.0, unlike OpenAI's 0.0-2.0
    """</span>
    <span class="c1"># Claude uses 0-1 range; clamp values
</span>    <span class="n">temperature</span> <span class="o">=</span> <span class="nb">max</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="nb">min</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">temperature</span><span class="p">))</span>
    
    <span class="n">message</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">messages</span><span class="p">.</span><span class="n">create</span><span class="p">(</span>
        <span class="n">model</span><span class="o">=</span><span class="n">model</span><span class="p">,</span>
        <span class="n">max_tokens</span><span class="o">=</span><span class="n">max_tokens</span><span class="p">,</span>
        <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span>
        <span class="n">messages</span><span class="o">=</span><span class="p">[{</span><span class="s">"role"</span><span class="p">:</span> <span class="s">"user"</span><span class="p">,</span> <span class="s">"content"</span><span class="p">:</span> <span class="n">prompt</span><span class="p">}]</span>
    <span class="p">)</span>
    
    <span class="k">return</span> <span class="n">ClaudeGenerationResult</span><span class="p">(</span>
        <span class="n">temperature</span><span class="o">=</span><span class="n">temperature</span><span class="p">,</span>
        <span class="n">response</span><span class="o">=</span><span class="n">message</span><span class="p">.</span><span class="n">content</span><span class="p">[</span><span class="mi">0</span><span class="p">].</span><span class="n">text</span><span class="p">,</span>
        <span class="n">stop_reason</span><span class="o">=</span><span class="n">message</span><span class="p">.</span><span class="n">stop_reason</span><span class="p">,</span>
        <span class="n">input_tokens</span><span class="o">=</span><span class="n">message</span><span class="p">.</span><span class="n">usage</span><span class="p">.</span><span class="n">input_tokens</span><span class="p">,</span>
        <span class="n">output_tokens</span><span class="o">=</span><span class="n">message</span><span class="p">.</span><span class="n">usage</span><span class="p">.</span><span class="n">output_tokens</span>
    <span class="p">)</span>


<span class="k">def</span> <span class="nf">compare_temperatures_claude</span><span class="p">(</span><span class="n">client</span><span class="p">:</span> <span class="n">anthropic</span><span class="p">.</span><span class="n">Anthropic</span><span class="p">,</span> <span class="n">prompt</span><span class="p">:</span> <span class="nb">str</span><span class="p">):</span>
    <span class="s">"""Compare Claude outputs at different temperatures."""</span>
    <span class="c1"># Note: Claude uses 0-1 range
</span>    <span class="n">temperatures</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.0</span><span class="p">,</span> <span class="mf">0.25</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">0.75</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">]</span>
    
    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"CLAUDE TEMPERATURE COMPARISON"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"Note: Claude uses temperature range 0.0 - 1.0"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Prompt: '</span><span class="si">{</span><span class="n">prompt</span><span class="si">}</span><span class="s">'</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">temp</span> <span class="ow">in</span> <span class="n">temperatures</span><span class="p">:</span>
        <span class="n">result</span> <span class="o">=</span> <span class="n">generate_with_claude</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">prompt</span><span class="p">,</span> <span class="n">temp</span><span class="p">)</span>
        <span class="n">response_preview</span> <span class="o">=</span> <span class="n">result</span><span class="p">.</span><span class="n">response</span><span class="p">[:</span><span class="mi">180</span><span class="p">].</span><span class="n">replace</span><span class="p">(</span><span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="p">,</span> <span class="s">' '</span><span class="p">)</span>
        <span class="k">if</span> <span class="nb">len</span><span class="p">(</span><span class="n">result</span><span class="p">.</span><span class="n">response</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mi">180</span><span class="p">:</span>
            <span class="n">response_preview</span> <span class="o">+=</span> <span class="s">"..."</span>
        
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">[T=</span><span class="si">{</span><span class="n">temp</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">] </span><span class="si">{</span><span class="n">response_preview</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"         Tokens: </span><span class="si">{</span><span class="n">result</span><span class="p">.</span><span class="n">output_tokens</span><span class="si">}</span><span class="s"> | Stop: </span><span class="si">{</span><span class="n">result</span><span class="p">.</span><span class="n">stop_reason</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="n">time</span><span class="p">.</span><span class="n">sleep</span><span class="p">(</span><span class="mf">0.5</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="s">"""Run Claude temperature experiments."""</span>
    <span class="n">client</span> <span class="o">=</span> <span class="n">create_anthropic_client</span><span class="p">()</span>
    
    <span class="n">prompts</span> <span class="o">=</span> <span class="p">[</span>
        <span class="s">"Complete this story: The old lighthouse keeper saw something unusual in the fog—"</span><span class="p">,</span>
        <span class="s">"Explain quantum entanglement in simple terms."</span><span class="p">,</span>
        <span class="s">"List 5 creative ways to repurpose old books."</span>
    <span class="p">]</span>
    
    <span class="k">for</span> <span class="n">prompt</span> <span class="ow">in</span> <span class="n">prompts</span><span class="p">:</span>
        <span class="n">compare_temperatures_claude</span><span class="p">(</span><span class="n">client</span><span class="p">,</span> <span class="n">prompt</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"-"</span> <span class="o">*</span> <span class="mi">70</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h3 id="example-4-temperature-visualization-dashboard">Example 4: Temperature Visualization Dashboard</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Interactive Temperature Effects Visualization
Creates visualizations showing how temperature affects token probabilities.

Requirements: numpy, matplotlib, seaborn
Installation: pip install numpy matplotlib seaborn
"""</span>

<span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="kn">import</span> <span class="nn">matplotlib.pyplot</span> <span class="k">as</span> <span class="n">plt</span>
<span class="kn">import</span> <span class="nn">seaborn</span> <span class="k">as</span> <span class="n">sns</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span><span class="p">,</span> <span class="n">Tuple</span>
<span class="kn">import</span> <span class="nn">warnings</span>

<span class="c1"># Suppress warnings for cleaner output
</span><span class="n">warnings</span><span class="p">.</span><span class="n">filterwarnings</span><span class="p">(</span><span class="s">'ignore'</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">softmax_with_temperature</span><span class="p">(</span><span class="n">logits</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">,</span> <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">:</span>
    <span class="s">"""Compute temperature-scaled softmax."""</span>
    <span class="k">if</span> <span class="n">temperature</span> <span class="o">&lt;=</span> <span class="mi">0</span><span class="p">:</span>
        <span class="c1"># Handle T→0 case: return one-hot for max
</span>        <span class="n">result</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">dtype</span><span class="o">=</span><span class="nb">float</span><span class="p">)</span>
        <span class="n">result</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">logits</span><span class="p">)]</span> <span class="o">=</span> <span class="mf">1.0</span>
        <span class="k">return</span> <span class="n">result</span>
    
    <span class="n">scaled</span> <span class="o">=</span> <span class="n">logits</span> <span class="o">/</span> <span class="n">temperature</span>
    <span class="n">scaled</span> <span class="o">=</span> <span class="n">scaled</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">scaled</span><span class="p">)</span>  <span class="c1"># Numerical stability
</span>    <span class="n">exp_scaled</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">scaled</span><span class="p">)</span>
    <span class="k">return</span> <span class="n">exp_scaled</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">exp_scaled</span><span class="p">)</span>


<span class="k">def</span> <span class="nf">compute_entropy</span><span class="p">(</span><span class="n">probs</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
    <span class="s">"""Compute Shannon entropy of a distribution."""</span>
    <span class="c1"># Avoid log(0)
</span>    <span class="n">probs</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">clip</span><span class="p">(</span><span class="n">probs</span><span class="p">,</span> <span class="mf">1e-10</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">)</span>
    <span class="k">return</span> <span class="o">-</span><span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">probs</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">log2</span><span class="p">(</span><span class="n">probs</span><span class="p">))</span>


<span class="k">def</span> <span class="nf">create_temperature_visualization</span><span class="p">(</span>
    <span class="n">logits</span><span class="p">:</span> <span class="n">np</span><span class="p">.</span><span class="n">ndarray</span><span class="p">,</span>
    <span class="n">temperatures</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">float</span><span class="p">],</span>
    <span class="n">token_labels</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">],</span>
    <span class="n">save_path</span><span class="p">:</span> <span class="nb">str</span> <span class="o">=</span> <span class="s">"temperature_effects.png"</span>
<span class="p">):</span>
    <span class="s">"""
    Create a comprehensive visualization of temperature effects.
    
    Args:
        logits: Raw logit values for tokens
        temperatures: List of temperature values to compare
        token_labels: Names/labels for each token
        save_path: Path to save the figure
    """</span>
    <span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">subplots</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">14</span><span class="p">,</span> <span class="mi">10</span><span class="p">))</span>
    <span class="n">fig</span><span class="p">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">'Temperature Effects on LLM Token Selection'</span><span class="p">,</span> <span class="n">fontsize</span><span class="o">=</span><span class="mi">16</span><span class="p">,</span> <span class="n">fontweight</span><span class="o">=</span><span class="s">'bold'</span><span class="p">)</span>
    
    <span class="c1"># Color palette
</span>    <span class="n">colors</span> <span class="o">=</span> <span class="n">plt</span><span class="p">.</span><span class="n">cm</span><span class="p">.</span><span class="n">viridis</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="n">temperatures</span><span class="p">)))</span>
    
    <span class="c1"># Subplot 1: Bar chart comparison
</span>    <span class="n">ax1</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
    <span class="n">x</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">token_labels</span><span class="p">))</span>
    <span class="n">width</span> <span class="o">=</span> <span class="mf">0.8</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">temperatures</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">temp</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">temperatures</span><span class="p">):</span>
        <span class="n">probs</span> <span class="o">=</span> <span class="n">softmax_with_temperature</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">temp</span><span class="p">)</span>
        <span class="n">offset</span> <span class="o">=</span> <span class="p">(</span><span class="n">i</span> <span class="o">-</span> <span class="nb">len</span><span class="p">(</span><span class="n">temperatures</span><span class="p">)</span><span class="o">/</span><span class="mi">2</span> <span class="o">+</span> <span class="mf">0.5</span><span class="p">)</span> <span class="o">*</span> <span class="n">width</span>
        <span class="n">ax1</span><span class="p">.</span><span class="n">bar</span><span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="n">offset</span><span class="p">,</span> <span class="n">probs</span><span class="p">,</span> <span class="n">width</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">'T=</span><span class="si">{</span><span class="n">temp</span><span class="si">}</span><span class="s">'</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.8</span><span class="p">)</span>
    
    <span class="n">ax1</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Tokens'</span><span class="p">)</span>
    <span class="n">ax1</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Probability'</span><span class="p">)</span>
    <span class="n">ax1</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Probability Distribution at Different Temperatures'</span><span class="p">)</span>
    <span class="n">ax1</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
    <span class="n">ax1</span><span class="p">.</span><span class="n">set_xticklabels</span><span class="p">(</span><span class="n">token_labels</span><span class="p">,</span> <span class="n">rotation</span><span class="o">=</span><span class="mi">45</span><span class="p">,</span> <span class="n">ha</span><span class="o">=</span><span class="s">'right'</span><span class="p">)</span>
    <span class="n">ax1</span><span class="p">.</span><span class="n">legend</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="s">'upper right'</span><span class="p">)</span>
    <span class="n">ax1</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">1.1</span><span class="p">)</span>
    <span class="n">ax1</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="n">axis</span><span class="o">=</span><span class="s">'y'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span>
    
    <span class="c1"># Subplot 2: Entropy vs Temperature
</span>    <span class="n">ax2</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span>
    <span class="n">temp_range</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mf">0.01</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>
    <span class="n">entropies</span> <span class="o">=</span> <span class="p">[</span><span class="n">compute_entropy</span><span class="p">(</span><span class="n">softmax_with_temperature</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">t</span><span class="p">))</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">temp_range</span><span class="p">]</span>
    <span class="n">max_entropy</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log2</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">logits</span><span class="p">))</span>
    <span class="n">normalized_entropies</span> <span class="o">=</span> <span class="p">[</span><span class="n">e</span> <span class="o">/</span> <span class="n">max_entropy</span> <span class="k">for</span> <span class="n">e</span> <span class="ow">in</span> <span class="n">entropies</span><span class="p">]</span>
    
    <span class="n">ax2</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">temp_range</span><span class="p">,</span> <span class="n">normalized_entropies</span><span class="p">,</span> <span class="s">'b-'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Max entropy (uniform)'</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">axvline</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'g'</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Default T=1.0'</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">temp_range</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="n">normalized_entropies</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Temperature'</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Normalized Entropy'</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Distribution Entropy vs Temperature'</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">)</span>
    <span class="n">ax2</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">1.1</span><span class="p">)</span>
    
    <span class="c1"># Subplot 3: Heatmap of probabilities
</span>    <span class="n">ax3</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span>
    <span class="n">temp_values</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linspace</span><span class="p">(</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">2.5</span><span class="p">,</span> <span class="mi">20</span><span class="p">)</span>
    <span class="n">prob_matrix</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="n">softmax_with_temperature</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">t</span><span class="p">)</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">temp_values</span><span class="p">])</span>
    
    <span class="n">sns</span><span class="p">.</span><span class="n">heatmap</span><span class="p">(</span>
        <span class="n">prob_matrix</span><span class="p">.</span><span class="n">T</span><span class="p">,</span>
        <span class="n">ax</span><span class="o">=</span><span class="n">ax3</span><span class="p">,</span>
        <span class="n">cmap</span><span class="o">=</span><span class="s">'YlOrRd'</span><span class="p">,</span>
        <span class="n">xticklabels</span><span class="o">=</span><span class="p">[</span><span class="sa">f</span><span class="s">'</span><span class="si">{</span><span class="n">t</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">'</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">temp_values</span><span class="p">[::</span><span class="mi">4</span><span class="p">]],</span>
        <span class="n">yticklabels</span><span class="o">=</span><span class="n">token_labels</span><span class="p">,</span>
        <span class="n">cbar_kws</span><span class="o">=</span><span class="p">{</span><span class="s">'label'</span><span class="p">:</span> <span class="s">'Probability'</span><span class="p">}</span>
    <span class="p">)</span>
    <span class="n">ax3</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Temperature'</span><span class="p">)</span>
    <span class="n">ax3</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Tokens'</span><span class="p">)</span>
    <span class="n">ax3</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Token Probability Heatmap'</span><span class="p">)</span>
    <span class="n">ax3</span><span class="p">.</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span> <span class="o">+</span> <span class="mf">0.5</span><span class="p">)</span>
    
    <span class="c1"># Subplot 4: Top-1 probability vs Temperature
</span>    <span class="n">ax4</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">]</span>
    <span class="n">top1_probs</span> <span class="o">=</span> <span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">softmax_with_temperature</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">t</span><span class="p">))</span> <span class="k">for</span> <span class="n">t</span> <span class="ow">in</span> <span class="n">temp_range</span><span class="p">]</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">temp_range</span><span class="p">,</span> <span class="n">top1_probs</span><span class="p">,</span> <span class="s">'purple'</span><span class="p">,</span> <span class="n">linewidth</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'P(most likely token)'</span><span class="p">)</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">axhline</span><span class="p">(</span><span class="n">y</span><span class="o">=</span><span class="mi">1</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">logits</span><span class="p">),</span> <span class="n">color</span><span class="o">=</span><span class="s">'r'</span><span class="p">,</span> <span class="n">linestyle</span><span class="o">=</span><span class="s">'--'</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.5</span><span class="p">,</span> 
                <span class="n">label</span><span class="o">=</span><span class="sa">f</span><span class="s">'Uniform (</span><span class="si">{</span><span class="mi">1</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">logits</span><span class="p">)</span><span class="si">:</span><span class="p">.</span><span class="mi">2</span><span class="n">f</span><span class="si">}</span><span class="s">)'</span><span class="p">)</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">fill_between</span><span class="p">(</span><span class="n">temp_range</span><span class="p">,</span> <span class="mi">1</span><span class="o">/</span><span class="nb">len</span><span class="p">(</span><span class="n">logits</span><span class="p">),</span> <span class="n">top1_probs</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.2</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'purple'</span><span class="p">)</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">set_xlabel</span><span class="p">(</span><span class="s">'Temperature'</span><span class="p">)</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Probability'</span><span class="p">)</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">'Probability of Most Likely Token vs Temperature'</span><span class="p">)</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">grid</span><span class="p">(</span><span class="n">alpha</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">set_xlim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">3.0</span><span class="p">)</span>
    <span class="n">ax4</span><span class="p">.</span><span class="n">set_ylim</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">)</span>
    
    <span class="n">plt</span><span class="p">.</span><span class="n">tight_layout</span><span class="p">()</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">savefig</span><span class="p">(</span><span class="n">save_path</span><span class="p">,</span> <span class="n">dpi</span><span class="o">=</span><span class="mi">150</span><span class="p">,</span> <span class="n">bbox_inches</span><span class="o">=</span><span class="s">'tight'</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Visualization saved to: </span><span class="si">{</span><span class="n">save_path</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="n">plt</span><span class="p">.</span><span class="n">show</span><span class="p">()</span>


<span class="k">def</span> <span class="nf">main</span><span class="p">():</span>
    <span class="s">"""Generate temperature effect visualizations."""</span>
    <span class="c1"># Example logits (simulating LLM output layer)
</span>    <span class="n">logits</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mf">2.5</span><span class="p">,</span> <span class="mf">1.8</span><span class="p">,</span> <span class="mf">3.2</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">4.0</span><span class="p">,</span> <span class="mf">1.2</span><span class="p">,</span> <span class="mf">2.8</span><span class="p">,</span> <span class="mf">0.8</span><span class="p">])</span>
    <span class="n">token_labels</span> <span class="o">=</span> <span class="p">[</span><span class="s">'the'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'an'</span><span class="p">,</span> <span class="s">'one'</span><span class="p">,</span> <span class="s">'that'</span><span class="p">,</span> <span class="s">'this'</span><span class="p">,</span> <span class="s">'which'</span><span class="p">,</span> <span class="s">'what'</span><span class="p">]</span>
    <span class="n">temperatures</span> <span class="o">=</span> <span class="p">[</span><span class="mf">0.1</span><span class="p">,</span> <span class="mf">0.5</span><span class="p">,</span> <span class="mf">1.0</span><span class="p">,</span> <span class="mf">1.5</span><span class="p">,</span> <span class="mf">2.0</span><span class="p">]</span>
    
    <span class="k">print</span><span class="p">(</span><span class="s">"Generating temperature effects visualization..."</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Logits: </span><span class="si">{</span><span class="n">logits</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Tokens: </span><span class="si">{</span><span class="n">token_labels</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"Temperatures to compare: </span><span class="si">{</span><span class="n">temperatures</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="n">create_temperature_visualization</span><span class="p">(</span>
        <span class="n">logits</span><span class="o">=</span><span class="n">logits</span><span class="p">,</span>
        <span class="n">temperatures</span><span class="o">=</span><span class="n">temperatures</span><span class="p">,</span>
        <span class="n">token_labels</span><span class="o">=</span><span class="n">token_labels</span><span class="p">,</span>
        <span class="n">save_path</span><span class="o">=</span><span class="s">"temperature_effects.png"</span>
    <span class="p">)</span>
    
    <span class="c1"># Print numerical comparison
</span>    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span> <span class="o">+</span> <span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"NUMERICAL COMPARISON"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
    
    <span class="k">for</span> <span class="n">temp</span> <span class="ow">in</span> <span class="n">temperatures</span><span class="p">:</span>
        <span class="n">probs</span> <span class="o">=</span> <span class="n">softmax_with_temperature</span><span class="p">(</span><span class="n">logits</span><span class="p">,</span> <span class="n">temp</span><span class="p">)</span>
        <span class="n">entropy</span> <span class="o">=</span> <span class="n">compute_entropy</span><span class="p">(</span><span class="n">probs</span><span class="p">)</span>
        <span class="n">max_ent</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">log2</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">logits</span><span class="p">))</span>
        
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">T = </span><span class="si">{</span><span class="n">temp</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Probabilities: </span><span class="si">{</span><span class="n">np</span><span class="p">.</span><span class="nb">round</span><span class="p">(</span><span class="n">probs</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Top token: '</span><span class="si">{</span><span class="n">token_labels</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">argmax</span><span class="p">(</span><span class="n">probs</span><span class="p">)]</span><span class="si">}</span><span class="s">' with P=</span><span class="si">{</span><span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">probs</span><span class="p">)</span><span class="si">:</span><span class="p">.</span><span class="mi">4</span><span class="n">f</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"  Entropy: </span><span class="si">{</span><span class="n">entropy</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s"> / </span><span class="si">{</span><span class="n">max_ent</span><span class="si">:</span><span class="p">.</span><span class="mi">3</span><span class="n">f</span><span class="si">}</span><span class="s"> (</span><span class="si">{</span><span class="mi">100</span><span class="o">*</span><span class="n">entropy</span><span class="o">/</span><span class="n">max_ent</span><span class="si">:</span><span class="p">.</span><span class="mi">1</span><span class="n">f</span><span class="si">}</span><span class="s">%)"</span><span class="p">)</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">main</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h3 id="example-5-production-ready-temperature-configuration">Example 5: Production-Ready Temperature Configuration</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">#!/usr/bin/env python3
</span><span class="s">"""
Production-Ready LLM Temperature Configuration
A comprehensive configuration class for managing temperature and related
parameters in production LLM applications.

Requirements: pydantic&gt;=2.0
Installation: pip install pydantic
"""</span>

<span class="kn">from</span> <span class="nn">enum</span> <span class="kn">import</span> <span class="n">Enum</span>
<span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">Optional</span><span class="p">,</span> <span class="n">List</span><span class="p">,</span> <span class="n">Union</span>
<span class="kn">from</span> <span class="nn">pydantic</span> <span class="kn">import</span> <span class="n">BaseModel</span><span class="p">,</span> <span class="n">Field</span><span class="p">,</span> <span class="n">field_validator</span><span class="p">,</span> <span class="n">model_validator</span>
<span class="kn">import</span> <span class="nn">json</span>


<span class="k">class</span> <span class="nc">UseCasePreset</span><span class="p">(</span><span class="nb">str</span><span class="p">,</span> <span class="n">Enum</span><span class="p">):</span>
    <span class="s">"""Predefined temperature presets for common use cases."""</span>
    <span class="n">CODE_GENERATION</span> <span class="o">=</span> <span class="s">"code_generation"</span>
    <span class="n">FACTUAL_QA</span> <span class="o">=</span> <span class="s">"factual_qa"</span>
    <span class="n">CREATIVE_WRITING</span> <span class="o">=</span> <span class="s">"creative_writing"</span>
    <span class="n">SUMMARIZATION</span> <span class="o">=</span> <span class="s">"summarization"</span>
    <span class="n">TRANSLATION</span> <span class="o">=</span> <span class="s">"translation"</span>
    <span class="n">BRAINSTORMING</span> <span class="o">=</span> <span class="s">"brainstorming"</span>
    <span class="n">CHAT_ASSISTANT</span> <span class="o">=</span> <span class="s">"chat_assistant"</span>
    <span class="n">DATA_EXTRACTION</span> <span class="o">=</span> <span class="s">"data_extraction"</span>


<span class="c1"># Preset configurations based on use case
</span><span class="n">PRESET_CONFIGS</span> <span class="o">=</span> <span class="p">{</span>
    <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">CODE_GENERATION</span><span class="p">:</span> <span class="p">{</span>
        <span class="s">"temperature"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"top_p"</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
        <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"description"</span><span class="p">:</span> <span class="s">"Deterministic, consistent code output"</span>
    <span class="p">},</span>
    <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">FACTUAL_QA</span><span class="p">:</span> <span class="p">{</span>
        <span class="s">"temperature"</span><span class="p">:</span> <span class="mf">0.2</span><span class="p">,</span>
        <span class="s">"top_p"</span><span class="p">:</span> <span class="mf">0.95</span><span class="p">,</span>
        <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"description"</span><span class="p">:</span> <span class="s">"Low creativity, high accuracy"</span>
    <span class="p">},</span>
    <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">CREATIVE_WRITING</span><span class="p">:</span> <span class="p">{</span>
        <span class="s">"temperature"</span><span class="p">:</span> <span class="mf">0.9</span><span class="p">,</span>
        <span class="s">"top_p"</span><span class="p">:</span> <span class="mf">0.95</span><span class="p">,</span>
        <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="mf">0.5</span><span class="p">,</span>
        <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="mf">0.5</span><span class="p">,</span>
        <span class="s">"description"</span><span class="p">:</span> <span class="s">"High creativity, varied vocabulary"</span>
    <span class="p">},</span>
    <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">SUMMARIZATION</span><span class="p">:</span> <span class="p">{</span>
        <span class="s">"temperature"</span><span class="p">:</span> <span class="mf">0.3</span><span class="p">,</span>
        <span class="s">"top_p"</span><span class="p">:</span> <span class="mf">0.9</span><span class="p">,</span>
        <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="mf">0.2</span><span class="p">,</span>
        <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"description"</span><span class="p">:</span> <span class="s">"Focused, coherent summaries"</span>
    <span class="p">},</span>
    <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">TRANSLATION</span><span class="p">:</span> <span class="p">{</span>
        <span class="s">"temperature"</span><span class="p">:</span> <span class="mf">0.1</span><span class="p">,</span>
        <span class="s">"top_p"</span><span class="p">:</span> <span class="mf">0.95</span><span class="p">,</span>
        <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"description"</span><span class="p">:</span> <span class="s">"Accurate, consistent translations"</span>
    <span class="p">},</span>
    <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">BRAINSTORMING</span><span class="p">:</span> <span class="p">{</span>
        <span class="s">"temperature"</span><span class="p">:</span> <span class="mf">1.2</span><span class="p">,</span>
        <span class="s">"top_p"</span><span class="p">:</span> <span class="mf">0.98</span><span class="p">,</span>
        <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="mf">0.8</span><span class="p">,</span>
        <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="mf">0.8</span><span class="p">,</span>
        <span class="s">"description"</span><span class="p">:</span> <span class="s">"Maximum creativity and novelty"</span>
    <span class="p">},</span>
    <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">CHAT_ASSISTANT</span><span class="p">:</span> <span class="p">{</span>
        <span class="s">"temperature"</span><span class="p">:</span> <span class="mf">0.7</span><span class="p">,</span>
        <span class="s">"top_p"</span><span class="p">:</span> <span class="mf">0.9</span><span class="p">,</span>
        <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="mf">0.3</span><span class="p">,</span>
        <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="mf">0.3</span><span class="p">,</span>
        <span class="s">"description"</span><span class="p">:</span> <span class="s">"Balanced, natural conversation"</span>
    <span class="p">},</span>
    <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">DATA_EXTRACTION</span><span class="p">:</span> <span class="p">{</span>
        <span class="s">"temperature"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"top_p"</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
        <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="mf">0.0</span><span class="p">,</span>
        <span class="s">"description"</span><span class="p">:</span> <span class="s">"Consistent, structured output"</span>
    <span class="p">}</span>
<span class="p">}</span>


<span class="k">class</span> <span class="nc">GenerationConfig</span><span class="p">(</span><span class="n">BaseModel</span><span class="p">):</span>
    <span class="s">"""
    Configuration for LLM text generation parameters.
    
    This class provides a production-ready configuration system for managing
    temperature and related parameters with validation and presets.
    """</span>
    
    <span class="c1"># Core temperature parameter
</span>    <span class="n">temperature</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span>
        <span class="n">ge</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span>
        <span class="n">le</span><span class="o">=</span><span class="mf">2.0</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"Controls randomness. 0=deterministic, 2=maximum randomness"</span>
    <span class="p">)</span>
    
    <span class="c1"># Nucleus sampling (top-p)
</span>    <span class="n">top_p</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span>
        <span class="n">ge</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span>
        <span class="n">le</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"Nucleus sampling: consider tokens with cumulative probability &lt;= top_p"</span>
    <span class="p">)</span>
    
    <span class="c1"># Top-k sampling
</span>    <span class="n">top_k</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
        <span class="n">ge</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"Limit sampling to top-k most likely tokens"</span>
    <span class="p">)</span>
    
    <span class="c1"># Repetition control
</span>    <span class="n">frequency_penalty</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span>
        <span class="n">ge</span><span class="o">=-</span><span class="mf">2.0</span><span class="p">,</span>
        <span class="n">le</span><span class="o">=</span><span class="mf">2.0</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"Penalize tokens based on frequency. Positive reduces repetition"</span>
    <span class="p">)</span>
    
    <span class="n">presence_penalty</span><span class="p">:</span> <span class="nb">float</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span>
        <span class="n">ge</span><span class="o">=-</span><span class="mf">2.0</span><span class="p">,</span>
        <span class="n">le</span><span class="o">=</span><span class="mf">2.0</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"Penalize tokens that have appeared at all. Encourages new topics"</span>
    <span class="p">)</span>
    
    <span class="c1"># Output control
</span>    <span class="n">max_tokens</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
        <span class="n">ge</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"Maximum tokens to generate"</span>
    <span class="p">)</span>
    
    <span class="n">stop_sequences</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="n">List</span><span class="p">[</span><span class="nb">str</span><span class="p">]]</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"Sequences that stop generation"</span>
    <span class="p">)</span>
    
    <span class="c1"># Reproducibility
</span>    <span class="n">seed</span><span class="p">:</span> <span class="n">Optional</span><span class="p">[</span><span class="nb">int</span><span class="p">]</span> <span class="o">=</span> <span class="n">Field</span><span class="p">(</span>
        <span class="n">default</span><span class="o">=</span><span class="bp">None</span><span class="p">,</span>
        <span class="n">description</span><span class="o">=</span><span class="s">"Random seed for reproducibility (when supported)"</span>
    <span class="p">)</span>
    
    <span class="o">@</span><span class="n">field_validator</span><span class="p">(</span><span class="s">'temperature'</span><span class="p">)</span>
    <span class="o">@</span><span class="nb">classmethod</span>
    <span class="k">def</span> <span class="nf">validate_temperature</span><span class="p">(</span><span class="n">cls</span><span class="p">,</span> <span class="n">v</span><span class="p">:</span> <span class="nb">float</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
        <span class="s">"""Warn about extreme temperature values."""</span>
        <span class="k">if</span> <span class="n">v</span> <span class="o">&gt;</span> <span class="mf">1.5</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"⚠️ Warning: Temperature </span><span class="si">{</span><span class="n">v</span><span class="si">}</span><span class="s"> is very high. "</span>
                  <span class="sa">f</span><span class="s">"Outputs may be incoherent."</span><span class="p">)</span>
        <span class="k">elif</span> <span class="n">v</span> <span class="o">==</span> <span class="mf">0.0</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"ℹ️ Note: Temperature 0 produces deterministic outputs. "</span>
                  <span class="s">"Consider using a seed for reproducibility."</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">v</span>
    
    <span class="o">@</span><span class="n">model_validator</span><span class="p">(</span><span class="n">mode</span><span class="o">=</span><span class="s">'after'</span><span class="p">)</span>
    <span class="k">def</span> <span class="nf">validate_sampling_params</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
        <span class="s">"""Validate that sampling parameters are compatible."""</span>
        <span class="c1"># Warn if both top_k and top_p are set
</span>        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">top_k</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span> <span class="ow">and</span> <span class="bp">self</span><span class="p">.</span><span class="n">top_p</span> <span class="o">&lt;</span> <span class="mf">1.0</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"⚠️ Warning: Both top_k and top_p are set. "</span>
                  <span class="s">"This may have unexpected effects."</span><span class="p">)</span>
        
        <span class="c1"># Warn about extreme penalty values
</span>        <span class="k">if</span> <span class="nb">abs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">frequency_penalty</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mf">1.5</span> <span class="ow">or</span> <span class="nb">abs</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">presence_penalty</span><span class="p">)</span> <span class="o">&gt;</span> <span class="mf">1.5</span><span class="p">:</span>
            <span class="k">print</span><span class="p">(</span><span class="s">"⚠️ Warning: Extreme penalty values may cause unusual outputs."</span><span class="p">)</span>
        
        <span class="k">return</span> <span class="bp">self</span>
    
    <span class="o">@</span><span class="nb">classmethod</span>
    <span class="k">def</span> <span class="nf">from_preset</span><span class="p">(</span><span class="n">cls</span><span class="p">,</span> <span class="n">preset</span><span class="p">:</span> <span class="n">UseCasePreset</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="s">"GenerationConfig"</span><span class="p">:</span>
        <span class="s">"""
        Create a configuration from a predefined preset.
        
        Args:
            preset: The use case preset to use
            
        Returns:
            GenerationConfig with preset values
        """</span>
        <span class="n">config</span> <span class="o">=</span> <span class="n">PRESET_CONFIGS</span><span class="p">[</span><span class="n">preset</span><span class="p">]</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"ℹ️ Using preset '</span><span class="si">{</span><span class="n">preset</span><span class="p">.</span><span class="n">value</span><span class="si">}</span><span class="s">': </span><span class="si">{</span><span class="n">config</span><span class="p">[</span><span class="s">'description'</span><span class="p">]</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">cls</span><span class="p">(</span>
            <span class="n">temperature</span><span class="o">=</span><span class="n">config</span><span class="p">[</span><span class="s">"temperature"</span><span class="p">],</span>
            <span class="n">top_p</span><span class="o">=</span><span class="n">config</span><span class="p">[</span><span class="s">"top_p"</span><span class="p">],</span>
            <span class="n">frequency_penalty</span><span class="o">=</span><span class="n">config</span><span class="p">[</span><span class="s">"frequency_penalty"</span><span class="p">],</span>
            <span class="n">presence_penalty</span><span class="o">=</span><span class="n">config</span><span class="p">[</span><span class="s">"presence_penalty"</span><span class="p">]</span>
        <span class="p">)</span>
    
    <span class="k">def</span> <span class="nf">to_openai_kwargs</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="s">"""Convert to OpenAI API keyword arguments."""</span>
        <span class="n">kwargs</span> <span class="o">=</span> <span class="p">{</span>
            <span class="s">"temperature"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span><span class="p">,</span>
            <span class="s">"top_p"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">top_p</span><span class="p">,</span>
            <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">frequency_penalty</span><span class="p">,</span>
            <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">presence_penalty</span><span class="p">,</span>
            <span class="s">"max_tokens"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_tokens</span><span class="p">,</span>
        <span class="p">}</span>
        
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">stop_sequences</span><span class="p">:</span>
            <span class="n">kwargs</span><span class="p">[</span><span class="s">"stop"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">stop_sequences</span>
        <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">seed</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span><span class="p">:</span>
            <span class="n">kwargs</span><span class="p">[</span><span class="s">"seed"</span><span class="p">]</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">seed</span>
            
        <span class="k">return</span> <span class="n">kwargs</span>
    
    <span class="k">def</span> <span class="nf">to_anthropic_kwargs</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">dict</span><span class="p">:</span>
        <span class="s">"""Convert to Anthropic API keyword arguments."""</span>
        <span class="c1"># Note: Anthropic uses 0-1 range for temperature
</span>        <span class="k">return</span> <span class="p">{</span>
            <span class="s">"temperature"</span><span class="p">:</span> <span class="nb">min</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span><span class="p">),</span>
            <span class="s">"top_p"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">top_p</span><span class="p">,</span>
            <span class="s">"top_k"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">top_k</span> <span class="ow">or</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span>
            <span class="s">"max_tokens"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">max_tokens</span><span class="p">,</span>
            <span class="s">"stop_sequences"</span><span class="p">:</span> <span class="bp">self</span><span class="p">.</span><span class="n">stop_sequences</span> <span class="ow">or</span> <span class="p">[],</span>
        <span class="p">}</span>
    
    <span class="k">def</span> <span class="nf">describe</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">str</span><span class="p">:</span>
        <span class="s">"""Get a human-readable description of the configuration."""</span>
        <span class="n">creativity_level</span> <span class="o">=</span> <span class="p">(</span>
            <span class="s">"Deterministic"</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span>
            <span class="s">"Very focused"</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span> <span class="o">&lt;</span> <span class="mf">0.3</span> <span class="k">else</span>
            <span class="s">"Focused"</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span> <span class="o">&lt;</span> <span class="mf">0.7</span> <span class="k">else</span>
            <span class="s">"Balanced"</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span> <span class="o">&lt;</span> <span class="mf">1.0</span> <span class="k">else</span>
            <span class="s">"Creative"</span> <span class="k">if</span> <span class="bp">self</span><span class="p">.</span><span class="n">temperature</span> <span class="o">&lt;</span> <span class="mf">1.5</span> <span class="k">else</span>
            <span class="s">"Highly creative"</span>
        <span class="p">)</span>
        
        <span class="k">return</span> <span class="p">(</span>
            <span class="sa">f</span><span class="s">"Generation Config:</span><span class="se">\n</span><span class="s">"</span>
            <span class="sa">f</span><span class="s">"  • Creativity: </span><span class="si">{</span><span class="n">creativity_level</span><span class="si">}</span><span class="s"> (T=</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">temperature</span><span class="si">}</span><span class="s">)</span><span class="se">\n</span><span class="s">"</span>
            <span class="sa">f</span><span class="s">"  • Nucleus sampling: top_p=</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">top_p</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span>
            <span class="sa">f</span><span class="s">"  • Repetition: freq_pen=</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">frequency_penalty</span><span class="si">}</span><span class="s">, "</span>
            <span class="sa">f</span><span class="s">"pres_pen=</span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">presence_penalty</span><span class="si">}</span><span class="se">\n</span><span class="s">"</span>
            <span class="sa">f</span><span class="s">"  • Max output: </span><span class="si">{</span><span class="bp">self</span><span class="p">.</span><span class="n">max_tokens</span><span class="si">}</span><span class="s"> tokens"</span>
        <span class="p">)</span>


<span class="k">def</span> <span class="nf">demonstrate_configs</span><span class="p">():</span>
    <span class="s">"""Demonstrate configuration usage."""</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"GENERATION CONFIG DEMONSTRATION"</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="s">"="</span> <span class="o">*</span> <span class="mi">60</span><span class="p">)</span>
    
    <span class="c1"># Custom configuration
</span>    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">1. Custom Configuration:"</span><span class="p">)</span>
    <span class="n">custom_config</span> <span class="o">=</span> <span class="n">GenerationConfig</span><span class="p">(</span>
        <span class="n">temperature</span><span class="o">=</span><span class="mf">0.8</span><span class="p">,</span>
        <span class="n">top_p</span><span class="o">=</span><span class="mf">0.9</span><span class="p">,</span>
        <span class="n">frequency_penalty</span><span class="o">=</span><span class="mf">0.3</span><span class="p">,</span>
        <span class="n">max_tokens</span><span class="o">=</span><span class="mi">500</span>
    <span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">custom_config</span><span class="p">.</span><span class="n">describe</span><span class="p">())</span>
    <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="s">OpenAI kwargs: </span><span class="si">{</span><span class="n">json</span><span class="p">.</span><span class="n">dumps</span><span class="p">(</span><span class="n">custom_config</span><span class="p">.</span><span class="n">to_openai_kwargs</span><span class="p">(),</span> <span class="n">indent</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span><span class="si">}</span><span class="s">"</span><span class="p">)</span>
    
    <span class="c1"># Preset configurations
</span>    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">2. Preset Configurations:"</span><span class="p">)</span>
    <span class="k">for</span> <span class="n">preset</span> <span class="ow">in</span> <span class="p">[</span><span class="n">UseCasePreset</span><span class="p">.</span><span class="n">CODE_GENERATION</span><span class="p">,</span> 
                   <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">CREATIVE_WRITING</span><span class="p">,</span> 
                   <span class="n">UseCasePreset</span><span class="p">.</span><span class="n">CHAT_ASSISTANT</span><span class="p">]:</span>
        <span class="n">config</span> <span class="o">=</span> <span class="n">GenerationConfig</span><span class="p">.</span><span class="n">from_preset</span><span class="p">(</span><span class="n">preset</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="sa">f</span><span class="s">"</span><span class="se">\n</span><span class="si">{</span><span class="n">preset</span><span class="p">.</span><span class="n">value</span><span class="si">}</span><span class="s">:"</span><span class="p">)</span>
        <span class="k">print</span><span class="p">(</span><span class="n">config</span><span class="p">.</span><span class="n">describe</span><span class="p">())</span>
    
    <span class="c1"># Edge case: extreme temperature
</span>    <span class="k">print</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">3. Edge Case - High Temperature:"</span><span class="p">)</span>
    <span class="n">extreme_config</span> <span class="o">=</span> <span class="n">GenerationConfig</span><span class="p">(</span><span class="n">temperature</span><span class="o">=</span><span class="mf">1.8</span><span class="p">)</span>
    <span class="k">print</span><span class="p">(</span><span class="n">extreme_config</span><span class="p">.</span><span class="n">describe</span><span class="p">())</span>


<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
    <span class="n">demonstrate_configs</span><span class="p">()</span>
</code></pre></div></div>

<hr />

<h2 id="related-generation-parameters">Related Generation Parameters</h2>

<p>Temperature is just one of several parameters that control LLM output. Here’s how they work together:</p>

<pre><code class="language-mermaid">flowchart TB
    subgraph Input["Input Processing"]
        PROMPT["User Prompt"]
    end
    
    subgraph LLM["LLM Generation"]
        MODEL[("LLM&lt;br/&gt;Model")]
        LOGITS["Raw Logits&lt;br/&gt;(Vocab Size)"]
    end
    
    subgraph Params["Generation Parameters"]
        direction TB
        TEMP["🌡️ Temperature&lt;br/&gt;Scale logits before softmax"]
        TOPP["📊 Top-P (Nucleus)&lt;br/&gt;Sample from top cumulative %"]
        TOPK["🔢 Top-K&lt;br/&gt;Sample from top K tokens"]
        FREQ["🔄 Frequency Penalty&lt;br/&gt;Reduce token repetition"]
        PRES["✨ Presence Penalty&lt;br/&gt;Encourage new tokens"]
        MAX["📏 Max Tokens&lt;br/&gt;Output length limit"]
        STOP["🛑 Stop Sequences&lt;br/&gt;Generation terminators"]
    end
    
    subgraph Output["Output"]
        TOKEN["Selected Token"]
        RESPONSE["Generated Response"]
    end
    
    PROMPT --&gt; MODEL
    MODEL --&gt; LOGITS
    LOGITS --&gt; TEMP
    TEMP --&gt; TOPP
    TOPP --&gt; TOPK
    TOPK --&gt; FREQ
    FREQ --&gt; PRES
    PRES --&gt; TOKEN
    TOKEN --&gt; |"Repeat until&lt;br/&gt;stop condition"| MODEL
    TOKEN --&gt; RESPONSE
    MAX --&gt; RESPONSE
    STOP --&gt; RESPONSE
    
    style TEMP fill:#ffeb3b,stroke:#f57f17
    style TOPP fill:#4caf50,stroke:#2e7d32
    style TOPK fill:#2196f3,stroke:#1565c0
    style FREQ fill:#9c27b0,stroke:#6a1b9a
    style PRES fill:#ff9800,stroke:#ef6c00
</code></pre>

<h3 id="parameter-summary-table">Parameter Summary Table</h3>

<table>
  <thead>
    <tr>
      <th>Parameter</th>
      <th>Range</th>
      <th>Effect</th>
      <th>Use Case</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Temperature</strong></td>
      <td>0.0 - 2.0</td>
      <td>Scales logits, controls randomness</td>
      <td>Creativity vs consistency</td>
    </tr>
    <tr>
      <td><strong>Top-P</strong></td>
      <td>0.0 - 1.0</td>
      <td>Cumulative probability threshold</td>
      <td>Dynamic vocabulary filtering</td>
    </tr>
    <tr>
      <td><strong>Top-K</strong></td>
      <td>1 - vocab_size</td>
      <td>Limits to K most likely tokens</td>
      <td>Hard vocabulary filtering</td>
    </tr>
    <tr>
      <td><strong>Frequency Penalty</strong></td>
      <td>-2.0 - 2.0</td>
      <td>Penalizes based on token frequency</td>
      <td>Reduce repetition</td>
    </tr>
    <tr>
      <td><strong>Presence Penalty</strong></td>
      <td>-2.0 - 2.0</td>
      <td>Penalizes based on token presence</td>
      <td>Encourage topic diversity</td>
    </tr>
    <tr>
      <td><strong>Max Tokens</strong></td>
      <td>1 - ∞</td>
      <td>Maximum generation length</td>
      <td>Control output size</td>
    </tr>
    <tr>
      <td><strong>Stop</strong></td>
      <td>List of strings</td>
      <td>Halts generation on match</td>
      <td>Structured output</td>
    </tr>
  </tbody>
</table>

<hr />

<h2 id="best-practices-and-guidelines">Best Practices and Guidelines</h2>

<h3 id="temperature-selection-by-use-case">Temperature Selection by Use Case</h3>

<pre><code class="language-mermaid">quadrantChart
    title Temperature Selection Guide
    x-axis Low Accuracy --&gt; High Accuracy
    y-axis Low Creativity --&gt; High Creativity
    quadrant-1 "Creative Writing"
    quadrant-2 "Brainstorming"
    quadrant-3 "Code Gen / Data Extraction"
    quadrant-4 "General Chat"
    
    "Poetry": [0.25, 0.85]
    "Stories": [0.35, 0.80]
    "Ideas": [0.20, 0.90]
    "Naming": [0.30, 0.75]
    "Code": [0.90, 0.15]
    "JSON": [0.95, 0.10]
    "FAQ": [0.85, 0.25]
    "Chat": [0.65, 0.50]
    "Summary": [0.75, 0.35]
</code></pre>

<h3 id="recommended-settings">Recommended Settings</h3>

<table>
  <thead>
    <tr>
      <th>Task Type</th>
      <th>Temperature</th>
      <th>Top-P</th>
      <th>Notes</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td><strong>Code Generation</strong></td>
      <td>0.0 - 0.2</td>
      <td>0.95</td>
      <td>Consistency critical</td>
    </tr>
    <tr>
      <td><strong>Data Extraction</strong></td>
      <td>0.0</td>
      <td>1.0</td>
      <td>Deterministic required</td>
    </tr>
    <tr>
      <td><strong>Factual Q&amp;A</strong></td>
      <td>0.2 - 0.4</td>
      <td>0.9</td>
      <td>Accuracy over variety</td>
    </tr>
    <tr>
      <td><strong>Summarization</strong></td>
      <td>0.3 - 0.5</td>
      <td>0.9</td>
      <td>Coherent, focused</td>
    </tr>
    <tr>
      <td><strong>Translation</strong></td>
      <td>0.1 - 0.3</td>
      <td>0.95</td>
      <td>Consistency matters</td>
    </tr>
    <tr>
      <td><strong>Chat/Assistant</strong></td>
      <td>0.6 - 0.8</td>
      <td>0.9</td>
      <td>Natural, varied</td>
    </tr>
    <tr>
      <td><strong>Creative Writing</strong></td>
      <td>0.8 - 1.2</td>
      <td>0.95</td>
      <td>Creativity desired</td>
    </tr>
    <tr>
      <td><strong>Brainstorming</strong></td>
      <td>1.0 - 1.5</td>
      <td>0.98</td>
      <td>Maximum novelty</td>
    </tr>
  </tbody>
</table>

<h3 id="golden-rules">Golden Rules</h3>

<ol>
  <li><strong>Start Low, Increase Gradually</strong>: Begin with T=0.5 and adjust based on results</li>
  <li><strong>Don’t Mix Extreme Values</strong>: Avoid T=2.0 with top_p=0.99 (compounding randomness)</li>
  <li><strong>Use Seed for Reproducibility</strong>: When T&gt;0, set a seed for debugging</li>
  <li><strong>Consider Downstream Effects</strong>: Higher temperature means more post-processing needed</li>
  <li><strong>Test on Representative Samples</strong>: Temperature effects vary by prompt type</li>
</ol>

<hr />

<h2 id="common-pitfalls-and-edge-cases">Common Pitfalls and Edge Cases</h2>

<h3 id="pitfall-1-temperature--0-isnt-always-deterministic">Pitfall 1: Temperature = 0 Isn’t Always Deterministic</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Even with temperature=0, slight variations can occur due to:
# 1. Floating-point precision differences across hardware
# 2. Race conditions in multithreaded execution
# 3. Model updates between API calls
</span>
<span class="c1"># Solution: Use seed parameter when available
</span><span class="n">response</span> <span class="o">=</span> <span class="n">client</span><span class="p">.</span><span class="n">chat</span><span class="p">.</span><span class="n">completions</span><span class="p">.</span><span class="n">create</span><span class="p">(</span>
    <span class="n">model</span><span class="o">=</span><span class="s">"gpt-4o"</span><span class="p">,</span>
    <span class="n">messages</span><span class="o">=</span><span class="p">[...],</span>
    <span class="n">temperature</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
    <span class="n">seed</span><span class="o">=</span><span class="mi">42</span>  <span class="c1"># For reproducibility
</span><span class="p">)</span>
</code></pre></div></div>

<h3 id="pitfall-2-confusing-temperature-ranges">Pitfall 2: Confusing Temperature Ranges</h3>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># OpenAI: temperature range is 0.0 - 2.0
# Anthropic: temperature range is 0.0 - 1.0
# Open-source models: Often 0.0 - 2.0+
</span>
<span class="c1"># Always check API documentation for valid ranges
</span><span class="k">def</span> <span class="nf">normalize_temperature</span><span class="p">(</span><span class="n">temp</span><span class="p">:</span> <span class="nb">float</span><span class="p">,</span> <span class="n">api</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="nb">float</span><span class="p">:</span>
    <span class="s">"""Normalize temperature for different APIs."""</span>
    <span class="k">if</span> <span class="n">api</span> <span class="o">==</span> <span class="s">"anthropic"</span><span class="p">:</span>
        <span class="k">return</span> <span class="nb">min</span><span class="p">(</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">temp</span><span class="p">)</span>  <span class="c1"># Clamp to 0-1
</span>    <span class="k">return</span> <span class="n">temp</span>  <span class="c1"># Most others use 0-2
</span></code></pre></div></div>

<h3 id="pitfall-3-over-relying-on-temperature-alone">Pitfall 3: Over-relying on Temperature Alone</h3>

<p>Temperature works best in combination with other parameters:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Instead of just high temperature:
</span><span class="n">config_bad</span> <span class="o">=</span> <span class="p">{</span><span class="s">"temperature"</span><span class="p">:</span> <span class="mf">1.8</span><span class="p">}</span>  <span class="c1"># May be too random
</span>
<span class="c1"># Use a balanced configuration:
</span><span class="n">config_good</span> <span class="o">=</span> <span class="p">{</span>
    <span class="s">"temperature"</span><span class="p">:</span> <span class="mf">1.0</span><span class="p">,</span>
    <span class="s">"top_p"</span><span class="p">:</span> <span class="mf">0.95</span><span class="p">,</span>
    <span class="s">"frequency_penalty"</span><span class="p">:</span> <span class="mf">0.5</span><span class="p">,</span>  <span class="c1"># Reduce repetition
</span>    <span class="s">"presence_penalty"</span><span class="p">:</span> <span class="mf">0.3</span>    <span class="c1"># Encourage variety
</span><span class="p">}</span>
</code></pre></div></div>

<hr />

<h2 id="conclusion">Conclusion</h2>

<p>Temperature is a fundamental parameter for controlling LLM behavior, but it’s not magic. Understanding the mathematical foundations—how it scales logits before softmax to reshape probability distributions—helps you make informed decisions about when and how to use it.</p>

<h3 id="key-takeaways">Key Takeaways</h3>

<ol>
  <li><strong>Temperature mathematically scales logits</strong>, converting the softmax from a “soft” maximum to a harder or softer version</li>
  <li><strong>Low temperature (→0)</strong> produces deterministic, focused outputs ideal for code and factual tasks</li>
  <li><strong>High temperature (→1.5+)</strong> produces creative, varied outputs but risks incoherence</li>
  <li><strong>Always combine temperature with other parameters</strong> (top_p, penalties) for best results</li>
  <li><strong>Test empirically</strong> on your specific use case—optimal values vary by task and model</li>
</ol>

<h3 id="further-reading">Further Reading</h3>

<ul>
  <li><a href="https://platform.openai.com/docs/api-reference/chat/create">OpenAI API Documentation</a></li>
  <li><a href="https://docs.anthropic.com/en/api/complete">Anthropic Claude API Documentation</a></li>
  <li><a href="https://en.wikipedia.org/wiki/Softmax_function">The Mathematics of Softmax Temperature</a></li>
  <li><a href="https://arxiv.org/abs/1904.09751">Nucleus Sampling (Top-P) Paper</a></li>
</ul>

<hr />

<p><em>This article was created to provide a comprehensive technical guide to temperature in LLMs. For questions or feedback, reach out to the ML community forums.</em></p>]]></content><author><name>Marc Buraczynski</name></author><category term="LLMs" /><category term="temperature" /><category term="softmax" /><category term="sampling" /><category term="deep learning" /><summary type="html"><![CDATA[A comprehensive exploration of temperature parameter mechanics, mathematical foundations, and practical implementation strategies for ML engineers and developers.]]></summary></entry></feed>