GPT-4o mini and prompt caching, a turning point in debugging?

and

Aug 21, 2024

TLDR: We are at an inflection point - prompt caching and GPT-4o mini have lowered costs to search a problem space sufficiently that LLMs are now a viable strategy to recreate bugs.

It’s no secret that developers spend a lot of time on debugging, and the problem keeps growing. The debugging process typically involves a few consistent steps:

Triage / issue selection
Building context (viewing metrics, logs, monitoring etc.)
Hypothesis testing (recreating bugs, isolating problematic code within systems)
Resolution

Tooling already exists for building context - observability stacks such as Datadog/Splunk are fantastic at helping developers understand a bug. However not much exists for hypothesis testing which remains a very manual process. Tools like SWE-Agent and Devin touch this problem to a degree but it’s not their primary focus - they tend to be single-threaded and too broadly focused.

We address this issue by building a focused hypothesis testing system. Rather than build a single agent to reproduce a bug, we restructure this as a search of the space of possible bug hypotheses. An LLM produces potential hypotheses in parallel in the form of debugging scripts, which run in a simulated environment to test error reproduction. Recent developments in LLM efficiency (prompt caching and GPT4o-mini) make this strategy not only effective but viable from a cost perspective.

A typical flow for debugging

SRE or QA engineer identifies a bug from an event trigger, customer complaint, etc. and adds an issue to Github repo / Jira. These issues typically have incomplete or missing logs / supporting data

Developer picks up issues during maintenance time or if urgent resolution needed

Developer uses tools to understand the systems causing the error

Grafana dashboard to understand performance / ongoing impact of issue
Check logging and monitoring systems (e.g. splunk, datadog)
Look for any similar errors previously experienced
Use distributed tracing tooling to narrow down source of issue to specific systems

Developer forms hypothesis on cause of error and attempts to isolate / recreate issue

Enrich the data from the issue e.g. add additional logging to capture missing data
Recreate the environment where the error occurred (state, config, system specs)
Recreate the code execution that generated the error

Finally, the Developer implements the fix for the bug and resolves the issue.

This flow is clearly generalized, but it illustrates our view that the “Context Building tools” are more mature than other steps. We see potential for improvement on the “hypothesis testing” portion of the process where developers utilize more manual approaches.

Experiment: automate bug recreation for a real-world repo

Our goal for this experiment was automating the recreation of hard to crack bugs from issues in a real development environment. We built a system which uses LLMs to search for potential reproductions. The system takes logs, source, core dumps and the actual bug description to build the reproduction.

Case Study - ComfyUI

We tested this system on the ComfyUI repo, a large popular repo that happens to be near and dear to our hearts. Issue #4525 will serve as a test case as a tough bug to reproduce (involved timing issues, certain behaviors causing lags)

ComfyUI #4525, our test cast

We pulled the issue from github with the source code and attached logs. This data is preprocessed and loaded into context. The system generates multiple attempts (in parallel) at recreating the issue using different methods, and defines a loss function to judge which generated solutions are most effective at recreating the issue. These solutions are fed through rounds of evolution, and after each iteration the top 3 solutions are selected to serve as the basis for the next round.

An evolution of the solutions, with the top 3 solutions selected.

Solutions iterating and improving

After several iterations, the system converges on the recreated bug and the top 5 attempts to recreate the issue. The top rated script was:

```
import asyncio
import aiohttp
import logging
import json
import sys
import os
# Import the PromptServer class from server.py
sys.path.append('/var/folders/p0/xjn9q68j5yq5by4rd9y2mpf00000gn/T/tmp5vkscqv2')
from server import PromptServer
# Configure logging to display the information
logging.basicConfig(level=logging.INFO)
# Mock request class to simulate incoming requests to the server
class MockRequest:
    def __init__(self, json_data):
        self._json = json_data
    async def json(self):
        return self._json
# Function to simulate sending a prompt to the server
async def send_prompt(server, prompt):
    json_data = {
        "prompt": prompt,
        "number": 1,
        "client_id": "test_client"
    }
    request = MockRequest(json_data)
    
    # Call the 'post_prompt' method which handles prompts
    response = await server.post_prompt(request)
    return response
# Function to simulate typing events while generating an image
async def simulate_typing(server, prompt):
    for i in range(5):
        logging.info(f"Typing: {prompt} {i}")
        await send_prompt(server, f"{prompt} {i}")
        await asyncio.sleep(0.5)  # Simulate typing delay
# Main function to initialize the server and run tests
async def main():
    loop = asyncio.get_event_loop()
    server = PromptServer(loop)
    # Setup server and add routes
    await server.setup()
    server.add_routes()
    # Start server on an available port to avoid address already in use error
    port = 8081
    while True:
        try:
            await server.start('localhost', port, verbose=False)
            break
        except OSError as e:
            if e.errno == 98:  # Address already in use
                logging.warning(f"Port {port} is already in use. Trying next port...")
                port += 1
            else:
                logging.error(f"Failed to start server: {e}")
                return
    # Start the image generation prompt and typing simulation
    tasks = [
        send_prompt(server, "Generate an image"),
        simulate_typing(server, "Typing while generating")
    ]
    # Run tasks concurrently
    await asyncio.gather(*tasks)
    logging.info("Test completed. Check the logs for the timing information.")
# Run the main function in the asyncio event loop
if __name__ == "__main__":
    try:
        asyncio.run(main())
    except Exception as e:
        logging.error(f"An error occurred: {e}")
```

```
Explanation: The script attempts to start a server on port 8081. If that port is already in use, it increments the port number to find an available one, thus addressing the 'address already in use' error. Overall, the script simulates typing while sending requests to the server, allowing us to reproduce timing issues related to user inputs and server responses. 
 Critique:
 The provided script attempts to create an instance of the PromptServer and simulate user input while generating images. This closely aligns with the bug description of experiencing delays while typing during image generation. The script also attempts to start the server at an available port, which is relevant for simulating the environment described. However, the key issue is that the server fails to start due to the port being occupied, which somewhat diminishes its ability to fully recreate the timing issues stated in the bug report. Overall, it captures relevant aspects of the bug but does not completely reproduce the problem, which mostly stems from environmental setup issues. 
 Score:
 70/100
```

It’s important to note - in this short experiment we were not able to directly pipe in the output of this system to recreate the bug, and required manual tweaking to resolve. Testing is continuing with better parameterization and more iterations.

Conclusion

Using this system we were able to automatically generate several viable scripts to recreate the selected issue. While we weren’t able to perfectly recreate the issue, for a couple cents, we managed to make decent headway into understanding (and maybe) reproducing it.

Further work is needed to understand the efficacy of this solution from an ongoing cost (and performance) perspective, but this initial test is promising. This system will also benefit from ongoing efficiency improvements to LLM models.

If you’re interested in this experiment, leave us feedback and keep in touch at www.artemis-hq.com.

A guest post by

Richard Kennedy

Founder, ex-Alphabet, ex-Stability AI

Matthew’s Substack

Discussion about this post