GPT-4o mini and prompt caching, a turning point in debugging?
TLDR: We are at an inflection point - prompt caching and GPT-4o mini have lowered costs to search a problem space sufficiently that LLMs are now a viable strategy to recreate bugs.
It’s no secret that developers spend a lot of time on debugging, and the problem keeps growing. The debugging process typically involves a few consistent steps:
Triage / issue selection
Building context (viewing metrics, logs, monitoring etc.)
Hypothesis testing (recreating bugs, isolating problematic code within systems)
Resolution
Tooling already exists for building context - observability stacks such as Datadog/Splunk are fantastic at helping developers understand a bug. However not much exists for hypothesis testing which remains a very manual process. Tools like SWE-Agent and Devin touch this problem to a degree but it’s not their primary focus - they tend to be single-threaded and too broadly focused.
We address this issue by building a focused hypothesis testing system. Rather than build a single agent to reproduce a bug, we restructure this as a search of the space of possible bug hypotheses. An LLM produces potential hypotheses in parallel in the form of debugging scripts, which run in a simulated environment to test error reproduction. Recent developments in LLM efficiency (prompt caching and GPT4o-mini) make this strategy not only effective but viable from a cost perspective.
A typical flow for debugging
SRE or QA engineer identifies a bug from an event trigger, customer complaint, etc. and adds an issue to Github repo / Jira. These issues typically have incomplete or missing logs / supporting data
Developer picks up issues during maintenance time or if urgent resolution needed
Developer uses tools to understand the systems causing the error
Grafana dashboard to understand performance / ongoing impact of issue
Check logging and monitoring systems (e.g. splunk, datadog)
Look for any similar errors previously experienced
Use distributed tracing tooling to narrow down source of issue to specific systems
Developer forms hypothesis on cause of error and attempts to isolate / recreate issue
Enrich the data from the issue e.g. add additional logging to capture missing data
Recreate the environment where the error occurred (state, config, system specs)
Recreate the code execution that generated the error
Finally, the Developer implements the fix for the bug and resolves the issue.
This flow is clearly generalized, but it illustrates our view that the “Context Building tools” are more mature than other steps. We see potential for improvement on the “hypothesis testing” portion of the process where developers utilize more manual approaches.
Experiment: automate bug recreation for a real-world repo
Our goal for this experiment was automating the recreation of hard to crack bugs from issues in a real development environment. We built a system which uses LLMs to search for potential reproductions. The system takes logs, source, core dumps and the actual bug description to build the reproduction.
Case Study - ComfyUI
We tested this system on the ComfyUI repo, a large popular repo that happens to be near and dear to our hearts. Issue #4525 will serve as a test case as a tough bug to reproduce (involved timing issues, certain behaviors causing lags)
ComfyUI #4525, our test cast
We pulled the issue from github with the source code and attached logs. This data is preprocessed and loaded into context. The system generates multiple attempts (in parallel) at recreating the issue using different methods, and defines a loss function to judge which generated solutions are most effective at recreating the issue. These solutions are fed through rounds of evolution, and after each iteration the top 3 solutions are selected to serve as the basis for the next round.
An evolution of the solutions, with the top 3 solutions selected.
Solutions iterating and improving
After several iterations, the system converges on the recreated bug and the top 5 attempts to recreate the issue. The top rated script was:
```
import asyncio
import aiohttp
import logging
import json
import sys
import os
# Import the PromptServer class from server.py
sys.path.append('/var/folders/p0/xjn9q68j5yq5by4rd9y2mpf00000gn/T/tmp5vkscqv2')
from server import PromptServer
# Configure logging to display the information
logging.basicConfig(level=logging.INFO)
# Mock request class to simulate incoming requests to the server
class MockRequest:
def __init__(self, json_data):
self._json = json_data
async def json(self):
return self._json
# Function to simulate sending a prompt to the server
async def send_prompt(server, prompt):
json_data = {
"prompt": prompt,
"number": 1,
"client_id": "test_client"
}
request = MockRequest(json_data)
# Call the 'post_prompt' method which handles prompts
response = await server.post_prompt(request)
return response
# Function to simulate typing events while generating an image
async def simulate_typing(server, prompt):
for i in range(5):
logging.info(f"Typing: {prompt} {i}")
await send_prompt(server, f"{prompt} {i}")
await asyncio.sleep(0.5) # Simulate typing delay
# Main function to initialize the server and run tests
async def main():
loop = asyncio.get_event_loop()
server = PromptServer(loop)
# Setup server and add routes
await server.setup()
server.add_routes()
# Start server on an available port to avoid address already in use error
port = 8081
while True:
try:
await server.start('localhost', port, verbose=False)
break
except OSError as e:
if e.errno == 98: # Address already in use
logging.warning(f"Port {port} is already in use. Trying next port...")
port += 1
else:
logging.error(f"Failed to start server: {e}")
return
# Start the image generation prompt and typing simulation
tasks = [
send_prompt(server, "Generate an image"),
simulate_typing(server, "Typing while generating")
]
# Run tasks concurrently
await asyncio.gather(*tasks)
logging.info("Test completed. Check the logs for the timing information.")
# Run the main function in the asyncio event loop
if __name__ == "__main__":
try:
asyncio.run(main())
except Exception as e:
logging.error(f"An error occurred: {e}")
```
```
Explanation: The script attempts to start a server on port 8081. If that port is already in use, it increments the port number to find an available one, thus addressing the 'address already in use' error. Overall, the script simulates typing while sending requests to the server, allowing us to reproduce timing issues related to user inputs and server responses.
Critique:
The provided script attempts to create an instance of the PromptServer and simulate user input while generating images. This closely aligns with the bug description of experiencing delays while typing during image generation. The script also attempts to start the server at an available port, which is relevant for simulating the environment described. However, the key issue is that the server fails to start due to the port being occupied, which somewhat diminishes its ability to fully recreate the timing issues stated in the bug report. Overall, it captures relevant aspects of the bug but does not completely reproduce the problem, which mostly stems from environmental setup issues.
Score:
70/100
```
It’s important to note - in this short experiment we were not able to directly pipe in the output of this system to recreate the bug, and required manual tweaking to resolve. Testing is continuing with better parameterization and more iterations.
Conclusion
Using this system we were able to automatically generate several viable scripts to recreate the selected issue. While we weren’t able to perfectly recreate the issue, for a couple cents, we managed to make decent headway into understanding (and maybe) reproducing it.
Further work is needed to understand the efficacy of this solution from an ongoing cost (and performance) perspective, but this initial test is promising. This system will also benefit from ongoing efficiency improvements to LLM models.
If you’re interested in this experiment, leave us feedback and keep in touch at www.artemis-hq.com.