Dev tools AI‑fication 101: lessons from Martian robots

August 20, 2024

Topics

Today, software engineers have GPTs, LLMs, RAGs, and even llamas (🦙 “mhe-e”) coming at them from every direction. Yes, it’s becoming normal for every service or tool to have some kind of special intelligence baked in. As developer tool builders, it’s natural to experience AI-FOMO. And if that’s you, don’t worry—we’re here to help. Read on to see how we approached the AI-ification of our TestProf toolkit and start embracing LLMs today!

For years, Evil Martians have been building for developers. We’ve seen many technology hype waves come and go: “NoSQL is the (unstructured) way!”; “We’re moving away from our monolith to a GraphQL federation (oh, and it’s gonna be serverless, of course)”; and one of our favourites: “Can you help us re-write our software in Go (then Rust)?“.

But, by comparison, the GPT wave has turned out to be a real tsunami, hasn’t it? And, while some have managed to ride the wave, others have been overtaken by it, and the majority are still floating about as they trying to learn how to swim in these uncharted waters.

Yet, it’s now inevitable that AI will become a fundamental component of every software engineer’s life. (Alongsie code editors and IDEs, virtualization and containerization systems, compute clouds and distracting video calls.) That’s why it’s essential for everyone building developer tools to be ready for the AI-ification of their software.

Irina Nazarova CEO at Evil Martians

Book a call

We want to share our method of bringing intelligence to developer tools based on our recent experience with TestProf, a Ruby tests profiling and optimization toolbox.

We’ll use TestProf as an example as we discuss the following major aspects of building with LLMs:

Finding your AI use case
Picking the right LLM for the job
On prompts and loops
Github Actions as a developer-friendly UI

Finding your AI use case

Okay, so, you’ve realized it’s time to join the AI bandwagon. But where to start?

First, try asking yourself some questions. Unfortunately, the obvious starting question, “How can AI make my tool better?”, doesn’t usually help much. “Better” is just too vague. Instead, we need to narrow down the better-ness scope.

Let’s try a different angle: why do engineers choose a particular developer tool? Because they want to improve the developer experience and, thus, increase their productivity. “Experience” is the key here, and that’s exactly what we should aim to improve.

Let’s hop back into our story: developers use TestProf to profile and refactor tests. This process usually consists of two phases:

From general profiling, flamegraph analysis, and configuration-level tweaks…
…to micro-patching test files individually.

The former step usually requires more context to identify problems and some creativity in order to solve them efficiently (i.e., with the least possible amount of refactoring). Both “more context” and “creativity” indicate that it could be problematic to teach (current) LLMs to perform these tasks; it’s not impossible, but probably isn’t worth the effort. Instead, it’s better to look for low-hanging fruits first, that is, find the developer experience scenarios that can be easier delegated to AI.

Now, what about that second phase, the refactoring of the test files one by one? Let’s take a look at a minimal example of this kind of refactoring:

 RSpec.describe User do
-  let(:user) { User.create(name: 'Sara Connor', email: 'sara@sky.net') }
+  let_it_be(:user) { User.create(name: 'Sara Connor', email: 'sara@sky.net') }

   it 'knows how to escape from a T-800' do
     expect(user.escape_skills).to include('T-800 evasion')
   end

   it 'is not a robot' do
     expect(user.robot?).to be_falsey
   end
 end

Yeah, that’s it! Now, you might think that this kind of task can be accomplished with a good old “Find and replace” operation, no AI or human required. But that’s only true in some cases.

Most of the time, there are hidden dependencies and other edge cases that must be resolved by hand. Just to give you a visual approximation, imagine a task where you need to replace all the meantions of the color “orange” in a book with “red”. Simple enough? But there’s a catch! “Orange” can be represented in many different forms, like, “orangish”, or “#FFA500”, or “a mix of yellow and red”. Oh, and there are also derivatives of orange you must update, too!

So, we have a repetitive task that can be solved 90% programmatically with the remaining 10% requiring a pair of eyes. Fair enough.

Still, most importantly …this is a pretty boring task! Boring tasks hurt developer experience, no one likes them. But know who doesn’t mind being bored? AI.

Routine, boring tasks are the best candidates for delegating to LLMs.

We decided that single file refactoring would be the perfect task for an AI assistant:

We minimized the context (one file) and the instructions set (possible refactoring techniques, e.g., let -> let_it_be).
We have a natural way to determine operational success (the test must pass, the timings must decrease).
Since we’re operating on per-file basis, we can scale the process horizontally and, thus, significantly boost engineer productivity; instead of performing refactoring themselves they only need to do code reviews.

And this is our candidate for AI-ification. Now, we need to figure out whether we can bring this idea to life, and to do that, first of all, we need to pick the right LLM for the job.

Picking the right LLM for the job

Today’s LLM market is plentiful: local and cloud-based, opensource and commercial, cheap and expensive, and, of course, a few released a week ago. So, how to pick the right one for you task? Let us share our recipe.

We usually start with the hottest (that is, the most recently released) LLMs and add well-performing ones (according to some rankings, like this one) to the list of candidates. Only after (and if) we’ve found an LLM that succeeds, we then try to downscale, that is, try smaller (and cheaper) models, or local/opensource ones.

How do we evaluate and compare LLMs? In this case, we don’t use any specific comparison system, just eyeballing. To do this, we chat with LLMs and ask them to solve the task (to refactor a test file), and check how clearly each of them understands our instructions, and how they recover from failures.

For TestProf AI, we used OpenRouter to play with different LLMs. With OpenRouter, you can chat with any supported LLM using the same UI. This is pretty handy for our hypothesis verification process.

So, which questions did we ask to evaluate the different models? We didn’t want them to do carry out the full refactoring, and we were mostly interested in the two things discussed below.

First of all, we wanted to see if an LLM would be capable of taking source code as input, doing some minor refactoring (like a “find and replace” operation) and responding with the refactored source code in full and in a parseable form.

In other words, we wanted to ensure that the LLM understands the requirements. For example, with some LLMs it was hard to persuade them to respond with the full source code (and not ” # 10 more lines …” placeholders). Accordingly, these were removed from the list of the candidates.

Second, we wanted to determine if the LLM has sufficient Ruby, Rails, and RSpec knowledge to recover from errors (in our case, these would be test failures after refactoring). So, we pasted the test run output with error messages and stack traces and asked the AI to refactor the code. It’s worth noting that we also limited the number of recovery attempts to 5-7 (depending on the file size—we need to fit all the conversation into the context).

In the end, we were able to prove the hypothesis (that an LLM can assist us in tests refactoring) and found the model to integrate into TestProf. And the winner was…(🥁)…Claude 3.5 Sonnet! Honestly, this model exceeded all our expectations about how “smart” and reliable an AI could be.

What about the others? We couldn’t make ChatGPT 4o reliably recover from errors, meaning, to produce a refactored and a non-failing test file.

Opensource LLMs, wizardlm2 and deepseek-coder, showed promising results, but they were just not good enough (although we plan to fine-tune them to see if we can use them as a cheaper replacement for Claude).

Other LLMs, like ChatGPT 3.5, couldn’t return syntactically correct Ruby code on many occasions.

With the LLM selection out of the way, it’s time to integrate our chat-based proof of concept into a product.

On prompts and loops

We’ve already figured out the basics of talking to AI during the LLM evaluation phase when we chatted with different models and tried to convince them to solve our problems. In our case, these conversations were pretty random and consisted of many iterations. Now, we need to automate and stabilize this process. Let’s do some prompt engineering.

When working with AI models, it’s best to start with zero-or-few-shot prompting—asking AI to perform tasks without any examples. This approach revealed Claude’s extensive knowledge of Ruby, RSpec, and even TestProf. So, we were able to avoid explaining the basics of writing tests in Ruby and even describing TestProf recipes.

Our first prompt iteration included two primary blocks: must blocks and how-to blocks. The former aim to keep the LLM’s imagination under control, while the latter give additional context. Here’s what we had in the prompt:

<!-- Task definition -->
You have been asked to refactor a test file to improve its performance.

<!-- MUST-s -->
You MUST keep your answers very short, concise, simple and informative.
You MUST introduce changes only when they are absolutely neccessary and bring noticeable performance improvements (don't over-optimize).
You MUST always send back the whole file, even if some parts of it didn't change.
The file contents MUST be surrounded by __BEGIN__ and __END__ tokens.

<!-- HOW-TO-s -->
You SHOULD use TestProf's let_it_be and before_all features.

Use the following example refactoring as a guide:

%{example_git_diff}

Most of the prompt actually looks like a super detailed task definition for a human engineer. However, there are a couple of things that indicate we’re writing for a machine.

First, we try to get parseable output by instructing the LLM to use the __BEGIN__ and __END__ tokens.

Getting structured output has become its own class of LLM-related problems, and you’ll definitely face it when you try to do this for your project. In some cases, a regexp and a polite request work fine (like in our example); in other situtations, you may have to use special tools or recover sub-prompts (yes, you can ask an LLM to fix the format).

One thing we know for sure—you’re gonna miss a human’s ability to understand at a glance!

Second, we provide an example via the Git diff: a machine-readable representation of the refactoring. We do this both to save available context space and to make the instructions as clear as possible.

Give AI a soul

So, how did our initial prompt perform? Frankly speaking, so-so. Some files were refactored with success, others failed. It turned out that we forgot one of the key ingredients for cooking a perfect prompt—we forgot to specify the AI’s identity!

Enhancing the context with an identity can significantly boost result quality.

We found that the Pygmalion effect holds for LLMs, too: if you tell an AI to act like an experienced engineer, it will try to do its best not to fail you.

So, we changed the beginning of our prompt as follows:

<!-- Identity -->
You're an experienced Ruby on Rails engineer, attentive to details, and who has spent the last 10 years optimizing performance of *_spec.rb files using the TestProf toolbox.

...

Surprisingly (for us, at least), the results got noticeable better. However, the prompt still produced mistakes. That’s fine. We don’t expect experienced engineers to refactor tests without breaking them with just a single attempt. So, we need to instruct our LLM to work on mistakes.

Thought, Action, PAUSE, Observation

We need to equip our LLM with result verification and evaluation tools. In other words, we should provide some feedback and let the AI iterate on it. The most common way to do that is to implement a Thought-Action-PAUSE-Observation loop, aka a ReAct pattern.

In this framework, you give the LLM a set of actions to perform, e.g., “run_rspec”, “run_factory_prof”, and so on.

At the Thought phase, the LLM may decide to peform an Action. If so, it moves to the PAUSE phase and waits for the action to be executed (usually, this action is a function in your code). Once the action completes, you resume the conversation with the AI and provide the results of the action. The LLM observes the results and decides what to do next, to terminate or execute another action.

We started with a single action: “run_rspec”. It runs a given RSpec test file (contents are provided by the LLM) and returns the output (including some test profilers information). Thus, the LLM observes whether the test has passed and was ran faster, and if both hold true, the refactoring terminates. Otherwise, it iterates on the test contents and runs the action again. Just like a human, right?

So, the prompt evolved as follows:

<!-- ReAct instruction -->
You run in a loop of Thought, Action, PAUSE, Observation.

At the end of the loop, you may output an Answer or
re-enter the loop if you're not satisifed with the Observation.

Use Thought to describe your thoughts about the question you have been asked.

Use Action to run one of the actions available to you,
then return PAUSE and stop.

Observation will be the result of running those actions.

Every step of the loop MUST start with the corresponding keyword (Question, Thought, Action, PAUSE, Observation, Answer) followed by a colon and a space.

The Action keyword is only followed by the action name; the action payload goes on the next lines.

The action payload MUST end with the __END__ keyword.

Your available actions are:

run_rspec:

Example (it's a multiline action):

Action: run_rspec
<Ruby RSpec code>
__END__

Runs the given test contents and returns the RSpec output containing TestProf profiling information (FactoryProf).

...

(It also makes sense to include an example converstation into the prompt since more clear instructions and examples mean better results.)

Then, in our code, we scan for the loop phase indicators. Whenever we encounter an action, we parse and execute it. We continue to loop until we reach the max number of attempts. Usually, 3-5 runs is enough; if an LLM cannot solve the task in 5 rounds, it’s likely to get stuck. Additionally, the context space is not unlimited.

In the end, our prompt reached almost a hundred lines of text (not including dynamic parts, like Git diffs). But size is not what matters here, of course, and the success ratio of this prompt became high enough to finally make it a part of our development workflow.

Github Actions as a developer-friendly UI

Once you’re satisfied with a prompt (and its ReActive capabilities), you might be eager to go ahead and pack it into a new tool (or even a product) right away. After all, you want your users to be able to benefit from AI-powered features as soon as possible.

But, for the initial launch, it’s very important to collect feedback and react on it as quickly as possible. The fast-paced evolution of AI and LLMs is another component to adapt, and so it’s better to minimize maintenance friction of your new AI project.

For instance, for our TestProf AI assistant, we decided to postpone integration into our TestProf Autopilot project or even its release as a standalone tool. No web service either: we don’t want to deal with managing users, LLM API keys, source control integrations, and so on. Instead, we packed our solution into a GitHub Action.

Since the main purpose of TestProf AI is to refactor code (tests), making it a part of the GitHub Flow seemed natural. We can orchestrate everything using GitHub Actions. Developers still need to review changes, and pull requests work for that perfectly. Meanwhile, issues can be used as task definitions. Finally, the code doesn’t leave GitHub, a place most of our users already trust.

So, we delegated most of the boring tasks to the existing platform, leaving us to focus on improving our AI project.

Working on developer tools comes with many time-saving opportunities where you can integrate your project into existing platforms and environments instead of building everything yourself. Don’t miss these!

TestProf AI GitHub Action at glance

Our GitHub flow works as follows:

A user creates an issue with the “test-prof” label and specifies a path to the test file to be refactored.
A CI action including the test-prof-aiptimize step is triggered.
Our action opens a PR and post comments and code updates on every AI loop iteration (this is so you can see in real-time all the thought process).

TestProf AI GitHub Action demo

In the video above, you can see that we leverage one of the GitHub’s recent additions (still in beta): issue forms. This way, we are able to control how the target file path will be formatted in the issue description (so we can extract it programmatically later). We also automatically attach the required label via our issue template.

The workflow to process this issue request can be defined as follows:

# .github/workflows/test-prof.yml
name: TestProf AI

on:
  issues:
    types: [labeled]

jobs:
  optimize:
    # IMPORTANT: only run this workflow for explicitly labeled issues
    if: github.event.label.name == 'test-prof'
    runs-on: ubuntu-latest

    env:
      RAILS_ENV: test
      # ... this is where your environment setup goes

    steps:
      - uses: actions/checkout@v4
      - uses: ruby/setup-ruby@v1
        with:
          bundler-cache: true
      - name: Prepare app
        run: |
          bundle exec rails db:test:prepare

      - name: Run TestProf AI
        uses: test-prof/test-prof-aiptimize-action@main
        with:
          api-key: $
          issue-number:  $

That’s it! All you need to do is drop this kind of workflow (with your application specific testing setup) and add a Claude API key to the repository secrets.

We’ve been testing our TestProf AI-ification against the Mastodon codebase, and it has shown some pretty good results. You can find those in our fork. The source code of the action itself is also available on GitHub: test-prof/test-prof-aiptimize-action.

Currently, our GitHub Action works autonomously and is only able to handle a single file per issue. One natural extension would be treating issue comments as user commands (like, “Please, refactor another file: path/to/file_spec.rb” or “Please, avoid using let_it_be in nested contexts”). That would make the experience even more interactive and human-like. And we don’t need to leave the comfort zone of the platform to implement such features!

We hope you found our AI-ification journey inspiring and helpful. Don’t hesitate to reach out if you have any question regarding bringing AI to your developer-facing tool!