Test Refactoring with AI - Efficient and Fun

Any non-trivial software needs maintenance during its lifespan; tests are no different. We initially started with Webdriver.IO tests, which was a great start in building test coverage for client projects. However, after implementing Moshe Weitzman’s Test Traits, we noticed several improvements, for instance, a decrease in false positives and faster execution speed. The WDIO JavaScript- and browser-based tests, executed in real browsers, were fragile due to timing issues, accidental slowdowns in the CI environment, and even changes in browser versions, which could break the process. Large legacy projects that even survived Drupal 8, 9, and 10 upgrades still have WDIO tests alongside the PHPUnit test suites. The issue of having WDIO tests is twofold. First, executing and debugging those tests is especially inefficient for developers with no experience in WDIO. Second, it slowed down the CI pipeline as it was necessary to install all those extra packages required by WDIO. But let’s face it: It’s a boring refactor, and the business value is hard to sell since only the developers see it. Still, eliminating technical debt is a wise move in the long run, and we had the chance to act.

Challenge

The problem to solve is a business constraint; refactoring those tests is relatively easy but time-consuming. To be able to sell it as maintenance, it needs to be done quite quickly. That’s where AI comes into the picture. We divided the WDIO tests into two groups: trivial tests and complex tests. We created GitHub issues for the complex ones to be dealt with by real developers. We hoped we could refactor the simple ones using AI. There were 15 tests in this group, timeboxed to 2.5 hours total to convert them.

Solution

AI platforms also offer an API that lets us interact with the models programmatically. So we asked Claude Opus to create a script for the conversion with this prompt:

Refactor the following WDIO tests using AI:
first.js
second.js
...
- create a python script that converts such JS WDIO tests to using Anthropic AI to phpunit tests. Sample phpunit test:

<?php

namespace Drupal\Tests\server_general\ExistingSite;

use Symfony\Component\HttpFoundation\Response;

/**
 * Test input formats.
 */
class ServerGeneralInputFormatTest extends ServerGeneralTestBase {

   /**
   * Test Full HTML input format.
   */
  public function testFullHtmlFormat() {
    $user = $this->createUser();
    $user->addRole('administrator');
    $user->save();

    $this->drupalLogin($user);

    $this->drupalGet('/node/add/news');
    $this->assertSession()->statusCodeEquals(Response::HTTP_OK);
    $this->getSession()->getPage()->fillField('edit-title-0-value', 'Test Page' . time());
    $this->getSession()->getPage()->fillField('edit-field-body-0-value', 'I cannot have a script tag: <script></script>, as that would be way too dangerous.  See   https://owasp.org/www-community/attacks/xss/. <div class="danger-danger" onmouseover="javascript: whatafunction()">abc</div>');
    $this->getSession()->getPage()->selectFieldOption('edit-field-body-0-format--2', 'full_html');
    $this->click('#edit-submit');
    // <script> tag is eliminated.
    $this->assertSession()->elementNotExists('css', '.node--type-news script');
    // The class attribute is preserved.
    $this->assertSession()->elementExists('css', '.danger-danger');
    // The onmouseover attribute is completely droppped.
    $this->assertStringNotContainsString('onmouseover', $this->getCurrentPage()->getOuterHtml());
    $this->clickLink('Delete');
    $this->click('#edit-submit');
 }
}

- path to the WDIO tests: ../wdio/specs/backend/ , new phpunit class path:  web/modules/custom/server_general/tests/src/ExistingSite/  .
 create a proper prompt for the conversion and create a python script to orchestrate the whole process.

One of the simple WDIO tests looked like this:

const Page = require('../../page_objects/page');
const assert = require('assert');

describe('About', () => {

    it('Should show translated breadcrumbs', () => {
        let page = new Page();
        page.visit('about/who-we-are');

        // Breadcrumbs assertion for this page.
        assert.strictEqual($$('nav .crumb').length, 3);
        page.assertDisplayed("//nav/div[contains(@class, 'flex')]/div[contains(@class, 'crumb')]/a[text()='Home']");
        page.assertDisplayed("//nav/div[contains(@class, 'flex')]/div[contains(@class, 'crumb')][contains(text(), 'About')]");
        page.assertDisplayed("//nav/div[contains(@class, 'flex')]/div[contains(@class, 'crumb')][contains(text(), 'Who we are')]");

        page.login('Translator');
        page.visit('/es/permanent_entity/who-we-are/translations/add/en/es');
        $('#edit-label-0-value').setValue('Quienes Somos');
        $('input[value=Guardar]').click();

        assert.strictEqual($$('nav .crumb').length, 3);
        page.assertDisplayed("//nav/div[contains(@class, 'flex')]/div[contains(@class, 'crumb')]/a[text()='Inicio']");
        page.assertDisplayed("//nav/div[contains(@class, 'flex')]/div[contains(@class, 'crumb')][contains(text(), 'Nosotros')]");
        page.assertDisplayed("//nav/div[contains(@class, 'flex')]/div[contains(@class, 'crumb')][contains(text(), 'Quienes Somos')]");
    });
});

The result was far from perfect, but with quick modifications, with minimal Python knowledge, we had a working script. We could have chosen PHP as well for this script that controls the conversion process, but according to my experience and according to AI’s self-assessment, Python is one of the top programming languages in the training models.

The generated script contains the following prompt:

You are an AI developer assistant. Your task is to convert this WDIO test to a PHPUnit test.
Here is the WDIO test content:
{wdio_test_content}
The PHPUnit test should follow this sample:
{sample_phpunit_test_content}
Please convert the WDIO test to PHPUnit test, ensuring that the test logic is preserved and adapted to the PHPUnit framework.
Only provide the PHP code as a response, nothing else.

The actual source code of the script is not really important as the language models constantly evolve over time. You might refer to the LLM AI Leaderboard where you can see some statistics about the best models. By the time of writing this blog post, Gemini-Exp-1206 is the most capable model, at least according to the users of lmarena.ai. Also, pay attention to the WebDev Arena Leaderboard. There, the models of Anthropic lead the group.

Results

Not a single PHP test could be executed as-is after the Python script finished the refactor. We had a bunch of PHP scripts; most of them were syntactically correct. However, for some, the model ignored the instruction to “Only provide the PHP code as a response, nothing else.”

I accomplished the conversion in the end within 2.5 hours with just a few steps. First, an explanation was sometimes part of the output despite the prompt, but it was easy to clean up. We did not specify a mapping between WDIO and PHPUnit assertions, so it hallucinated a few methods. Since it was consistent, it was just a search-and-replace to make those work. In some cases, logical errors happened, which were the most time-consuming to fix. Then, phpcs could adapt the code to match the Drupal coding standards, so the pipeline could accept the new code. Finally, the phpunit test suite passed with the new tests.

What did we gain with the involvement of AI? Estimation is one of the most difficult things in software engineering, but the time saved could be somewhere between 4-8 hours of development. The API key that I generated for the job consumed merely $0.77 USD for 15 invocations. Was it worth it? Absolutely.

Additionally, developer satisfaction is certainly higher this way!

Takeaway

You can boldly think about AI-based Large Scale Changes for your codebase too. When the change fits in an algorithm, static code analysis tools with refactoring capabilities can certainly outperform AI due to the deterministic nature, but when it’s tedious and has some uncertainties here and there, AI via an API can help with speeding up the changes.

What’s next? I can imagine using AI for data quality checks in the next larger migration project, for example, letting the AI summarize what went wrong on a CI build. And for the next project where WDIO conversion would happen, AI could do better if we feed in the manual steps from this blog post as part of the prompt!

Test Refactoring with AI - Efficient and Fun

Challenge

Solution

Results

Takeaway

Áron Novák