August 28, 2024

Can self-healing tests improve the automated game QA process?

Maybe it’s because I work in automated testing, but I get an ad for Rainforest QA and their self-healing tests on LinkedIn every single day.

https://www.youtube.com/watch?v=Pbr8lUT8vZM

This idea of a self-healing test caught my eye, especially as someone working on automated game testing in Unity. Games are often an order of magnitude more dynamic than that of web or app software. Even small changes in game mechanics and graphics can cause tests to break, not to mention the inherent randomness and non-deterministic behavior that make games replayable and fun.

Can we implement an effective self-healing test system for games? Let's explore this idea further and see where it leads.

What is a self-healing test?

At its core, a self-healing test is a test in which AI can rewrite portions of the test when it fails by identifying elements in a piece of software based on the intent of the original test.

It seems like the space has grown quite a bit over the past few years in the wake of LLMs. Before, pure machine learning approaches would focus on classifying new UI elements on a page when the expected elements could not be located. With LLMs, training complicated models is no longer necessary, and tools like Healenium and Rainforest QA can generally write new rules on the fly using natural-language descriptions of the desired tests. These tools primarily focus on regenerating the rules for finding elements on the page (e.g. the expected login button tag was missing, but we see this new login button - let’s use that), with quite a few limitations on wildly changing the actions taken as part of the test.

Not everyone is sold on the idea of self-healing tests. For instance, take this quote from an excellent article by Eliya Hasan regarding the practical downsides of this approach:

... the reliability of self-healing locators comes into question due to their susceptibility to false positives — misinterpreting changes and leading to incorrect adjustments in tests. This not only undermines the integrity of test results but also complicates the debugging and maintenance processes ...

We still have a long way to go, but if implemented in a way that gives humans the ultimate control over how these tests are accepted, this has the potential to really change how testing is done.

The difficulty of self-healing tests in games

Self-healing tests in games is an interesting concept because the problems that healing tests face in regular software are even more pronounced in this domain. In a game test, it's not often just a "missing element" causing the test to become flaky.

Randomness and non-determinism can cause test scenarios to be inherently non-static, introducing frequent variability in the data you are trying to test.
Even the most minor changes in game mechanics or graphics can cause downstream effects that break tests.
Testing the state of a game typically requires much more data than that of a webpage—hundreds of entities, visual effects, and complicated interconnected systems mean that some tests need to involve many parts of the game at once.

These facts of game testing often cause tests to break and become flaky, and this is cited as a common reason why developers abandon the idea of automated tests. Of course, developers need to write better tests, but this fact alone leads me to think that a system that can heal game tests could be beneficial. Just see this post from Reddit when someone asked why game developers don’t create tests early in the game development process.

https://www.reddit.com/r/gamedev/comments/1031i28/why_arent_you_seeing_tdd_related_issues_around/

I’ve seen this firsthand in Unity Test Runner files as well, where quality assurance teams will start with writing a simple script for automating their gameplay, but it eventually gets removed due to the difficulty in maintaining that file as the game grows. Especially in Unity, the support for dynamic and conditional behavior in tests is lacking (a problem we are looking to solve with smarter agents).

What I want to try

I suspect that a combination of multiple test runs, using humans in the loop, and providing better testing tools that support the dynamics of a game could be a decent way to support self-healing game tests and solve some of these issues.

There are a few things I want to think more deeply about and experiment with at Regression Games:

Can we use GPT-4o to read the logs, test results, and screenshots of the game to determine whether the test needs to be updated? We already collect this information in our Unity testing tool called Quickscope, so it would be interesting to apply it here.
Can the context of the game design be shared with LLMs to help them further understand in what cases a test is indeed failing versus situations where the test is "broken" but in an accepted state?
Can we refine these systems by constraining the types of tests that LLMs can fix? Maybe we can't automatically heal 20-minute gameplay end-to-end tests, but surely fixing Unity UI canvas tests could be a decent start?

There is a lot more to say on this subject - let's chat about it as a community! Join the conversation on our Discord.

‍

References

https://help.rainforestqa.com/docs/how-to-generate-self-healing-tests

https://www.linkedin.com/pulse/automation-self-healing-healenium-venkata-goday/

https://healenium.io/

https://codecept.io/

https://www.testim.io/test-stability/

‍