jungsteven 6 days ago

Hey engineer here. Mutation testing may not be a familiar concept. To put it simply, it tells you how effective your unit test cases are at catching faults by injecting faults into your codebase.

If you are interested in learning more about mutation testing and how big tech companies use it, read:

- State of Mutation Testing at Google: https://research.google/pubs/state-of-mutation-testing-at-go... - Industrial Application of Mutation Testing: https://homes.cs.washington.edu/~rjust/publ/industrial_mutat... - LLM-based Mutation Testing: https://arxiv.org/pdf/2406.09843 - Medium Blog on Mutahunter: https://medium.com/codeintegrity-engineering/transforming-qa... - Short Demo: https://www.youtube.com/watch?v=8h4zpeK6LOA

Feel free to ask me any questions. My wish is to get mutation testing widely spread!

  • vlovich123 6 days ago

    Since LLMs necessarily generate mutations slower than traditional techniques and generally cost more, why doesn’t the paper compare against traditional mutation testing frameworks to demonstrate the bug / $ and bug / time spent testing? Seems like important criteria to justify that LLMs are worth it.

    The abstract claims LLMs are 18% better than traditional approaches, but I can’t actually find that in the body of the paper (unless uBert is the “traditional way” but that’s an LLM approach too).

    • jungsteven 6 days ago

      Nice question! The paper acknowledges that LLMs generate mutations slower and are more costly than traditional methods like PIT and Major, which are traditional testing tools. They did include metrics like cost per 1K mutations. However, the researchers focused on the effectiveness and high quality of the mutations generated by LLMs. For instance, GPT-3.5 achieves a 96.7% real bug detectability rate compared to Major’s 91.6% (not to mention GPT-4 outperformed all of them). All in all, LLMs produced fewer equivalent mutants, mutants with higher fault detection potential, as well as higher coupling and semantic similarity with real faults.

      • vlovich123 6 days ago

        > All in all, LLMs produced fewer equivalent mutants, mutants with higher fault detection potential, as well as higher coupling and semantic similarity with real faults.

        The problem with PIT and Major is that they don’t do profile guided mutation testing [0] which in theory would raise the detectability rate without a meaningful cost increase. Other works explore the use of GANs [1] which would probably be cheaper and likely as effective but not as sexy as LLMs.

        [0] https://arxiv.org/pdf/2102.11378

        [1] https://ar5iv.labs.arxiv.org/html/2303.07546

        • jungsteven 6 days ago

          Thanks for sharing the papers! I remember reading the first one from Google and can’t wait to dive into the new one. Appreciate the insights!

    • jungsteven 6 days ago

      + it would be great to see more research comparing LLMs and traditional methods in terms of cost-effectiveness. In my opinion, regarding the cost issue, there are engineering ways to mitigate this, such as running the LLMs only on changed code.

  • radiospiel 6 days ago

    Quick feedback on the presentation:

    - a oneliner over the video that explains what you are doing would be helpful, - and then "If you don't know what mutation testing is, you must be living under a rock! " brings people away from your repo faster than you can look the other side.

    • jungsteven 6 days ago

      Appreciate the feedback and suggestions!

rgx 6 days ago

[flagged]