notes

Personal notes
git clone git://git.laack.co/notes.git
Log | Files | Refs

PropertyBasedTesting.md (3668B)


      1 # Property Based Testing (PBT)
      2 
      3 **Definition:** Property based testing is a testing approach where formal executable specifications (properties) are written for software components and then automated harnesses check these specifications against automatically generated inputs.
      4 
      5 ## Property-Based Testing In Practice
      6 
      7 Paper from UPenn and Jane Street.
      8 
      9 - Developers define a property
     10 - Harness checks this property using many random inputs produced by a generator
     11 - If a counterexample is found, the developer is notified
     12 
     13 The difference between PBT and fuzzing is fuzzing looks for crashes whereas PBT attempts to validate properties. In this way, fuzz tests can be seen as a subset of property based tests where the property being evaluated is that the program doesn't crash.
     14 
     15 Python library for hypothesis testing:
     16 
     17 - https://github.com/HypothesisWorks/hypothesis
     18 
     19 ## Agentic PBT: Finding Bugs Across the Python Ecosystem
     20 
     21 This was made primarily by people from Anthropic so there may be a conflict of interest here. Moreover, they seem to overstate the effectiveness, they found a single bug in NumPy, and some other trivial bugs, but that doesn't really seem like that successful of a campaign.
     22 
     23 ### Agentic property-based testing steps
     24 
     25 1. Define a target
     26     - Their testing is constrained to python code (somewhat arbitrarily) and only for functions, files, or a module
     27         - Their approach should be generalizable, I see no reason a diff wouldn't work equally as well
     28 1. Prompt agent with the following instructions (they actually include the prompt in the paper, yay!)
     29     1. Analyze the target
     30         - Figure out if the target is a function, files, or a module
     31             - Kind of weird they don't specify this already, maybe this is some form of context loading...
     32     2. Understand the target
     33         - Read documentation, function signatures, source code, use web search, whatever is needed to understand the logic.
     34     3. Propose properties
     35         - Basically, ask it to use what it has 'learned' to define some properties that should hold
     36     4. Write tests to exercise said properties
     37     5. Execute and triage tests
     38     6. Report bugs
     39 
     40 They spent a claimed $5,474.20 on Opus tokens to find 18 bugs, 17 of which were worth reporting. They thus spent ~$322 per bug they reported... There was also quite a bit of manual intervention at the end. It seems like it would've been better to add a final step where the agent resolved the issue and then performed some more validation, similar to the process done by Orion. This only puts humans in the loop for reviewing the PR that fixes the problem.
     41 
     42 More context about the above:
     43 
     44 > Our evaluation demonstrates that LLM-guided property-based testing can systematically uncover
     45 > bugs missed by traditional testing. With a cost of $5.56/bug report, and extrapolating from our manual
     46 > grading that 56% of these are valid bugs, our agent can find bugs for $9.93/valid bug. This is an upper
     47 > bound on the real-world cost, where developers with domain expertise can be more judicious with
     48 > where to target the agent. The diversity of issues, spanning numerical issues to business logic issues,
     49 > show the power of PBT, and the ability of agents to autonomously mine for such properties.
     50 
     51 Okay, so I don't really believe they sampled fairly for the bugs they evaluated, nor do I trust their percentage of valid bugs estimation. Despite this, there is a non-zero chance this approach could be useful.
     52 
     53 They found one issue in NumPy related the the Wald distribution
     54 - https://github.com/numpy/numpy/pull/29609
     55 
     56 ### Can LLMs Write Good Property-Based Tests?
     57 
     58 The version in my hands is from July 2024.
     59