PropertyBasedTesting.md (3668B)
1 # Property Based Testing (PBT) 2 3 **Definition:** Property based testing is a testing approach where formal executable specifications (properties) are written for software components and then automated harnesses check these specifications against automatically generated inputs. 4 5 ## Property-Based Testing In Practice 6 7 Paper from UPenn and Jane Street. 8 9 - Developers define a property 10 - Harness checks this property using many random inputs produced by a generator 11 - If a counterexample is found, the developer is notified 12 13 The difference between PBT and fuzzing is fuzzing looks for crashes whereas PBT attempts to validate properties. In this way, fuzz tests can be seen as a subset of property based tests where the property being evaluated is that the program doesn't crash. 14 15 Python library for hypothesis testing: 16 17 - https://github.com/HypothesisWorks/hypothesis 18 19 ## Agentic PBT: Finding Bugs Across the Python Ecosystem 20 21 This was made primarily by people from Anthropic so there may be a conflict of interest here. Moreover, they seem to overstate the effectiveness, they found a single bug in NumPy, and some other trivial bugs, but that doesn't really seem like that successful of a campaign. 22 23 ### Agentic property-based testing steps 24 25 1. Define a target 26 - Their testing is constrained to python code (somewhat arbitrarily) and only for functions, files, or a module 27 - Their approach should be generalizable, I see no reason a diff wouldn't work equally as well 28 1. Prompt agent with the following instructions (they actually include the prompt in the paper, yay!) 29 1. Analyze the target 30 - Figure out if the target is a function, files, or a module 31 - Kind of weird they don't specify this already, maybe this is some form of context loading... 32 2. Understand the target 33 - Read documentation, function signatures, source code, use web search, whatever is needed to understand the logic. 34 3. Propose properties 35 - Basically, ask it to use what it has 'learned' to define some properties that should hold 36 4. Write tests to exercise said properties 37 5. Execute and triage tests 38 6. Report bugs 39 40 They spent a claimed $5,474.20 on Opus tokens to find 18 bugs, 17 of which were worth reporting. They thus spent ~$322 per bug they reported... There was also quite a bit of manual intervention at the end. It seems like it would've been better to add a final step where the agent resolved the issue and then performed some more validation, similar to the process done by Orion. This only puts humans in the loop for reviewing the PR that fixes the problem. 41 42 More context about the above: 43 44 > Our evaluation demonstrates that LLM-guided property-based testing can systematically uncover 45 > bugs missed by traditional testing. With a cost of $5.56/bug report, and extrapolating from our manual 46 > grading that 56% of these are valid bugs, our agent can find bugs for $9.93/valid bug. This is an upper 47 > bound on the real-world cost, where developers with domain expertise can be more judicious with 48 > where to target the agent. The diversity of issues, spanning numerical issues to business logic issues, 49 > show the power of PBT, and the ability of agents to autonomously mine for such properties. 50 51 Okay, so I don't really believe they sampled fairly for the bugs they evaluated, nor do I trust their percentage of valid bugs estimation. Despite this, there is a non-zero chance this approach could be useful. 52 53 They found one issue in NumPy related the the Wald distribution 54 - https://github.com/numpy/numpy/pull/29609 55 56 ### Can LLMs Write Good Property-Based Tests? 57 58 The version in my hands is from July 2024. 59