PropertyBasedTesting.md - notes

PropertyBasedTesting.md (3668B)

1 # Property Based Testing (PBT)
2
3 **Definition:** Property based testing is a testing approach where formal executable specifications (properties) are written for software components and then automated harnesses check these specifications against automatically generated inputs.
4
5 ## Property-Based Testing In Practice
6
7 Paper from UPenn and Jane Street.
8
9 - Developers define a property
10 - Harness checks this property using many random inputs produced by a generator
11 - If a counterexample is found, the developer is notified
12
13 The difference between PBT and fuzzing is fuzzing looks for crashes whereas PBT attempts to validate properties. In this way, fuzz tests can be seen as a subset of property based tests where the property being evaluated is that the program doesn't crash.
14
15 Python library for hypothesis testing:
16
17 - https://github.com/HypothesisWorks/hypothesis
18
19 ## Agentic PBT: Finding Bugs Across the Python Ecosystem
20
21 This was made primarily by people from Anthropic so there may be a conflict of interest here. Moreover, they seem to overstate the effectiveness, they found a single bug in NumPy, and some other trivial bugs, but that doesn't really seem like that successful of a campaign.
22
23 ### Agentic property-based testing steps
24
25 1. Define a target
26 - Their testing is constrained to python code (somewhat arbitrarily) and only for functions, files, or a module
27 - Their approach should be generalizable, I see no reason a diff wouldn't work equally as well
28 1. Prompt agent with the following instructions (they actually include the prompt in the paper, yay!)
29 1. Analyze the target
30 - Figure out if the target is a function, files, or a module
31 - Kind of weird they don't specify this already, maybe this is some form of context loading...
32 2. Understand the target
33 - Read documentation, function signatures, source code, use web search, whatever is needed to understand the logic.
34 3. Propose properties
35 - Basically, ask it to use what it has 'learned' to define some properties that should hold
36 4. Write tests to exercise said properties
37 5. Execute and triage tests
38 6. Report bugs
39
40 They spent a claimed $5,474.20 on Opus tokens to find 18 bugs, 17 of which were worth reporting. They thus spent ~$322 per bug they reported... There was also quite a bit of manual intervention at the end. It seems like it would've been better to add a final step where the agent resolved the issue and then performed some more validation, similar to the process done by Orion. This only puts humans in the loop for reviewing the PR that fixes the problem.
41
42 More context about the above:
43
44 > Our evaluation demonstrates that LLM-guided property-based testing can systematically uncover
45 > bugs missed by traditional testing. With a cost of $5.56/bug report, and extrapolating from our manual
46 > grading that 56% of these are valid bugs, our agent can find bugs for $9.93/valid bug. This is an upper
47 > bound on the real-world cost, where developers with domain expertise can be more judicious with
48 > where to target the agent. The diversity of issues, spanning numerical issues to business logic issues,
49 > show the power of PBT, and the ability of agents to autonomously mine for such properties.
50
51 Okay, so I don't really believe they sampled fairly for the bugs they evaluated, nor do I trust their percentage of valid bugs estimation. Despite this, there is a non-zero chance this approach could be useful.
52
53 They found one issue in NumPy related the the Wald distribution
54 - https://github.com/numpy/numpy/pull/29609
55
56 ### Can LLMs Write Good Property-Based Tests?
57
58 The version in my hands is from July 2024.
59

	notes Personal notes
	git clone git://git.laack.co/notes.git
	Log \| Files \| Refs