WhereToFuzz.md (4608B)
1 # SoK: Where to Fuzz 2 3 **Source:** SoK: Where to Fuzz? Assessing Target Selection Methods in Directed Fuzzing 4 5 ## Background 6 7 It is common to improve fuzzing performance by selecting regions of interest within a program instead of the entire program. This paper is an analysis of target selection method for fuzzing. 8 9 ## Selection Methods 10 11 Discrete Scoring: 12 13 - Assign a 0 or a 1 to a given location if it is relevant or not for fuzzing (most tools fall into this category) 14 - The distance to relevant code locations is often used as a proxy score for indirect ranking and comparisons of locational importance. 15 16 Continuous Scoring: 17 18 - Assigns a continuous value to all code locations 19 20 ## Their Experimental Setup 21 22 - They have a corpus of 1600 crashes from 97 software projects from OSS-Fuzz 23 - These are considered ground truth issues 24 - They then use each of the methods to pick out code blocks of interest based on the method's weightings 25 - They then check how many of the ground truth issues are covered by each approach 26 27 ## Questions & Critiques 28 29 - How do they deal with potential differences between actual code failure locations and what is tested with OSS-Fuzz 30 - One could imagine there are even more issues throughout the codebases that haven't yet been caught by OSS-Fuzz or are insufficiently covered by it, making the training dataset faulty in the sense that it doesn't represent the true distribution of errors / failures / vulnerabilities. 31 32 > While there is still no guarantee that OSS-Fuzz identified every bug, 33 > the massive amount of time spent on fuzzing these targets is likely 34 > to find a large majority of crashes reachable by a fuzzer. 35 36 That's kind of lame TBH. That said, if write code with some number of vulnerabilities, that would be even more contrived. Ideally, they'd use the entirety of some CVE database because existing fuzzers might still be missing key issues. What they did seems acceptable, albeit limited. 37 38 - Their evals are ran against open source C and C++ projects from 2016 - 2023 39 - This seems like it is limited in a few ways 40 - It is only projects written in two languages that aren't exceptionally popular to start projects with today 41 - ie. these are likely mature projects 42 - These are open source projects 43 - These are distributionally different than proprietary projects, but in what ways I'm unsure 44 45 - They are only using issues found by OSS-Fuzz as their metric, but this doesn't weight them according to importance 46 - Maybe this doesn't matter as one vulnerability means the system is vulnerable, but still, it seems prudent to weight specific issues more highly than others. 47 48 ## Results 49 50 Given their dataset, the approaches they evaluated, and the NDCG- (normalized discounted cululative gain) and NDCG+ calculations (NDCG with only most recent function from stack trace vs all functions in the stack trace to get an under and overestimate for retrieval), they found the following: 51 52 - Leopard-V performs the best followed by Leopard-C, Sanitizer, then CodeT5+. 53 - The top three are deterministic. There are also worse deterministic ones, but it is not the case that the ML approaches performed better (or even on par) 54 - that said, the paper is from July 2024 so LLMs were less sophisticated back then, and the literature was more limited. 55 - I'd be curious to see how this stacks up today. 56 - Every approach except Linevul outperforms the random baseline by a ss amount 57 58 > Target selection methods based on software metrics per- 59 > form significantly better than every other considered 60 > method. The best software metric, Leopard-V, correctly 61 > captures as much as 13% of the crashes with its highest 62 > ranking function across the whole corpus of more than 63 > 1600 crashes. This makes it the most natural and really only 64 > viable candidate for fuzzing approaches which require a 65 > discrete selection method. 66 67 > Software metric-based target selection methods perform 68 > significantly better than any other method across most 69 > types of crashes and sanitizers. The only other method 70 > close to their performance in some cases is the sanitizer- 71 > based one. 72 73 ## Takeaways 74 75 The problem of identifying key areas to fuzz can be seen as an information retrieval problem whree NDCG can be used to compute how well selection match actual problematic code regions. This actually seems fairly useful for our line number based evals. 76 77 The Leopard approach (code metrics) outperforms more sophisticated approaches in identifying potential functions of interest, within the constraints of the survey.