WhereToFuzz.md - notes - Personal notes

WhereToFuzz.md (4608B)
      1 # SoK: Where to Fuzz
      2 
      3 **Source:** SoK: Where to Fuzz? Assessing Target Selection Methods in Directed Fuzzing
      4 
      5 ## Background
      6 
      7 It is common to improve fuzzing performance by selecting regions of interest within a program instead of the entire program. This paper is an analysis of target selection method for fuzzing.
      8 
      9 ## Selection Methods
     10 
     11 Discrete Scoring:
     12 
     13 - Assign a 0 or a 1 to a given location if it is relevant or not for fuzzing (most tools fall into this category)
     14     - The distance to relevant code locations is often used as a proxy score for indirect ranking and comparisons of locational importance.
     15 
     16 Continuous Scoring:
     17 
     18 - Assigns a continuous value to all code locations
     19 
     20 ## Their Experimental Setup
     21 
     22 - They have a corpus of 1600 crashes from 97 software projects from OSS-Fuzz
     23     - These are considered ground truth issues
     24 - They then use each of the methods to pick out code blocks of interest based on the method's weightings
     25 - They then check how many of the ground truth issues are covered by each approach
     26 
     27 ## Questions & Critiques
     28 
     29 - How do they deal with potential differences between actual code failure locations and what is tested with OSS-Fuzz
     30     - One could imagine there are even more issues throughout the codebases that haven't yet been caught by OSS-Fuzz or are insufficiently covered by it, making the training dataset faulty in the sense that it doesn't represent the true distribution of errors / failures / vulnerabilities.
     31 
     32 > While there is still no guarantee that OSS-Fuzz identified every bug,
     33 > the massive amount of time spent on fuzzing these targets is likely
     34 > to find a large majority of crashes reachable by a fuzzer.
     35 
     36 That's kind of lame TBH. That said, if write code with some number of vulnerabilities, that would be even more contrived. Ideally, they'd use the entirety of some CVE database because existing fuzzers might still be missing key issues. What they did seems acceptable, albeit limited.
     37 
     38 - Their evals are ran against open source C and C++ projects from 2016 - 2023
     39     - This seems like it is limited in a few ways
     40         - It is only projects written in two languages that aren't exceptionally popular to start projects with today
     41             - ie. these are likely mature projects
     42         - These are open source projects
     43             - These are distributionally different than proprietary projects, but in what ways I'm unsure
     44 
     45 - They are only using issues found by OSS-Fuzz as their metric, but this doesn't weight them according to importance
     46     - Maybe this doesn't matter as one vulnerability means the system is vulnerable, but still, it seems prudent to weight specific issues more highly than others.
     47 
     48 ## Results
     49 
     50 Given their dataset, the approaches they evaluated, and the NDCG- (normalized discounted cululative gain) and NDCG+ calculations (NDCG with only most recent function from stack trace vs all functions in the stack trace to get an under and overestimate for retrieval), they found the following:
     51 
     52 - Leopard-V performs the best followed by Leopard-C, Sanitizer, then CodeT5+.
     53     - The top three are deterministic. There are also worse deterministic ones, but it is not the case that the ML approaches performed better (or even on par)
     54         - that said, the paper is from July 2024 so LLMs were less sophisticated back then, and the literature was more limited.
     55             - I'd be curious to see how this stacks up today.
     56 - Every approach except Linevul outperforms the random baseline by a ss amount
     57 
     58 > Target selection methods based on software metrics per-
     59 > form significantly better than every other considered
     60 > method. The best software metric, Leopard-V, correctly
     61 > captures as much as 13% of the crashes with its highest
     62 > ranking function across the whole corpus of more than
     63 > 1600 crashes. This makes it the most natural and really only
     64 > viable candidate for fuzzing approaches which require a
     65 > discrete selection method.
     66 
     67 > Software metric-based target selection methods perform
     68 > significantly better than any other method across most
     69 > types of crashes and sanitizers. The only other method
     70 > close to their performance in some cases is the sanitizer-
     71 > based one.
     72 
     73 ## Takeaways
     74 
     75 The problem of identifying key areas to fuzz can be seen as an information retrieval problem whree NDCG can be used to compute how well selection match actual problematic code regions. This actually seems fairly useful for our line number based evals.
     76 
     77 The Leopard approach (code metrics) outperforms more sophisticated approaches in identifying potential functions of interest, within the constraints of the survey.
	notes Personal notes
	git clone git://git.laack.co/notes.git
	Log \| Files \| Refs