notes

Unnamed repository; edit this file 'description' to name the repository.
Log | Files | Refs

commit 38b37b107cba99500ea3857613feb424ba17a0a8
parent 221e04e660dba1a628a2a1ff294b521dd340532e
Author: Andrew Laack <andrew@laack.co>
Date:   Sat, 29 Nov 2025 04:17:53 -0600

Took notes on haskell, free software licenses, cs stuff, and code verification topics.

Diffstat:
Mdocs/Associative.md | 2--
Adocs/Associativity.md | 9+++++++++
Adocs/CodeSanitizer.md | 26++++++++++++++++++++++++++
Adocs/CodeVerification.md | 11+++++++++++
Mdocs/ComputerScience.md | 1+
Adocs/CyclomaticComplexity.md | 9+++++++++
Adocs/DirectedFuzzer.md | 11+++++++++++
Mdocs/FreeSoftware.md | 2+-
Mdocs/Haskell.md | 2++
Adocs/Orion.md | 69+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adocs/PointfreeProgramming.md | 17+++++++++++++++++
Adocs/WhereToFuzz.md | 30++++++++++++++++++++++++++++++
12 files changed, 186 insertions(+), 3 deletions(-)

diff --git a/docs/Associative.md b/docs/Associative.md @@ -2,8 +2,6 @@ MML Ch 2.2 - - **Definition:** Associativity of an operation means that regardless of the location of parenthesis the resulting computation is still the same assuming the order of values is also the same. Example: diff --git a/docs/Associativity.md b/docs/Associativity.md @@ -0,0 +1,9 @@ +# Associativity (Context of FP) + +**Source:** Effective Haskell + +**Chapter:** 1 + +**Definition:** Associativity is the choice in direction to parse (right to left or left to right). + +Haskell is by default left associative (this can be overriden for infix functions with a fixity declaration), meaning it applies precedence starting from the left and moving to the right. diff --git a/docs/CodeSanitizer.md b/docs/CodeSanitizer.md @@ -0,0 +1,26 @@ +# Code Sanitizer + +**Source:** Orion Fuzzing Paper + +**Definition:** Code sanitizers are a dynamic program analysis tool that detects bugs from undefined or suspicious behavior by inserting instrumentation into code at runtime. + +Code sanitizers are often used with fuzzers to detect both crashes and unexpected behavior. + +## Examples + +Examples of issues caught by this approach may include: + +- Undefined behavior +- Use after free +- Buffer overflows + +## Implementations + +A few implementations of code sanitizers are: + +- ASan (Google's address sanitizer) + - Uses shadow mapped memory to detect memory corruption +- KASan + - Kernel address sanitizer detects dynamic memory errors in the Linux kernel +- UBSan + - Used to detect undefined behavior diff --git a/docs/CodeVerification.md b/docs/CodeVerification.md @@ -0,0 +1,11 @@ +# Code Verification + +Notes related to correctness verification of code + +## Links + +- [Code Sanitizer](CodeSanitizer.md) +- [Cyclomatic Complexity](CyclomaticComplexity.md) +- [Orion](Orion.md) +- [Where To Fuzz.md](WhereToFuzz.md) +- [Directed Fuzzer](DirectedFuzzer.md) diff --git a/docs/ComputerScience.md b/docs/ComputerScience.md @@ -24,6 +24,7 @@ This is the index for my Computer Science related notes. - [Software Licenses](SoftwareLicenses.md) - [Haskell](Haskell.md) - [Free Software](FreeSoftware.md) +- [Code Verification](CodeVerification.md) ## Forced to Take Notes on diff --git a/docs/CyclomaticComplexity.md b/docs/CyclomaticComplexity.md @@ -0,0 +1,9 @@ +# Cyclomatic Complexity + +**Source:** Orion Fuzzing Paper + +**Definition:** Cyclomatic complexity is a metric used to indicate the complexity of a program. The metric is the number of linearly independent paths through a program's source code. + +This metric can be calculated with a control-flow graph of the program. + +If there are no conditionals in source code, the complexity is 1. If there is a single if conditional, the complexity is 2. Two nested if conditionals results in a complexity of 3. diff --git a/docs/DirectedFuzzer.md b/docs/DirectedFuzzer.md @@ -0,0 +1,11 @@ +# Directed Fuzzer + +**Source:** Where To Fuzz (paper) + +**Definition:** Directed fuzzers are fuzzers that focus on specific parts of source code / a binary. These locations can either be defined manually or using some sort of heuristic. + +## Examples + +One example is regression greybox fuzzing which focuses on recently changed code. The idea behind this is that recently changed code is more likely to impose regressions than existing code which has been, presumably, more thoroughly tested. + +Another example are the metrics used in [Orion](Orion.md) for calculating interfaces of interest to write fuzzing harnesses for. Their approach combines deterministic values like cyclomatic complexity and call graph size with LLM derived values coming from the usage of techniques commonly associated with vulnerabilities (eg. pointer arithmetic and the likes). diff --git a/docs/FreeSoftware.md b/docs/FreeSoftware.md @@ -13,7 +13,7 @@ --- -What the fuck are these patents: +What are these patents: - https://patents.google.com/patent/US20240111498A1/en?q=(LLM)&oq=LLM - https://patents.google.com/patent/US11900068B1/en?q=(LLM)&oq=LLM diff --git a/docs/Haskell.md b/docs/Haskell.md @@ -15,3 +15,5 @@ - [Function Application Operator](FunctionApplicationOperator.md) - [Function Composition Operator](FunctionCompositionOperator.md) - [Higher Order Function](HigherOrderFunction.md) +- [Pointfree Programming](PointfreeProgramming.md) +- [Associativity](Associativity.md) diff --git a/docs/Orion.md b/docs/Orion.md @@ -0,0 +1,69 @@ +# Orion + +**Source:** Orion: Fuzzing Workflow Automation + +## Notes + +This is an approach to fuzzing automation presented in the 'Orion: Fuzzing Workflow Automation' paper by employees of Nvidia. + +## Process + +- Harness generation and execution + - Takes target project source code as input + - Constructs a codebase index + - The codebase is chunked on the basis of functions + - Select interfaces for fuzzing by ranking non-static functions by how likely it thinks fuzzing will trigger bugs + - This ranking is done by computing a few metrics: + - Cyclomatic complexity + - Number of independent paths paths through a function + - Internal function calls + - How frequently a function is called by others + - Lines of code + - Callgraph size + - Number of functions reachable from the given function + - This seems to be an attempt to place more weight on more important functionallity which might be misguided. + - If there is a part of the codebase that is very integral, it seems likely that is more well tested. + - Dangerous expressions + - Constructs like pointer arithmetic, memory maangement, bit operations (detected by the LLM...) + - Honestly, I'd just use regex... + - Sink functions + - Functions associated with vulnerabilities + - Again... why are you using LLMs!? Just use regex + - Parsing functions + - Functions that parse structured inputs + - fair play for LLM usage. + - Generates seed inputs for each selected function + - Generate harness + - A dependency analysis agent identifies setup and teardown processes and header file dependencies + - Constructs compilable fuzz drivers compatible with generated seeds + - The resulting harnesses and seeds are then sent to the fuzzing infra which executes the fuzzer, monitors for errors, and records the results +- Crash handling + - The triage agent filters out harness-related issues + - Triage agent root causes issues and creates minimal repros +- Patching + - Patching agent patches the issue + - Minimal repros are validated as fixed from the prior step + - If the issue still exists, patching agent tries again + +## Questions During Reading + +- How do they ensure the bug triaging / patching doesn't result in further regressions? + - It seems likely they are finding real issues, but the patches would be dubious, even if they result in the problem being resolved for a given byte buffer input to the harness. + - Towards the end of the paper they say basically a human reviews at the end and they ensure it passes minimal tests + - This seems like basically they just validate it compiles and passes the fuzz repros +- Why does parsing the source code to create call grpahs and type indexes improve performance? + - Agentic models should be able to request this information themselves with tool use without the need for preprocessing. + - This paper was put on ArXiv on Sep. 18th so they clearly had access to tool using agentic models that could easily do this. + - Also, I don't see ablations so I don't know if it does. +- How did they choose the target identification metrics? + +## Interesting Questions to Explore + +- Does generating call graphs improve model performance? + - What about other forms of codebase processing for context generation? +- What are the most important metrics for deterministic risk identification? + +## Important Takeaways + +- They define seeds prior to their harnesses. + - Their rationale seems to be if they constrain the input and output spaces, LLMs will be better at generating harnesses. diff --git a/docs/PointfreeProgramming.md b/docs/PointfreeProgramming.md @@ -0,0 +1,17 @@ +# Pointfree Programming / Tacit Programming + +**Source:** Effective Haskell + +**Chapter:** 1 + +**Definition:** Pointfree programming is a programming approach where functions don't have any named parameters. This is often achieved using eta-reduction to remove all named parameters. + +## Example + +```haskell + +firstPart = (<> " ") +makeGreeting' = (<>) . firstPart +makeGreeting' "hello" "andrew" + +``` diff --git a/docs/WhereToFuzz.md b/docs/WhereToFuzz.md @@ -0,0 +1,30 @@ +# SoK: Where to Fuzz + +**Source:** SoK: Where to Fuzz? Assessing Target Selection Methods in Directed Fuzzing + +## Background + +It is common to improve fuzzing performance by selecting regions of interest within a program instead of the entire program. This paper is an analysis of target selection method for fuzzing. + +## Selection Methods + +Discrete Scoring: + +- Assign a 0 or a 1 to a given location if it is relevant or not for fuzzing (most tools fall into this category) + - The distance to relevant code locations is often used as a proxy score for indirect ranking and comparisons of locational importance. + +Continuous Scoring: + +- Assigns a continuous value to all code locations + +## Their Experimental Setup + +- They have a corpus of 1600 crashes from 97 software projects from OSS-Fuzz + - These are considered ground truth issues +- They then use each of the methods to pick out code blocks of interest based on the method's weightings +- They then check how many of the ground truth issues are covered by each approach + +## Questions + +- How do they deal with potential differences between actual code failure locations and what is tested with OSS-Fuzz + - One could imagine there are even more issues throughout the codebases that haven't yet been caught by OSS-Fuzz or are insufficiently covered by it, making the training dataset faulty in the sense that it doesn't represent the true distribution of errors / failures / vulnerabilities.