Author name cluster

Richard Zak

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers

1 author row

NeurIPS Conference 2024 Conference Paper

Is Function Similarity Over-Engineered? Building a Benchmark

Rebecca Saul
Chang Liu
Noah Fleischmann
Richard Zak
Kristopher Micinski
Edward Raff
James Holt

Binary analysis is a core component of many critical security tasks, including reverse engineering, malware analysis, and vulnerability detection. Manual analysis is often time-consuming, but identifying commonly-used or previously-seen functions can reduce the time it takes to understand a new file. However, given the complexity of assembly, and the NP-hard nature of determining function equivalence, this task is extremely difficult. Common approaches often use sophisticated disassembly and decompilation tools, graph analysis, and other expensive pre-processing steps to perform function similarity searches over some corpus. In this work, we identify a number of discrepancies between the current research environment and the underlying application need. To remedy this, we build a new benchmark, REFuSe-Bench, for binary function similarity detection consisting of high-quality datasets and tests that better reflect real-world use cases. In doing so, we address issues like data duplication and accurate labeling, experiment with real malware, and perform the first serious evaluation of ML binary function similarity models on Windows data. Our benchmark reveals that a new, simple baseline — one which looks at only the raw bytes of a function, and requires no disassembly or other pre-processing --- is able to achieve state-of-the-art performance in multiple settings. Our findings challenge conventional assumptions that complex models with highly-engineered features are being used to their full potential, and demonstrate that simpler approaches can provide significant value.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Classifying Sequences of Extreme Length with Constant Memory Applied to Malware Detection

Edward Raff
William Fleshman
Richard Zak
Hyrum S. Anderson
Bobby Filar
Mark McLean

Recent works within machine learning have been tackling inputs of ever-increasing size, with cybersecurity presenting sequence classification problems of particularly extreme lengths. In the case of Windows executable malware detection, inputs may exceed 100 MB, which corresponds to a time series with T = 100, 000, 000 steps. To date, the closest approach to handling such a task is MalConv, a convolutional neural network capable of processing up to T = 2, 000, 000 steps. The O(T) memory of CNNs has prevented further application of CNNs to malware. In this work, we develop a new approach to temporal max pooling that makes the required memory invariant to the sequence length T. This makes MalConv 116× more memory efficient, and up to 25. 8× faster to train on its original dataset, while removing the input length restrictions to MalConv. We re-invest these gains into improving the Mal- Conv architecture by developing a new Global Channel Gating design, giving us an attention mechanism capable of learning feature interactions across 100 million time steps in an efficient manner, a capability lacked by the original MalConv CNN. Our implementation can be found at https: //github. com/ NeuromorphicComputationResearchProgram/MalConv2

PDF Details