Back to top

Master's Thesis Abdullah Gulraiz

Last modified Jun 2, 2023
   No tags assigned

Semantic Analysis for Deduplication of Security Findings in DevOps Security Tool Reports

 

Abstract

Semantic Analysis in Natural Language Processing describes the process of understanding natural language in the way humans communicate, based on its meaning and context. It tries to interpret unstructured texts by analyzing lexical items and their relations. One of the types of Semantic Analysis is Semantic Similarity, which is the task of determining how similar two sentences are in terms of what they mean. This is done by scoring the relationship between text fragments or documents using a defined similarity metric.

In DevOps, there exist multiple security tools that produce huge amounts of findings regarding software vulnerabilities or its quality. These tools are usually for static code analysis, a practice to identify where and how the code doesn't comply with the coding rules, or for dynamic software analysis, a practice that tries to manipulate a running program in order to discover security vulnerabilities.

Often, these DevOps tools have similar or overlapping use cases, resulting in a lot of duplicate security findings which are cumbersome for the security experts to go through and analyze manually. These findings consist of natural language text, and are often accompanied by extra information like severity, reference, and/or a unique ID. For each tool, the findings are contained in separate reports of parsable formats (JSON, XML, etc.).

In this Master Thesis, we intend to explore and compare the usage of different semantic similarity techniques to deduplicate the security findings in DevOps security tool reports. In doing so, we plan to test numerous state-of-the-art semantic similarity models with our security findings data to first cluster semantically similar findings together, and then identify duplicate findings from each cluster.

According to our initial research, there exist four types of semantic similarity methods, namely knowledge-based, corpus-based, deep neural network-based and hybrid ones. We intend to try methods from different types, and assess them quantitatively as well as qualitatively with the help of security experts to see how well the implemented techniques perform on our security reports data.


Research Questions

  1. What semantic similarity methods have been proposed in literature?
  2. How do we construct a suitable corpus from security tool reports?
  3. What methods are applicable to find semantic clusters in security tool reports?
  4. How can we use these methods to identify semantic duplicates from the clusters?

Sources

  1. Chandrasekaran, D., & Mago, V. (2021). Evolution of Semantic Similarity—A Survey. ACM Computing Surveys (CSUR), 54(2), 1-37.
  2. Ranade, P., Piplai, A., Joshi, A., & Finin, T. (2021, December). CyBERT: Contextualized Embeddings for the Cybersecurity Domain. In IEEE International Conference on Big Data.
  3. Piplai, A., Mittal, S., Joshi, A., Finin, T., Holt, J., & Zak, R. (2020). Creating cybersecurity knowledge graphs from malware after action reports. IEEE Access, 8, 211691-211703.
  4. Kiesling, E., Ekelhart, A., Kurniawan, K., & Ekaputra, F. (2019, October). The SEPSES knowledge graph: an integrated resource for cybersecurity. In International Semantic Web Conference (pp. 198-214). Springer, Cham.

Files and Subpages