Malware classification using machine learning

Topic > Malware classification using machine learning

Malware is commonly a headache in almost all mobile phones, laptops, memory cards, etc. The most common technique used by malware to avoid detection is binary obfuscation using encryption. One of the techniques malware uses to escape detection is binary obfuscation via cryptography (polymorphism) or metamorphic attacks (different code for the same functionality). To locate them quickly and effectively, we should group them according to their family. This gives rise to a growing need for automated, self-learning, fast and efficient techniques that are resistant to these attacks. In this paper, we only intended to classify malware into their respective families and not detect them (identify whether they are malware or not). We need to select a criterion of 500 counts of an observed value for our feature dataset that will be used by our machine learning algorithms. In this article, we focus on new data visualization techniques such as malware image representation and classification based on artificial neural networks and K-Nearest Neighbor. Say no to plagiarism. Get a tailor-made essay on "Why Violent Video Games Shouldn't Be Banned"? Get an Original Essay Malware analysis is usually performed as “Static Analysis”, “Dynamic Analysis” and also “Signature-based”. In static analysis, disassembly code files are analyzed for malicious system calls. You will need to build a model for control flow graphs. While, in dynamic malware analysis technique, the data is analyzed in a controlled environment and is also tracked (system logs). This mentioned process is extremely slow and also requires resources and time. Both mentioned techniques work well, but static code analysis suffers from differences in malware implementation, while dynamic malware analysis is limited to the environment and activation conditions of the malware, and thus is also an option scalable. To analyze the malware signature, you need to create the signature using N-Gram techniques. Malware disassembly is analyzed for most opcode repetitions and N-grams need to be built on this. We use malware visualization techniques to visualize the data. We will convert each malware byte code into a grayscale image. Malware from different families have similarities in visual appearance, this is the basic principle to follow. These images should be used for image-based classification. OPCODE must be calculated from the disassembly code. The purpose of this paper is to implement machine learning algorithms to classify malware into their respective families. The data should be taken from www.kaggle.com provided by Microsoft containing 10868 malware samples belonging to a total of 9 different classes i.e. the files are from nine different malware families namely Ramnit, Lollipop, Kelihos ver3, Vundo, Simda , Traceur, Kelihos ver1, Obfuscator.ACY and Gatak respectively. The goal here is to analyze, visualize the malware and analyze the data beforehand. So the goal is to develop a new integrated model that takes advantage of all models. Problem Definition: Extensive work has been done in terms of malware analysis. Static, dynamic, and signature-based malware analysis techniques have been studied in many articles. A publication based on image-based malware visualization was one of the favorite ways [1] in which it explains how to form an image from malware filesbinaries and how to view such images. In the alternative approach to extract data from the disassembly code, which could be used for classification [2], the data accuracy was not optimal. This article suggests a way to extract new features based on N-Gram, code sections, opcode sequences and DLL calls. But even before you can develop malware signatures, you need to do some work in malware detection and classification. Related work: Extensive work has been done on malware analysis. Many articles are published denoting static, dynamic, and signature-based malware analysis techniques. A publication based on image-based malware visualization as one of the preferred methods [1]. This article explains how to create an image from binary malware files and how to view those images. In this case the machines are used for image-based classifications. We also referenced a document that defines how to extract data from disassembly code, which could be used for classification.[2] This article suggests a way to extract new features based on N-Gram, code sections, opcode sequences and DLL calls. But even before you can develop malware signatures, you need to do some work in malware detection and classification. Analysis: We studied some documents that use the same principles as ours to classify malware into their respective families. It has been observed that in case of missing data multilayer perception model and logical regression are good. Image visualization techniques were used which provided an average predictive accuracy of 95% using the deep neural network. We also found that the methodology provides optimal results compared to other available techniques. While machine learning-based malware classification for Android applications using multimodal image representations [3] is a bit slow when it comes to data processing. Proposed methodology: To analyze the signature, the signature is constructed using N-Gram techniques. Malware disassembly is analyzed for most opcode repetitions and N-grams are created on top. We propose to use malware visualization techniques. Our goal is to convert each malware byte code into a grayscale image. In research and analysis, malware from different families has been observed to have similarities in visual appearance, giving us an opportunity to exploit this weakness. These malware images will be used for image-based classification. From the disassembly code, we will calculate the OP-CODE counts, DLLs and section counts from the assembly codes provided. The main features analyzed from all assembly files were used for malware classification. We need to select a criterion of 500 counts of an observed value for our feature dataset that will be used by our machine learning algorithms. These different analyzed datasets would be used for classification, performed on MATLAB's machine learning toolbox. In this article we have described the data visualization methods, analysis, selection of classification algorithms and the outputs obtained. Data visualization: As suggested in the malware image generation and classification technique, each byte of data is converted into a grayscale pixel. The array or byte stream has been converted into an image [1]. An image representation of the malware produces images.