IoT Botnet Detection: Application of Vision Transformer to Classification of Network Flow Traffic

Despite the demonstrated effectiveness of transformer models in NLP, and image and video classification, the available tools for extracting features from captured IoT network flow packets fail to capture sequential patterns in addition to the absence of spatial patterns consequently limiting transformer model application. This work introduces a novel preprocessing method to adapt transformer models, the vision transformer (ViT) in particular, for IoT botnet attack detection using network flow packets. The approach involves feature extraction from .pcap files and transforming each instance into a 1-channel 2D image shape, enabling ViT-based classification. Also, the ViT model was enhanced to allow use any classifier besides Multilayer Perceptron (MLP) that was deployed in the initial ViT paper. Models including the conventional feed forward Deep Neural Network (DNN), LSTM and Bidirectional-LSTM (BLSTM) demonstrated competitive performance in terms of precision, recall, and F1-score for multiclass-based attack detection when evaluated on two IoT attack datasets.
View on arXiv@article{wasswa2025_2504.18781, title={ IoT Botnet Detection: Application of Vision Transformer to Classification of Network Flow Traffic }, author={ Hassan Wasswa and Timothy Lynar and Aziida Nanyonga and Hussein Abbass }, journal={arXiv preprint arXiv:2504.18781}, year={ 2025 } }