14
0

The Dark Side of the Web: Towards Understanding Various Data Sources in Cyber Threat Intelligence

Abstract

Cyber threats have become increasingly prevalent and sophisticated. Prior work has extracted actionable cyber threat intelligence (CTI), such as indicators of compromise, tactics, techniques, and procedures (TTPs), or threat feeds from various sources: open source data (e.g., social networks), internal intelligence (e.g., log data), and ``first-hand'' communications from cybercriminals (e.g., underground forums, chats, darknet websites). However, "first-hand" data sources remain underutilized because it is difficult to access or scrape their data.In this work, we analyze (i) 6.6 million posts, (ii) 3.4 million messages, and (iii) 120,000 darknet websites. We combine NLP tools to address several challenges in analyzing such data. First, even on dedicated platforms, only some content is CTI-relevant, requiring effective filtering. Second, "first-hand" data can be CTI-relevant from a technical or strategic viewpoint. We demonstrate how to organize content along this distinction. Third, we describe the topics discussed and how "first-hand" data sources differ from each other. According to our filtering, 20% of our sample is CTI-relevant. Most of the CTI-relevant data focuses on strategic rather than technical discussions. Credit card-related crime is the most prevalent topic on darknet websites. On underground forums and chat channels, account and subscription selling is discussed most. Topic diversity is higher on underground forums and chat channels than on darknet websites. Our analyses suggest that different platforms may be used for activities with varying complexity and risks for criminals.

View on arXiv
@article{schröer2025_2504.14235,
  title={ The Dark Side of the Web: Towards Understanding Various Data Sources in Cyber Threat Intelligence },
  author={ Saskia Laura Schröer and Noé Canevascini and Irdin Pekaric and Philine Widmer and Pavel Laskov },
  journal={arXiv preprint arXiv:2504.14235},
  year={ 2025 }
}
Comments on this paper