Automatic Labeling for Entity Extraction in Cyber Security

22 August 2013

Abstract

Timely analysis of information in cyber-security necessitates automated information extraction from unstructured text. Unfortunately, state-of-the-art extraction methods require training data, which is unavailable in the cyber-security domain.To avoid the arduous task of hand-labeling data, we develop a very precise method to automatically label text from several data sources by leveraging article-specific structured data and provide public access to corpus annotated with cyber-security entities. We then prototype a maximum entropy model that processes this corpus of auto-labeled text to label new sentences and present results showing the Collins Perceptron outperforms the MLE with L-BFGS and OWL-QN optimization for parameter fitting. The main contribution of this paper is an automated technique for creating a training corpus from text related to a database. As a multitude of domains can benefit from automated extraction of domain-specific concepts for which no labeled data is available, we hope our solution is widely applicable.

View on arXiv

Comments on this paper