Data Analysis Prediction over Multiple Unseen Datasets: A Vector Embedding Approach

The massive increase in the data volume and dataset availability for analysts compels researchers to focus on data content and select high-quality datasets to enhance the performance of analytics operators. While selecting the highest quality data for analysis highly increases task accuracy and efficiency, it is still a hard task, especially when the number of available inputs is very large. To address this issue, we propose a novel methodology that infers the outcome of analytics operators by creating a model from datasets similar to the queried one. Dataset similarity is performed via projecting each dataset to a vector embedding representation. The vectorization process is performed using our proposed deep learning model NumTabData2Vec, which takes a whole dataset and projects it into a lower vector embedding representation space. Through experimental evaluation, we compare the prediction performance and the execution time of our framework to another state-of-the-art modelling operator framework, illustrating that our approach predicts analytics outcomes accurately. Furthermore, our vectorization model can project different real-world scenarios to a lower vector embedding representation and distinguish between them.
View on arXiv@article{loizou2025_2502.17060, title={ Data Analysis Prediction over Multiple Unseen Datasets: A Vector Embedding Approach }, author={ Andreas Loizou and Dimitrios Tsoumakos }, journal={arXiv preprint arXiv:2502.17060}, year={ 2025 } }