ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.15370
18
0

Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models

19 September 2024
Alexius Wadell
Anoushka Bhutani
Venkatasubramanian Viswanathan
ArXivPDFHTML
Abstract

Text-based foundation models have become an important part of scientific discovery, with molecular foundation models accelerating advancements in molecular design and materials science. However, existing models are constrained by closed-vocabulary tokenizers which capture only a fraction of molecular space. In this work, we systematically evaluate thirty tokenizers, including 19 chemistry-specific ones, for their coverage of the SMILES molecular representation language, revealing significant gaps. To assess the impact of tokenizer choice, we introduce n-gram language models as a low-cost proxy and validate their effectiveness by training and fine-tuning 18 RoBERTa-style encoders for molecular property prediction. To overcome the limitations of existing tokenizers, we propose two new tokenizers -- Smirk and Smirk-GPE -- with full coverage of the OpenSMILES specification. Our results highlight the need for open-vocabulary modeling and chemically diverse benchmarks in cheminformatics. The proposed tokenizer framework systematically integrates nuclear, electronic, and geometric degrees of freedom; this facilitates applications in pharmacology, agriculture, biology, and energy storage.

View on arXiv
@article{wadell2025_2409.15370,
  title={ Smirk: An Atomically Complete Tokenizer for Molecular Foundation Models },
  author={ Alexius Wadell and Anoushka Bhutani and Venkatasubramanian Viswanathan },
  journal={arXiv preprint arXiv:2409.15370},
  year={ 2025 }
}
Comments on this paper