ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.19631
68
0

Segment First or Comprehend First? Explore the Limit of Unsupervised Word Segmentation with Large Language Models

26 May 2025
Zihong Zhang
Liqi He
Zuchao Li
Lefei Zhang
Hai Zhao
Bo Du
ArXiv (abs)PDFHTML
Main:8 Pages
8 Figures
Bibliography:3 Pages
10 Tables
Appendix:7 Pages
Abstract

Word segmentation stands as a cornerstone of Natural Language Processing (NLP). Based on the concept of "comprehend first, segment later", we propose a new framework to explore the limit of unsupervised word segmentation with Large Language Models (LLMs) and evaluate the semantic understanding capabilities of LLMs based on word segmentation. We employ current mainstream LLMs to perform word segmentation across multiple languages to assess LLMs' "comprehension". Our findings reveal that LLMs are capable of following simple prompts to segment raw text into words. There is a trend suggesting that models with more parameters tend to perform better on multiple languages. Additionally, we introduce a novel unsupervised method, termed LLACA (L\textbf{L}Large L\textbf{L}Language Model-Inspired A\textbf{A}Aho-C\textbf{C}Corasick A\textbf{A}Automaton). Leveraging the advanced pattern recognition capabilities of Aho-Corasick automata, LLACA innovatively combines these with the deep insights of well-pretrained LLMs. This approach not only enables the construction of a dynamic nnn-gram model that adjusts based on contextual information but also integrates the nuanced understanding of LLMs, offering significant improvements over traditional methods. Our source code is available atthis https URL

View on arXiv
Comments on this paper