Rethinking tokenization
Crafting better tokenizers for large language models
Tokenization significantly influences language models (LMs)’ performance. This paper traces the evolution of tokenizers from word-level to
subword-level, analyzing how they balance tokens and types to enhance model
adaptability while controlling complexity. Despite subword tokenizers like Byte
Pair Encoding (BPE) overcoming many word tokenizer limitations, they encounter
difficulties in handling non-Latin languages and depend heavily on extensive
training data and computational resources to grasp the nuances of multiword
expressions (MWEs). This article argues that tokenizers, more than mere
technical tools, should drawing inspiration from the cognitive science about
human language processing. This study then introduces the “Principle of Least
Effort” from cognitive science, that humans naturally seek to reduce cognitive
effort, and discusses the benefits of this principle for tokenizer development.
Based on this principle, the paper proposes that the Less-is-Better (LiB) model
could be a new approach for LLM tokenizer. The LiB model can autonomously learn
an integrated vocabulary consisting of subwords, words, and MWEs, which
effectively reduces both the numbers of tokens and types. Comparative
evaluations show that the LiB tokenizer outperforms existing word and BPE
tokenizers, presenting an innovative method for tokenizer development, and
hinting at the possibility of future cognitive science-based tokenizers being
more efficient.
Article outline
- 1.Introduction
- 1.1From word-level tokenizers to subword-level tokenizers
- 1.2Balancing tokens and types by subwords
- 1.3Current marginalization of multiword expressions (MWEs) in language
models
- 2.Optimizing future tokenizers
- 2.1Principle of least effort
- LiB model: An implementation of ‘Principle of least effort’
- Model mechanism
- Results
- Practical application
- 3.Summary
- Notes
-
References
References (38)
References
Arnon, I., & Priva, U. C. (2013). More
than words: The effect of multi-word frequency and constituency on phonetic
duration. Lang.
Speech,
56
(Pt
3), 349–371. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Baker, A. (2022). Simplicity. In E. N. Zalta (Ed.), The
Stanford encyclopedia of philosophy (Summer
2022). [URL]; Metaphysics Research Lab, Stanford University.
Beinborn, L., & Pinter, Y. (2023). Analyzing
cognitive plausibility of subword
tokenization. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings
of the 2023 conference on empirical methods in natural language
processing (pp. 4478–4486). Association for Computational Linguistics. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic
segmentation and labeling of speech based on hidden markov
models. Speech
Commun.,
12
(4), 357–370. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Chater, N. (1999). The
search for simplicity: A fundamental cognitive
principle? Q. J. Exp. Psychol.
A,
52A
(2), 273–302. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Chater, N., & Vitányi, P. (2003). Simplicity:
A unifying principle in cognitive
science? Trends Cogn.
Sci.,
7
(1), 19–22. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Delétang, G., Ruoss, A., Duquenne, P.-A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchison, M., Orseau, L., Hutter, M., & Veness, J. (2023). Language
modeling is compression. [URL]
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT:
Pre-training of deep bidirectional transformers for language
understanding. Proceedings of the 2019
Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies, Volume 1 (Long and
Short
Papers), 4171–4186. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Feldman, J. (2016). The
simplicity principle in perception and
cognition. Wiley Interdiscip. Rev. Cogn.
Sci.,
7
(5), 330–340. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gage, P. (1994). A
new algorithm for data compression. The C
Users Journal Archive. [URL]
Goldwater, S., Griffiths, T. L., & Johnson, M. (2009). A
bayesian framework for word segmentation: Exploring the effects of
context. Cognition,
112
(1), 21–54. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Gruver, N., Finzi, M., Qiu, S., & Wilson, A. G. (2023). Large
language models are Zero-Shot time series forecasters. [URL]
Isbilen, E. S., & Christiansen, M. H. (2020). Chunk-Based
memory constraints on the cultural evolution of
language. Top. Cogn.
Sci.,
12
(2), 713–726. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Isbilen, E. S., McCauley, S. M., Kidd, E., & Christiansen, M. H. (2020). Statistically
induced chunking recall: A Memory-Based approach to statistical
learning. Cogn.
Sci.,
44
(7), e12848. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kudo, T. (2018). Subword
Regularization: Improving Neural Network Translation Models with Multiple
Subword
Candidates. In I. Gurevych & Y. Miyao (Eds.), Proceedings
of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long
Papers) (pp. 66–75). Association for Computational Linguistics. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Kudo, T., & Richardson, J. (2018). SentencePiece:
A simple and language independent subword tokenizer and detokenizer for
Neural Text Processing. Proceedings of the
2018 Conference on Empirical Methods in Natural Language Processing: System
Demonstrations, 66–71. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020, February 8). ALBERT:
A Lite BERT for Self-supervised Learning of Language
Representations. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based
learning applied to document
recognition. Proc.
IEEE,
86
(11), 2278–2324. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Lieber, O., Sharir, O., Lenz, B., & Shoham, Y. (2021). Jurassic-1:
Technical details and evaluation. White
Paper. AI21
Labs,
1
1.![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Meltzoff, A. N., Kuhl, P. K., Movellan, J., & Sejnowski, T. J. (2009). Foundations
for a new science of
learning. Science,
325
(5938), 284–288. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient
estimation of word representations in vector space. [URL]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022, March 4). Training
language models to follow instructions with human
feedback. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe:
Global vectors for word
representation. Proceedings of the 2014
Conference on Empirical Methods in Natural Language Processing
(EMNLP), 1532–1543. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Perruchet, P., & Vinter, A. (1998). PARSER:
A model for word segmentation. J. Mem.
Lang.,
39
(2), 246–263. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring
the limits of transfer learning with a unified Text-to-Text
transformer. [URL]
Rissanen, J. (1978). Modeling
by shortest data
description. Automatica,
14
(5), 465–471. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Ruoss, A., Delétang, G., Genewein, T., Grau-Moya, J., Csordás, R., Bennani, M., Legg, S., & Veness, J. (2023). Randomized
positional encodings boost length generalization of
transformers. [URL]. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
Schapiro, A. C., Turk-Browne, N. B., Norman, K. A., & Botvinick, M. M. (2016). Statistical
learning of temporal community structure in the
hippocampus. Hippocampus,
26
(1), 3–8. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Schuster, M., & Nakajima, K. (2012). Japanese
and Korean voice search. 2012 IEEE
International Conference on Acoustics, Speech and Signal Processing
(ICASSP), 5149–5152. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural
machine translation of rare words with subword
units. Proceedings of the 54th Annual Meeting
of the Association for Computational Linguistics (Volume 1: Long
Papers), 1715–1725. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., & Wu, H. (2019, April 19). ERNIE:
Enhanced Representation through Knowledge
Integration. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Tian, Y., James, I., & Son, H. (2023). How
Are Idioms Processed Inside Transformer Language
Models? In A. Palmer & J. Camacho-collados (Eds.), Proceedings
of the 12th Joint Conference on Lexical and Computational Semantics (*SEM
2023) (pp. 174–179). Association for Computational Linguistics. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Yang, J. (2022). Discovering
the units in language cognition: From empirical evidence to a computational
model [PhD
thesis, Radboud University & Max Planck Institute for Psycholinguistics]. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
Yang, J., Cai, Q., & Tian, X. (2020). How
do we segment text? Two-stage chunking operation in
reading. eNeuro,
7
(3). ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Yang, J., Frank, S. L., & van den Bosch, A. (2020). Less
is Better: A cognitively inspired unsupervised model for language
segmentation. Proceedings of the Workshop on
the Cognitive Aspects of the
Lexicon, 33–45. [URL]
Yang, J., van den Bosch, A., & Frank, S. L. (2022). Unsupervised
text segmentation predicts eye fixations during
reading. Frontiers in Artificial
Intelligence,
5
1. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2020, January 2). XLNet:
Generalized Autoregressive Pretraining for Language
Understanding. ![DOI logo](https://benjamins.com/logos/doi-logo.svg)
![Google Scholar](https://benjamins.com/logos/google-scholar.svg)
Zipf, G. K. (1949). Human
behavior and the principle of least
effort (Vol. 5731). Addison-Wesley Press. [URL]
Cited by (2)
Cited by two other publications
Fernando, Chrisantha, Simon Osindero & Dylan Banarse
2024.
The origin and function of external representations.
Adaptive Behavior ![DOI logo](//benjamins.com/logos/doi-logo.svg)
Wang, Tianlin
2024.
Introduction.
International Journal of Chinese Linguistics 11:1
► pp. 1 ff.
![DOI logo](//benjamins.com/logos/doi-logo.svg)
This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers.
Any errors therein should be reported to them.