Rethinking tokenization: Crafting better tokenizers for large language models

Yang, Jinbiao

doi:10.1075/ijchl.00023.yan

Article published In:

The Unit of Processing in Chinese
Edited by Tianlin Wang
[International Journal of Chinese Linguistics 11:1] 2024
► pp. 94–109

Rethinking tokenization

Crafting better tokenizers for large language models

Jinbiao Yang | Max Planck Institute for Psycholinguistics

Tokenization significantly influences language models (LMs)’ performance. This paper traces the evolution of tokenizers from word-level to subword-level, analyzing how they balance tokens and types to enhance model adaptability while controlling complexity. Despite subword tokenizers like Byte Pair Encoding (BPE) overcoming many word tokenizer limitations, they encounter difficulties in handling non-Latin languages and depend heavily on extensive training data and computational resources to grasp the nuances of multiword expressions (MWEs). This article argues that tokenizers, more than mere technical tools, should drawing inspiration from the cognitive science about human language processing. This study then introduces the “Principle of Least Effort” from cognitive science, that humans naturally seek to reduce cognitive effort, and discusses the benefits of this principle for tokenizer development. Based on this principle, the paper proposes that the Less-is-Better (LiB) model could be a new approach for LLM tokenizer. The LiB model can autonomously learn an integrated vocabulary consisting of subwords, words, and MWEs, which effectively reduces both the numbers of tokens and types. Comparative evaluations show that the LiB tokenizer outperforms existing word and BPE tokenizers, presenting an innovative method for tokenizer development, and hinting at the possibility of future cognitive science-based tokenizers being more efficient.

Keywords: tokenizer, tokenization, language model

Article outline

1.Introduction
- 1.1From word-level tokenizers to subword-level tokenizers
- 1.2Balancing tokens and types by subwords
- 1.3Current marginalization of multiword expressions (MWEs) in language models
2.Optimizing future tokenizers
- 2.1Principle of least effort
  - LiB model: An implementation of ‘Principle of least effort’
    - Model mechanism
    - Results
    - Practical application
3.Summary
Notes
References

Published online: 17 June 2024

https://doi.org/10.1075/ijchl.00023.yan

References (38)

References

Arnon, I., & Priva, U. C. (2013). More than words: The effect of multi-word frequency and constituency on phonetic duration. Lang. Speech, 56 (Pt 3), 349–371.

Baker, A. (2022). Simplicity. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy (Summer 2022). [URL]; Metaphysics Research Lab, Stanford University.

Beinborn, L., & Pinter, Y. (2023). Analyzing cognitive plausibility of subword tokenization. In H. Bouamor, J. Pino, & K. Bali (Eds.), Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 4478–4486). Association for Computational Linguistics.

Brugnara, F., Falavigna, D., & Omologo, M. (1993). Automatic segmentation and labeling of speech based on hidden markov models. Speech Commun., 12 (4), 357–370.

Chater, N. (1999). The search for simplicity: A fundamental cognitive principle? Q. J. Exp. Psychol. A, 52A (2), 273–302.

Chater, N., & Vitányi, P. (2003). Simplicity: A unifying principle in cognitive science? Trends Cogn. Sci., 7 (1), 19–22.

Delétang, G., Ruoss, A., Duquenne, P.-A., Catt, E., Genewein, T., Mattern, C., Grau-Moya, J., Wenliang, L. K., Aitchison, M., Orseau, L., Hutter, M., & Veness, J. (2023). Language modeling is compression. [URL]

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186.

Feldman, J. (2016). The simplicity principle in perception and cognition. Wiley Interdiscip. Rev. Cogn. Sci., 7 (5), 330–340.

Gage, P. (1994). A new algorithm for data compression. The C Users Journal Archive. [URL]

Goldwater, S., Griffiths, T. L., & Johnson, M. (2009). A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112 (1), 21–54.

Gruver, N., Finzi, M., Qiu, S., & Wilson, A. G. (2023). Large language models are Zero-Shot time series forecasters. [URL]

Isbilen, E. S., & Christiansen, M. H. (2020). Chunk-Based memory constraints on the cultural evolution of language. Top. Cogn. Sci., 12 (2), 713–726.

Isbilen, E. S., McCauley, S. M., Kidd, E., & Christiansen, M. H. (2020). Statistically induced chunking recall: A Memory-Based approach to statistical learning. Cogn. Sci., 44 (7), e12848.

Kudo, T. (2018). Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. In I. Gurevych & Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 66–75). Association for Computational Linguistics.

Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 66–71.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2020, February 8). ALBERT: A Lite BERT for Self-supervised Learning of Language Representations.

Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proc. IEEE, 86 (11), 2278–2324.

Lieber, O., Sharir, O., Lenz, B., & Shoham, Y. (2021). Jurassic-1: Technical details and evaluation. White Paper. AI21 Labs, 1 1.

Meltzoff, A. N., Kuhl, P. K., Movellan, J., & Sejnowski, T. J. (2009). Foundations for a new science of learning. Science, 325 (5938), 284–288.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. [URL]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022, March 4). Training language models to follow instructions with human feedback.

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–1543.

Perruchet, P., & Vinter, A. (1998). PARSER: A model for word segmentation. J. Mem. Lang., 39 (2), 246–263.

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2019). Exploring the limits of transfer learning with a unified Text-to-Text transformer. [URL]

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14 (5), 465–471.

Ruoss, A., Delétang, G., Genewein, T., Grau-Moya, J., Csordás, R., Bennani, M., Legg, S., & Veness, J. (2023). Randomized positional encodings boost length generalization of transformers. [URL].

Schapiro, A. C., Turk-Browne, N. B., Norman, K. A., & Botvinick, M. M. (2016). Statistical learning of temporal community structure in the hippocampus. Hippocampus, 26 (1), 3–8.

Schuster, M., & Nakajima, K. (2012). Japanese and Korean voice search. 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5149–5152.

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725.

Sun, Y., Wang, S., Li, Y., Feng, S., Chen, X., Zhang, H., Tian, X., Zhu, D., Tian, H., & Wu, H. (2019, April 19). ERNIE: Enhanced Representation through Knowledge Integration.

Tian, Y., James, I., & Son, H. (2023). How Are Idioms Processed Inside Transformer Language Models? In A. Palmer & J. Camacho-collados (Eds.), Proceedings of the 12th Joint Conference on Lexical and Computational Semantics (*SEM 2023) (pp. 174–179). Association for Computational Linguistics.

Yang, J. (2022). Discovering the units in language cognition: From empirical evidence to a computational model [PhD thesis, Radboud University & Max Planck Institute for Psycholinguistics].

Yang, J., Cai, Q., & Tian, X. (2020). How do we segment text? Two-stage chunking operation in reading. eNeuro, 7 (3).

Yang, J., Frank, S. L., & van den Bosch, A. (2020). Less is Better: A cognitively inspired unsupervised model for language segmentation. Proceedings of the Workshop on the Cognitive Aspects of the Lexicon, 33–45. [URL]

Yang, J., van den Bosch, A., & Frank, S. L. (2022). Unsupervised text segmentation predicts eye fixations during reading. Frontiers in Artificial Intelligence, 5 1.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., & Le, Q. V. (2020, January 2). XLNet: Generalized Autoregressive Pretraining for Language Understanding.

Zipf, G. K. (1949). Human behavior and the principle of least effort (Vol. 5731). Addison-Wesley Press. [URL]

Cited by (2)

Cited by two other publications

Fernando, Chrisantha, Simon Osindero & Dylan Banarse

2024. The origin and function of external representations. Adaptive Behavior

Wang, Tianlin

2024. Introduction. International Journal of Chinese Linguistics 11:1 ► pp. 1 ff.

This list is based on CrossRef data as of 4 july 2024. Please note that it may not be complete. Sources presented here have been supplied by the respective publishers. Any errors therein should be reported to them.