Workflow matters: Comparing human translators and multi-agent LLMs in literary translation
Large language models (LLMs) have shown significant potential in translation tasks but often struggle with literary texts. This study compares professional human translations with translations produced by two AI-driven systems that coordinate multiple LLM-based agents. The first system mimics professional human translation practice, with distinct drafting and revision phases. The second redesigns the process specifically for LLMs’ capabilities, breaking translation into granular steps with specialized AI agents handling strategic planning, stylistic refinement, and coherence checking. Expert evaluations revealed that both AI systems achieved accuracy comparable to professional human translators. The LLM-capability-driven system produced translations with superior stylistic qualities and poetic language, though it occasionally added extraneous content. Meanwhile, the practice-derived system delivered concise translations but sometimes lacked cohesive flow. Blind evaluations showed that the translations from both AI systems were frequently preferred over human translations, particularly in terms of fluency. This study demonstrates that rethinking translation workflows around LLM capabilities can yield exceptional results, sometimes surpassing human performance in certain aspects.
- large language models (LLMs),
- multi-agent systems,
- literary translation,
- translation quality,
- translation technology
Publication history
1.Introduction
The translation process has long been studied as a complex cognitive activity requiring multiple competencies, from linguistic knowledge to cultural awareness (e.g., Muñoz Martín 2016; Carl and Schaeffer 2017). The emergence of large language models (LLMs) has introduced new possibilities for translation, with state-of-the-art models now surpassing traditional neural machine translation (NMT) systems in fluency and naturalness for many language pairs (Briva-Iglesias, Camargo, and Dogru 2024; Gao et al. 2024; Jiang et al. 2024). However, compared with expert human translation, particularly in literary contexts, LLMs still show considerable limitations (He 2024; R. Zhang, Zhao, and Eger 2025).
The integration of translation technology into professional workflows is well established in technical and commercial domains, yet its adoption in literary translation has been cautious and limited. Traditional machine translation (MT) has struggled to capture the stylistic nuances, creative metaphors, and cultural subtleties that are central to literary works, leading many literary translators to avoid computer-aided translation (CAT) tools or MT altogether (Taivalkoski-Shilov 2019; Youdale, Rothwell, and Way 2023). These tools are often perceived as incompatible with the artistic and interpretive nature of literary translation. Concerns about creativity, translator autonomy, and emerging ethical issues continue to shape skepticism toward automation in the literary field (Toral and Way 2018; Taivalkoski-Shilov 2019; Kenny and Winters 2020). However, recent quality improvements in NMT and LLMs are beginning to challenge these reservations (Youdale, Rothwell, and Way 2023). Their potential to enhance efficiency and output quality, together with their limitations in creative contexts, underscore the need to understand how such models can be effectively integrated into literary translation practice.
Research on improving LLM-based translation has largely focused on prompting, a method that provides specific instructions to guide the model’s output and is more accessible than technical alternatives like fine-tuning (Elshin et al. 2024). Various prompting strategies have been tested, including providing example translations (Moslem et al. 2023), guiding models through reasoning steps (Wei et al. 2022; Peng et al. 2023), and offering detailed contextual information (He 2024; Jiang et al. 2024). Yet the results have been mixed, with some studies reporting that simpler prompts sometimes outperform complex ones (R. Zhang, Zhao, and Eger 2025). Despite these efforts, current approaches struggle to address the nuanced challenges of literary translation, where creative expression, cultural sensitivity, and stylistic appropriateness are paramount across its diverse genres and forms (Fu and L. Liu 2024; R. Zhang, Zhao, and Eger 2025).
In response to these limitations, a promising new approach involving multi-agent systems organizes multiple LLMs into collaborative structures with specialized roles (Wu, Xu, and Longyue Wang 2024). This approach divides complex tasks into manageable components and has been shown to be effective across various domains (Dorri, Kanhere, and Jurdak 2018; Guo et al. 2024). For translation specifically, systems like TransAgents have demonstrated strong results in fiction translation (Wu, Xu, and Longyue Wang 2024). However, many current multi-agent frameworks replicate human collaborative workflows (Qian et al. 2024; Lei Wang et al. 2024), raising questions about whether these structures optimally serve LLMs’ distinct capabilities and limitations. For instance, TransAgents has been shown to omit substantial portions of source content in longer texts (Wu et al. 2025), suggesting challenges in directly transferring human workflow models to LLM systems.
The question of optimal workflow design for LLM-based translation represents a significant gap in current research. While translation process research has extensively studied human translation workflows and collaboration patterns, the equivalent knowledge for LLM-based systems remains underdeveloped. This gap becomes particularly relevant as translation technology continues to advance and the integration of LLMs into professional translation workflows becomes increasingly common.
This study examines how different multi-agent designs affect literary translation quality by comparing two systems: a practice-derived multi-agent system (PD-MAS) based on standard human translation workflows and an LLM-capability-driven multi-agent system (LCD-MAS) featuring specialized agents for planning, stylistic refinement, and coherence checking. Through expert evaluations of Chinese–English fiction translations, we assess how these different process structures influence translation accuracy, fluency, and stylistic appropriateness. Specifically, this study addresses two primary questions:
-
How does the quality of translations from the LCD-MAS compare to those from the PD-MAS in terms of accuracy, fluency, and overall rater preference?
-
Can either multi-agent system achieve translation quality comparable to that of professional human translators?
The findings contribute to our understanding of effective translation process design in the era of advanced language models and offer insights into the future of human–AI collaborative translation.
2.Related work
This section reviews three key areas underpinning our research: the challenges of literary translation, LLM capabilities in translation, and multi-agent systems that mirror collaborative translation processes. We trace how the field has shifted from viewing translation as a solitary cognitive activity to a collaborative process — an evolution now reflected in computational approaches that distribute translation tasks across specialized agents.
2.1Literary translation: Distinctive features and challenges
Literary translation differs fundamentally from technical translation in both its objectives and challenges. Situated at the intersection of linguistic transfer and artistic recreation, it requires not only accuracy but also the ability to reproduce literary effects, voices, and styles in the target language (Jones 2019; Kenny and Winters 2020). Translators must attend to rhythm, wordplay, metaphor, allusion, and narrative voice, all of which shape a work’s artistic identity (Boase-Beier 2014; Jones 2019; Matusov 2019). These demands are further complicated by the diversity of literary genres, such as fiction, drama, and poetry, each with distinct aesthetic goals and stylistic conventions. Consequently, literary translation calls for a variety of strategies and creative skills to recreate the reading experience in another language.
Given these complexities, literary translation quality cannot be adequately assessed through error-counting or linguistic metrics alone. Instead, holistic models considering both textual and contextual factors, such as those proposed by Reiss (2000) and House (2015), emphasize the need to evaluate communicative function, effectiveness in recreating the source text’s aesthetic experience, and cultural resonance. Literary translation is thus judged as much by its ability to stand as an independent work in the target culture as by its formal accuracy. Reader-response approaches, which foreground how audiences perceive translated texts, further highlight the importance of subjective and contextual factors in quality assessment (Brumme and Espunya 2012; Fonteyne, Tezcan, and Macken 2020).
The creative nature of literary translation poses significant challenges for MT systems. Despite advances in NMT, studies consistently show that automated systems struggle with the creative and culturally embedded dimensions of literary texts. They often fail to adequately handle wordplay, metaphors, register shifts, idiomatic expressions, and cultural allusions, which are central to literary effect (Toral and Way 2018; Matusov 2019; Fonteyne, Tezcan, and Macken 2020; Kenny and Winters 2020; Guerberof-Arenas and Toral 2022). These limitations stem partly from training on predominantly non-literary corpora and from an inability to recognize the cultural significance of linguistic choices (Besacier and Schwartz 2015).
Furthermore, current MT architecture faces inherent structural limitations when processing literary texts. Most systems operate at the sentence level within narrow context windows, preventing them from maintaining narrative continuity or consistent character voices across chapters (Matusov 2019; Fonteyne, Tezcan, and Macken 2020). While advanced language models can produce superficially fluent translations, they often flatten stylistic nuances and introduce semantic distortions, particularly with creative or culturally specific language (Toral and Way 2018; R. Zhang, Zhao, and Eger 2025). Research shows that although some machine-translated sentences may approach publishable quality, most require substantial human post-editing to address stylistic, discursive, and cultural issues (Matusov 2019). These persistent limitations underscore why human expertise remains indispensable in literary translation (Kenny and Winters 2020; Guerberof-Arenas and Toral 2022).
2.2LLMs in translation: Capabilities and process integration
LLMs offer advantages over conventional NMT systems through large-scale transformer architectures and extensive pre-training on diverse multilingual corpora (Achiam et al. 2023). These features enable them to capture broader contextual dependencies during translation, mitigating the sentence-level fragmentation common in NMT outputs. Longyue Wang et al. (2023) and Briva-Iglesias, Camargo, and Dogru (2024) demonstrated that LLMs produce translations with improved coherence, particularly for documents requiring consistent terminology and stylistic choices. Their extensive pre-training also equips them with substantial world knowledge, enabling more nuanced translations of culturally bound expressions. Empirical studies show that LLMs outperform traditional systems across diverse genres, including legal texts (Briva-Iglesias, Camargo, and Dogru 2024), classical Chinese poetry (Gao et al. 2024), political discourse (Jiang et al. 2024), and news content (J. Yan et al. 2024).
Despite these advantages, LLMs encounter specific challenges when applied to literary texts. Their fundamental token-prediction mechanism limits creative problem-solving when confronted with novel linguistic structures or cultural references without direct equivalents (He 2024; R. Zhang, Zhao, and Eger 2025). Context windows, though expanded in recent models, still constrain narrative coherence across long texts (Karpinska and Iyyer 2023). This is particularly problematic for literary translation, where plot development and thematic motifs often span entire works. R. Zhang, Zhao, and Eger (2025) found that LLMs, while outperforming traditional NMT systems on literary texts, still produce translations characterized by overly literal renderings and stylistic deficiencies. These limitations become evident particularly in relation to the “rich points” identified in translation process research, where cultural, linguistic, and stylistic factors converge to create complex translation challenges (PACTE Group 2017).
Researchers have explored a range of prompting strategies to improve the quality of literary translation carried out by LLMs. Few-shot prompting has produced mixed results, with effectiveness depending more on the choice of examples than on their number (Moslem et al. 2023; B. Zhang, Haddow, and Birch 2023). Role-based prompting, which frames the LLM as a professional translator, generally produces only modest improvements (He 2024). Chain-of-thought approaches yield limited gains in translation quality (Wei et al. 2022; Peng et al. 2023). Paradoxically, studies by Puppel and Borg (2025) and R. Zhang, Zhao, and Eger (2025) report that simpler prompts can outperform more complex ones, suggesting that prompt engineering alone remains insufficient to address the creative aspects of literary translation.
2.3Multi-agent systems: Parallels with human translation process
Recent research has proposed multi-agent systems as a solution to the persistent challenges faced by LLMs in literary translation (Wu, Xu, and Longyue Wang 2024). Building on earlier work in artificial intelligence on agent cooperation (Wooldridge 2009), these systems divide complex problems into components handled by specialized agents with distinct roles and decision-making procedures (Dorri, Kanhere, and Jurdak 2018; Guo et al. 2024). Park et al. (2023) and Chan et al. (2023) have demonstrated the effectiveness of collaborative agent frameworks in complex tasks. When applied to translation, these systems distribute translation tasks across specialized agents, often outperforming single-agent approaches (Liang et al. 2024; Wu, Xu, and Longyue Wang 2024).
Multi-agent translation systems do not simply represent a shift from single-model to collaborative computational workflows; they also parallel the evolution of translation process research from viewing translation as an individual cognitive activity to understanding it as a collaborative social process. Earlier research, such as Jakobsen (2002) and Mossop (2000), conceptualized translation as a linear progression through pre-translation analysis, drafting, and post-translation revision. As research methodologies advanced, studies by Hvelplund (2011) and Schaeffer and Carl (2013) documented the non-linear and recursive nature of the translation process, revealing how translators constantly shift between source and target texts, re-evaluating and refining their work across multiple iterations (Muñoz Martín 2016). Multi-agent systems mirror this recursiveness by assigning different agents to sequential stages, such as initial drafting, revision, and final review, allowing each agent to iteratively improve the translation.
Socio-cognitive models in Translation Studies highlight the collaborative nature of professional translation, where complex projects are distributed among teams with specialized roles through organizational workflows (Kuznik and Verd 2010; Ehrensberger-Dow and Massey 2014; Risku 2014). Multi-agent systems embody this principle by assigning distinct roles to individual agents, such as terminology management, translation, and review. This division of labor mirrors professional translation workflows, in which translators, revisers, and reviewers contribute complementary expertise to the final product (International Organization for Standardization 2015). The multi-agent approach thus offers a computational framework for exploring translation as a distributed cognitive activity in the context of AI.
The design of multi-agent translation systems involves a fundamental choice between human-mimicking and LLM-capability-driven workflows. Human-mimicking approaches replicate professional translation practices by assigning agents to traditional roles (Wu, Xu, and Longyue Wang 2024). While these configurations leverage established process knowledge, they often fail to fully exploit LLMs’ unique capabilities. For instance, TransAgents demonstrated limitations with long texts, omitting significant portions of source content (Wu et al. 2025). In contrast, LLM-capability-driven approaches — designed around LLMs’ computational strengths rather than human role divisions — remain largely unexplored but offer promising directions for developing novel translation systems (Becker 2024).
Research on LLM-based translation workflows remains nascent, with significant gaps concerning optimal agent configurations and information flow between agents. Preliminary findings suggest that different communication protocols affect both output quality and computational efficiency (Becker 2024; Q. Wang et al. 2024). This study addresses these gaps by developing and evaluating two distinct multi-agent systems: a practice-derived workflow modeled on professional translation practice (PD-MAS) and an LLM-capability-driven system (LCD-MAS) featuring more granular task decomposition. This approach enables a systematic comparison between human-mimicking and LLM-capability-driven workflows for literary translation, thereby advancing our understanding of how multi-agent translation systems can be optimized to meet the distinctive challenges of literary texts.
3.Methods
3.1Multi-agent translation systems
3.1.1Practice-derived multi-agent translation system (PD-MAS)
The PD-MAS implements a workflow aligned with the ISO 17100:2015 (International Organization for Standardization 2015) translation service requirements. We designed agent profiles to reflect industry standards, enabling direct comparison with professional human workflows.
The system operates through two sequential stages: pre-production and production (Figure 1). In the pre-production stage, two specialized agents prepare essential resources: the text analyst analyzes source text characteristics (genre, domain, purpose, and stylistic features), while the term expert creates bilingual terminology lists for consistency. In this literary context, we use ‘terminology’ operationally to include any recurring element requiring consistent translation. This is particularly important for handling character names, place names, and recurring motifs, which have been identified as a key challenge for literary translation quality (Matusov 2019).
The production stage encompasses translation and quality assurance. The translator generates target text guided by the analysis and terminology resources, applying criteria such as terminological consistency, genre appropriateness, and cultural adaptation. After self-checking, the translator passes the text to the reviser, who conducts comparative analysis between source and target texts, focusing on accuracy and completeness. The reviewer then ensures linguistic and stylistic coherence before the proofreader performs the final quality check.
We structured the agent instructions as itemized lists rather than prose descriptions to optimize LLM performance, following recommended prompt engineering practices (Phoenix and Taylor 2024). The workflow progresses through defined stages of preparation, translation, revision, and review, enabling assessment of whether practice-derived translation processes remain effective when implemented through LLM-based agents.
3.1.2LLM-capability-driven multi-agent translation system (LCD-MAS)
The LCD-MAS was designed around the computational characteristics of large language models, featuring granular task decomposition and dedicated stylistic processing (Figure 2).
A key challenge in the translation of long texts is LLMs’ context limitations. Despite impressive technical context windows (e.g., GPT-4o’s 128K tokens), models show degraded performance in coherence, instruction-following, and accuracy at much shorter context lengths (Hsieh et al. 2024; Levy, Jacoby, and Goldberg 2024). Levy, Jacoby, and Goldberg (2024) report that the reasoning accuracy of LLMs, including GPT-4, declines gradually as input length increases, with measurable degradation even at around 3000 tokens. Such performance deterioration poses consistency challenges for LLM-based translation systems (Liang et al. 2024).
This system addresses context window limits by dividing source texts into semantically coherent units of approximately 300 Chinese characters each before the translation stage. To counterbalance potential loss of global context caused by source text chunking, we implemented specialized mechanisms at pre-translation and finalization stages.
The system operates through three interconnected stages. Pre-translation planning establishes global context through two agents: the summarizer generates a concise narrative summary capturing main events, characters, and themes, and the strategy planner develops a comprehensive translation plan addressing audience expectations, text type, and cultural references.
Translation and stylistic rewriting separate semantic transfer from stylistic refinement. For each source chunk, the translator produces an initial translation using the summary and strategy plan from the pre-translation stage. The reviser checks for accuracy, and then the style guide generator identifies appropriate stylistic enhancements. Finally, the stylistic rewriter applies these recommendations, incorporating literary devices while preserving semantic content. This pipeline addresses known limitations of fluency and style in LLM translations (He 2024; Jiang et al. 2024; R. Zhang, Zhao, and Eger 2025). It breaks down the translation process into two phases: an initial interlingual translation that conveys the semantic content, followed by an intralingual translation (i.e., stylistic rewriting within the target language) (Jakobson 1959; Whyatt 2017), where the stylistic rewriter enhances literary expression. By leveraging LLMs’ strengths in text style unbundling (Phoenix and Taylor 2024) and text style transfer (Reif et al. 2022; Tao et al. 2024), this design aims to achieve stylistic improvement without compromising semantic fidelity.
Finalization ensures coherence across independently translated segments, which are concatenated at this stage. After text concatenation, the style guide generator detects inconsistencies and awkward transitions and produces guidelines for the editor to implement, maintaining global coherence while preserving the established stylistic qualities.
This architecture reconfigures the translation process around LLMs’ computational characteristics rather than human cognitive patterns, combining global context-setting, granular task division, dedicated stylistic processing, and systematic finalization.
3.2Materials
This study evaluated translation quality using a corpus of contemporary Chinese fiction with existing professional English translations. We selected twenty-eight chapters from fifteen works by fourteen prominent Chinese authors, including Nobel laureate Mo Yan (e.g., 天堂蒜薹之歌 Tiantang suantai zhi ge, translated by Howard Goldblatt as The Garlic Ballads; see Appendix A for the complete list). These established translations served as benchmarks for evaluating whether machine-generated translations could match or exceed professional human translation quality in literary contexts.
The corpus included diverse subgenres to ensure broad representativeness: general fiction, mystery and detective fiction, science fiction, romance, and 仙侠 xianxia (a genre featuring cultivation and martial arts elements). To capture potential variations across narrative progression, we selected chapters from the beginning, middle, and end of each work.
To control for text length effects, we standardized chapters to approximately 3000 Chinese characters, with longer chapters truncated at narrative breaks and shorter ones supplemented with adjacent content. This standardization ensured comparable processing conditions across all texts.
3.3Technical implementation
Both multi-agent systems were implemented using OpenAI’s GPT-4o (version gpt-4o-2024-11-20) via Microsoft Azure OpenAI API, with Python 3.12.0 as the development environment. All twenty-eight source chapters were processed on 1 January 2025, ensuring consistent model performance across the evaluation corpus. This setup provided a controlled experimental environment where workflow architecture, rather than model capability, served as the independent variable.
Temperature settings were strategically configured based on agent function across both systems. Temperature is a parameter that controls the randomness of model outputs, where lower values produce more deterministic results and higher values allow for more variation. Analytical agents (text analyst, term expert, summarizer, and editor) operated at a temperature of 0 to produce deterministic outputs with high consistency. Translation and stylistic agents operated at a temperature of 0.5, balancing creative language generation with semantic fidelity. The top_p parameter, which controls the range of vocabulary the model considers when generating text, remained at its default value. The max_tokens parameter was left unrestricted to avoid artificial truncation of outputs.
Agent interactions were coordinated through an orchestration layer that maintained contextual continuity across processing stages, allowing outputs from earlier phases to be seamlessly incorporated into subsequent ones. This implementation ensured that any observed differences in translation quality could be attributed to workflow design rather than technical variables.
3.4Text selection and quality assessment framework
In the evaluation phase, we extracted thirty text samples from the translated chapters, carefully selecting passages that represented both narrative and dialogue elements from the beginning, middle, and end sections of the source texts to ensure comprehensive coverage of the stylistic and register variations typical in literary fiction (Egbert and Mahlberg 2020; Chou and K. Liu 2024). Table 1 presents summary statistics showing length distributions across the source texts and all three translation versions. As Figure 3 shows, LCD-MAS produced generally longer translations than both human translators and PD-MAS — a pattern examined in our discussion of stylistic tendencies.
| Source and target texts | Min | Max | Median | Mean | SD |
|---|---|---|---|---|---|
| Chinese source texts | 106 | 360 | 163 | 166 | 55.2 |
| PD-MAS translations | 36 | 186 | 94 | 99 | 42.8 |
| LCD-MAS translations | 81 | 313 | 143 | 149.5 | 58.1 |
| Human translations | 65 | 204 | 112 | 113.4 | 30.5 |
The thirty text samples were evaluated along two key dimensions: accuracy (faithful conveyance of source text meaning) and fluency (naturalness and adherence to target language norms) (Castilho et al. 2018; Salmi 2020).
Four expert raters with complementary expertise conducted the evaluations. Two native English speakers with extensive experience teaching English writing assessed fluency (Raters 1 and 2), while two professors of translation from Chinese universities, each with over ten years of experience teaching literary translation, evaluated accuracy (Raters 3 and 4). We adapted Waddington’s (2001) five-level scoring rubric to better suit professional translation evaluation, imposing stricter requirements for higher scores (see Appendix B). Raters used a 1–10 scale with whole-number increments to enhance scoring reliability.
To ensure consistent application of assessment criteria, all raters participated in training and calibration sessions. The evaluation employed a blind design. Each rater received a document containing the source texts alongside three anonymized and randomized translations (produced by human translators and the two multi-agent systems). Raters were not informed that any translations were machine-generated. In addition to assigning numerical scores, raters selected their preferred translation for each sample and provided written comments explaining their evaluations, considering factors including accuracy, fluency, stylistic appropriateness, creativity, and any other aspects they deemed relevant to translation quality.
This framework allowed assessment of technical quality through numerical ratings and of subjective reception through preference votes and qualitative feedback, providing a comprehensive view of how different translation approaches performed on literary texts.
4.Results
Our comparison of PD-MAS and LCD-MAS translations with professional human translations revealed distinct patterns in quality and reception. The evaluation examined three dimensions: accuracy (semantic fidelity to source texts), fluency (naturalness and readability in the target language), and overall preference as determined by expert raters. Statistical analyses for each dimension, complemented by qualitative assessments of translation characteristics, revealed both expected and unexpected patterns across the three translation approaches.
4.1Accuracy analysis
We first assessed the consistency of accuracy evaluations using the intra-class correlation coefficient (ICC). A two-way random-effects model for average ratings, ICC(2,k), showed good interrater reliability between the two translation experts (ICC = .74, 95% CI [.52, .85], F(89, 89) = 4.50, p < .001), indicating consistent application of the evaluation criteria.
Statistical comparisons of average accuracy scores across the three translation approaches were conducted using non-parametric tests, since the Shapiro-Wilk test indicated non-normal distributions for all three groups. The Friedman test showed no statistically significant differences in accuracy among professional human translations, PD-MAS translations, and LCD-MAS translations (χ²(2) = 0.37, p = .832).
Follow-up pairwise comparisons using Bonferroni-corrected Wilcoxon signed-rank tests confirmed this result. Median accuracy scores were identical across all three approaches (Mdn = 8), with no significant differences detected between any pair: LCD-MAS versus PD-MAS (p = .739, r = .06), LCD-MAS versus human translators (p = .920, r = .02), and PD-MAS versus human translators (p = .837, r = .04).
4.2Fluency analysis
Fluency evaluations showed high consistency between raters, with interrater reliability analysis yielding an ICC of 0.83 (95% CI [.74, .89], F(89, 89) = 6.30, p < .001). This indicates strong agreement between the two native English-speaking evaluators in their assessment of how naturally the translations read in English.
In contrast to accuracy scores, fluency ratings revealed significant differences across translation approaches. A Friedman test indicated statistically significant variation among the three translation types (χ²(2) = 14.92, p < .001). To identify specific differences, we conducted pairwise comparisons using Wilcoxon signed-rank tests. The p-values were Bonferroni-corrected for multiple comparisons.
These pairwise tests showed that LCD-MAS received significantly higher fluency scores (Mdn = 8) than both human translators (Mdn = 7, p < .001, r = .71) and PD-MAS (Mdn = 7.5, p = .024, r = .49). The effect size for LCD-MAS versus human translations (r = .71) indicates a large effect. No significant difference was observed between human and PD-MAS translations (p = .164, r = .35).
Figure 4 presents a comparison of both accuracy and fluency scores across all three translation approaches. While the accuracy distribution shows identical median scores (Mdn = 8), the box plots reveal that LCD-MAS achieved significantly higher fluency ratings than both human translators and PD-MAS.
4.3Translation preference analysis
Beyond numerical ratings, we examined overall translation preferences through direct comparison. Raters selected their preferred translation for each sample, yielding clear preference patterns across the 120 total evaluations (30 samples × 4 raters). LCD-MAS emerged as the most frequently preferred translation approach, receiving fifty-two votes (43.33%), followed by PD-MAS with thirty-nine votes (32.50%). Professional human translations were least preferred, with only twenty-nine votes (24.17%) (see Table 2).
| Translator | Rater 1 | Rater 2 | Rater 3 | Rater 4 | Total votes | Percentage |
|---|---|---|---|---|---|---|
| PD-MAS | 9 | 11 | 7 | 12 | 39 | 32.50 |
| LCD-MAS | 14 | 10 | 15 | 13 | 52 | 43.33 |
| Humans | 7 | 9 | 8 | 5 | 29 | 24.17 |
Although LCD-MAS received the highest number of preference votes overall, individual rater preferences showed some variation. Three of the four raters consistently preferred LCD-MAS, while one rater (Rater 2) slightly favored PD-MAS. This variation suggests that despite the overall preference trend, translation quality assessment remains somewhat subjective, with different evaluators prioritizing different aspects of translation.
The preference data align with the fluency results reported in Section 4.2, indicating that fluency may have exerted a stronger influence on overall preference than accuracy. This relationship is particularly noteworthy given that no significant differences were found in accuracy scores across the three translation approaches, while LCD-MAS demonstrated significantly higher fluency.
Raters’ written justifications for their preferences revealed distinctive characteristics associated with each translation approach. These qualitative insights are examined in detail in Section 5, where we analyze how specific translation qualities influenced overall preference patterns and what this suggests about effective translation process design.
5.Discussion
Our comparison of practice-derived and LLM-capability-driven translation systems reveals insights that challenge conventional assumptions about literary translation. This section interprets the observed performance patterns, analyzes distinctive stylistic characteristics, discusses persistent challenges in cultural translation, and considers broader implications for translation practice and technology.
5.1Performance patterns
The equivalence in accuracy scores across all three translation approaches challenges long-standing assumptions about literary translation requirements. The statistical parity between both AI systems and professional human translators suggests that well-designed multi-agent systems can effectively transfer semantic content from source to target language.
The superior fluency performance of LCD-MAS demonstrates how architectural design can influence translation quality beyond semantic fidelity. Its dedicated stylistic processing stage produced translations that raters consistently preferred over both PD-MAS outputs and professional human translations. This finding diverges from previous research suggesting that human translators retain significant advantages over LLMs in stylistic aspects (R. Zhang, Zhao, and Eger 2025). Raters’ preference patterns align more closely with fluency than with accuracy, suggesting that they prioritized natural, engaging language over strict semantic equivalence when evaluating literary translation quality.
These findings indicate that translation workflow design significantly impacts output quality. The LCD-MAS architecture, which separates semantic transfer from stylistic refinement, appears particularly effective for literary translation, where both meaning and artistic expression matter. This challenges the conventional integrated approach where translators often address meaning and style simultaneously.
5.2Stylistic profiles and translation quality
The quantitative performance metrics reveal only part of the story. Analysis of raters’ qualitative feedback uncovered distinct stylistic profiles associated with each translation approach, which significantly influenced quality assessments and raters’ preferences. These stylistic differences emerged consistently across the corpus, revealing characteristic approaches to literary translation that extended beyond individual texts.
LCD-MAS’s translations displayed a distinctive literary quality characterized by rich and vivid imagery, figurative language, and dynamic expression. Example (1) illustrates this stylistic profile.
来日的晨时,秋天里的晨时。晨时里的日光,在豫东的平原上,因着晨时,它就血成一团,漫天漫地红着。铺红着,就有了这一天的晨时。
lai ri de chen shi qiu tian li de chen shi chen shi li de ri guang zai yu dong de ping yuan shang yin zhe chen shi ta jiu xue cheng yi tuan man tian man di hong zhe pu hong zhe jiu you le zhe yi tian de chen shi
‘The morning of the coming day, a morning in autumn. The sunlight in the morning, on the plains of eastern Henan, because of the morning, it congealed into a ball of blood, reddening the entire sky and earth. Spreading red, thus came this day’s morning.’
(L. Yan 2006, Chapter 3)
The next morning arrived, an autumn morning steeped in quiet beauty. The sunlight stretched across the eastern plains of Henan, painting the horizon in hues of glowing red, like molten fire. The sky and earth seemed to merge in a vivid crimson, as if the day itself was being born in flames. This fiery red heralded the start of a day that carried with it a fragile sense of hope.
The coming morning marked the onset of autumn. The early sunlight spread across the plains of Eastern Henan, painting the sky and earth in vivid crimson hues. With this brilliant red, the day began.
It is late autumn, the dawn of a new day. The sun rises above the East Henan plain. A blood-red ball turning the earth and sky a deep shade of crimson. As red unfurls, so follows morning. Another day begins.
(L. Yan 2011, Chapter 3)
The source text is highly poetic and atmospheric, characterized by rhythmic repetition (晨时 chenshi ‘morning time’), which creates an almost incantatory effect. Its language is deeply descriptive and evocative, employing vivid imagery such as 血成一团 xue cheng yi tuan ‘congealed like blood’ and 漫天漫地红着 man tian man di hong zhe ‘reddening the whole sky and earth’. This repetition and imagery work together to evoke a cyclical and ritualistic sense of dawn on the eastern plain.
The LCD-MAS translation amplifies this effect through vivid figurative language (“like molten fire,” “born in flames”) that heightens the text’s evocative quality. Phrases such as “steeped in quiet beauty” and “a fragile sense of hope” enhance the atmosphere while remaining thematically consistent with the source text. In contrast, PD-MAS delivers a more direct and concise translation that conveys the core information efficiently but lacks the literary flair of the LCD-MAS version. The human translation employs a distinctive staccato and fragmented style, creating a different rhythm that diverges from the more poetic and incantatory mood of the original.
However, this literary embellishment occasionally led LCD-MAS to introduce elements absent from the source text, as Example (2) demonstrates.
半个小时后,当他们轻易译出密电后,方才还是莫名无实的慌惶,顿时像剥掉了皮肉,露出血淋淋、狰狞的本质,把他们都吓瘫了。
ban ge xiao shi hou dang ta men qing yi yi chu mi dian hou fang cai hai shi mo ming wu shi de huang huang dun shi xiang bo diao le pi rou lu chu xue lin lin zheng ning de ben zhi ba ta men dou xia tan le
‘Half an hour later, when they easily deciphered the encrypted message, the vague and groundless panic from just moments before suddenly seemed to have its skin and flesh stripped away, revealing its bloody, hideous true nature, leaving them all paralyzed with fear.’
(Mai 2018, Chapter 1)
Half an hour later, as they effortlessly deciphered the encrypted message, the vague disquiet that had settled over them earlier took on a far more sinister shape. It was as if a mask had been torn away, exposing something raw, bloody, and grotesque beneath. The revelation struck with such force that it froze them in place, their minds paralyzed by a terror so overwhelming it seemed to drain the air from the room.
Half an hour later, when they easily cracked the code, the vague unease they had felt earlier transformed into raw, visceral terror — an unrelenting nightmare that left them paralyzed with fear.
Half an hour later, when they had deciphered the dictionary message with perfect ease, that earlier sense of bemusement was replaced by an all-consuming, paralysing terror. It was as if they’d been flayed, as if they’d been stripped of their surface equilibrium and reduced to raw emotion.
(Mai 2020, Chapter 1)
Here, the LCD-MAS translation develops the original metaphor (‘skin and flesh being peeled off’) into an extended series of images. While this creates dramatic tension, it adds elements not present in the source text, such as “drain the air from the room.” This tendency toward embellishment sometimes crossed into over-translation, with raters describing such passages as “florid” or “superfluous.”
PD-MAS consistently produced more concise translations that effectively conveyed the core meaning. Its approach prioritized accuracy and directness, often condensing source text information into efficient target language expressions. However, this conciseness occasionally led to reduced cohesion and stylistic nuance, with raters noting “fragmented syntax” and a “lack of cohesion.”
Human translations exhibited yet another stylistic profile, characterized by accurate rendering of meaning with varying levels of fluency. While human translators generally captured cultural nuances effectively, their stylistic choices sometimes resulted in what raters described as “overly literal” renderings or “fragmented syntax.” The human translation in Example (1) shows this tendency toward fragmentation, with short, choppy sentences that accurately convey content but can appear abrupt.
These stylistic profiles help explain why LCD-MAS received higher fluency scores and preference ratings despite all three approaches achieving comparable accuracy. Its emphasis on literary quality and engaging language appears to have resonated with evaluators, even when it occasionally expanded beyond the source text’s literal meaning. This finding confirms that literary translation quality depends not only on semantic accuracy but also on the stylistic and affective impact of the target text.
However, the dominant criticism raised by raters against LCD-MAS warrants careful consideration. They observed that it systematically added content, from small descriptive details to entirely new information. This was perceived as its primary flaw, often sacrificing fidelity for a “dramatic” style criticized as “superfluous” and “unwarranted.” Rater 4’s comment that some passages read more like “transcreation” than translation highlights a key tension in our findings: while raters frequently preferred the more engaging prose, they simultaneously questioned its deviation from translation norms.
This tendency toward embellishment raises questions about the boundaries of translation and the ethics of AI-mediated creativity. LCD-MAS’s output, though successful by certain metrics, blurs the lines between translation, adaptation, and creative rewriting. Optimizing AI systems for stylistic effect may inadvertently privilege fluency over the preservation of authorial voice and cultural specificity, which is an especially delicate issue in literary contexts (Taivalkoski-Shilov 2019; Kenny and Winters 2020). When an LLM introduces its own metaphors or dramatic flourishes, it risks misrepresenting the original author’s voice, style, and intended meaning, even if the translation output achieves stylistic appeal. It may also homogenize diverse authorial styles into a recognizable “AI voice,” inadvertently erasing the very cultural and stylistic nuances that make literary works unique. This aligns with concerns that technology could flatten diverse voices “to sound like one and the same person” (Taivalkoski-Shilov 2019, 697). Such embellishment also raises ethical concerns regarding readers who expect a faithful rendering of the original work (Taivalkoski-Shilov 2019).
Ultimately, the system’s success in fluency and preference ratings highlights a promising direction for translation technology, but its content addition signals a departure from established translational ethics. The challenge lies in striking an appropriate balance between aesthetic effect and semantic fidelity in AI-assisted literary translation, while preserving authorial integrity and cultural authenticity.
5.3Challenges in translating cultural references
Despite the impressive performance of both multi-agent systems in overall accuracy and fluency, our analysis revealed persistent difficulties in translating culturally specific references. These challenges represent a significant limitation of current LLM-based approaches to literary translation.
Both multi-agent systems struggled with culturally bound expressions that require deep contextual understanding rather than linguistic knowledge alone. For example, when translating the temporal reference “用了两炷香的时间” yong le liang zhu xiang de shijian ‘in the time it took to burn two incense sticks’, both AI systems opted for literal renderings — “took the time of two incense sticks” and “took him two incense sticks’ worth of time.” While comprehensible, these translations fail to convey the idiomatic meaning readily understood by readers familiar with Chinese culture. The human translator appropriately rendered this as “took him two hours,” demonstrating cultural competence beyond literal transfer.
Similar patterns emerged with titles and proper names. When translating “上神” shangshen ‘high god/supreme deity’ in “青丘的那位九尾狐的上神” Qingqiu de na wei jiuwei hu de shangshen ‘that nine-tailed fox high god from Qingqiu’, LCD-MAS produced “that Nine-Tailed Fox Shangshen from Qingqiu,” while PD-MAS rendered it as “that Nine-Tailed Fox High God from Qingqiu.” The human translation — “this Qingqiu goddess” — better conveys the meaning to English readers. Likewise, both systems translated “土司太太” tusi taitai ‘chieftain’s wife’ literally (“Tusi Madam” and “Tusi’s wife”), whereas the human translator used the more culturally appropriate “the chieftain’s wife.”
Notably, these culturally inappropriate translations did not stem from a lack of relevant knowledge. Our examination of agents’ outputs revealed that the systems often recognized the cultural references but suggested suboptimal translation strategies. For instance, the strategy planner in LCD-MAS correctly identified “上神” shangshen ‘high god/supreme deity’ as referring to “hierarchical relationships in the celestial realm,” yet explicitly recommended transliteration with explanatory notes, but these notes did not subsequently appear in the final translation.
This disconnect between cultural knowledge and translation execution points to a critical limitation in current multi-agent translation systems: while cultural information is available, it is not effectively incorporated into the final translation. Even when individual agents proposed appropriate strategies for handling cultural references, these were not consistently implemented in the translation pipeline. This challenge highlights the continuing importance of human expertise and suggests that fully automated literary translation still faces significant obstacles where cultural competence is required.
5.4Implications for translation technology and practice
Our findings have important implications for translation technology development and professional practice. LCD-MAS’s superior performance suggests that appropriately designed multi-agent architectures can produce high-quality literary translations that raters may prefer to human translations in certain aspects.
The effectiveness of separating semantic transfer from stylistic refinement demonstrates the value of workflow designs tailored to computational strengths rather than modeled on human cognitive processes. This architectural insight could guide future translation technology development toward specialized processing stages rather than end-to-end approaches.
The persistent difficulties observed in handling cultural references indicate that fully automated literary translation still faces challenges. The findings indicate that optimal approaches may involve human–AI collaboration rather than full automation, with human translators focusing on cultural adaptation while AI systems handle drafting and stylistic enhancement.
For translation theory, our findings invite reconsideration of translation processes. The effectiveness of non-human workflow design, which breaks translation into specialized sub-tasks, challenges traditional models and opens new theoretical directions for translation process design. This perspective shifts translation from being viewed primarily as an individual cognitive activity to a collaborative, functionally distributed process, whether performed by humans or AI agents.
6.Conclusion
This study compared the performance of two multi-agent translation systems against professional human translations for literary texts. The findings demonstrate that LLM-based multi-agent systems can achieve accuracy comparable to human translators while potentially surpassing them in fluency and rater preference. The LLM-capability-driven system, designed around LLMs’ computational capabilities rather than standard human practice, produced translations with enhanced literary quality and stylistic richness, though sometimes at the cost of introducing content absent from the source text. The human-practice-derived system generated more concise translations but often lacked cohesion and natural flow. Notably, both AI approaches struggled with cultural references despite demonstrating understanding of these elements, suggesting a gap between cultural knowledge and effective translation strategy implementation. These results challenge fundamental assumptions about literary translation requirements and indicate that rethinking translation workflows specifically for LLM capabilities can yield exceptional results in certain aspects of translation quality.
Our study has several limitations that should be acknowledged. First, we focused exclusively on Chinese-to-English translation with a single LLM (GPT-4o), limiting the generalizability of our findings to other language pairs and model architectures. Second, the evaluation was based on relatively short text samples rather than full-length novels, leaving questions about how these systems can maintain consistency across longer narratives. Third, our study evaluated each multi-agent system as a holistic unit and did not isolate the performance of individual agents within the pipeline. Finally, our evaluation, while incorporating both quantitative ratings and qualitative assessments from expert raters, still captures only certain dimensions of translation quality and may not fully represent how different audiences would perceive the translations.
Future research should explore a broader range of language pairs, text types, and LLM architectures to assess the generalizability of our findings. Developing methods to address the cultural reference challenges we identified represents a particularly important direction, perhaps through enhanced coordination between agents responsible for strategic planning and those implementing the translation. Studies examining longer texts or complete literary works would also help determine whether multi-agent systems can maintain consistency across book-length translations. Research into human–AI collaborative translation workflows that combine the stylistic strengths of LLM systems with human cultural expertise could lead to particularly productive approaches. Moreover, our study has a potential confounding variable in the design of the LLM-capability-driven system, as it simultaneously introduced text chunking and a more sophisticated agentic architecture. Consequently, our results cannot fully disentangle whether the observed improvements in translation quality stem from the granular processing of smaller text units, the specialized multi-agent architecture, or their synergistic effect. Future research should aim to isolate these variables to determine their independent contributions. These directions can further expand our understanding of how LLM-based systems can contribute to literary translation while addressing their current limitations. By reimagining translation processes around the capabilities of advanced language models rather than simply replicating human workflows, researchers and developers can continue to push the boundaries of what MT can achieve in even the most challenging domains.
Funding
Open Access publication of this article was funded through a Transformative Agreement with Hong Kong Polytechnic University.
Acknowledgements
The authors thank the reviewers and editors for their constructive comments, which greatly improved the quality of this paper. The first author would also like to thank Professors Ricardo Muñoz Martín, Bogusława M. Whyatt, Joss Moorkens, and Christopher D. Mellinger for their valuable input during the individual tutorial sessions at the MC2 Lab’s 3rd International Summer School on Cognitive Translation & Interpreting Studies in July 2025.
References
Appendix A.Sources of the text samples used for the experiment
| Book title | Author | Publisher | Publication year | Chapter(s) | Translation title | Translator(s) | Publisher | Publication year |
|---|---|---|---|---|---|---|---|---|
|
尘埃落定
Chen’ai luoding ‘Dust settles’ |
Alai | People’s Literature Publishing House | 2012 | 3, 12 | Red Poppies | Howard Goldblatt, Sylvia Li-Chun Lin | Houghton Mifflin Harcourt Publishing Company | 2002 |
|
第七天
Di qi tian ‘The seventh day’ |
Yu Hua | New Star Press | 2013 | 1, 4 | The Seventh Day: A Novel | Allan H. Barr | Pantheon Books | 2015 |
|
丁庄梦
Ding zhuang meng ‘Dream of Ding Village’ |
Yan Lianke | Shanghai Literature and Art Publishing House | 2006 | 2, 3 | Dream of Ding Village | Cindy Carter | Text Publishing | 2011 |
|
我们家
Women jia ‘Our family’ |
Yan Ge | Zhejiang Literature and Art Publishing House | 2013 | 6 | The Chilli Bean Paste Clan: A Novel | Nicky Harman | Balestier Press | 2018 |
|
高兴
Gaoxing ‘Happy’ |
Jia Pingwa | People’s Literature Publishing House | 2008 | 9, 10 | Happy Dreams | Nicky Harman | AmazonCrossing | 2017 |
|
天堂蒜薹之歌
Tiantang suantai zhi ge ‘Song of garlic scapes in paradise’ |
Mo Yan | China Writers Publishing House | 2012 | 9, 10 | The Garlic Ballads | Howard Goldblatt | Arcade Publishing | 2011 |
|
风声
Feng sheng ‘The sound of wind’ |
Mai Jia | Beijing October Arts and Literature Publishing House | 2018 | 1, 4 | The Message | Olivia Milburn | Head of Zeus | 2020 |
|
无证之罪
Wu zheng zhi zui ‘Crime without evidence’ |
Zijin Chen | Hunan People’s Publishing House | 2014 | 1 | The Untouched Crime | Michelle Deeter | AmazonCrossing | 2016 |
|
北京折叠
Beijing zhedie ‘Folding Beijing’ |
Hao Jingfang | Zhejiang Education Publishing House | 2023 | 2, 4 | Folding Beijing | Ken Liu | Uncanny Magazine | 2015 |
|
流浪地球
Liulang diqiu ‘The wandering earth’ |
Liu Cixin | Changjiang Literature and Art Publishing House | 2008 | 1 | The Wandering Earth | Ken Liu, Elizabeth Hanlon, Zac Haluza, Adam Lanphier, and Holger Nahm | Head of Zeus | 2017 |
|
三体
San ti ‘The three-body [problem]’ |
Liu Cixin | Chongqing Publishing House | 2016 | 21 | The Three-Body Problem | Ken Liu | Head of Zeus | 2015 |
|
荒潮
Huang chao ‘Waste tide’ |
Chen Qiufan | Shanghai Literature and Art Publishing House | 2019 | 3, 4 | Waste Tide | Ken Liu | Tom Doherty Associates | 2019 |
|
盗墓笔记1:七星鲁王宫
Daomu biji 1: Qixing Lu wang gong ‘Tomb-robbing notes 1: Seven-star palace of King Lu’ |
Nanpai Sanshu | Shanghai Culture Publishing House | 2011 | 2, 3, 8 | The Grave Robbers’ Chronicles: Cavern of the Blood Zombies | Kathy Mok | ThingsAsian Press | 2011 |
|
我欲封天
Wo yu feng tian ‘I shall seal the heavens’ |
Er Gen | 21st Century Publishing Group | 2015 | 1, 5 | I Shall Seal the Heavens | Jeremy Bai | Wuxiaworld Publishing | 2021 |
|
三生三世十里桃花
Sansheng sanshi shili taohua ‘Three lifetimes, three worlds, ten miles of peach blossoms’ |
Tang Qi | Changjiang Publishing House | 2016 | 2, 15, 16 | To the Sky Kingdom | Poppy Toland | AmazonCrossing | 2016 |
Appendix B.Scoring rubric for translation quality evaluation
| Level | Accuracy | Fluency | Score |
|---|---|---|---|
| Level 5 | Complete transfer of source text information. | Translation reads like a piece originally written in English. | 9–10 |
| Level 4 | Almost complete transfer; there may be one or two insignificant inaccuracies; some revision needed to reach professional standard. | Large sections read like a piece originally written in English, but minor lexical, grammatical, or spelling errors are present. | 7–8 |
| Level 3 | General ideas of the source text are conveyed, but with a number of lapses in accuracy; considerable revision required to reach professional standard. | Certain parts read like a piece originally written in English, but others clearly read like a translation. A considerable number of errors are present. | 5–6 |
| Level 2 | Transfer of content is undermined by serious inaccuracies; thorough revision required to reach professional standard. | Almost the entire text reads like a translation, with continual lexical, grammatical, or spelling errors. | 3–4 |
| Level 1 | Transfer of content is totally inadequate; the translation is not worth revising. | Text reveals a total lack of ability to express ideas adequately in English. | 1–2 |