Aligning verb + noun collocations to improve a French-Romanian FSMT system
We present several Verb + Noun collocation integration methods using linguistic information, aiming to improve the
results of a French-Romanian factored statistical machine translation system (FSMT). The system uses lemmatised,
tagged and sentence-aligned legal parallel corpora. Verb + Noun collocations are frequent word associations, sometimes
discontinuous, related by syntactic links and with non-compositional sense (Gledhill, 2007). Our first strategy extracts collocations from monolingual corpora, using a hybrid method
which combines morphosyntactic properties and frequency criteria. The second method applies a bilingual collocation
dictionary to identify collocations. Both methods transform collocations into single tokens before alignment. The
third method applies a specific alignment algorithm for collocations. We evaluate the influence of these collocation
alignment methods on the results of the lexical alignment and of the FSMT system.
Article outline
- 1.Context and motivation
- 2.Handling MWEs for MT
- 3.Collocation definition
- 4.Translation problems
- 5.The Architecture of the FSMT system and verb + noun collocation integration
- 6.Preprocessing Verb + Noun collocations
- 7.The MWE dictionary
- 8.The collocation alignment algorithm
- 9.Experiments
- 9.1MWEs and the lexical alignment system
- 9.2MWEs and FSMT system
- 9.3MWE identification before aligning
- 10.Conclusions and future work
-
Notes
-
References