Quality Estimation - Доклады специальных секций Computational Linguistics and Intellectual Tech

Acknowledgements

4. Quality Estimation

The term Quality Estimation (QE) refers to the task of estimating translation quality in the absence of human reference translations (Specia et al., 2010; Callison- Burch et al., 2012). That is, the only information available is that of the source and translated texts and, possibly, some information on the translation system itself. This problem was already introduced ten years ago (Blatz et al., 2003), with the term Confidence Estimation, but it has not been until more recently that it concentrated a broader attention from the community, with the creation of specific shared tasks for evaluating QE systems and approaches under the umbrella of the WMT workshops on Statistical Machine Translation (Callison-Burch et al., 2012).³

QE measures have a wide range of applications in practical MT system development, analysis and usage. For instance, they can be useful for: system parameter tuning, informing MT end-users about estimated translation quality, quality-oriented

3 The 2013 edition is also under development. Find more information at:

http://www.statmt.org/wmt13/ quality-estimation-task.html

Màrquez L.

filtering of translation cases (e.g., to identify translations requiring manual post-edition, or to identify casual users’ post-editions that are useful for enriching the MT system), selecting the best translation among a set of alternatives (e.g., in a system combination scenario), etc.

Quality Estimation is usually addressed as a scoring task (Specia et al., 2009;

Specia et al., 2010), where some regression function predicts the absolute quality of the automatic translation of a source text. QE has recently evolved towards two separate subtasks consisting in scoring itself and ranking, where different MT outputs for a given source sentence have to be ranked according to their comparative quality.

Results obtained so far on QE have been more satisfactory for the ranking approach (Specia et al., 2010; Avramidis, 2012; Callison-Burch et al., 2012).

System ranking based on human quality annotations has been established as a common practice for MT evaluation in shared tasks (Callison-Burch et al., 2012).

Therefore, training corpora are available for researchers to train ranking functions with supervised machine learning methods to perform automatic ranking mimick- ing human annotations. Learned models can be reusable, provided they are system independent and based on a generic analysis (i.e., no system dependent features can be used for training), and applicable to other sets containing any input and multiple outputs. The applications of QE-for-ranking are diverse: from hybrid MT system combination to their internal optimization and evaluation. The most popular practical scenario of QE models (both rankers and regressors) consists of ranking alternative MT systems’ outputs to predict the best translation at segment level.

It is worth noting that the research conducted in QE for training ranking models from human annotations has always been done in controlled environments, consisting of well-formed text with little presence of noise (such as News or EU Parliament acts).

However, MT in real life has to deal with a more complex scenario, including non- standard usage of text (e.g., social media, blogs, reviews, etc.), which is totally open domain and prone to contain ungrammaticalities and errors (misspellings, slang, ab- breviations, etc.). An example of noisy environment is found in the publicly available FAUST corpus⁴ (Pighin et al., 2012b), collected from the 24/7 Reverso.net MT service.

This corpus is composed of 1,882 weblog source sentences translated with 5 independent MT systems. The systems were ranked according to human assessments of ade- quacy by several users using a graph-based methodology, obtaining considerably high agreement and quality indicators (Pighin et al., 2012a).

At UPC we have studied the supervised training of QE prediction models from the aforementioned FAUST corpus to rank alternative system translations. Our study fo- cused on different aspects, such as: i) the typology of the problem (regression vs. bi- nary classification), ii) suitability of the learning algorithm, and iii) best combination of features to learn. Results showed that is possible to build reliable QE models from an annotated real life MT corpus. Concretely, correlation results are comparable to those described in the literature for standard text. Furthermore, we also observed that comparative (ranked–based) QE models fit better to the system selection task (i.e. predict always the best translation) compared to absolute (regression-based) QE models.

4 http://www.faust-fp7.eu/faust/Main/DataReleases

Automatic Evaluation of Machine Translation Quality

References

1. Avramidis, Eleftherios. 2012. Comparative quality estimation: Automatic sentence-level ranking of multiple machine translation outputs. In Proceedings of 24th International Conference on Computational Linguistics (COLING), pages 115–132, Mumbai, India.

2. Berka, Jan, Ondrej Bojar, Mark Fishel, Maja Popović , and Daniel Zeman. 2012. Au- tomatic MT error analysis: Hjerson helping Addicter. In Proceedings of the 8th In- ternational Conference on Language Resources and Evaluation, pages 2158–2163, Istanbul, Turkey.

3. Blatz, John, Erin Fitzgerald, George Foster, Simona Gandrabur, Cyril Goutte, Alex Kulesza, Alberto Sanchis, and Nicola Ueffing. 2003. Confidence estimation for machine translation. Final Report of Johns Hopkins 2003 Summer Workshop on Speech and Language Engineering. Technical report, Johns Hopkins University.

4. Callison-Burch, Chris, Miles Osborne, and Philipp Koehn. 2006. Re-evaluating the Role of BLEU in Machine Translation Research. In Proceedings of 11th Confer- ence of the European Chapter of the Association for Computational Linguistics (EACL).

5. Callison-Burch, Chris, Philipp Koehn, Christof Monz, Matt Post, Radu Soricut, and Lucia Specia. 2012. Findings of the 2012 workshop on statistical machine translation. In Proceedings of the Seventh Workshop on Statistical Machine Transla- tion, pages 10–51, Montréal, Canada, June.

6. Chan, Yee Seng and Hwee Tou Ng. 2008. MAXSIM: A maximum similarity metric for machine translation evaluation. In Proceedings of ACL-08: HLT, pages 55–62.

7. Coughlin, Deborah. 2003. Correlating Automated and Human Assessments of Ma- chine Translation Quality. In Proceedings of Machine Translation Summit IX, pages 23–27.

8. Culy, Christopher and Susanne Z. Riehemann. 2003. The Limits of N-gram Trans- lation Evaluation Metrics. In Proceedings of MT-SUMMIT IX, pages 1–8.

9. Denkowski, Michael and Alon Lavie. 2010. Meteornext and the meteor paraphrase tables: Improved evaluation support for five target languages. In Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and Metrics- MATR, pages 339–342, Uppsala, Sweden, July. Association for Computational Linguistics.

10. Denkowski, Michael and Alon Lavie. 2012. Challenges in predicting machine translation utility for human post-editors. In Proceedings of the Tenth Confer- ence of the Association for Machine Translation in the Americas (AMTA’2012), San Diego, CA, USA, October. AMTA.

11. Dreyer, Markus and Daniel Marcu. 2012. Hyter: Meaning-equivalent semantics for translation evaluation. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Lan- guage Technologies, pages 162–171, Montréal, Canada, June. Association for Computational Linguistics.

12. Fishel, Mark, Ondˇrej Bojar, Daniel Zeman, and Jan Berka. 2011. Automatic Trans- lation Error Analysis. In Proceedings of the 14th Text, Speech and Dialogue (TSD).

Màrquez L.

13. Giménez, Jesús and Lluís Màrquez. 2010a. Asiya: An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation. The Prague Bulletin of Mathematical Linguistics, 94: 77–86.

14. Giménez, Jesús and Lluís Màrquez. 2010b. Linguistic features for automatic MT evaluation. Machine Translation, 24(3–4): 209–240.

15. Gonzàlez, Meritxell, Jesús Giménez, and Lluís Màrquez. 2012. A graphical inter- face for MT evaluation and error analysis. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL). System Demonstrations, pages 139–144, Jeju, South Korea, July.

16. Kahn, Jeremy G., Matthew Snover, and Mari Ostendorf. 2009. Expected Depen- dency Pair Match: Predicting translation quality with expected syntactic structure. Machine Translation.

17. Kauchak, David and Regina Barzilay. 2006. Paraphrasing for Automatic Evalu- ation. In Proceedings of the Joint Conference on Human Language Technology and the North American Chapter of the Association for Computational Linguis- tics (HLTNAACL), pages 455–462.

18. Liu, Ding and Daniel Gildea. 2005. Syntactic Features for Evaluation of Machine Translation. In Proceedings of ACL Workshop on Intrinsic and Extrinsic Evalua- tion Measures for MT and/or Summarization, pages 25–32.

19. Lo, Chi-kiu and Dekai Wu. 2011. Meant: An inexpensive, high-accuracy, semi- automatic metric for evaluating translation utility based on semantic roles.

In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 220–229, Portland, Oregon, USA, June. Association for Computational Linguistics.

20. Mehay, Dennis and Chris Brew. 2007. BLEUATRE: Flattening Syntactic Depen- dencies for MT Evaluation. In Proceedings of the 11th Conference on Theoretical and Methodological Issues in Machine Translation (TMI).

21. Owczarzak, Karolina, Declan Groves, Josef Van Genabith, and Andy Way. 2006.

Contextual BitextDerived Paraphrases in Automatic MT Evaluation. In Proceed- ings of the 7th Conference of the Association for Machine Translation in the Americas (AMTA), pages 148–155.

22. Owczarzak, Karolina, Josef van Genabith, and Andy Way. 2007. Dependency-Based Automatic Evaluation for Machine Translation. In Proceedings of SSST, NAACL-HLT/

AMTA Workshop on Syntax and Structure in Statistical Translation, pages 80–87.

23. Padó, Sebastian, Michel Galley, Daniel Jurafsky, and Christopher D. Manning. 2009.

Machine translation evaluation with textual entailment features. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 37–41, Athens, Greece, March. Association for Computational Linguistics.

24. Papineni, Kishore, Salim Roukos, Todd Ward, and WeiJing Zhu. 2002. BLEU:

A Method for Automatic Evaluation of Machine Translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL).

25. Pighin, Daniele, Lluís Formiga, and Lluís Màrquez. 2012a. A graph-based strategy to streamline translation quality assessments. In Proceedings of the Tenth Con- ference of the Association for Machine Translation in the Americas (AMTA’2012), San Diego, USA, October.

Automatic Evaluation of Machine Translation Quality

26. Pighin, Daniele, Lluís Màrquez, and Jonathan May. 2012b. An Analysis (and an Annotated Corpus) of User Responses to Machine Translation Output. In Pro- ceedings of the Eight International Conference on Language Resources and Eval- uation (LREC’12), Istanbul, Turkey.

27. Popović, Maja and Hermann Ney. 2007. Word Error Rates: Decomposition over POS classes and Applications for Error Analysis. In Proceedings of the Second Workshop on Statistical Machine Translation, pages 48–55, Prague, Czech Re- public, June. Association for Computational Linguistics.

28. Popović, Maja. 2011. Hjerson: An Open Source Tool for Automatic Error Clas- sification of Machine Translation Output. The Prague Bulletin of Mathematical Linguistics, 96: 59–68.

29. Reeder, Florence, Keith Miller, Jennifer Doyon, and John White. 2001. The Naming of Things and the Confusion of Tongues: an MT Metric. In Proceedings of the Workshop on MT Evaluation ”Who did what to whom?” at Machine Translation Summit VIII, pages 55–59.

30. Russo-Lassner, Grazia, Jimmy Lin, and Philip Resnik. 2005. A Paraphrase-Based Approach to Machine Translation Evaluation (LAMP-TR-125/CSTR-4754/

U MIACS-TR-2005-57). Technical report, University of Maryland, College Park.

31. Snover, Matthew G., Nitin Madnani, Bonnie Dorr, and Richard Schwartz. 2010.

Ter-plus: paraphrase, semantic, and alignment enhancements to translation edit rate. Machine Translation, 23(2–3): 209–240.

32. Specia, Lucia, Marco Turchi, Nicola Cancedda, Mark Dymetman, and Nello Cris- tianini. 2009. Estimating the Sentence-Level Quality of Machine Translation Systems. In Proceedings of the 13th Annual Meeting of the European Association for Machine Translation (EAMT-2009), pages 28–35, Barcelona, Spain.

33. Specia, Lucia, Dhwaj Raj, and Marco Turchi. 2010. Machine Translation Evalua- tion Versus Quality Estimation. Machine Translation, 24:39–50, March.

34. Vilar, David, Jia Xu, Luis Fernando D’Haro, and Hermann Ney. 2006. Error Analy- sis of Machine Translation Output. In Proc. 5th Intl. Conference on Language Resources and Evaluation (LREC), pages 697–702, Genoa, Italy.

35. Zeman, Daniel, Mark Fishel, Jan Berka, and Ondrej Bojar. 2011. Addicter: What Is Wrong with My Translations? The Prague Bulletin of Mathematical Linguis- tics, 96: 79–88.

36. Zhou, Liang, Chin-Yew Lin, and Eduard Hovy. 2006. Re-evaluating Machine Translation Results with Paraphrase Support. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 77–84.

Дорожка по оценке машинного перевоДа ROMIP MTEval 2013:

отчет организаторов

Браславский П.

(pbras@yandex.ru)

Kontur Labs; Уральский федеральный университет, Екатеринбург, Россия

Белобородов А.

(xander-beloborodov@yandex.ru) Уральский федеральный университет,

Екатеринбург, Россия

Шаров С. (s.sharoff@leeds.ac.uk)

University of Leeds, Лидс, Великобритания

Халилов М.

(maxim@tauslabs.com) TAUS Labs, Амстердам, Нидерланды

Ключевые слова: машинный перевод, оценка, англо-русский перевод

ROMIP MT EvaluaTIOn TRack 2013:

ORganIzERs’ REPORT

Braslavski P.

(pbras@yandex.ru)

Kontur labs; Ural Federal University, Russia

Beloborodov A. (xander-beloborodov@yandex.ru)

Ural Federal University, Russia

Sharoff S.

(s.sharoff@leeds.ac.uk) University of Leeds, Leeds, UK

Khalilov M.

(maxim@tauslabs.com) TAUS Labs, Amsterdam, Netherlands

The paper presents the settings and the results of the ROMIP 2013 machine translation evaluation campaign for the English-to-Russian language pair.

The quality of generated translations was assessed using automatic metrics and human evaluation. We also demonstrate the usefulness of a dynamic mechanism for human evaluation based on pairwise segment comparison.

Keywords: machine translation, evaluation, English-to-Russian translation

ROMIP MT Evaluation Track 2013: Organizers’ Report

1. Введение

Русский и английский были одной из первых языковых пар на заре исследо- ваний в этой области машинного перевода (МП) в 1950-х годах [Hutchins2000].

С тех пор парадигмы МП поменялись много раз, многие системы для этой языковой пары появлялись и исчезали, но, насколько нам известно, до сих пор не проводилась систематическая сравнительная оценки систем МП, ана- логичная DARPA’94 [White et al., 1994] и более поздним мероприятиям. Семи- нар по статистическому машинному переводу (Workshop on Statistical Machine Translation, WMT) в 2013 году впервые включил русско-английскую пару в свою программу.¹ На данный момент эта оценка еще не проведена, к тому же в се- минаре примут участие системы, обученные на данных, предоставленных ор- ганизаторами. За рамками оценки останутся существующие системы, в част- ности — системы на основе правил и гибридные системы.

Кампании по оценке играют важную роль в развитии технологий МП.

В последнее время был проведен ряд открытых кампаний для различных ком- бинаций европейских, азиатских и семитских языков, см. [Callison-Burch et al., 2011; Callison-Burch et al., 2012; Federico et al., 2012]. В этой статье мы описываем кампанию по оценке англо-русского машинного перевода в рамках РОМИП.

РОМИП (Российский семинар по Оценке Методов Информационного Поиска)² — это российский аналог TREC и других инициатив по оценке задач информационного поиска. Первый цикл оценки был организован в 2002 году.

В течение этих десяти лет РОМИП организовал серию дорожек по оценке, вклю- чая классическую задачу поиска по запросу, задачи тематической классифи- кации документов, вопросно-ответного поиска, формирования сниппетов, анализа тональности текста, поиска изображений и т. д. В рамках этой деятель- ности было подготовлено несколько свободно распространяемых наборов дан- ных, содержащих документы и оценки релевантности, сделанные асессорами.

Российские сообщества, занимающиеся информационным поиском и машин- ным переводом, имеют давние связи, их представители тесно общаются. По- этому было естественным организовать кампанию по оценке МП в рамках РО- МИП, используя накопленный опыт семинара. Кроме того, важной целью ме- роприятия была консолидация групп, разрабатывающих как статистические системы МП (SMT), так и системы, основанные на правилах (RBMT).

Одна из проблем для систем МП, работающих с русским языком, и для их оценки — это необходимость иметь дело с относительно свободным поряд- ком слов в предложении и развитой морфологией. За счет развитой морфоло- гии у русских лемм много словоформ (в среднем 8,2 формы для существитель- ных, 34,6 — для глаголов [Sharoff et al., 2013]), что осложняет выравнивание на уровне слов при статистическом подходе. Дистантные зависимости создают дополнительные проблемы, особенно для SMT-систем.

1 http://www.statmt.org/wmt13/

2 http://romip.ru

Braslavski P., Beloborodov A., Sharoff S., Khalilov M.

Для оценки было выбрано одно направление перевода (английский → рус- ский). Во-первых, для этого направления нам намного проще было найти асес- соров, для которых целевой язык является родным. Во-вторых, системы-участ- ницы в основном используются именно в этом направлении (перевод англий- ских текстов для русскоязычных пользователей).

No documento Доклады специальных секций Computational Linguistics and Intellectual Technologies Papers from the Annual International Conference “Dialogue” (2013) Issue 12 Volume 2 of 2 (páginas 127-134)