The Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language

Document Type: Original Article


1 MA in Translation Studies, Faculty of Persian Literature and Foreign Languages, South Tehran Branch of Azad University Iran

2 Assistant Professor of Artificial Intelligence, Faculty of Computer Engineering, Shahid Rajaei Teacher Training University, Tehran, Iran

3 Assistant Professor of TESOL, Iran Encyclopedia Compiling Foundation, Tehran, Iran


Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine translated Persian texts. This study focused on answering three main questions, which included the extent that Automatic Machine Translation Evaluation Metrics is valid on evaluating translated Persian texts; the probable significant correlation between human evaluation and automatic evaluation metrics in evaluating English to Persian translations; and the best predictor of human judgment. For this purpose, a dataset containing 200 English sentences and their four reference human translations, was translated using four different statistical Machine translation systems. The results of these systems were evaluated by seven automatic MTEMs and three human evaluators. Then the correlations of metrics and human evaluators were calculated using both Spearman and Kendall correlation coefficients. The result of the study confirmed the relatively high correlation of MTEMs with human evaluation on Persian language where GTM proved to be more efficient compared with other metrics.


Agarwal, A., & Lavie, A. (2008). METEOR, M-BLEU and M-TER: Evaluation metrics for high-correlation with human rankings of machine translation output (pp. 115–118). Presented at the Third Workshop on Statistical Machine Translation, Columbus.

Ansari, E., Sadreddini, M. H., Tabebordbar, A., & WALLACE, R. (2014). Extracting Persian-English parallel sentences from document level aligned comparable corpus using bi-directional translation. Advances in Computer Science: An International Journal, 3(5), 59–65.

Banerjee, S., & Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization (pp. 65–72). Michigan.

Bouamor, H., Alshikhabobak, H., Mohit, B., & Oflazer, K. (2014). A human judgment corpus and a metric for Arabic MT evaluation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 207–213). Doha, Qatar.

Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., & Schroeder, J. (2007). (Meta-) Evaluation of Machine Translation. In Proceedings of the Second Workshop on Statistical Machine Translation (pp. 136–158). Stroudsburg, PA, USA: Association for Computational Linguistics.

Callison-Burch, C., Osborne, M., & Koehn, P. (2006). Re-evaluating the role of BLEU in machine translation research. In In Proceedings of EACL-2006.

Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence Statistics. In HLT ’02 Proceedings of the second international conference on Human Language Technology Research (pp. 138–145). Morgan Kaufmann Publishers Inc.

Dreyer, M., & Marcu, D. (2012). HyTER: Meaning-Equivalent Semantics for Translation Evaluation (pp. 162–171). Presented at the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Farzi, S., & Faili, H. (2015). A swarm-inspired re-ranker system for statistical machine translation. Computer Speech & Language, 29(1), 45–62.

Giménez, J., & Màrquez, L. (2010). Asiya: An Open Toolkit for Automatic Machine Translation (Meta-)Evaluation. The Prague Bulletin of Mathematical Linguistics, 94, 77–86.

Kalyani, A., Kumud, H., Singh, S. P., & Kumar, A. (2014). Assessing the Quality of MT Systems for Hindi to English Translation. International Journal of Computer Applications, 89(15), 41–45.

Machine translation. (2015, September 6). In Wikipedia. Retrieved from

MATLAB. (2017, January 17). In Wikipedia. Retrieved from

Nießen, S., Och, F. J., Leusch, G., Ney, H., & Informatik, L. F. (2000). A Evaluation Tool for Machine Translation: Fast Evaluation for MT Research. In In Proceedings of the 2nd International Conference on Language Resources and Evaluation (LREC-2000).

Papineni, K., Roukos, S., Ward, T., & Wei, J. Z. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association forComputational Linguistics (ACL) (pp. 311–318). Philadelphia.

pilevar, M. T., & Faili, H. (2010). Persian SMT: A first attempt to English-Persian statistical machine translation. In JADT 2010: 10th international conference on statistical analysis of textual data.

Snover, M., Dorr, B., Schwart, R., Micciulla, L., & Makhoul, M. (2006). A Study of Translation Edit Rate with Targeted Human Annotation. In In Proceedings of Association for Machine Translation in the Americas (pp. 223–231).

Sun, Y. (2010). Mining the Correlation between Human and Automatic Evaluation at Sentence Level. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, … D. Tapias (Eds.), Proceedings of the International Conference on Language Resources and Evaluation, LREC 2010, 17-23 May 2010, Valletta, Malta. European Language Resources Association.

Tillmann, C., Vogel, S., Ney, H., Zubiaga, A., & Sawaf, H. (1997). Accelerated DP based search for statistical translation. In Proceedings of European Conference on Speech Communication and Technology. Rhodes, Greece.

Turian, J., Shen, L., & Melamed, I. D. (2003). Evaluation of Machine Translation and its Evaluation. In In Proceedings of MT Summit IX (pp. 386–393).