BLEURT: a Transfer Learning-Based Metric for Natural Language Generation. BLEURT is an evaluation metric for Natural Language Generation. It takes a pair of sentences as input, a reference and a candidate, and it returns a score that indicates to what extent the candidate is grammatical and conveys the mearning of the reference. It is comparable to sentence-BLEU and BERTscore. BLEURT is a trained metric, that is, it is a regression model trained on ratings data. The model is based on BERT. This repository contains all the code necessary to use it and/or fine-tune it for your own applications. BLEURT uses Tensorflow, and it benefits greatly from modern GPUs (it runs on CPU too). A comprehensive overview of BLEURT can be found in our ACL paper BLEURT: Learning Robust Metrics for Text Generation and our blog post.

