Detecting Erroneous Handwritten Byzantine Text Recognition

Handwritten text recognition (HTR) yields textual output that comprises errors, which are considerably more compared to that of recognised printed (OCRed) text. Post-correcting methods can eliminate such errors but may also introduce errors. In this study, we investigate the issues arising from this reality in Byzantine Greek. We investigate the properties of the texts that lead post-correction systems to this adversarial behaviour and we experiment with text classification systems that learn to detect incorrect recognition output. A large masked language model, pre-trained in modern and fine-tuned in Byzantine Greek, achieves an Average Precision score of 95%. The score improves to 97% when using a model that is pretrained in modern and then in ancient Greek, the two language forms Byzantine Greek combines elements from. A century-based analysis shows that the advantage of the classifier that is further-pre-trained in ancient Greek concerns texts of older centuries. The application of this classifier before a neural post-corrector on HTRed text reduced significantly the postcorrection mistakes.

Detecting Erroneous Handwritten Byzantine Text Recognition

John Pavlopoulos;Vasiliki Kougia;Paraskevi Platanou;Holger Essler

2023-01-01

Abstract

Handwritten text recognition (HTR) yields textual output that comprises errors, which are considerably more compared to that of recognised printed (OCRed) text. Post-correcting methods can eliminate such errors but may also introduce errors. In this study, we investigate the issues arising from this reality in Byzantine Greek. We investigate the properties of the texts that lead post-correction systems to this adversarial behaviour and we experiment with text classification systems that learn to detect incorrect recognition output. A large masked language model, pre-trained in modern and fine-tuned in Byzantine Greek, achieves an Average Precision score of 95%. The score improves to 97% when using a model that is pretrained in modern and then in ancient Greek, the two language forms Byzantine Greek combines elements from. A century-based analysis shows that the advantage of the classifier that is further-pre-trained in ancient Greek concerns texts of older centuries. The application of this classifier before a neural post-corrector on HTRed text reduced significantly the postcorrection mistakes.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno pubblicazione
	
			2023
		
	Titolo del volume
	
			Findings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)
		
	Appare nelle tipologie:
	
			4.1 Articolo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
Pavlopoulos_Detecting Erroneous Handwritten Byzantine Text Recognition_2023_htrec.pdf non disponibili Tipologia: Documento in Pre-print Licenza: Accesso chiuso-personale Dimensione 374.76 kB Formato Adobe PDF Visualizza/Apri	374.76 kB	Adobe PDF	Visualizza/Apri

I documenti in ARCA sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/10278/5044630

Citazioni

ND

0

ND

social impact