Oxford University-Led Research Achieves AI Breakthrough in Transcribing Sanskrit Manuscripts

Major progress has been reported in 'Digital Humanities & Hindu Studies: Creating AI Models for Handwriting and Text Recognition in South Asian Manuscripts', a University of Oxford and Oxford Centre for Hindu Studies project using artificial intelligence to transcribe handwritten Sanskrit texts. Based in the Faculty of Theology and Religion and led by Dr Bjarne Wernicke-Olesen and Dr Lucian Wong (Oxford Centre for Hindu Studies), the project combines the study of South Asian source texts with AI and data science to develop Devanāgarī Optical Character Recognition (OCR) using Transkribus, converting ancient handwritten manuscripts into machine-readable, searchable e-texts.

Working with a team of Nepalese experts at the OCHS Kathmandu Office and under the digital curatorship of Tom Derrick in Oxford, the project has led to a significant milestone. Through successive rounds of training Transkribus’ leading text recognition AI tool, the team has fine-tuned a model using more than 500 pages of handwritten, digitised manuscripts from the OCHS Indic Manuscript Database.

The improvement in the performance of the model proved to be increasingly marginal with each round of training, eventually leading to a plateau after 500 pages. This indicates that the latest model represents the pinnacle of what is possible for auto-transcription of handwritten Sanskrit manuscripts.

The latest model is expected to automatically transcribe the majority of the Sanskrit manuscripts in the collection with 97% accuracy or above, with a proportion approaching 99% accuracy – representing the highest standard of what is currently possible for digitised source material for Digital Humanities research.

This expected accuracy has been confirmed through an extensive analysis of AI generated transcriptions from a cross-section of handwritten manuscripts in the database. In addition to the accuracy reported by Transkribus’ in-built evaluation tool, the project team manually checked transcriptions produced by Transkribus’ model, comparing the transcriptions with the correct ‘ground truth’ versions of the text to verify the accuracy of the tool.

The Nepalese team consists of the following experts in paleography and manuscriptology: Professor Bhim Kandel, Mr Bharat Maharjan, and Dr Kedar Ghimire. They have expressed their excitement about this new tool, its usefulness for their work and teaching, and its wider implications for future research. As Professor Kandel said: “The current results of the project have far exceeded our expectations of what would be possible.”

Dr Bjarne Wernicke-Olesen shares, “This is an important step towards making large collections of unknown Sanskrit manuscripts widely accessible. As the database work progresses, it has the potential to significantly expand how these texts can be studied, searched, translated, and analysed in ways that would previously have been impossible. In other words, AI tools like this can revolutionise our access to primary sources in South Asian studies.”

The team have undertaken a thorough review of the digitised manuscripts within the curent sample collection, assessing their suitability for automated transcription. Each manuscript was assigned a category A, B or C, with those categorised as A representing the best candidates for automated text recognition. These are legible manuscripts Transkribus will be able to accurately transcribe using the model trained by the team. At the opposite end of the scale, category C manuscripts are those extreme cases Transkribus would likely struggle with, often due to material deterioration of the manuscript, or blotchy ink, both of which hamper attempts for the AI to read and transcribe the text. This exercise found that more than two thirds of the digitised manuscripts within the database are prime candidates for automated recognition.

Dr Lucian Wong has recently secured a grant from the OCHS Digital Humanities Fund, which will facilitate an extension of the project until June 2027.

The next phase of the project will involve applying the model to complete manuscripts and assessing performance across wider manuscript collections and Devanāgarī-related scripts, including new collectionsfrom Nepal and from Professor Alexis Sanderson's private collection of tantric manuscript material.

The team will also explore the potential for generating transliterations directly from the manuscripts, with the aim of further enhancing the accessibility and usability of these texts for research, study and translation.

To learn more about the project, click here.