Automated Punctuation Tagging for Medieval Hebrew Manuscripts

” Old Hebrew manuscripts, for historical and cultural reasons, often lacked punctuation. In the modern era, many of thexe texts have been made available online via scanning and OCR, but when students want to study them, as evidenced by the popularity of sites like Hebrewbooks.org and Sefaria.org the lack of punctuation can serve as a barrier.

While working with Dicta, an educational non-profit dedicated to making old Hebrew manuscripts more accessible, I developed a novel tagging algothim to automate punctuation of these texts. The algorithm used LSTMs, and also incorprated severl clever tricks from the structure of Hidden Markov Models, ultimately allowing for the final taggin sequence to be decoded via the Viterbi Algorithm.

The project initially was a submission to a contest, and you can see my submission video below:

However, since then it has grown to be much more complex and, more importantly, accurate. I am still working with the people at Dicta and Sefaria, and hope it will be launching soon to the public.

For more details, checkout the github.