'Extract Text' module- but pdfs already have OCR

WillGerrand · August 11, 2020, 2:35am

Im planning to upload hundreds of scanned PDF documents to an Omeka-S installation. A great thing about Omeka is the potential to get searcg results from the ‘full text’ of an Item as well as its metadata. It seems the most immediate way of doing this is to use the ‘Extract Text’ module. But there are other approaches… im paraphrasing user https://forum.omeka.org/u/dfox in this thread … What are the best practices for full text search with Omeka S? … who ‘Extracted the text from the PDF then added the text to the item metadata, which is searchable.’ The PDF docs im uploading have OCR already. Is there a way to use this OCR without making Extract Text do the work?

jflatnes · August 11, 2020, 5:16pm

All Extract Text does is pull out the existing text content of a PDF, if there is any. It doesn’t do any OCR or anything like that. If you have OCRed PDFs then Extract Text will read that out of the PDF and save it in the Omeka S database so it’s searchable.

WillGerrand · August 11, 2020, 11:14pm

Ok thanks. That is good to know.

system · April 18, 2021, 11:14pm

This topic was automatically closed 250 days after the last reply. New replies are no longer allowed.