OCR planning notes

Note: There actually isn’t, it turns out, anything built into Hyrax for storing/using OCR/hOCR, so even if we were using hyrax we’d have to implement it. (Although potentially could copy other hyrax user’s implementations more wholesale). But even hyrax implementers have to implement this stuff as add-ons to hyrax.

Actual OCR process

  • We definitely want to capture “position” information necessary to do things like create text layer in PDF, not just text extraction.

  • As far as I can tell, the most common standard format used for this is hOCR.

  • What software to use to do the OCR and create hOCR? I identify two major possibilities

    • tesseract is open source, very popular, and used by some of our peers.

      • Considerations:

        • Many of our peers use it

        • Lots of control: LOTS of language packs for different languages; plus third-party ones; plus you can train your own. (Includes Latin and Early Modern English). Choice of “FAST” or “BEST” models.

        • Not necessarily very good at handwriting

        • Can install version 4.x on heroku, harder to install newer version 5.x (not in ubuntu 22 apt repo). But peers I reached out to are all using 4.x too

        • We’d run it ourselves on heroku worker dynos, which means we need to manage and pay for dynos, but we have lots of architecture for that, and are comfortable with our bg job dyno autoscale

      • Cost estimate: In order to estimate cost, we have to run it on some heroku dynos on some of our images, and see how many seconds a page takes.

        • We roughly estimate 10 seconds a page. So 10,000 seconds = 2.8 hours for 1000 page.

        • Heroku standard-2X dyno is approximately $0.07/hour, so $0.19/10000 pages, orders of magnitude faster than AWS Textract. (Are we doing math right?). So for estimated 40K bck-fill, that’s $7.20?!?!

      • Here’s some Princeton code example (via escowles) for running tesseract in a bg job. https://github.com/pulibrary/figgy/blob/main/app/derivative_services/hocr_derivative_service.rb

    • AWS has a service, Textract

      • Considerations

        • using a cloud service without having to run our own worker dynos can be an advantage, but also can actually be more complicated for us to wire up (like how we need to poll for async HLS creation to be done with video!). Also ties us to AWS a bit – source material maybe needs to be on S3, etc.

        • Unclear if it could work better or worse than tesseract – for all we know, it’s tesseract under the hood!

        • Only does six languages:  English, Spanish, German, Italian, French, and Portuguese. But those include the languages we have most need/use of

        • Claims to be able to do hand-writing? In all those languages?

        • We don’t know any peers that use it

      • Cost estimate: Actually kind of expensive! $15.00 per 1000 pages.

        • Need better estimates (how many Assets actually attached to Works with “text” format? How many created a month on average)

        • But if we just OCR all existing Assets to back-fill, (this is an overshot, some are audio/video, some we may know are images without texts etc), we estimate maybe 40K, so ~$600

  • Do we want to do manual QA/correction of OCR?

    • I’m not aware of any peers that do this, but haven’t totally investigated

    • I actually can’t find any good tools with UIs for doing manual corrections to hOCR!

  • Specs for what to run OCR on

    • We don’t want to run it on photographs of museum items for instance.

    • We don’t want to run it on hand-written manuscripts

    • We prob do want to run it on graphical advertisements that have text in them

    • We might want to START with English , and then consider ohter languages

    • Do we have a flag that Annabel et al have to set that’s just “do OCR”, or do we try to automatically identify from existing metadata, like “Format: text”, etc.

    • at what point does the metadata know to do it?

    • let’s see what other software (like Islandora?) does? Does it just OCR everything?

 

Storage of OCR data

The simplest thing to do would be just to make another attr_json text attribute on the Asset for hocr or what have you – similar to the ones we have now for (manual) transcription and translation.

I was worried that putting this much data directly on the record in postgres would cause performance issues – it’s going to be fetched and sent across the wire and instantiated in ruby objects every time you fetch these Assets. But tpendragon and princeton says this is what he does and he was initially worried, but it works out fine for him. (By way of valkyrie; granted, he may not be using ActiveRecord, so it could be worse for us).

So we’ll probably at least start like that.

There will be some uses where we need the plain text extracted from hOCR markup. (eg for indexing) Shall we extract it at ingest time and store the extracted text separately? (Possibly making performance storage issues double worse?). Or extract it each time we need it (causing a separate performance problem?). Probably initially do the simplest thing, extract it when we need it – but let’s be aware of indexing performance degradation, which we might not be able to tell for sure until we have hOCR for a large part of corpus and can run bulk index, the numbers are so small per-doc either way.

Use of OCR data

OK, what are we going to actually do with it?

PDF with text layer

This is the initial obvious thing to do. Enhance our existing multi-page PDF so it has text layer based on hOCR (and make sure it’s available even for single-page works, which I’m not sure if our current PDF generation is).

While OCR engines can create PDFs with text layer themselves, we will probably create it ourselves, to our specs, from our images combined with hOCR.

There are tools that can combine an image with an hOCR to make PDF with text layer

  • https://github.com/ocropus/hocr-tools#hocr-pdf but assumes certain locations on file system and naming of files, not necessarily convenient for us

  • Another tool from internetArchive, not sure I understand what it’s doing:

  • hocr2pdf looks maybe more convenient, and is available via ExactImage toolkit which is in apt-get and brew.

  • Hypothetically we could write our own code to do this using ruby PDF toolkits or something, but it’s kind of pain, and ruby PDF tools are limited (hexapdf has a license that may not work for us).

    • But here’s a ruby tool which isn’t really maintained, and also does things we don’t want it to do, and also has an ImageMagick dependency we don’t really love… but does include hOCR to text layer func, demoing how you might do it.

 

Since our hOCR coords will be based on full TIFF, this may require doing some hOCR scaling. See

We will have to see how much this slows down our on-demand PDF creation. Difficulties of alternative pre-rendering PDFs is we still need to catch every time a work has an asset added/removed/changed/published to re-gen! But we could use same on-demand caching system, but keep generated things around longer, or pre-trigger their creation.

We might want to indicate somehow which PDFs will have text layers? Even if they all do, let people know in UI?

We may want to revisit resolution of images in our PDFs, they may still be excessive, we can perhaps create a smaller PDF to download.

Get Google to index…

It would be nice if a google search for a passage in our text would at least hypothetically hit on us, if our OCR’d text were indexed by google.

Not really sure how we’d do this, we should ask around and see if anyone’s tried.

Might require that we have a UI that lets you look at text of (eg) page 43 as HTML, and press next/prev to go through the pages, just to have something for Google to index.

Or we try to get Google to index all our generated PDFs with text-layers? But get them to have links back DC works too?

Or “Download OCR text as text” below might do it, if done right and included in our SiteMap (and with links back to Digital Collections if someone does find it!)….

Using our existing infrastructure that we use for Bredig and Oral History “full text”, it will be fairly easy to Solr index the OCR fulltext. And provide search results in Dig Coll, and with search results with “search in context” snippets/highlights.

The potential UI concern is that okay, now you click on one of these results… and how do you find where in the (eg) book matched your terms?

So we may need some kind of search-in-work functionality to make this not frustrating, possibly as a pre-req, and that is much harder.

Search within Work

You want to be able to:

  • search within a work, and see what pages have hits – there are various UIs for this, sometimes integrated into a reader/viewer

  • Ideally you also get highlighted results on the page image (hOCR makes this possible)

This --especially the second bullet -- will require a very significant development effort one way or another, it is the most challenging thing on this list.

There are two main packages I find people using to this.

  • UniversalViewer: Used by many of our peers

    • Example: NCSU:

      • See more NCSU examples at

      • Note you can search within the book, results are indicated as markers on the page scroll bar, you go there and the hits are highlighted on the page image

    • Using UniversalViewer probably requires us building out a IIIF infrastructure we don’t have now – this is a significant investment. Including providing our images in IIIF image server format (which can be static files with IIIF “Level 0”, prob what we’d want to investigate); IIIF Manifests for works composed of multiple images; and specifically the “IIIF Content Search” API (I think that’s what it’s called!) for the searching we’re talking about

      • “IIIF Content Search” is not built into hyrax, even if we were using hyrax we’d have to build it out

      • But some people in samvera community have built it out, either for use with hyrax or not. I think the implementations somehow uses solr to search text, but produce results based on hOCR including page coordinates etcc.

      • tpendragon shared that princeton has an implementation, here is their initial PR of their implementation  

      • Looks like an NCSU implementation providing IIIF Content Search API using HOCR is here?

      • There may be other samvera implementations to look at

  • InternetArchive viewer

    • Is a really nice UI, I like it maybe better than UV --maybe even better than our custom one? It lets you scroll through all pages, and also look at all thumbnails – could maybe replace much of our current UI. While also offering search within the “book”.

    • Eg

      • Search results down side PLUS marked in scroll bar, also highlighted over image

      • This is the only in-browser non-PDF interface i’ve seen where the text is also selectable/copy-pasteable!!!

      • It’s not clear how a non-IA host supplies search (and selectable text!); it’s not standard API’s like IIIF necessarily, it’s not documented super well. But it could end up being easier to implement than IIIF?

    • This is a tool the IA made for their own use, and they have releaesd it open source – but the docs aren’t great for how to use it. There are definitely other libraries using it – but not sure about using the search and selectable-text features! We’ll have to do research, maybe talk to an insider.

  • Hypothetically, it might be possible to add “highlight results in image” functionality to our current custom viewer. After all, UV is doing it on top of OpenSeadragon, which we do too. Could we add our own highlighting layers? Probably. Adding selectable text a-la Internet Archive harder, but maybe possible. Unclear if these would be easier than switching to a third-party viewer – especially over the long-term maintenance, as our own local staff expertise may change.

    • some info on openseadragon overlays…

None of these options will be easy.

Selectable/copy/pasteable text overlaid on image

Only seen in Internet Archive viewer as above, but it sure is neat!

Viewable online as HTML text?

Not sure how useful this is, but it does give a hook for google indexing.

Internet Archive does it as one giant HTML page with kind of pre-formatted ascii fixed width font, not sure why….

With OCR errors and all.

One could also imagine a more HTML-ish ordinary text display (although it would still have errors), and/or one-page-at-a-time display.

Download OCR text as text

Both the NCSU UniversalViewer example and Internet Archive book reader have integrated “download” buttons

NCSA offers an “Download as OCR Text” (that’s literally how it’s labelled) download option – in ascii text. It definitely has lots of errors, but they offer it as an option.

The IA “owl” example above doesn’t have the OCR text in the embedded viewer “download” tab, but definitely offers “Full Text” as a “Download Option” on work page, which seem sto be OCR text, errors and all.

 

Other UI/UX?

 

Anything else anyone can think of?