Search-inside feature added to local viewer

After exploring several options for providing in-web-browser search-inside-the-work, including doing some R&D experimentation with the Internet Archive BookReader – we have decided to add a “search-inside” feature to our current in-house-developed “Viewer” Javascript after all.

Use case

You want to be able to:

search within a work for a word or phrase, and see what pages have hits. (Should we specify case sensitivity - ie, that search will or will not provide case-insensitive results?-GZ)
- Gabriela: I’m still working out exactly how the search will work balancing feasibility and needs. It will be case-insensitive. It may or may not be “whole word”. This is something we have to work out by seeing what we can do and how well it works.
get highlighted results on the page image (hOCR makes this possible)
Do we need to specify that initial phase is not for mobile?-GZ
- I’m not sure if that’s true or not, we need to decide!
- My vote is not mobile right now (does “mobile” include iPads?”

We have a prototype demonstrating we can do this, although it will be a lot of work to polish it.

Will not include

The viewer images will be highlighted with results, but these images do not allow copy-paste of text.
- The only tool I know that does that is Internet Archive BookReader. UniversalViewer that many samvera peers are using does not, to my knowledge, have this feature either. If there are other examples of this feature please link them for reference.
- Adding this to local viewer is probably prohibitively complex. at any rate it’s out of scope of initial implementation.

Challenges – actual search matching

Actually matching queries to OCR text, with coordinates, is kind of technically complex and not well supported by our current toolchain.

Note: even if we had gone with IA BookReader, we were still responsible for this part of implementation and would have had same challenges.

We tried to look for samvera peers who could share their implementation – we’ve only found one so far, which was from Princeton, which used the same simple approach we had started with – and basically only finds exact matches. Well, it can control for case-insensitivity, but that’s about the only thing it can do.

Problematic false negatives include

it will not find alternate singular/pluralized/other endings of words, as our solr search can
it will not find non-diacritic versions of words, it’s not going to find “schön” when you enter “schon”.
- Worse, it might have problems with different unicode normalization, although we can hypothetically control for this a bit…. (have to figure out what normalization tesseract HOCR is, or ensure it becomes!)
If a word had been divided between two lines and hyphenated, it will show up in source as eg mid- dle and a search for “middle” won’t find it
Alternate versions of punctuation, like tesseract OCR often uses curly-quotes for apostrophes in eg |isn‘t| won’t be be matched by straight quote entered with US keyboard isn't
Obviously typos or noise in the OCR are also an issue
There is no phrase searching (we can make multi-word searches just search for them each individually)

Some of these issues there are possible workarounds, if we choose to spend the time on them – others are intractable unless we change our approach. For now, we are delivering an initial version that will have most of these problems – problems shared by the one peer example we have been able to investigate.

The BEST way to solve this is to change our approach, and use a custom Solr plugin that lets us use the same Solr search technology we use for our general search, and still get image coordinates for highlighting in our search results. There is such a plugin: https://github.com/dbmdz/solr-ocrhighlighting

But our current cloud-hosted Solr provider does not let us install custom JAR plugins like this (at least not without paying for a plan that’s too expensive for us).

In the future, we may look into switching our solr to be more “self-hosted”, perhaps via fly.io.

Some search things that it looks like we could tweak one way or the other

Should we require queries to match whole words only, or allow matches to “starts with”?
We will NOT support phrase queries with this implementation. So when user enters multiple words, they will be highlighted separately. But
- If you enter two words, should a match require both to be on the same page, or it doesn’t matter, just highlight both words wherever they appear, even one at a time on a page?

UX Polish to add

Few of these in first draft, but some we probably will add as part of this project phase, depending on how expensive they end up.

In the list of page thumbnails, highlight pages that have a match
show number of results at top?
Let user open/close search area, to have full screen for page if they aren’t searching
Provide page/image numbers on search results
- We don’t have true pagination-as-labelled-in-work metadata, so we can supply at best image sequence numbers at present.
When going to result, don’t just go to the page, but “zoom” to the area of the page with the match at a readable size.
Search box on main work page, that when you submit it, opens up viewer with your search
- In addition to saving clicks, this will be better advertise that search-inside feature exists
- Next step – pre-fill with their overall search query, where applicable
Provide next/previous to step through search results
We have to figure out what happens on small/mobile screens, currently we broke them
Bookmarkable search – record current search in URL, so if you copy and paste it you can go back to particular search results (just like now you can go back to particular page in viewer)
Handle slanted term-on-page highlights – sometimes the words are not compeltely horizontal but at a slight (or greater) angle, and this info is in Tesseract hOCR and could probably be used to make the highlight on screen be even better positioned.