Tuesday, May 8, 2012

How is Google different from traditional Library OPACs & databases?

It's a truism in library circles today to say that Google and web search engines (I will use "Google" as a stand in for web search engines) have changed the way users search which in turn affects what they expect from searches in the library.

Libraries have two ways to react, first is to try to change user behavior through information literacy or altering the library catalogues and databases to fit user expectations. If one goes by the famous "User is not broken" meme,  the later seems to be the way to go.

That said how is Google different from typical library catalogues or even library databases?

In particular Web Scale Discovery products which are pretty much Google like search engines have the potential to be exactly like Google but what exactly are we talking about and if it's possible should we do it?

I was inspired by a recent exchange in a Summon mailing list, between Librarians and many Library IT people (so called Shambrarians) on how exactly Summon should work which led to a disgression on what additional feature sets should be in there to support searching.

It seems to me we are discussing two different matters and that the most traditional library catalogues and databases differ from Google in two main ways.  

  • Firstly the default search works differently from Google. 
  • Secondly they have additional options not available in Google. 

My thesis is that increasingly there is much less debate over the first type of difference (though there are still some holdouts for some features) but still plenty of disagreement about the value of the second as an option. In particular many of the defaults we take for granted in library catalogues and databases today have *already* being impacted by Google.

It's a long blog post so here's the summary

Library OPACS and databases generally already

  • rank using relevancy by default
  • do "implied AND" by default

They are slowly moving towards

  • autostemming (for databases)
  • covering full-text (for Web scale discovery services)
 But unless a miracle happens will never


  • Do a "soft AND" , where occasionally search terms might be dropped  

In short, the further away your library search is from these characteristics , the more difficult your users will find the search to use due to different expectations. Trained by Google, their searches are created based on the expectations such features are built-in , lacking any one of them will result in difficulties and poor quality results.

Of course implementing these features means losing control and predictability of searches, librarians don't want to be surprised and for sure they don't want to see a result they can't explain. Being able to do a precise controlled search would enable a searcher to be *sure* he has done a exhaustive search that he wants.

Google works on the opposite paradigm, aiming at getting a few good relevant results never mind if the searcher can't explain how it got that result, so there's a tension here.

Even if a library search ever achieves all  five points, a second battlefront opens... what about adding more advanced options for librarians and power users? This is a pretty contentious issue....

But for now let's discuss the differences in default searches. Let's start from features  that are totally accepted


1. Traditional Library catalogues and databases do not rank by relevance even for keyword searching

This is an area where Google web search engines have completely won. The idea that you could get results without any relevance ranking at all is one that is hard to wrap your head around in this day and age.

Even newer librarians find this concept alien. I remember explaining to a colleague in 2007 that traditionally Boolean searches did not rank results by relevancy as in theory all results can be considered equally relevant as they meet the search criteria but she didn't believe me.

The older OPACs, merely ranked in system order (date the record was entered) or if you were lucky by author, title or publication date.

Of course today the newer OPACs and next generation catalogues have relevancy ranking as the default (relevancy ranking is one of the signature characteristics of a next gen catalogue).

Still there are some older OPACs out there that don't do relevancy ranking at all for keyword searching.

An example would be the National Library Board ageing classic catalogue (note : they also have Primo as well) , according to the help file 

"The default sorting order for the list of titles is in reverse chronological order, that is, the most current items first. If your retrieval set is less that 300, you may also choose to sort your list by author, title, call number, date, or original order." - no mention of relevancy ranking at all.

Even our classic catalogue Webpac pro, IIRC turned on relevancy ranking by default fairly late in 2007.


Still doing advanced field searches like (a:twain) and (t:huck*) , disables relevancy ranking as an option (probably a technical reason). Something that surprises even librarians sometimes.



In general though I think the idea of ranking results by relevancy has met with very little resistance by librarians (if they even knew of a time before relevancy rankings), most of whom are happy with somewhat unpredictable rankings as long as they can explain why a result is found.

Still there is some loss of control to the searcher, you might be able to explain every result (assuming strict boolean) but you can't explain why this result is on top not another. 

As such I know of librarians who are a little unhappy with the limited amount of information available on how relevance ranking works.

In the case of OPACs and next generation catalogues, like our own Encore it is fairly easy to figure out.

There are 5 levels only -
  • "Most relevant" - exact phrase match in title $a subfield, 
  • "Highly relevant" - exact phrase match in title but other subfields
  • "Very relevant" - exact phrase match in any field
  • "Relevant" - No phrase match but all terms appear in title field
  • "Other relevant" - No phrase match but all terms appear, but not all in title field 

Then again with the limited amount of information you have in OPACs, you probably can't get more sophisticated then that.

But when it comes to databases considering full text it obviously gets complicated (here's EBSCOhost's explanation) and all you get is a sketch, typically will state what fields are heavily weighted (title, subject heading typically, if you are lucky might order the fields by importance), weighted toward exact phrase match, and some general explanation of tf-idf.

For librarians myself whose knowledge of search relevant rankings extend just to tf-idf weighting I probably wouldn't understand everything if they bothered to explain it all, not that they would. :)

Still that doesn't stop librarians from asking if they can alter the relevancy rankings for Web Scale discovery tools. This is something that is usually not possible, typically you can ask to "boost" the results from your location collections or Institutional repositories but nothing beyond that except maybe Primo Central (self-hosted version only?).

Though one wonders if most know what they are asking for.



2. Traditional Library catalogues and databases do not do implied AND

Again this is one area where Google has slowly chipped away at what traditional databases used to do. In Google when you enter a search say Singapore History , it is implied that you are also happy with results that include Singapore AND History, instead of just Singapore History as a phrase.

Correct me if I am wrong but in the past, most databases did not do implied AND. If you typed in Singapore History the database expects that you want articles that have the two terms in that order next to each other aka a "phrase search".

But now pretty much every library database from Scopus, Web of Science, JSTOR all do implied AND.

I am hard pressed to find examples where implied AND is not on by default. So far for us, I have found Lexisnexis academic that does not do implied AND.

Of course, some databases have various modes and/or allows you the option to set the default to whatever you want, in Ebscohost platforms you can in fact choose between a pure "Boolean/phrase" mode (so if you don't enter AND it assumes a phrase search) or a "Find all my search terms" mode which is doing implied AND.

For databases with multiple modes, it's common to have the "basic mode" with implied AND but the advanced/boolean mode do implied phrase search.

Still, I know the lack of implied AND was very common in library databases in the past, as I can still find mention about how "some databases do implied AND" or similar statements in older books on database searching, my thesis is now implied AND is more of a norm than exception.

In short like relevancy rankings this is another area where web search engines like Google have impacted how library systems work.

Again similar to relevant ranking, my impression is that this change has being accepted & absorbed by library community without much resistance. Sometimes you do need exact search, but phrase search using quotes is generally sufficient and a worthwhile price to pay for the gain. Thanks to relevant rankings that prioritize phrase matches, so you seldom need that anyway.

This change involves a little loss of control to the searcher since the search tries to be "clever", but you can gain control back by adding quotes (or + operator until Google changed it due to Googleplus).



3. Traditional Library catalogues and databases generally do not do stemming by default

Relevance ranking and implied AND has being generally absorbed into library databases and opacs without much dispute but what about auto-stemming? Here I use stemming loosely to mean including word stems (e.g run finds running) . Closely related is the ability of search engine to include synoymns and other related words (e.g automobiles finds cars).

From what I observe auto-stemming by default is currently still in the minority but the tide is turning.

Traditional databases like Scopus/JSTOR generally give you exactly what you search for. If you search for librarian, it won't give you librarians (note extra "s"), library or even information professional. In this sense it is totally predictable.

But cracks are appearing, the new web of science platform "automatic searches for over 7,000 spelling variations such as British/US English (colour / color) as well as name variants (mice / mouse)"

OvidSP's default basic search by default includes related words ("included related terms") - typically plurals but occasionally synonyms.

Ebscohost database has options "Applied related words" , though it isn't turned on by default on my version.

Lastly Summon itself does autostemming unless you add Boolean operators in which case it does exact searches. I believe Summon also has a list of proper names so it will by default search for those as well.

From the reaction seen on the Summon list autostemming by default is more controversial.

Add autostemming without a way around it such as adding quotes or a Verbatim mode in Google? You will have a librarian revolt on your hands.

Google of course does autostemming for word variants but according to Googleguide (which is not official) it does not find related terms (cars vs automobiles) unless you add the tilde (~) operator. Personally I think the later portion is a simplification.

It you look at what verbatim mode actually turns off, you can see Google does more than just does word stems.

Personally I think if you have a choice it is good to turn on stemming by default. Particularly if stemming is of the form that searches the root-words.  In most cases, a searcher looking for car is also looking for cars . It is just a bit cruel to make him use truncations, wildcards or even the Boolean OR just to achieve this. In the rare case, he isn't just add quotes.

I am a bit more on the fence if it is of the form that throws in additional synonyms.  On one hand, I agree with Iris that the greatest difficulty students face when searching is what she calls "term economy" , where adding the wrong keywords get you bad or even no results, so if the search can help by entering appropriate related terms to expand the search it would be great.

As with all the cases where the search tries to be "intelligent" (and hence less predictable), how useful this feature is depends on how good the system is at throwing in the right synoymns.

Our library databases and OPACs generally rely on authority records and thesauri and in cases of very technical areas when the right term , lingo or technical term can break or make the search, this can be useful. Google probably as a even more sophisticated system to find related words but either way, this can be decided by empirical tests on whether to turn this on by default.

Regardless of whether autostemming improves results on average, turning it on by default means losing more control. 

Typically Librarians would prefer for such features to be an option. Even more control can be had if searchers get to choose what synonyms & related words to add.

Something like OVIDSP's "Map Term to Subject Heading" or Ebscohost's "Suggest Subject Terms" is what would be ideal since one can decide what extra search terms to add, instead of immediately adding keywords without any control.

OVIDSP's "include related terms"  in basic search mode automatically dumps all related terms but at least the search isn't opaque and you can understand why a certain result is found.





4. Traditional Library catalogues and databases generally do exact searches ("hard implied AND) and will not drop search terms (except stopwords)
 
Google is generally very very unpredictable despite the impression given by http://www.googleguide.com/

For example, in a long search query, it may randomly drop a search term if the search term causes the results to drop drastically. In the UK Phil Bradley and Karen Blakeman has done quite a bit of experimentation on Google searches, in particular the comments at the end of this blog post is enlightening by a googler.

 "When you do a multi-term query on Google (even with quoted terms), the algorithm sometimes backs-off from hard ANDing all of the terms together. It’s a kind of “soft” backoff. Why? Because it’s clear that people will often write long queries (with anywhere from 5 to 10 terms) for which there are no results. Google will then selectively remove the terms that are the lowest frequency to give you some results (rather than none)."

That's just one oddity among others due to the various optimizations done by google, for now though databases and web scale discovery services are still not at this level of opacity and unpredictability.

The closest function to this I know of in library databases is ebscohost's SmartText Searching which allows you to enter a chunk of text (up to 5000 characters) and it will try to match the best article which may not have all the words.

Google also tends to "autocorrect" your searches and will automatically without prompting give you a search it thinks you want, though it does tell you about it and gives you a choice to change back. And this happens even if your inputted search has some results!





In terms of library related system analogues, I know of some library catalogues if it finds no results will automatically switch to ORing the terms and showing the result but that's the closest I can think of.

If relevancy ranking and implied AND were accepted without much dispute and auto-stemming caused some grumbling, unpredictable searches similar to doing "Soft AND" would be the end of the world. If Summon or library databases started to work like Google where you couldn't tell most of the time why a search result was retrieved, all control is thrown out of the window and the library apocalypse would be here.

Still I wonder while the typical librarian or serious searcher would care, as the comment in Karen's blog indicates 99% of people wouldn't care as long as the results were relevant. Again we are back to control/predictability vs quality of results.


5. Traditional Library catalogues & databases which used to be I&A do not have full-text

Similar to relevancy ranking, the fact that OPACs search terms just cover the bibliographic record and not the full-text is something that is very alien to today's searchers thanks to Google. The library database version of this is of course Indexing & Abstract databases, like Scopus and Web of Science, and I often field queries from users who are stunned to realize that Scopus does not have the full-text.

Somewhat related to this that confuses users is the idea of indexing and searching only certain fields ("Keyword search" in many OPACS/next generation OPACs/databases search a subset of available fields)  or worse do things like inverted author names.

The idea of full-text search reigns over all, so concepts like index browse, pre-coordination of subjects or controlled vocab are totally alien to users, even if explained to them, they will just probably think it's is a strange weird idea.

The fact that Web Scale Discovery services actually

a) allows full-text search for articles (some of them anyway)

b) allows full-text search for books (some of them anyway)

is in my opinion the biggest gain of Web Scale Discovery systems and perfectly aligns with what users are thinking of and are the main reason why they are a big hit.

I have often helped students who tell me they can't find anything in the catalogue. If it's a article they are looking for, it's fine to explain we don't have articles title in there though I still get weird looks sometimes but what if he is searching for a book?

It's perfectly possible of course they are searching on very very specific topics that aren't in books but often what the user actually wants is a textbook on a specific statistical technique say a specific type of linear regression.

A search in the library catalogue typically yields nothing and so I will ask them to do a google books search and lo and behold a ton of books appear, most of which we have in the library! I then have to explain to users that our catalogue doesnt have the full text (though occasionally we have table of contents) so even though the technique might be covered in a chapter or a few pages our library catalogue fails.

Web Scale Discovery systems obviously include ebooks full text if agreement exists but even in cases where all you have is a print version and not ebook, it can still help indicate a print book is relevant because it knows from the ebook the term you are looking for is in there. Essentially you get something similar to Google books.

I would argue many cases of people having problems with OPACs stem from the fact searchers or just used to searching full-text. In google, they can happily search for very specific terms and still gets hits but fail in OPACs since we just include limited book information.

While searching over full-text doesn't mean choosing the right search terms or not important, the fact you have a bigger set of text to match over makes it more likely to get something.

I also have seen many users struggling over "subject search" in OPACS, thinking it is searching over full-text. Of course, no-one but a cataloger would ever start from using library of congress subject headings or other controlled vocab (though one could use pearl growing techniques of course from a relevant item)

In general the ability to search full-text for books and break down silos across databases for articles is still quite a new experience to librarians so I am unsure how librarians are reacting to it, though I guess most librarian subscribe to the more is better?

Of course, searching over full-text means a lot more possible hits for broad general search terms which puts greater stress on the relevancy ranking algorithm. Thus far I have seen a request on the Summon List to put in an option to search just metadata and not full text similar to what ebsco discovery service has.

Another issue specific to Summon is while it tells you that a result is obtained due to a match in the full-text, you can't quite verify it easily since unlike Google books it doesn't have a "snipplet view" to see the words in context.



 Showing that a result is matched via full-text but not exactly what is matched


Even if one can agree what defaults are best, and that we should basically do defaults to simulate what users expect aka Google, what about the need to add options for more advanced/librarian-like features.

Here's a list of such features,

What's not in Google but in Library OPAC and databases

1. Boolean operators, proximity operators, truncation/wildcards

Google does have a lot of advanced operators including OR, the minus operator which is equalvant to NOT and also uses quotes for exact phrase search. Still that pales compared to a typical database.

For one thing they don't have a proximity operator though there is a undocumented AROUND function  and hacks to sort of achieve that using  asterisk  but neither takes into account term order though.

E.g Ebscohost has NEAR and WITHIN.

While Google uses the * option it is not a truncation feature like most library databases.

In fact most databases differentiate between wildcards typically ?  (replace one character or sometimes zero or one) and truncation typically * (replace unlimited number of characters or in some cases a fixed number).

That said, with tons of intelligent searching built-in including autostemming, does Google really need truncation or exact boolean operators?


2. Plenty of search modes and field searches

Google has one advanced search and it is pretty comprehensive in turns of search fields available. Including ability to narrow by language, region, last update, the site or domain it is on, file type etc.

Google "field searches" are limited to title of the page, URL of page, in links to the page (anchor text) and of course text of the page.

With library databases on the other we have tons of field searches. In business source premier on Ebscohost I see 19 fields you can search,  with Psychinfo on ovidSp I count over 70 fields searchable! With so many fields available to search, it is pretty common for many databases to offer multiple search fields connected with boolean operators pull down menus like the multi-field search in OVIDSP below.






 Some of 70+ fields in PsycInfo on OVIDSP in multifield search mode


Other databases with similar layout includes EBSCOhost, Scopus, Web of Science etc. This design encourages the use of nested boolean operators with each "nest" consisting of one concept combined with OR to pick up synonyms. 

To be fair Google mostly indexes webpages which has a lot less meta-data but there is some. I suspect though even if there was more metadata and fields Google probably won't trust it due to spamming (or nicer term SEO).

Of course we librarians love our advanced searches, and to "honour us" Google gave us this awesome advanced search on April 1 2012. Just kidding!




 3. Subject/Author etc browsing & Thearusi 

 Related to above many "serious" library databases have subject thearusi that allow you to browse and "Explode" concepts. Examples include pubmed, psychoinfo etc. Even the most humble OPAC allows you to browse by subjects. AFAIK Google doesn't have anything close.




4. Combination of search history or sets

Many databases like OVIDSP, Scopus, Web of Science, Ebscohost also keep track of your past searches and allow you to combine them using Boolean (what else)!


Conclusion

Going through the differences I can see that the major difference between Google and typical library search systems is one of control and predictability.  In most cases modern library systems and newer systems can in fact duplicate a lot of the functions turned on by default in Google, but the modern library systems generally require you explicitly turn on the option.

The typical library search systems also place a lot of focus on Boolean operators , leaving aside the functionality to combine search history sets, a pure library system that does not help users by stemming or adding other related terms will require Boolean operators to help get higher quality results.

Conversely, the "smarter" a system is at helping the user, the less he needs Boolean and/or truncation/proximity due to a combination of relevancy ranking, stemming and "soft AND" that knows to exclude search terms.

That said, Google's target audience is often people who just want to find a few relevant pages, while serious academic researchers want exhaustiveness and this requires high level of controls.

A big debate is now brewing over the ultimate goal of Web Scale Discovery systems. One School of thought is that such systems including Summon should aim to cater only to undergraduates and shouldn't aim to be more. David Pattern is just one of several of this view (see this blog post and comments)

If I understand correctly, this school of thought feels that serious researchers should still use normal library databases, and while Summon can still useful to them, it's just one option out of many.

They oppose any attempt to make Web Scale discovery tools closer to typical databases by adding more library type functions like ability to combine search history, giving more options to control search and more powerful advance searches (It must be noted though that while Summon is not much like a typical library database, some others like Ebsco Discovery Service are far closer.)

They fear the clean UI will be messed up and might even confuse users.

Another school of thought with includes Marshall Breeding feels that such services should evolve to support all users including advanced users. Given such systems will tend to be default searches the inability to support more than one class of users seems to be a waste...

EDIT  May 10 2012. As noted in the comment below and by others include Dave Pattern himself, the disagreement here between Marshall Breeding and Dave Pattern may be over-stated. Sincere apologies for misrepresenting the two.

Which side of the debate do you stand on?







blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...