Sunday, October 20, 2013

6 things I am wondering about discovery

I blogged 8 things we know about web scale discovery systems in 2013 , an attempt to summarize the current consensus after 4 years of web scale discovery service use in libraries and hundreds of research papers and presentation. (Not sure what they are?)

It was a post that seemed to be pretty widely shared and read, but drew no comments, from which I conclude I didn't really say much that was controversial.

This time, I am back to think aloud about things I feel about issues where things are still up in the air. The same qualifiers I made in the last post apply, I am not an "expert", my familiarity is mostly with Summon and to lesser degree EDS etc.  

Be warned it's a long ramble.

1. Are Blended results a good idea? Or should we implement Bento style search results?

The original incentive for discovery services, was because users were telling us they wanted a "one-search" to put all our results regardless of content type (particularly journal articles which were not in catalogues) into a single search "like google", instead of going to a separate silo for each type of content.

As such it seemed obvious, that the difficult part was to get all your content into a single index (those pesky content providers needed to agree), and presentation of results was a single matter of using relevancy ranking techniques similar to what web search engines do to order then.

However, currently doubts have began to surface about the wisdom of mixing up all the search results from different content types such as books, journal articles, newspaper articles etc together in one result list (the so called blended search model).

Some have pointed out that even Google has silos, for example the main google search does not (usually see later) mix up results from the main google search with Google books, Google Scholar, Google news.

So what is to be done? An attempt is to improve that is becoming popular is to do a "bento style" results list, with results segregated by content types. Many University Libraries around the world such as Columbia University Library, NCSU State University Library, Princeton University Libraries, Dartmouth etc are doing so.

A point of clarification, I consider both simple two column style interfaces with "Books & more"  and "Articles" columns (e.g Villanova University's Vufind implementation is probably the most well known example) or results with multiple boxes (say NCSU's example) as both examples of bento style.


As noted here , there are different ways to achieve this from libraries loading the discovery index into open source systems such as VuFind, Blacklight, Xeres, or those who create their own custom jumping off webpage.

Why are bento style interfaces better? (What follows is mostly analysis given here)

First, it eases the burden on the relevancy algorithm, as currently the relevancy system must decide how to rank items from totally different content types with totally different amount of data to work with (eg some are full-text available, some metadata only)

Secondly, by including only a "book" or "books &more/catalog" bento box, it caters for faculty and research staff who care only for those items. This also can indirectly solve the known item search issue for books.

Thirdly there is mixed evidence on whether the blended style is what users want  according to a bibliographic Wilderness blog post. My own personal experience handling user feedback on Summon implementation here is that

i) As already mentioned, discovery services make finding of catalogue material occasionally harder, you can always filter for them but it slows down the process.

ii) Less experienced users have a big problem trying to differentiate material types, and discovery services often don't highlight such differences clearly, so there are users who can't tell the different between a ebook and a online book review, or have problems telling if a certain content type is a newspaper article or a journal article. For discovery services like Summon that have a lot of newspaper article content this can lead to less experienced users producing poor citations. This is one of (but not only) the reasons this paper concluded that EDS users produced better citations for a assignment as judged by Librarians

Lastly, by implementing your own interface layer which is what is required if you want a bento style interface, you are protected from future changes in discovery providers as you can retain the same interface, just plugin results via API.

The main issue here of course is that implementing such bento style systems requires quite a bit of work. It's unclear if the big 4 web scale discovery services will start to roll out native interfaces that offer bento style results, but Summon 2.0 is rolling out a alternative model, called content spotlighting showing "grouped results" for some content types.

This draws inspiration I believe from how Google sometimes embeds results from Google News into a web search. Below shows our a web search for "Nexus 7" gets you mostly web pages but in the 3rd result, you see results from Google news.

And of course Summon 2.0 has the same idea.

In the above exanple, you see a grouped result for newspaper articles, where Summon 2.0 highlights the top 3 ranked newspaper articles in the midst of other results, and if you click on "news results for...." it will just show newspaper articles.

It has been announced this grouping will work for "Reference" material as well as possibly other content types in the future (I am hoping for "book reviews").

I would add that, this idea isn't new in Library systems, our version of Encore/Encore Synergy for example, is actually the reverse of Summon 2.0, where it shows catalogue results, with selected journal articles appearing in the middle of results.

2. How should we cater for advanced users who have needs for advanced features in discovery systems? Or should we even do so?

One of the greatest frustrations by librarians and advanced users with regards to discovery systems besides relevancy is they tend to compare the features they have in databases and wonder why discovery systems lack them.

Some of this could be due to fundamental limitations in the discovery service. For example, one could never have a controlled vocabulary system like MESH for all articles (with all the included benefits such as browsing capabilities, thearsui etc) in a discovery service because the items in a discovery service all come from different sources. Not without huge amount of efforts to do crosswalks anyway and even that might not be possible.

Similarly one could never have extremely obscure granular filters /features of use for just a given discipline (e.g  A way to search for chemical reactions like scifinder?).

On the other hand, some features are I think relatively trivial and could be added for example when I was in the Greater China serialssolutions meeting, a librarian asked for sort by times cited function similar to what is in Scopus etc.

On the practical level, I can see some possible obstacles to this feature, but the fundamental question is, should we include a feature set used by a very very small percentage of users?

Here's how I see it, because there is a limit to how many features you can squeeze into the discovery interface, I see 2 paths

1. Put in a small set of most commonly used standard features from native databases  + unique features that only discovery services can have due to the scope, or to address specific issues of discovery.

2. Put in as many standard features you see in most other native interfaces

Path one would make discovery a unique tool in the toolkit of searchers that would complement native databases.

Path two would make the discovery search a closer substitute of native databases and reduce the need to use multiple tools.

Which path is better?

One wonders though if something like  SciVerse Application Framework was implemented for discovery services, where developers could make plugin/gadgets for Sciverse Scopus and offer them for use.

Users could shop for and add the ones they want.

The above shows the Sciverse application gallery, where you can shop for apps you want to use with Scopus and Sciencedirect. 

This allows a customization at the level of each user, so you wouldn't have a one size fits all interface. Institutions could set default applications/plugins but users would be free to turn on or off the ones they wanted.

There many be practical implications though in terms of reuse of data?

Undergraduates vs advanced users

Another dimension to this issue is, what is the target audience of discovery services?

Serialssolutions' line tends to be Summon does not replace traditional databases (whether full text or A&I) and that there is always a place for it. The line goes is that Summon is good for undergraduates as a starting or jumping up point to find good enough articles but they should also go to databases for more serious research.

Ebscohost's EDS, tries to differentiate itself by claiming their service is for everyone not just undergraduates, fitting the needs of more advanced users as well.

In a page titled "Beyond undergraduate" - the page read "If a discovery service should truly encompass a university's entire collection, shouldn't it also cater to its entire user base? With an unparalleled user experience and the inclusion of important subject indexes used at the graduate and post-graduate level via Inclusion of Subject Indexes, EDS is poised to debunk the myth that discovery is merely an “undergraduate” resource."

I would say you can see this focus from not just the attempt to focus on indexes but the amount of searching options built-in for EDS. Of course this is because EDS is built-off the existing ebsco platform, but as a librarian who loves control, EDS interface appeals to me.

So we basically have a difference of philosophy here it seems. EDS is designed for the advanced users, with a multitude of search features while Summon is designed with a fewer but well chosen set of features that the majority use.

As we shall see later in #3, in general EDS also pushes users to discover content even if the library does not have direct access to it, a feature that is good for advanced users but confusing to less advanced ones.

The implication of this is EDS would likely capture a greater portion of advanced users compared to Summon, because Summon lacks some of the advanced search features, such users are used to.

NOTE : EDS is actually essentially already a database platform though it does include non-ebsco content which explains why it has more native database features in the first place.

Would that imply that the displacement effect where more users are no longer searching in the native databases compared in the discovery service be even stronger in EDS compared to Summon? We already see this impact for Summon, but what about EDS? More research needed here.

I can't imagine a world where EDS or any discovery has totally killed off use of native interfaces, because some do provide very unique value propositions (e.g Scifinder, Pubmed, Scopus), but I do imagine a philosophy that tends towards EDS's could potentially be more disruptive to native interfaces as compared to Summon's complementary to database approach.

I would add that the earlier discussion of "standard features from native databases" vs "unique to discovery features" is probably independent of the target audience question as both approaches would be valuable to all types of users.

3. Should discovery services provide full text only  results or include results with abstract only by default? Should they only search the metadata or full-text by default?

Should discovery services show only items the library owns either as a physical copy or has full-text only (by default), or should we show items that we may not have access to?

The argument for showing only items that are available in the library or full-text, is that the heaviest users of such services are undergraduates and they carry on the mindset from catalogues to library web scale discovery services (after all both are default searches), where everything you list in the catalogue, you can get.

So if you list items you can't get immediately (also some libraries don't offer or offer limited document delivery services or interlibrary loans to undergraduates) , they get upset.

Summon itself by default mostly positions itself to show full-text only (either free or subscription matching your holdings), though there are exceptions where you will see "citation only" results.

For example, if you include Abstract & indexing databases (A&I databases), open access, free packages which are "All or nothing" ("Zero title databases" in Serialssolutions speak) and other Institutional repositories you may see "citation only" results.

There are two situations where you may see citation online results

a) Pre-login
b) Post login

Typically pre-login you won't see any citation online items, except in rare situations. Below shows one example.

The above shows an example of a "Citation online" result from ERIC database in Summon pre-login. In general such examples are rare as most A&Is, do not allow display of their data without authentication.

Additional note, many institution repositories packages , journal packages, free A&Is may be "all or nothing" (e.g Proquest dissertation and thesis, Henry Stewart talks), so the moment you turn it on, you get the whole set of results whether you have access to each item or not. There is no matching to your holdings.

Most A&Is do not allow display of their results to unauthenticated users so in Summon, the results from A&I databases are less problematic since most of them (Web of Science. MLA etc) do not provide metadata for free, so will require users to login first before searching, something most users won't do, so only experienced users who go looking for it will find it.

Login first before searching and you will see additional results some of which are "citation online" only

Once you have done a login, you will notice additional results appearing compared to the earlier search. Some are full-text items, some are not.

Due to the desire of many libraries to show only full-text even for A&I results, Serialssolutions introduced a "Exclude Citation Online Content", which would hide all citation online results. It's unclear to me how many Summon libraries have turned this on.

Ebsco discovery service has a different model when detailing with A&I results that require authentication.  Even prelogin the results will show that a result has matched a subscribed A&I but won't give any details not even the title, and encourages you to login to see the citation.

As shown above EDS will show results like #4, where you will have to login to see the item from a restricted A&I source (mostly like but not always something you have no full-text access). This is unlike Summon, where you would be unaware such a result exists unless you logged-in first.

Which model is better?

Even leaving this difference aside with regards to A&I, somewhat interestingly for EDS, there is a limiter called "Available in Library Collection" (this means physical and full text online) .

It's somewhat confusing what it does (explanation here), but I would guess it would roughly be the flipside of Summon's "Add results beyond your library's collection", which will show the whole index of Summon (less restricted A&I) items.

So if in EDS you do not have "Available in Library Collection" on , you will see unsubscribed content from say Jstor appearing I believe.

If we go by the premise that the default settings are the ones that you don't need to turn on, it would seem in EDS the recommended default is to show results even if there are not available in the library collection?

Of course, this might be a simple artifact of the existing ebsco platform, in any case some libraries like MIT library currently do not have this on by default.

Overall though, It seems to me EDS libraries will usually tend to show more citation results to users than Summon and it's unclear to me which model is actually better. I guess it depends on the sophistication of users you are dealing with. EDS may be positioning itself to deal with more advanced users.

Matching on metadata vs fulltext

A somewhat smaller issue I have been pondering on, is whether the search should by default match full-text + metadata or just metadata. While full-text searching is more powerful in theory, there are complaints that the relevancy ranking systems of discovery services is often not good enough and often surface irrelevant content because of some chance matching in full-text of books or articles.

I have seen EDS libraries with either option on or off by default, libraries that do not have "Searched full text of articles", are generally saying they don't want to rely on full-text matching, because of poor results particularly for known item search?

In Summon, there is no option to restrict search to just metadata generally, though there was a recent algorithm changed that by default, would restrict matching in full-text to within 200 words which can help combat the issue where the keywords appear pages apart but are totally irrelevant. Adding quotes (the sign of a "advanced user") would turn this function off. As of Oct 2013, this seems to be removed.

4. Will Discovery services lead to the decline of Abstracting and Indexing services?

Are A & I Services in a Death Spiral? considers only the impact of Google Scholar without even considering the effect of Discovery services which only hastens the trend.

Particularly as a new generation of researchers grow up, always having access to full-text the idea of "abstract only" results is extremely alien. Even now, I get graduate students who express shock that there are results with no full-text, "What is the point of including them then?" they ask.

Currently more A&I content is being fed into discovery systems (something that I wouldn't have expected), with Scopus and Web of Science working with the main discovery services. In addition, Summon itself now supports ERIC, MLA and over 100 A&I databases.

EDS covers even more A&I and boasts of Platform blending , it was explained to me by a EDS vendor that this was unique to EDS and the only way certain A&I content holders like APA would allow their content to be included in a discovery service. You might also want to see the following exchange of letters between Ebsco, Ex Libris and Orbis-Cade alliance   saying the same thing.

So it seems the future of A&Is are secure given that you can't cancel them without losing the content in discovery.

But the question is, how much of this content is actually unique? We no longer live in a world where you needed to use A&I to check if something exists.Most publishers of content are perfectly happy to push their metadata to as many places as possible, be it Google Scholar, or discovery services.

In many cases, a unrestricted search of a discovery's index ("search beyond library collection" in Summon) provides as good a index as the A&Is.

While it is true that a lot of this data is not retrospective and may miss out some of the more obscure content providers, as time passes this becomes a smaller problem has discovery services become ever more encompassing bringing in even non-english content providers.

A&Is do hold an edge in better indexing but it's a open question how much this helps which brings us to the next issue......

5.  How much metadata is needed for good relevancy? Is "thin metadata + full text" sufficient?

This is a age old debate with EDS obviously claiming this is of greater importance due to their better store of index terms compared to rivals. It is however extremely difficult to measure the additional relevancy boosts as a result of this so it's unlikely we will see this resolved.

As stated in  8 things we know about web scale discovery systems in 2013 the head to head tests are mixed.

A common idea floating around is that while Google can do world class relevancy ranking with mainly full-text and little indexing, they have advantages library can't match due to their willingness to track users and use signals that libraries can't and won't do.

6. Will Discovery services lead to the decline of OPAC? Or a new breath of life?

The traditional discovery service will harvest MARC records from the ILS then display the results in the discovery search results, but the amount you click on the result, it will direct you to the traditional catalogue.

Recently the thinking seems to be that this leads to poor usability as the user will suddenly be dropped into a totally different interface that can be jarring.

There seems to be three approaches to solving this.

a. Library discovery vendors who are already ILS vendors eg Ex Ex Libris offer a combined product with Primo Central.

b. Library discovery vendors partner with ILS vendors eg Ebsco discovery service partners with Innovative Interfaces Encore and other ILS

c. Libraries using open source interfaces eg Vufind and piping in Discovery index results - basically a DIY version of b.

All third approaches are interesting that make the library catalogue the base, and pipe in results from the discovery service index. You get back the familiar catalogue (hopefully next generation catalog) interface, you can do loan related transactions directly from the interface (eg place holds).

There is also no time delay where you catalogue a record in your ILS and it doesn't show up in your discovery service.

The interesting thing that occurs to me with this arrangement is, how would relevancy be done? Would we really be talking of one combined index of catalogue results and discovery results?

Presumably this would be the case in option a) where the ILS and discovery vendor are the same. But in cases of b) and c) where the API is used, it seems to me relevancy would be a bit more difficult.


This has been a long rambling post, hope it was of some value.

BTW If you want to keep up with articles, blog posts, videos etc on web scale discovery, do consider subscribing to my custom magazine curated by me on Flipboard or looking at the bibliography on web scale discovery services)
blog comments powered by Disqus

Share this!

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Related Posts Plugin for WordPress, Blogger...