Retrieval

After we ingest our higher level objects to ElasticSearch, we can use all of the associated tools that come with ElasticSearch to retrieve objects.

More recently, two stage retrieval systems, which deploy a deep learning reranking model on top of retrieved document results, have shown to be better performing for both basic information retrieval and also for downstream tasks such as question answering.

We also deploy a two stage reranking model. Our reranker uses BERT-Large as its base architecture, and by default is trained on MS-Marco. On our roadmap is easily training this model on user feedback.

Our setting is slightly different from the traditional document retrieval setting, and also different from settings such as question answering. Like question answering, we are retrieving relatively short contexts, but the contexts are composed of not pure text sequences. Also unlike question answering, we are seeking to return interesting information, not necessarily the specific answer to a user’s query.

With this last point in mind, we prioritize diversity in returned PDFs. To do this, we use ElasticSearch to retrieve a set of N documents, given all the text content in that document. We find all objects of the type defined by the query on these returned documents. We then run reranking on all these objects.

Instead of returning this reranked list, we choose to filter the list such that only the top ranking object for each of the initial N documents remains. In this way, we end up with a ranked list of the initial N documents based on how informative the “most” informative object in that document is.

For pagination, we paginate at the document level. If you retrieve the first 25 documents, you will get a ranked list of those 25 documents according to ElasticSearch + reranking. If you ask for more results, the next page of objects will come from the next 25 documents, and thus will have no overlap with the first page of results.

In this way, you can scroll through hundreds of documents, finding fresh relevant objects to explore.

Elasticsearch Index - Fields

Field

Type

Description

Object fields

area

integer

size (in sq. pixels) of identified area

cls

text

detected object class (figure, table, body text, section header, etc)

content

text

Text content within the detected object

context_from_text

text

Context surrounding mentions of identified object within the body text. (Experimental)

dataset_id

text

(COSMOS internal) - Identifier for dataset.

detect_score

float

Score (strength) of detected classification prediciton

full_content

text

Combined content of the object and its associated parent/header

header_content

text

Text content of the associated parent/header object (caption, section header)

img_pth

text

Path to image file on disk

pdf_name

text

Name of source PDF

postprocess_score

float

Confidence of postprocess detection correction process

Entity fields

aliases

text

Aliases for known entities

canonical_id

text

Canonical ID for known entities (e.g. UMLS id)

description

text

Description of known entities

name

text

Name of known entity

types

keyword

Type of entity