Retrieval ========= After we ingest our higher level objects to ElasticSearch, we can use all of the associated tools that come with ElasticSearch to retrieve objects. More recently, two stage retrieval systems, which deploy a deep learning reranking model on top of retrieved document results, have shown to be better performing for both basic information retrieval and also for downstream tasks such as question answering. We also deploy a two stage reranking model. Our reranker uses BERT-Large as its base architecture, and by default is trained on MS-Marco. On our roadmap is easily training this model on user feedback. Our setting is slightly different from the traditional document retrieval setting, and also different from settings such as question answering. Like question answering, we are retrieving relatively short contexts, but the contexts are composed of not pure text sequences. Also unlike question answering, we are seeking to return interesting information, not necessarily the specific answer to a user's query. With this last point in mind, we prioritize diversity in returned PDFs. To do this, we use ElasticSearch to retrieve a set of N documents, given all the text content in that document. We find all objects of the type defined by the query on these returned documents. We then run reranking on all these objects. Instead of returning this reranked list, we choose to filter the list such that only the top ranking object for each of the initial N documents remains. In this way, we end up with a ranked list of the initial N documents based on how informative the "most" informative object in that document is. For pagination, we paginate at the document level. If you retrieve the first 25 documents, you will get a ranked list of those 25 documents according to ElasticSearch + reranking. If you ask for more results, the next page of objects will come from the next 25 documents, and thus will have no overlap with the first page of results. In this way, you can scroll through hundreds of documents, finding fresh relevant objects to explore. Elasticsearch Index - Fields ============================ +-------------------+---------+----------------------------------------------------------------------------------------+ | Field | Type | Description | +===================+=========+========================================================================================+ | Object fields | +-------------------+---------+----------------------------------------------------------------------------------------+ | area | integer | size (in sq. pixels) of identified area | +-------------------+---------+----------------------------------------------------------------------------------------+ | cls | text | detected object class (figure, table, body text, section header, etc) | +-------------------+---------+----------------------------------------------------------------------------------------+ | content | text | Text content within the detected object | +-------------------+---------+----------------------------------------------------------------------------------------+ | context_from_text | text | Context surrounding mentions of identified object within the body text. (Experimental) | +-------------------+---------+----------------------------------------------------------------------------------------+ | dataset_id | text | (COSMOS internal) - Identifier for dataset. | +-------------------+---------+----------------------------------------------------------------------------------------+ | detect_score | float | Score (strength) of detected classification prediciton | +-------------------+---------+----------------------------------------------------------------------------------------+ | full_content | text | Combined content of the object and its associated parent/header | +-------------------+---------+----------------------------------------------------------------------------------------+ | header_content | text | Text content of the associated parent/header object (caption, section header) | +-------------------+---------+----------------------------------------------------------------------------------------+ | img_pth | text | Path to image file on disk | +-------------------+---------+----------------------------------------------------------------------------------------+ | pdf_name | text | Name of source PDF | +-------------------+---------+----------------------------------------------------------------------------------------+ | postprocess_score | float | Confidence of postprocess detection correction process | +-------------------+---------+----------------------------------------------------------------------------------------+ | Entity fields | +-------------------+---------+----------------------------------------------------------------------------------------+ | aliases | text | Aliases for known entities | +-------------------+---------+----------------------------------------------------------------------------------------+ | canonical_id | text | Canonical ID for known entities (e.g. UMLS id) | +-------------------+---------+----------------------------------------------------------------------------------------+ | description | text | Description of known entities | +-------------------+---------+----------------------------------------------------------------------------------------+ | name | text | Name of known entity | +-------------------+---------+----------------------------------------------------------------------------------------+ | types | keyword | Type of entity | +-------------------+---------+----------------------------------------------------------------------------------------+