Getting Started ================================== To get started with Cosmos, we highly suggest utilizing some of our prebuilt _Docker images. These handle building dependencies and running your documents through the pipeline without having to worry about installing things locally. First, clone the Cosmos repository, and change into the root directory of the repository. Open a file named .env in the directory, and populate it with the following: .. code-block:: console BASE_IMAGE=uwcosmos/cosmos-base DETECT_IMAGE=uwcosmos/cosmos-ingestion WORKER_IMAGE=uwcosmos/cosmos-ingestion RETRIEVAL_IMAGE=uwcosmos/cosmos-retrieval EXTRACTION_IMAGE=ankurgos/cosmos-extraction:latest VISUALIZER_IMAGE=uwcosmos/visualizer_kb:latest UPLOAD_IMAGE=iaross/cosmos-api:latest API_IMAGE=iaross/cosmos-api:latest LINKING_IMAGE=uwcosmos/cosmos-linking DETECT_PROCS=1 WORKER_PROCS=8 DEVICE=cuda RERANKING_DEVICE=cuda #DEVICE=cpu #RERANKING_DEVICE=cpu SCHEDULER_ADDRESS=scheduler:8786 ELASTIC_ADDRESS=es01:9200 INPUT_DIR=/path/to/input/dir TMP_DIR=/path/to/tmp/dir OUTPUT_DIR=/path/to/output/dir ELASTIC_DATA_PATH=/path/to/es/directory The first seven lines in the file define which images to use. The default cosmos images assumes access to a CUDA enabled GPU. To utilize a CPU, append to each image "-cpu". For example, change uwcosmos/cosmos-base:latest to uwcosmos/cosmos-base-cpu:latest. If you use the CPU version, make sure to change all \*_DEVICE from 'cuda' to 'cpu'. Depending on your machine, you can scale the process by setting DETECT_PROCS and WORKER_PROCS to the desired number of processes. Finally, make sure to set the final four directories, denoting: 1. an input directory pointing to your PDFs (all pdfs will need to be renamed to valid docids) 2. a temporary directory with sufficient hard drive space to write images and such 3. an output directory to write information. 4. Note that the directory serving as ELASTIC_DATA_PATH will need its permissions set to allow read/write by any user, in order to accommodate ElasticSearch's permissions model (e.g. sudo chmod 777 /path/to/es/directory). 5. create a directory images/ within your output directory so your user has access to all the output. If working on a remote server, launch tmux or other multiplexing if you haven't already as you will need several panes to run several docker-compose applications. To process the images, run the following line: .. code-block:: console docker-compose -f deployment/docker-compose-ingest.yml -p cosmos up The docker-compose ingest application will process all the documents and then just idle when it is finished. You can ctrl-c out of it at that stage. Newer versions of docker-compose (starting with v1.27) require a slightly different syntax for declaring GPU resources, and require the .env file to be places within the ./deployments directory. A version of the docker-compose definitions compatible with newer versions is supplied (docker-compose-ingest-newdc.yml) and can be invoked identically: .. code-block:: console docker-compose -f deployment/docker-compose-ingest-newdc.yml -p cosmos up The output directory you defined will now be populated with a set of Parquet_ files, as well as an images directory containing object specific images, and saved word embeddings over the input corpus. In the output images directory (/path/to/output/dir/images) run the following to move all files into folders beginning with the first two chars of their file names: .. code-block:: console for f in *.png; do [ ! -d ${f:0:2} ] && mkdir ${f:0:2}; mv $f ${f:0:2}/$f; done Entity linking -------------- Once the documents have been ingested, a separate process can be run to recognize named entities within the extracted objects, linking them to the Unified Medical Language System (UMLS). Named entity recognition and UMLS linking are accomplished via SciSpacy_. The parquet files will be modified to include linked and unlinked entities, and an additional parquet file will be created containing canonical information for the linked entities. To run the entity linking pipeline: .. code-block:: console docker-compose -f deployment/docker-compose-link.yml -p cosmos up Deploying the COSMOS search interface ------------------------------------- To deploy the search interface over this processed dataset, run the following command: .. code-block:: console docker-compose -f deployment/docker-compose-api.yml cosmos up This will create an elasticsearch service for you. If you already have a ElasticSearch cluster online, see :ref:`Existing ElasticSearch Cluster `. Its datapath is will be set to the environment variable set in the .env file. You can expect to see five images up while this is running with docker ps. They are named: 1. cosmos_rerank_model_1 2. cosmos_front_end_1 3. cosmos_scheduler_1 4. cosmos_birdnest_backend 5. es01 If you see an es01 failure in the logging at this stage it is likely due to the permissions not being set appropriately on the ELASTIC_DATA_PATH as mentioned above. See the :doc:`troubleshooting` guide for details. Keep the docker-compose API application running in it's own pane/terminal. Start a new pane/terminal for the next step. Reading ingested data into ElasticSearch ---------------------------------------- The files outputted by ingestion must now be read into ElasticSearch indices. In a separate window, run: .. code-block:: console docker-compose -f deployment/docker-compose-api-upload.yml cosmos up This could take some time, depending on the size of your document stack. Note that the ElasticSearch data will persist even if you re-run the ingest, api and upload applications. To make sure you are only retrieving on the latest ElasticSearch data: delete, re-create, and reset the permissions on the ELASTIC_DATA_PATH directory between cycles. Viewing output -------------- You should now be able to navigate to localhost:8082 in your browser to access the search interface. .. _Docker: https://www.docker.com/ .. _Parquet: https://parquet.apache.org/ .. _ElasticSearch: https://www.elastic.co/home .. _SciSpacy: https://allenai.github.io/scispacy/ Problems -------- Common errors/issues and their solutions are detailed here in the :doc:`troubleshooting` guide.