DocumentStores

You can think of the DocumentStore as a "database" that:

stores your texts and meta data
provides them to the retriever at query time

There are different DocumentStores in Haystack to fit different use cases and tech stacks.

Initialisation

Initialising a new DocumentStore within Haystack is straight forward.

Elasticsearch

Install Elasticsearch and then start an instance.

If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull docker.elastic.co/elasticsearch/elasticsearch:7.9.2
docker run -d -p 9200:9200 -e "discovery.type=single-node" elasticsearch:7.9.2

Next you can initialize the Haystack object that will connect to this instance.

from haystack.document_store import ElasticSearchDocumentStore

document_store = ElasticsearchDocumentStore()

Open Distro for Elasticsearch

Learn how to get started here

If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull amazon/opendistro-for-elasticsearch:1.13.2
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" amazon/opendistro-for-elasticsearch:1.13.2

Next you can initialize the Haystack object that will connect to this instance.

from haystack.document_store import OpenDistroElasticsearchDocumentStore

document_store = OpenDistroElasticsearchDocumentStore()

OpenSearch

Learn how to get started here

If you have Docker set up, we recommend pulling the Docker image and running it.

docker pull opensearchproject/opensearch:1.0.1
docker run -p 9200:9200 -p 9600:9600 -e "discovery.type=single-node" opensearchproject/opensearch:1.0.1

Next you can initialize the Haystack object that will connect to this instance.

from haystack.document_store import OpenSearchDocumentStore

document_store = OpenSearchDocumentStore()

Milvus

Follow the official documentation to start a Milvus instance via Docker. Note that we also have a utility function haystack.utils.launch_milvus that can start up a Milvus instance.

You can initialize the Haystack object that will connect to this instance as follows:

from haystack.document_store import MilvusDocumentStore

document_store = MilvusDocumentStore()

FAISS

The FAISSDocumentStore requires no external setup. Start it by simply using this line.

from haystack.document_store import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

Save & Load

FAISS document stores can be saved to disk and reloaded:

from haystack.document_store import FAISSDocumentStore

document_store = FAISSDocumentStore(faiss_index_factory_str="Flat")

# Generates two files: my_faiss_index.faiss and my_faiss_index.json
document_store.save("my_faiss_index.faiss")

# Looks for the two files generated above
new_document_store = FAISSDocumentStore.load("my_faiss_index.faiss")

assert new_document_store.faiss_index_factory_str == "Flat"

While my_faiss_index.faiss contains the index, my_faiss_index.json contains the parameters used to inizialize it (like faiss_index_factory_store). This configuration file is necessary for load() to work. It simply contains the initial parameters in a JSON format. For example, a hand-written configuration file for the above FAISS index could look like:

{
    faiss_index_factory_store: 'Flat'
}

In Memory

The InMemoryDocumentStore() requires no external setup. Start it by simply using this line.

from haystack.document_store import InMemoryDocumentStore

document_store = InMemoryDocumentStore()

SQL

The SQLDocumentStore requires SQLite, PostgresQL or MySQL to be installed and started. Note that SQLite already comes packaged with most operating systems.

from haystack.document_store import SQLDocumentStore

document_store = SQLDocumentStore()

Weaviate

The WeaviateDocumentStore requires a running Weaviate Server. You can start a basic instance like this (see the Weaviate docs for details):

    docker run -d -p 8080:8080 --env AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED='true' --env PERSISTENCE_DATA_PATH='/var/lib/weaviate' semitechnologies/weaviate:1.4.0

Afterwards, you can use it in Haystack:

from haystack.document_store import WeaviateDocumentStore

document_store = WeaviateDocumentStore()

Each DocumentStore constructor allows for arguments specifying how to connect to existing databases and the names of indexes. See API documentation for more info.

Input Format

DocumentStores expect Documents in dictionary form, like that below. They are loaded using the DocumentStore.write_documents() method. See Preprocessing for more information on the cleaning and splitting steps that will help you maximize Haystack's performance.

from haystack.document_store import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore()
dicts = [
    {
        'text': DOCUMENT_TEXT_HERE,
        'meta': {'name': DOCUMENT_NAME, ...}
    }, ...
]
document_store.write_documents(dicts)

Writing Documents (Sparse Retrievers)

Haystack allows for you to write store documents in an optimised fashion so that query times can be kept low. For sparse, keyword based retrievers such as BM25 and TF-IDF, you simply have to call DocumentStore.write_documents(). The creation of the inverted index which optimises querying speed is handled automatically.

document_store.write_documents(dicts)

Writing Documents (Dense Retrievers)

For dense neural network based retrievers like Dense Passage Retrieval, or Embedding Retrieval, indexing involves computing the Document embeddings which will be compared against the Query embedding.

The storing of the text is handled by DocumentStore.write_documents() and the computation of the embeddings is started by DocumentStore.update_embeddings().

document_store.write_documents(dicts)
document_store.update_embeddings(retriever)

This step is computationally intensive since it will engage the transformer based encoders. Having GPU acceleration will significantly speed this up.

Choosing the Right Document Store

The Document Stores have different characteristics. You should choose one depending on the maturity of your project, the use case and technical environment:

Elasticsearch

Pros:

Fast & accurate sparse retrieval with many tuning options
Basic support for dense retrieval
Production-ready

Cons:

Slow for dense retrieval with more than ~ 1 Mio documents

Open Distro for Elasticsearch

Pros:

Fully open source (Apache 2.0 license)
Essentially the same features as Elasticsearch

Cons:

Slow for dense retrieval with more than ~ 1 Mio documents

OpenSearch

Pros:

Fully open source (Apache 2.0 license)
Essentially the same features as Elasticsearch
Has more support for vector similarity comparisons and approximate nearest neighbours algorithms

Cons:

Not as optimized as dedicated vector similarity options like Milvus and FAISS

Milvus

Pros:

Scalable DocumentStore that excels at handling vectors (hence suited to dense retrieval methods like DPR)
Encapsulates multiple ANN libraries (e.g. FAISS and ANNOY) and provides added reliability
Runs as a separate service (e.g. a Docker container)
Allows dynamic data management

Cons:

No efficient sparse retrieval

FAISS

Pros:

Fast & accurate dense retrieval
Highly scalable due to approximate nearest neighbour algorithms (ANN)
Many options to tune dense retrieval via different index types (more info here)

Cons:

No efficient sparse retrieval

In Memory

Pros:

Simple
Exists already in many environments

Cons:

Only compatible with minimal TF-IDF Retriever
Bad retrieval performance
Not recommended for production

SQL

Pros:

Simple & fast to test
No database requirements
Supports MySQL, PostgreSQL and SQLite

Cons:

Not scalable
Not persisting your data on disk

Weaviate

Pros:

Simple vector search
Stores everything in one place: documents, meta data and vectors - so less network overhead when scaling this up
Allows combination of vector search and scalar filtering, i.e. you can filter for a certain tag and do dense retrieval on that subset

Cons:

Less options for ANN algorithms than FAISS or Milvus
No BM25 / Tf-idf retrieval

Our Recommendations

Restricted environment: Use the InMemoryDocumentStore, if you are just giving Haystack a quick try on a small sample and are working in a restricted environment that complicates running Elasticsearch or other databases

Allrounder: Use the ElasticSearchDocumentStore, if you want to evaluate the performance of different retrieval options (dense vs. sparse) and are aiming for a smooth transition from PoC to production

Vector Specialist: Use the MilvusDocumentStore, if you want to focus on dense retrieval and possibly deal with larger datasets

Stars

2435

Edit on GitHub

Initialisation
Input Format
Writing Documents (Sparse Retrievers)
Writing Documents (Dense Retrievers)
Choosing the Right Document Store