Milvus with Docker and Python⚓︎

Milvus Python Docker

Milvus Docker Intro PNG

Milvus Docker Intro GIF

TL;DR

Learn how to build and use a vector database server to store and search your documents on your local machine . Then, you can use this setup as a tool to give to locally run AI agents .

About This Project⚓︎

Here, we're going to setup a Milvus server in Docker for using the vectorstore on our local machines . We'll see how to add some example data to the vectorstore and then how to search the database for a given query . Once our Milvus server is setup properly, we can also check out some useful information about our client sessions by navigating to http://localhost:9091/webui.

The code we learn and use here will serve as the foundation for an indispensable tool to give to our agents, allowing them to obtain information about our personal data . To see what sorts of agents we'll build to use this tool and others, check out the agents tutorials.

As previously mentioned, the way we're going to build agents is by first building local servers for all the gadgets that our agents will need . Then, we can learn how to pass these gadgets over to our agents with LangChain and LangGraph.

We've gone over how to create an Ollama server to chat with LMs and a SearXNG server to search the web. Now, we're going to follow the same sort of process to create the Milvus server. We'll learn how to setup and use the provided Python code, built on the PyMilvus library, to interact with the server then dive into the code to see how it all works . This time, we'll also dive into the math behind the search process to better understand our results .

A vector database, or a vectorstore, is just a special type of database that can be used to store and query your data. When data is added to the vectorstore, it's stored with additional representations called embeddings which map the data in ways that allow relevant information to be obtained in searches . There are various types of embeddings to represent data differently , and we'll get a look at how to use one of them in this tutorial, the sparse embedding or sparse vector ¹.

Vectorstores also utilize special variables called indices which describe how the data should be represented for quick searches . By defining different embeddings and indices, we can represent and search our data in different ways. We can have very different types of data (e.g. images or code documents ) and still be able to find relevant information from our queries if we choose proper embeddings and indices .

One way to search data is to do an exact keyword filtering . For example, if the search query has the word Milvus in it, the most relevant data entries found will be those that contain the word Milvus. The search that we'll do in this tutorial, a full-text search, is similar in that it searches for entire keywords of the query while utilizing sparse vectors for the search .

However, this type of search also ranks documents for a given query , and so it's probably better defined as a ranking algorithm. There's quite a lot of nuance here that can be worked through and we'll get to do just that when we dive into the code and the math .

Besides full-text searches, another type of search is the semantic search. These searches find data based on relations that are, much of the time, not readily apparent . Instead of simply finding the data that has a particular keyword in it, these searches find data that have complex relations to the query due to the use of describing the data with dense embeddings, or dense vectors ².

Using Milvus as a vectorstore to perform these types of searches is an obvious choice . As we'll see here and in the document agent tutorial, it's a powerhouse for storing and searching data . As for all the various types of searches that we discussed above, Milvus already has a lot of these primed and ready for us to use .

With Milvus, we could choose to do either a full-text or a semantic search, depending on our preferences . We could also perform a hybrid search, meaning we can query the data on both sparse and dense vectors at the same time, performing a full-text search (exact keyword) and a semantic search (complex relationships) simultaneously while utilizing ranking algorithms to further increase the relevancy of our results .

The Milvus server will also utilize a MinIO server for data storage and an etcd server for storage and coordination. I won't go into the details of this part of the setup, though I tried to add extensive documentation to the code as a result of me trying to understand a bit better what it was doing .

There are a couple of alternatives that I tried for storing and searching data, both of which bridge nicely with LangChain. If Milvus doesn't fit your needs, you can also check out FAISS or Chroma.

For a refresher on how to use Docker to build an LM server that can power the decision making and response generating aspects of our agents, check out the Ollama server tutorial. To see how to build a metasearch engine tool to search the web, check out the SearXNG server tutorial. For an idea of what types of agents we'll build with our servers, check out the agents tutorials.

Finally, before you start building, you can also check out the Docker compose file on which ours is based .

Now, let's get building !

Getting Started⚓︎

First, we're going to setup and build the repo to make sure that it works . Then, we can play around with the code and learn more about it .

Check out all the source code here .

To setup and build the repo follow these steps:

Toggle for visual instructions

This is currently under construction.

Make sure Docker is installed and running.

Clone the repo, head there, then create a Python environment:

git clone https://github.com/anima-kit/milvus-docker.git
cd milvus-docker
python -m venv venv

Activate the Python environment:
```
venv/Scripts/activate
```
Install the necessary Python libraries:
```
pip install -r requirements.txt
```
Build and start all the Docker containers:
```
docker compose up -d
```
All server data will be located in Docker volumes (milvus-data, minio-data, and etcd-data).
Head to http://127.0.0.1:9091/webui/ to check out some useful Milvus client information.
Run the test script to ensure the Milvus server can be reached through the PyMilvus library:
```
python -m scripts.milvus_test
```
All logs from the test script are output in the console and stored in the ./milvus-docker.log file.
When you're done, stop the Docker containers and cleanup with:
```
docker compose down
```

Example Use Cases⚓︎

Now that the repo is built and working, let's play around with the code a bit .

When we built our Ollama and SearXNG servers, we demonstrated that the servers could be reached and properly invoked by using the provided Python methods . We did this by first instantiating the classes that held the methods to use, then executed the proper commands in the command line and by running scripts .

In this case, the main class to interact with the Milvus server is the MilvusClientInit class. Once this class is initialized, there are many external methods that we can use to manage our vectorstore collections and perform full-text searches .

We can create collections to hold our data using the create_collection method and delete these collections with the drop_collection method. We can list all available collections with the list_collections method, and we can add data to or remove data from a collection using the insert and delete methods. Finally, we can search a collection for a given query using the full_text_search method.

For this tutorial, we'll represent our data with sparse vectors which can then be used to search for particular keywords . When we build our document agent, we'll see how to add in additional dense vectors to represent our data. In this case, we'll be able to use both sparse and dense embeddings to perform hybrid searches for both particular keywords and data based on complex, semantic connections to the query .

To start off learning how to use the code, let's manage and search a vectorstore through the command line .

Data Management and Search through the Command Line⚓︎

Toggle for visual instructions

This is currently under construction.

To manage and search data in the command line, follow these steps:

Do step 3 then step 5 of the Getting Started section to activate the Python environment and run all the Docker containers to start the Milvus server.
Call the Python environment to the command line:
```
python
```
Now that you're in the Python environment, import the MilvusClientInit class:
```
from pyfiles.milvus_utils import MilvusClientInit
```
Initialize the MilvusClientInit class:
```
client = MilvusClientInit()
```
Create a collection with the default parameters:
```
client.create_collection()
```
Insert the default data into the collection:
```
client.insert()
```
Perform the default full-text search on the data:
```
client.full_text_search()
```
Drop the default collection at the end to clean up:
```
client.drop_collection()
```
Do step 8 of the Getting Started section to stop the containers when you're done.

Just like with the test script, all logs will be printed in the console and stored in ./milvus-docker.log.

There are lots of variables we can customize here: the collection we use, the data we insert, and the queries for which we search the database ³.

In the next example, I show how to make these customizations by creating and running a custom script .

Data Management and Search through Running Scripts⚓︎

Toggle for visual instructions

This is currently under construction.

To manage and search your data through a custom script, follow these steps:

Do step 3 then step 5 of the Getting Started section to activate the Python environment and run the Milvus server in Docker.

Create a script in the ./scripts folder named my-data-search-ex.py with the following:

# Import MilvusClientInit class
from pyfiles.milvus_utils import MilvusClientInit

# Initialize client
client = MilvusClientInit()

# Create collection
collection_name = 'my_collection'
client.create_collection(collection_name)

# Create data and insert into collection
my_data = [
    {'text': 'grocery list: bananas, bread, choco'},
    {'text': 'grocery: green beans'},
    {'text': 'todo list: start chatbot tutorial, network with community'},
    {'text': 'study list: langchain v1, lucid dreaming and asc'},
    {'text': 'study latest gradio implements'},
    {'text': 'My dream last night involved'},
    {'text': 'then I woke up, confused as to if I was still dreaming.'}
]
client.insert(name=collection_name, data=my_data)

# Define the maximum number of results and the search query
num_results = 2
query_list = ['grocery', 'study', 'dream']

# Get results
client.full_text_search(
    name=collection_name, 
    query_list=query_list, 
    limit=num_results
)

# (Optional) Delete collection when done to clean up
client.drop_collection(name=collection_name)

Run the script
```
python -m scripts.my-data-search-ex
```
Do step 8 of the Getting Started section to stop the containers when you're done.

Again, all logs will be printed in the console and stored in ./milvus-docker.log. The name of the Python script doesn't matter as long as you use the same name in step 2 and step 3.

Notice these searches are full-text, meaning the data is searched for exact keywords . We can improve this search by performing a semantic search simultaneously, picking up on more subtle relationships between the data . This is exactly what we'll do in the document agent tutorial .

Now that we understand how to use the code, let's open it up to check out the gears .

Project Structure⚓︎

Before we take a deep dive into the source code , let's look at the repo structure to see what code we'll want to learn .

├── docker-compose.yml      # Docker configurations
├── pyfiles/                # Python source code
│   └── logger.py           # Python logger for tracking progress
│   └── milvus_utils.py     # Python methods to use Milvus server
├── requirements.txt        # Required Python libraries for main app
├── requirements-dev.txt    # Required Python libraries for development
├── scripts/                # Example scripts to use Python methods
│   └── milvus_test.py      # Python test of methods
│   └── latency_test.py     # Timing tests for methods
├── tests/                  # Testing suite
├── third-party/            # Milvus/PyMilvus licensing
├── validators/             # Validators for Python methods
└── └── milvus_types_.py    # Type validation for Python methods

milvus_utils.py

The milvus_utils.py file defines the class and methods needed in order to manage and search custom data using our Milvus server. This is the file on which we're going to spend our deep dive .

milvus_types.py

This file performs type validation for the methods in the milvus_utils.py file . What this means is that the arguments and results of the methods are checked to make sure they have the correct Python type using Pydantic (see lines 211-213 & 223-225 of the full version of the milvus_utils.py file for an example of how this file is used). This trick is quite helpful for ridding your code of potential bugs related to recieving the wrong types of objects . I won't go into this file in detail, but you can check it out if you're interested in learning .

How do all the files work?

For a refresher on the logger.py, requirements*.txt, latency_test.py, and milvus_test.py files as well as the tests/ folder, check out the Ollama and SearXNG tutorials.

We've also seen the Docker compose file for the Ollama and SearXNG tutorials. Here, this file doesn't contain much we haven't seen yet. Similarly to the SearXNG server, we define multiple services for Milvus as well as the supporting servers, MinIO and etcd. We also perform some new tricks for healthchecks and ensure that the Milvus server can't be started without its supporting servers . I won't go into details for these, but you can check out the file to see what I mean.

Ok, that's all the files . Let's go diving !

Code Deep Dive⚓︎

Here, we're going to look at the relevant files in more detail . We're going to start with looking at the full files to see which parts of the code we'll want to learn, then we can further probe each of the important pieces to see how they work .

File 1 | `milvus_utils.py`⚓︎

Toggle file visibility

SkeletonFull

milvus_utils.py skeleton
### milvus_utils
## Defines functions needed to setup, manange, and query a vector database with a Milvus client.
## Based on Milvus docs | https://milvus.io/docs/full-text-search.md

## Imports
# Third-party modules
import time
from pymilvus import (
    MilvusClient,
    Function, 
    FunctionType, 
    DataType 
)

## Constants
# URI | Milvus server uri
uri = 'http://localhost:19530'

## Index params list
# List of dictionaries defining the indices to add to index params
# Here, we're only working with sparse vectors created with BM25
index_params_list = [
    {
        "field_name": "sparse",
        "index_type": "SPARSE_INVERTED_INDEX",
        "metric_type": "BM25",
        "params": {
            "inverted_index_algo": "DAAT_MAXSCORE",
            "bm25_k1": 3,   # Maximize importance of term frequency
            "bm25_b": 1     # Full normalization of docs
        }
    }
]

## BM25 embed function
# The function to get sparse embeddings from text
func_bm25 = Function(
    name="text_bm25_emb",
    input_field_names=["text"],
    output_field_names=["sparse"],
    function_type=FunctionType.BM25,
)

## Field params list
# List of dictionaries defining the fields to add to the schema
# This is how we describe our data:
# just need text and sparse vectors for this demo (full-text search only)
field_params_list = [
    {
        "field_name": "id", 
        "datatype": DataType.INT64, 
        "is_primary": True, 
        "auto_id": True
    },
    {
        "field_name": "text", 
        "datatype": DataType.VARCHAR, 
        "max_length": 1000, 
        "enable_analyzer": True
    },
    {
        "field_name": "sparse", 
        "datatype": DataType.SPARSE_FLOAT_VECTOR
    }
]

## Collection name
collection_name = 'collection_ex'

## Example data
# Default data to use for the insert method
data_ex = [
    {'text': 'information retrieval is a field of study.'},
    {'text': 'information retrieval focuses on finding relevant information in large datasets.'},
    {'text': 'data mining and information retrieval overlap in research.'},
    {'text': 'the rest of the lyrics go,'},
    {'text': 'Last night I dreamed about'}
]

## Query list
# Default list of queries to search the database
query_list = ["What's the focus of information retrieval?"]

## Result limit
# Default maximum number of results to get from search
lim_results = 3



class MilvusClientInit:
    def __init__(
        self, 
        uri = uri
    ):
        self.uri = uri
        # Initialize MilvusClient from PyMilvus
        self.client = self._init_client()


    ## Initialize MilvusClient
    def _init_client(
        self
    ):
        ## Define MilvusClient with PyMilvus library
        client = MilvusClient(
            uri=self.uri
        )
        return client


    ## Create field for schema
    def _create_field(
        self, 
        schema, 
        params
    ):
        # Add the field to the schema for the given params
        schema.add_field(**params)


    ## Create index for index parameters
    def _create_index(
        self, 
        index_params, 
        params
    ):
        # Add the index to the index params for the given params
        index_params.add_index(**params)


    ## List all client collections
    def list_collections(
        self
    ):
        ## List all client collections
        collections = self.client.list_collections()
        return collections


    ## Create a collection for the client
    def create_collection(
        self, 
        name = collection_name, 
        field_params_list = field_params_list, 
        func_list = [func_bm25], 
        index_params_list = index_params_list,
    ):
        ## Initialize the collection schema
        # Allow for adding different fields later on with enable_dynamic_field
        schema = self.client.create_schema(
            enable_dynamic_field=True,
        )

        ## Add all fields to the schema
        # This is how to describe the data
        for params in field_params_list:
            self._create_field(schema, params)

        ## Add all functions to the schema (embedding)
        # This is how to represent the data
        for func in func_list:
            schema.add_function(func)

        ## Add all indices to the index params
        # This is how to search the data
        index_params = self.client.prepare_index_params()
        for params in index_params_list:
            self._create_index(index_params, params)

        ## Create collection
        self.client.create_collection(
            collection_name=name,
            schema=schema,
            index_params=index_params
        )


    ## Drop a collection for the client
    def drop_collection(
        self, 
        name = collection_name
    ):
        # Drop collection
        self.client.drop_collection(collection_name=name)


    ## Insert data into a collection
    def insert(
        self, 
        name = collection_name, 
        data = data_ex
    ):
        # Insert data into the collection
        results = self.client.insert(
            collection_name=name,
            data=data
        )
        # Wait for the database to update
        time.sleep(0.5)

        ## Return results
        return results


    ## Delete data from a collection
    def delete(
        self, 
        ids,
        name = collection_name, 
    ):
        # Delete data into the collection
        results = self.client.delete(
            collection_name=name,
            ids=ids
        )
        # Wait for the database to update
        time.sleep(0.5)


    ## Perform a full text search on a collection
    def full_text_search(
        self, 
        name = collection_name, 
        query_list = query_list, 
        limit = lim_results
    ):
        ## Define search params
        # Only full-text so focus on sparse vectors built from text
        anns_field = 'sparse'
        output_fields = ['text']

        ## Add extra search params
        # Controls trade-off between speed and accuracy in ANN searches
        # Drop some percentage of results before searching
        search_params = {
            'params': {'drop_ratio_search': 0.2},
        }

        ## Get the search results
        results = self.client.search(
            collection_name=name, 
            data=query_list,
            anns_field=anns_field,
            output_fields=output_fields,
            limit=limit,
            search_params=search_params
        )
        return results

milvus_utils.py full
### milvus_utils
## Defines functions needed to setup, manange, and query a vector database with a Milvus client.
## Based on Milvus docs | https://milvus.io/docs/full-text-search.md

## Imports
# Third-party modules
import time
import pprint

from pymilvus.client.search_result import SearchResult # type: ignore
from pymilvus.milvus_client.index import IndexParams # type: ignore
from pymilvus import ( # type: ignore
    MilvusClient, 
    CollectionSchema, 
    Function, 
    FunctionType, 
    DataType 
)

from typing import List

## Internal modules
import milvus_types
from logger import (
    logger, 
    with_spinner
)

## Constants
# URI | Milvus server uri
uri: str = 'http://localhost:19530'

## Index params list
# List of dictionaries defining the indices to add to index params
# Here, we're only working with sparse vectors created with BM25
index_params_list: List[dict] = [
    {
        "field_name": "sparse",
        "index_type": "SPARSE_INVERTED_INDEX",
        "metric_type": "BM25",
        "params": {
            "inverted_index_algo": "DAAT_MAXSCORE",
            "bm25_k1": 3,   # Maximize importance of term frequency
            "bm25_b": 1     # Full normalization of docs
        }
    }
]

## BM25 embed function
# The function to get sparse embeddings from text
func_bm25: Function = Function(
    name="text_bm25_emb",
    input_field_names=["text"],
    output_field_names=["sparse"],
    function_type=FunctionType.BM25,
)

## Field params list
# List of dictionaries defining the fields to add to the schema
# This is how we describe our data:
# just need text and sparse vectors for this demo (full-text search only)
field_params_list: List[dict] = [
    {
        "field_name": "id", 
        "datatype": DataType.INT64, 
        "is_primary": True, 
        "auto_id": True
    },
    {
        "field_name": "text", 
        "datatype": DataType.VARCHAR, 
        "max_length": 1000, 
        "enable_analyzer": True
    },
    {
        "field_name": "sparse", 
        "datatype": DataType.SPARSE_FLOAT_VECTOR
    }
]

## Collection name
collection_name: str = 'collection_ex'

## Example data
# Default data to use for the insert method
data_ex: List[dict] = [
    {'text': 'information retrieval is a field of study.'},
    {'text': 'information retrieval focuses on finding relevant information in large datasets.'},
    {'text': 'data mining and information retrieval overlap in research.'},
    {'text': 'the rest of the lyrics go,'},
    {'text': 'Last night I dreamed about'}
]

## Query list
# Default list of queries to search the database
query_list: List[str] = ["What's the focus of information retrieval?"]

## Result limit
# Default maximum number of results to get from search
lim_results: int = 3



class MilvusClientInit:
    """
    A Milvus client that can be used to manage and search data.

    The user can do a full-text search on their data by initializing the client then using the `full_text_search` method to get results.

    For example, to initialize the client for a given URI:
    ```python
    uri = 'http://localhost:19530'
    client = MilvusClientInit(uri=uri)
    ```

    Then, to create a collection:
    ```python
    name = 'collection_ex'

    field_params_list = [
        {
            "field_name": "id", 
            "datatype": DataType.INT64, 
            "is_primary": True, 
            "auto_id": True
        },
        {
            "field_name": "text", 
            "datatype": DataType.VARCHAR, 
            "max_length": 1000, 
            "enable_analyzer": True
        },
        {
            "field_name": "sparse", 
            "datatype": DataType.SPARSE_FLOAT_VECTOR
        }
    ]

    func_list = [
        Function(
            name="text_bm25_emb",
            input_field_names=["text"],
            output_field_names=["sparse"],
            function_type=FunctionType.BM25,
        )
    ]

    index_params_list = [
        {
            "field_name": "sparse",
            "index_type": "AUTOINDEX",
            "metric_type": "BM25",
        }
    ]

    client.create_collection(
        name=name, 
        field_params_list=field_params_list, 
        func_list=func_list, 
        index_params_list=index_params_list
    )
    ```

    Then, to insert some data:
    ```python
    data = [
        {'text': 'information retrieval is a field of study.'},
        {'text': 'information retrieval focuses on finding relevant information in large datasets.'},
        {'text': 'data mining and information retrieval overlap in research.'},
        {'text': 'the rest of the lyrics go,'},
        {'text': 'Last night I dreamed about'}
    ]
    client.insert(name=name, data=data)
    ```

    Finally, to do a full-text search:
    ```python
    query_list = ['whats the focus of information retrieval?']
    lim_results = 3
    client.full_text_search(name=name, query_list=query_list, limit=limit)
    ```

    Attributes
    ------------
        uri: str, Optional
            The uri on which to host the Milvus client.
            Defaults to 'http://localhost:19530'.
        client: MilvusClient
            The Milvus client to use to manage and query data.
    """
    def __init__(
        self, 
        uri: str = uri,
        client: MilvusClient | None = None
    ) -> None:
        """
        Initialize the Milvus client hosted on the given URI.

        Args
        ------------
            uri: str, Optional
                The uri on which to host the Milvus client.
                Defaults to 'http://localhost:19530'.

        Raises
        ------------
            Exception: 
                If initialization fails, error is logged and raised.
        """
        ## Validate all argument types
        milvus_types.InitClientParams(
            uri = uri
        )

        ## Start Init
        try:
            self.uri = uri
            # Initialize MilvusClient from PyMilvus
            if client is None:
                self.client: MilvusClient = self._init_client()
            else:
                ## Validate client
                milvus_types.InitClientResults(
                    results=client
                )
                self.client = client
        except Exception as e:
            logger.info(f'❌ Problem initializing Milvus client: {str(e)}')
            raise


    ## Initialize MilvusClient
    def _init_client(
        self
    ) -> MilvusClient:
        """
        Connect the Milvus client.

        Returns
        ------------
            MilvusClient
                The client instance.

        Raises
        ------------
            Exception: 
                If client connection fails, error is logged and raised.
        """
        logger.info(f'⚙️ Starting Milvus client on URI `{self.uri}`')
        try:
            ## Define MilvusClient with PyMilvus library
            client: MilvusClient = MilvusClient(
                uri=self.uri
            )

            ## Validate results
            milvus_types.InitClientResults(
                results=client
            )

            ## Return results
            logger.info(f'⚙️ Milvus client connected at `{self.uri}`')
            return client
        except Exception as e:
            logger.error(f'❌ Problem connecting to Milvus client: `{str(e)}`')
            raise


    ## Create field for schema
    def _create_field(
        self, 
        schema: CollectionSchema, 
        params: dict
    ) -> None:
        """
        Add a field with the given parameters to the given schema.

        Args
        ------------
            schema: CollectionSchema
                The schema for which a field will be created.
                The fields will be created using the schema.add_field method.
            params: dict
                Dictionary with key-value pairs given by the argument-value pairs for the schema.add_field method.

        Raises
        ------------
            Exception: 
                If adding the field fails, error is logged and raised.
        """
        ## Validate all argument types
        milvus_types.CreateFieldParams(
            collection_schema=schema,
            params=params
        )

        ## Start Create Field
        logger.info(f'⚙️ Creating field for params: \n {pprint.pformat(params, indent=0, width=500)} \n')
        try:
            # Add the field to the schema for the given params
            schema.add_field(**params)
            logger.info(f'✅ Field added to schema')
        except Exception as e:
            logger.info(f'❌ Problem creating field: `{str(e)}`')
            raise


    ## Create index for index parameters
    def _create_index(
        self, 
        index_params: IndexParams, 
        params: dict
    ) -> None:
        """
        Add an index with the given parameters to the given list of index parameters.

        Args
        ------------
            index_params: IndexParams
                The index parameters for which an index will be created.
                The index will be created using the index_params.add_index method.
            params: dict
                Dictionary with key-value pairs given by the argument-value pairs for the index_params.add_index method.

        Raises
        ------------
            Exception: 
                If adding the index fails, error is logged and raised.
        """
        ## Validate all argument types
        milvus_types.CreateIndexParams(
            index_params=index_params,
            params=params
        )

        ## Start Create Index
        logger.info(f'⚙️ Creating index for params: \n {pprint.pformat(params, indent=0, width=500)} \n')
        try:
            # Add the index to the index params for the given params
            index_params.add_index(**params)
            logger.info(f'✅ Index added to index params')
        except Exception as e:
            logger.info(f'❌ Problem creating index params: `{str(e)}`')
            raise


    ## List all client collections
    def list_collections(
        self
    ) -> List[str | None]:
        """
        Get all collections for the client.

        For example, to get the collections for a given client:
        ```python
        # Initialize the client
        uri = 'http://localhost:19530'
        client = MilvusClientInit(uri=uri)

        # List collections
        client.list_collections()
        ```

        Returns
        ------------
            List[str | None]: 
                A list of the names of all available collections.

        Raises
        ------------
            Exception: 
                If getting the collections fails, error is logged and raised.
        """
        try:
            ## List all client collections
            collections = self.client.list_collections()

            ## Validate results
            milvus_types.ListCollectionsResults(
                results=collections
            )

            ## Return results
            logger.info(f'📝 Available collections: \n {pprint.pformat(collections, indent=0, width=500)} \n')
            return collections
        except Exception as e:
            logger.error(f'❌ Problem listing collections: `{str(e)}`')
            raise   


    ## Create a collection for the client
    def create_collection(
        self, 
        name: str = collection_name, 
        field_params_list: List[dict] = field_params_list, 
        func_list: List[Function] = [func_bm25], 
        index_params_list: List[dict] = index_params_list,
    ) -> None:
        """
        Create a collection with the given name, field parameters, embedding functions, and index parameters.

        For example, one can create a collection to describe data with text and sparse vectors. This will be good for performing full-text searches:
        ```python
        # Initialize the client
        uri = 'http://localhost:19530'
        client = MilvusClientInit(uri=uri)

        # Create a collection
        name = 'collection_ex'

        field_params_list = [
            {
                "field_name": "id", 
                "datatype": DataType.INT64, 
                "is_primary": True, 
                "auto_id": True
            },
            {
                "field_name": "text", 
                "datatype": DataType.VARCHAR, 
                "max_length": 1000, 
                "enable_analyzer": True
            },
            {
                "field_name": "sparse", 
                "datatype": DataType.SPARSE_FLOAT_VECTOR
            }
        ]

        func_list = [
            Function(
                name="text_bm25_emb",
                input_field_names=["text"],
                output_field_names=["sparse"],
                function_type=FunctionType.BM25,
            )
        ]

        index_params_list = [
            {
                "field_name": "sparse",
                "index_type": "AUTOINDEX",
                "metric_type": "BM25",
            }
        ]

        client.create_collection(
            name=name, 
            field_params_list=field_params_list, 
            func_list=func_list, 
            index_params_list=index_params_list
        )
        ```

        Args
        ------------
            name: str, Optional
                The name of the collection to create.
                Defaults to 'collection_ex'.
            field_params_list: List[dict]
                List of dictionaries for the field parameters.
                Fields to add to the collection (to describe data).
                Key-value pairs give the argument-value pairs for the schema.add_field method
                Defaults to: [
                    {
                        "field_name": "id", 
                        "datatype": DataType.INT64, 
                        "is_primary": True, 
                        "auto_id": True
                    },
                    {
                        "field_name": "text", 
                        "datatype": DataType.VARCHAR, 
                        "max_length": 1000, 
                        "enable_analyzer": True
                    },
                    {
                        "field_name": "sparse", 
                        "datatype": DataType.SPARSE_FLOAT_VECTOR
                    }
                ]
            index_params_list: List[dict]
                List of dictionaries for the index parameters.
                Fields to use in search with given specifications.
                Key-value pairs give the argument-value pairs for the index_params.add_index method
                Defaults to: [
                    {
                        "field_name": "sparse",
                        "index_type": "AUTOINDEX",
                        "metric_type": "BM25",
                    }
                ]
            func_list: List[Function]
                List of functions for the data embedding.
                Defaults to the BM25 embed function:
                    Function(
                        name="text_bm25_emb",
                        input_field_names=["text"],
                        output_field_names=["sparse"],
                        function_type=FunctionType.BM25,
                    ) 

        Raises
        ------------
            Exception: 
                If creating the collection fails, error is logged and raised.
        """
        ## Validate all argument types
        milvus_types.CreateCollectionParams(
            name=name,
            field_params_list=field_params_list,
            func_list=func_list,
            index_params_list=index_params_list
        )

        ## Check if collection already exists
        collections: List[str | None] = self.list_collections()
        if name in collections:
            logger.info(f'✅ Collections `{name}` already exists.')
            return

        ## Start Create Collection
        logger.info(f'⚙️ Creating collection `{name}`')
        try:
            ## Initialize the collection schema
            # Allow for adding different fields later on with enable_dynamic_field
            schema: CollectionSchema = self.client.create_schema(
                enable_dynamic_field=True,
            )

            ## Add all fields to the schema
            # This is how to describe the data
            for params in field_params_list:
                self._create_field(schema, params)

            ## Add all functions to the schema (embedding)
            # This is how to represent the data
            for func in func_list:
                schema.add_function(func)

            ## Add all indices to the index params
            # This is how to search the data
            index_params: IndexParams = self.client.prepare_index_params()
            for params in index_params_list:
                self._create_index(index_params, params)

            ## Create collection
            self.client.create_collection(
                collection_name=name,
                schema=schema,
                index_params=index_params
            )

            ## List collections
            # Make sure collection was added properly
            collections = self.list_collections()
            if name in collections:
                logger.info(f'✅ Created collection `{name}`')
            else:
                error_message = f"The new collection `name` is not in the collections list."
                logger.error(error_message)
                raise ValueError(error_message)
        except Exception as e:
            logger.error(f'❌ Problem creating collection `{name}`: `{str(e)}`')
            raise   


    ## Drop a collection for the client
    def drop_collection(
        self, 
        name: str = collection_name
    ) -> None:
        """
        Drop a collection with the given name.

        For example, to create then drop a collection:
        ```python
        # Initialize the client
        uri = 'http://localhost:19530'
        client = MilvusClientInit(uri=uri)

        # Create a collection
        name = 'collection_ex'
        client.create_collection(name=name)

        # Drop the colleciton
        client.drop_collection(name=name)
        ```

        Args
        ------------
            name: str
                The name of the collection to drop.

        Raises
        ------------
            Exception: 
                If dropping the collection fails, error is logged and raised.
        """
        ## Validate all argument types
        milvus_types.DropCollectionParams(
            name=name
        )

        ## Check that collection exists
        collections: List[str | None] = self.list_collections()
        if name not in collections:
            logger.info(f'❌ Cannot find collection `{name}`.')
            return

        ## Start Drop collection
        logger.info(f'⚙️ Dropping collection `{name}`')
        try:
            # Drop collection
            self.client.drop_collection(collection_name=name)
            # Make sure collection was dropped
            collections = self.list_collections()
            if name not in collections:
                logger.info(f'✅ Dropped collection `{name}`')
            else:
                error_message = f"The collection `name` is still in the collections list."
                logger.error(error_message)
                raise ValueError(error_message)
        except Exception as e:
            logger.error(f'❌ Problem dropping collection: `{str(e)}`')
            raise  


    ## Insert data into a collection
    def insert(
        self, 
        name: str = collection_name, 
        data: List[dict] = data_ex
    ) -> dict | None:
        """
        Insert the given data into the given collection.

        For example, to insert data into a given collection:
        ```python
        # Initialize the client
        uri = 'http://localhost:19530'
        client = MilvusClientInit(uri=uri)

        # Create a collection
        name = 'collection_ex'
        client.create_collection(name=name)

        # Insert some data
        data = [
            {'text': 'information retrieval is a field of study.'},
            {'text': 'information retrieval focuses on finding relevant information in large datasets.'},
            {'text': 'data mining and information retrieval overlap in research.'},
            {'text': 'the rest of the lyrics go,'},
            {'text': 'Last night I dreamed about'}
        ]
        client.insert(name=name, data=data)
        ```

        Args
        ------------
            name: str
                Name of the collection to insert the data into.
            data: List[dict]
                The data to insert into the collection.

        Returns
        ------------
            dict: 
                A dictionary of the data added to the collection.

        Raises
        ------------
            Exception: 
                If adding the data fails, error is logged and raised.
        """
        ## Validate all argument types
        milvus_types.InsertParams(
            name=name,
            data=data
        )

        ## Check that collection exists
        collections = self.list_collections()
        if name not in collections:
            logger.info(f'❌ Cannot find collection `{name}`.')
            return None

        ## Start Insert
        logger.info(f'⚙️ Inserting data into `{name}`')
        try:
            # Insert data into the collection
            results: dict = self.client.insert(
                collection_name=name,
                data=data
            )
            # Wait for the database to update
            time.sleep(0.5)
            logger.info(f'✅ Inserted data')

            ## Validate results types
            milvus_types.InsertResults(
                results=results
            )

            ## Return results
            return results
        except Exception as e:
            logger.error(f'❌ Problem inserting data: {str(e)}')
            raise


    ## Delete data from a collection
    def delete(
        self, 
        ids: List[str],
        name: str = collection_name, 
    ) -> None:
        """
        Delete the given data from the given collection.

        For example, to delete data from a given collection:
        ```python
        # Initialize the client
        uri = 'http://localhost:19530'
        client = MilvusClientInit(uri=uri)
        client.delete(ids=['data-id'], name=name)
        ```

        Args
        ------------
            data: List[str]
                A list of ids to delete from the collection.
            name: str
                Name of the collection to upsert the data into.

        Raises
        ------------
            Exception: 
                If deleting the data fails, error is logged and raised.
        """
        ## Validate all argument types
        milvus_types.DeleteParams(
            name=name,
            ids=ids
        )

        ## Check that collection exists
        collections = self.list_collections()
        if name not in collections:
            logger.info(f'❌ Cannot find collection `{name}`.')
            return None

        ## Start Delete
        logger.info(f'⚙️ Deleting data from `{name}`')
        try:
            # Delete data into the collection
            results: dict = self.client.delete(
                collection_name=name,
                ids=ids
            )
            # Wait for the database to update
            time.sleep(0.5)
            logger.info(f'✅ Deleted data')
        except Exception as e:
            logger.error(f'❌ Problem deleting data: {str(e)}')
            raise


    ## Perform a full text search on a collection
    def full_text_search(
        self, 
        name: str = collection_name, 
        query_list: List[str] = query_list, 
        limit: int = lim_results
    ) -> SearchResult | None:
        """
        Perform a full text search on the given collection for the queries in the given query list. 
        Get a maximum number of results given by the result limit.

        For example to search a collection for a given list of queries:
        ```python
        # Initialize the client
        uri = 'http://localhost:19530'
        client = MilvusClientInit(uri=uri)

        # Create a collection
        name = 'collection_ex'
        client.create_collection(name=name)

        # Insert some data
        data = [
            {'text': 'information retrieval is a field of study.'},
            {'text': 'information retrieval focuses on finding relevant information in large datasets.'},
            {'text': 'data mining and information retrieval overlap in research.'},
            {'text': 'the rest of the lyrics go,'},
            {'text': 'Last night I dreamed about'}
        ]
        client.insert(name=name, data=data)

        # Query the data
        query_list: List[str] = ['whats the focus of information retrieval?']
        lim_results: int = 3
        client.full_text_search(name=name, query_list=query_list, limit=limit)
        ```

        Args
        ------------
            name: str
                Name of the collection to query.
            query_list: List[str]
                List of queries to perform.
            limit: int
                Maximum number of results to obtain.

        Returns
        ------------
            List[dict]: 
                A list of the search results for the given queries.

        Raises
        ------------
            Exception: 
                If performing the search fails, error is logged and raised.
        """
        ## Validate all argument types
        milvus_types.FullTextSearchParams(
            name=name,
            query_list=query_list,
            limit=limit
        )

        ## Check that collection exists
        collections = self.list_collections()
        if name not in collections:
            logger.info(f'❌ Cannot find collection `{name}`.')
            return None

        ## Start Full Text Search
        logger.info(f'⚙️ Performing search on `{name}` for queries: \n {pprint.pformat(query_list, indent=0, width=500)} \n')
        try:
            ## Define search params
            # Only full-text so focus on sparse vectors built from text
            anns_field: str = 'sparse'
            output_fields: List[str] = ['text']

            ## Add extra search params
            # Controls trade-off between speed and accuracy in ANN searches
            # Drop some percentage of results before searching
            search_params: dict = {
                'params': {'drop_ratio_search': 0.2},
            }
            with with_spinner(description=f"🔎 Getting results..."):
                ## Get the search results
                results: List[dict] = self.client.search(
                    collection_name=name, 
                    data=query_list,
                    anns_field=anns_field,
                    output_fields=output_fields,
                    limit=limit,
                    search_params=search_params
                )

            ## Validate results types
            milvus_types.FullTextSearchResults(
                results=results
            )

            ## Return results
            for i, result in enumerate(results):
                for j, hit in enumerate(result):
                    logger.info(f'📝 Result {i}, {j}: \n {pprint.pformat(hit, indent=0, width=500)} \n')
            return results
        except Exception as e:
            logger.error(f'❌ Problem performing search: {str(e)}')
            raise

Above, I show the milvus_utils.py file in all its full glory as well as in a skeleton version which is all the code needed to work and almost none of the code for some crucial best practices.

Compared to the ollama_utils.py file in the Ollama server tutorial and the searxng_utils.py in the SearXNG server tutorial, many more of the methods in the MilvusClientInit class are external methods to be used outside the class. However, the main methods that we're going to use are the create_collection, insert, and full_text_search methods. These are what we used when working in the command line and running scripts in the Example Use Cases section.

Before we start, you can also checkout the guide that much of this code is based on.

Now, let's check these methods out .

Method 1.1 | `create_collection`⚓︎

See lines 141-175 of milvus_utils.py

We've already seen that the create_collection method can take in a collection name, then create a Milvus collection that can be used to store and search data. Now, we can open up the method to see how this is all done .

create_collection method of milvus_utils.py
## Collection name
collection_name = 'collection_ex'

## Create a collection for the client
def create_collection(
    self, 
    name = collection_name, 
    field_params_list = field_params_list, 
    func_list = [func_bm25], 
    index_params_list = index_params_list,
):
    ## Initialize the collection schema
    # Allow for adding different fields later on with enable_dynamic_field
    schema = self.client.create_schema(
        enable_dynamic_field=True,
    )

    ## Add all fields to the schema
    # This is how to describe the data
    for params in field_params_list:
        self._create_field(schema, params)

    ## Add all functions to the schema (embedding)
    # This is how to represent the data
    for func in func_list:
        schema.add_function(func)

    ## Add all indices to the index params
    # This is how to search the data
    index_params = self.client.prepare_index_params()
    for params in index_params_list:
        self._create_index(index_params, params)

    ## Create collection
    self.client.create_collection(
        collection_name=name,
        schema=schema,
        index_params=index_params
    )

Before discussing how the method works, let's discuss the default values that go into it. These include the:

collection_name (lines 2 and 7) which is just a string defining the name of the collection,
field_params_list (line 8) defining the fields with which to represent our data,
func_bm25 (line 9) defining the embedding function to use for an additional representation of the data,
index_params_list (line 10) defining the fields to use for searching and the index type for quick retrieval.

Let's look at the last three of these together .

field_params_listfunc_bm25index_params_list

default field_params_list of create_collection method
from pymilvus import DataType 

## Field params list
# List of dictionaries defining the fields to add to the schema
# This is how we describe our data:
# just need text and sparse vectors for this demo (full-text search only)
field_params_list = [
    {
        "field_name": "id", 
        "datatype": DataType.INT64, 
        "is_primary": True, 
        "auto_id": True
    },
    {
        "field_name": "text", 
        "datatype": DataType.VARCHAR, 
        "max_length": 1000, 
        "enable_analyzer": True
    },
    {
        "field_name": "sparse", 
        "datatype": DataType.SPARSE_FLOAT_VECTOR
    }
]

default embedding function of create_collection method
from pymilvus import Function, FunctionType 

## BM25 embed function
# The function to get sparse embeddings from text
func_bm25 = Function(
    name="text_bm25_emb",
    input_field_names=["text"],
    output_field_names=["sparse"],
    function_type=FunctionType.BM25,
)

default index_params_list of create_collection method
## Index params list
# List of dictionaries defining the indices to add to index params
# Here, we're only working with sparse vectors created with BM25
index_params_list = [
    {
        "field_name": "sparse",
        "index_type": "SPARSE_INVERTED_INDEX",
        "metric_type": "BM25",
        "params": {
            "inverted_index_algo": "DAAT_MAXSCORE",
            "bm25_k1": 3,   # Maximize importance of term frequency
            "bm25_b": 1     # Full normalization of docs
        }
    }
]

We can see from the field_params_list that each data entry will have three fields associated with it, an ID for management, a text field which will be the content of the data entry, and a sparse field . This last field will be populated by our embedding function, func_bm25, which will have a text field input and will output a sparse field. We can also see from the index_params_list that this field will be the one used to search the data .

In other words, we start with a list of text entries and our embedding function, func_bm25, turns this text into sparse vectors . As defined by our index_params_list, these vectors are then used when scoring results to output the most relevant ones .

I want more info about the default values!

There's a lot more that can be said about the field_params_list, the func_bm25, and the index_params_list.

field_params_listfunc_bm25index_params_list

This is how we tell Milvus what fields we want our data to be represented by .

We let Milvus handle giving IDs to our data entries by setting the auto_id argument to True. We put a limit on the size of text entries and we set the enable_analyzer argument to True in order to use our embedding function to turn the text field into a sparse field . Finally, each of these fields needs to have a datatype associated with it that Milvus can work with .

This is how we turn text fields into sparse vectors and how we score documents to obtain results with the full_text_search method .

This uses the BM25 method that's built into Milvus. For more details about how this function works see the math deep dive section.

This is how we tell Milvus the fields on which we want our documents to be evaluated when searching and how to index our data for quick retrieval .

We use the SPARSE_INVERTED_INDEX value for the index_type. What does this mean ?

The INVERTED index type optimizes document retrieval by mapping each term to each document that contains the term . Here, we're just making sure we use the inverted index for sparse vectors specifically .

Milvus stores data in growing segements which, after reaching some size, are indexed according to the user's choice of indices . So, the text fields are turned into sparse vectors with BM25 and this information is stored in a growing segment. Then, when the segment gets large enough the sparse vectors are indexed in an inverted format that allows for quicker fetching of relevant documents . However, we don't want to index every segment before they reach a particular size, because we would then use up too much memory for too small a gain in speed .

We let Milvus know that we want to use the BM25 metric for evaluation and we tack on some extra parameters for further customization. We can see that we're setting the \(k_1\) and \(b\) constants from the BM25 scoring function.

We're also using the DAAT_MAXSCORE algorithm to allow for faster evaluation time by filtering out documents that can't possibly score higher than the current highest score evaluated . I believe it also does some sorting of the query terms by how much they can possibly contribute to the document scores (i.e. terms that will give very high scores are scored first and terms that will give very low scores are deemed non-essential and thrown out) .

After initializing the schema to use for the collection (lines 14-16 of the create_collection method), we add each of the fields in the field_params_list to the schema using the _create_field method (lines 20-21). We then add the func_bm25 to the schema functions using the schema.add_function method (lines 25-26).

We also initialize the index parameters list to use for searching the collection (line 30) and we add each of the indices in the index_params_list using the _create_index method (lines 31-32). Finally, we create the collection with the client.create_collection method (lines 35-39) .

Let's discuss how the schema and index parameters are initialized and the collection is created through the client attribute of the MilvusClientInit class. .

Method 1.2 | `_init_client`⚓︎

See lines 101-108 of milvus_utils.py

Everytime we use the client attribute of the class, we're using the MilvusClient that we initiated with the _init_client method.

Defining the client attribute of the MilvusClientInit class
from pymilvus import MilvusClient

# URI | Milvus server uri
uri = 'http://localhost:19530'

class MilvusClientInit:
    def __init__(
        self, 
        uri = uri
    ):
        self.uri = uri
        # Initialize MilvusClient from PyMilvus
        self.client = self._init_client()


    ## Initialize MilvusClient
    def _init_client(
        self
    ):
        ## Define MilvusClient with PyMilvus library
        client = MilvusClient(
            uri=self.uri
        )
        return client

So, we can see that we've initialized a MilvusClient on the URI that was defined in our Docker setup. For collection creation, we then use the built in MilvusClient methods to initialize the schema and index parameters, as well as create the collection .

How do the other methods work?

The _create_field and _create_index methods are also just simple wrappers of built in PyMilvus methods :

_create_field_create_index

_create_field method of the MilvusClientInit class
## Create field for schema
def _create_field(
    self, 
    schema, 
    params
):
    # Add the field to the schema for the given params
    schema.add_field(**params)

_create_index method of the MilvusClientInit class
## Create index for index parameters
def _create_index(
    self, 
    index_params, 
    params
):
    # Add the index to the index params for the given params
    index_params.add_index(**params)

For the _create_field method, we use the schema.add_field method which takes in all the necessary inputs that are needed for a given field . So, we define these with the field_params_list, then we pass these params to the add_field method (see lines 20-21 of the create_collection method) .

The _create_index method works exactly the same way, except it uses the index_params.add_index method . Just like before, we define all the inputs we need for each of the indices in the index_params_list, then pass these params to the add_index method (see lines 31-32 of the create_collection method) .

Now that we understand how collections are defined and created with the create_collection method . Let's check out how data is inserted with the insert method .

Method 1.3 | `insert`⚓︎

See lines 188-202 of milvus_utils.py

We've already seen that the insert method can take in a collection name and some data then add this data to the appropriate collection. Now, we can open up the method to see how this is all done .

insert method of milvus_utils.py
## Insert data into a collection
def insert(
    self, 
    name = collection_name, 
    data = data_ex
):
    # Insert data into the collection
    results = self.client.insert(
        collection_name=name,
        data=data
    )
    # Wait for the database to update
    time.sleep(0.5)

    ## Return results
    return results

When we recall how the client attribute of the class is defined, we can see that this method is just a wrapper of the MilvusClient method with the same name . We want to insert some given data into the given collection, and we want to wait for a bit before moving on to ensure the data is inserted before trying any searches .

Ok, that one was simple . Now, that we know how to create a collection and insert data into the collection, let's see how to do the full-text search .

Method 1.4 | `full_text_search`⚓︎

See lines 221-248 of milvus_utils.py

We've already seen that the full_text_search method can take in a collection name, a list of queries, and a maximum number of results, then output a list of dictionaries showing the most relevant results. Now, we can open up the method to see how this is all done .

full_text_search method of milvus_utils.py
## Collection name
collection_name = 'collection_ex'

## Query list
# Default list of queries to search the database
query_list = ["What's the focus of information retrieval?"]

## Result limit
# Default maximum number of results to get from search
lim_results = 3

## Perform a full text search on a collection
def full_text_search(
    self, 
    name = collection_name, 
    query_list = query_list, 
    limit = lim_results
):
    ## Define search params
    # Only full-text so focus on sparse vectors built from text
    anns_field = 'sparse'
    output_fields = ['text']

    ## Add extra search params
    # Controls trade-off between speed and accuracy in ANN searches
    # Drop some percentage of results before searching
    search_params = {
        'params': {'drop_ratio_search': 0.2},
    }

    ## Get the search results
    results = self.client.search(
        collection_name=name, 
        data=query_list,
        anns_field=anns_field,
        output_fields=output_fields,
        limit=limit,
        search_params=search_params
    )
    return results

Here, we define the full_text_search method with some default values for the collection name, the list of queries we want to search for, and the maximum number of results to retrieve. We tell Milvus we want to do a search on the sparse vector fields (lines 21 & 35) and we want the results to be returned with the text fields (lines 22 & 36) .

We also define an extra search parameter, the drop_ratio_search which is some number between zero and one defining the fraction of documents to drop before searching . In our case, we tell Milvus to drop \(20\%\) of the candidates. Higher values means our search will be faster but less accurate .

That's it ! We've gone through all the code in this repo that's needed to understand how to setup a Miluvs server in Docker and use it to manage and search custom data .

To get an idea of how documents are scored and retrieved using this method, let's check out the func_bm25 function in more detail .

Math Deep Dive⚓︎

Here, we're going to look at the relevant math in more detail .

BM25⚓︎

This scoring function was the 25th (and best performing) version of the algorithms tested for the Okapi Information System at London's City University back in the late 1900s and has been widely used since for exact keyword searches .

Let's say we want to search our documents for a query, \(q\). The scoring function will take into account three terms in order to determine the relevancy of a given document, \(d\): the Term Frequency (TF), the Inverse Document Frequency (IDF), and the Document Length Normalization (DLN).

\(\text{TF}(q,d)\): the frequency at which the term, \(q\), appears in the document, \(d\)
\(\text{IDF}(q)\): a measure of how many documents contain the term, \(q\)
\(\text{DLN}(d)\): ensures that longer documents don't get higher scores based mostly on their length rather than their relevancy.

The TF is just some number, however many times the term appears in the document . The other two terms have more complicated mathematical forms . Let's look at these two, then we can see how all the terms go together to get the BM25 score .

Function 1.1 | Inverse Document Frequency⚓︎

The IDF is given by:

\[ \text{IDF}(q) = \ln \left( \frac{N - n(q) + 0.5}{n(q) + 0.5} + 1 \right) \]

where \(q\) is the term being searched for, \(N\) is the total number of documents, and \(n(q)\) is the number of documents that contain the term \(q\).

What happens if none of the documents contain the term (i.e. \(n(q)=0\))?

\[ \text{IDF}(q) = \ln \left( \frac{N + 0.5}{0.5} + 1 \right) = \ln(2) \ln(N+1) \]

So, if none of the documents contain the term, we get some positive number that grows with the total number of documents.

What happens if all the documents contain the term (i.e. \(n(q)=N\))?

\[ \text{IDF}(q) = \ln \left( \frac{0.5}{N + 0.5} + 1 \right) = \ln \left( \frac{N + 1}{N + 0.5} \right) = \ln \left( \frac{1 + \left(1/N\right)}{1 + \left(0.5/N\right)} \right) \]

\[ \left.\text{IDF}(q)\right\vert_{N\gg1} = \left.\ln \left( \frac{1 + \left(1/N\right)}{1 + \left(0.5/N\right)} \right)\right\vert_{N\gg1} \approx \ln(1) \]

So, if all the documents contain the term, the IDF for that term gets close to zero (especially when \(N\gg1\), which will usually be the case).

To wrap it all up:

\(n(q)\)	\(\text{IDF}(q)\)
Few documents contain \(q\)	Increases as \(N\) increases
Most documents contain \(q\)	Decreases to zero as \(N\) increases

Now, let's look at the DLN term .

Function 1.2 | Document Length Normalization⚓︎

The DLN depends on some constant, \(0 \leq b \leq 1\), chosen by the user ,

\[ \text{DLN}(d) = \left(1 - b + b \frac{d_l}{D_l}\right) \]

where \(d_l\) is the length of document \(d\) and \(D_l\) is the average length of all documents.

What happens if the length of the document is much longer than the average length (i.e. \( d_l \gg D_l \))?

\[ \text{DLN}(d) = \left(1 - b + b \frac{d_l}{D_l}\right) = \frac{d_l}{D_l}\left(\frac{D_l}{d_l} - b \frac{D_l}{d_l} + b\right) \]

\[ \left. \text{DLN}(d) \right\vert_{d_l \gg D_l} = \left. \frac{d_l}{D_l}\left(\frac{D_l}{d_l} - b \frac{D_l}{d_l} + b\right) \right\vert_{d_l \gg D_l} \approx b \frac{d_l}{D_l} \]

So if the length of the document is much longer than the average length, the DLN will be some large number multiplied by a constant chosen by the user. If the constant is zero, then the DLN is also zero and if the constant is one, the DLN is the ratio of the document length to the average length, which is some very large positive number.

What happens if the length of the document is much shorter than the average length (i.e. \( d_l \ll D_l \))?

\[ \left. \text{DLN}(d) \right\vert_{d_l \ll D_l} = \left. \left(1 - b + b \frac{d_l}{D_l}\right) \right\vert_{d_l \ll D_l} \approx (1-b) \]

So, if the length of the document is much shorter, the DLN is some number between zero (if \(b=1\)) and one (if \(b=0\)); a very mild number compared to the ratio above.

To wrap it all up:

\(d_l / D_l\)	\(\text{DLN}(d)\) for \(b=0\)	\(\text{DLN}(d)\) for \(b=1\)
Document much longer than average	0	Some large number
Document much shorter than average	1	0

Now that we understand these terms , let's check out the BM25 function .

Function 1.3 | Final Form of BM25⚓︎

The BM25 scoring function is given by:

\[ \text{score}(d, Q) = \sum^{m}_{i=1} \text{IDF}(q_i) \frac{\text{TF}(q_i, d) \left(k_1 + 1\right)}{\text{TF}(q_i, d) + k_1 \text{DLN}(d)} \]

where \(d\) is the document being scored, \(Q\) is the user's entire search query which is split into \(m\) terms, and \(q_i\) is the \(i\)-th term of the search query. Here, \(0 \leq k_1 \leq 3\) is a constant chosen by the user which determines the contribution of TF to the score by capping off the score as TF gets really large .

If a document contains a term over and over again, the TF is going to be huge and the score for this term is going to blow up compared to the other scores. However, if we limit how large the score can be for a given term, we won't end up with this problem. Check out the equation below to see what I mean as TF becomes very large .

\[ \left.\frac{\text{TF}(q_i, d) \left(k_1 + 1\right)}{\text{TF}(q_i, d) + k_1 \text{DLN}(d)}\right\vert_{\text{TF}\rightarrow\inf} = \left.\frac{\text{TF}(q_i, d)}{\text{TF}(q_i, d)}\frac{\left(k_1 + 1\right)}{1 + k_1 \frac{\text{DLN}(d)}{\text{TF}(q_i, d)}} \right\vert_{\text{TF}\rightarrow\inf} \approx k_1 + 1 \]

So, for terms that have really large TF, the contribution to the score from TF is capped off to \(k_1 + 1\). This prevents the score from being dominated by overly frequent terms and allows for more subtle results to be found . As \(k_1\) gets larger, the maximum allowed contribution from the TF also gets larger and when \(k_1 = 0\), there's no contribution from the TF at all.

To wrap it all up, for very large TF:

\(k\)	\(\text{score}(d, Q)\)
0	\(\sum^{m}_{i=1} \text{IDF}(q_i)\)
3	\(4 \sum^{m}_{i=1} \text{IDF}(q_i)\)

Now, recall the limits we examined for the IDF. The scoring function depends on the IDF linearly for a given term. So, ignoring the TF contribution (which is capped off to \(k_1+1\)), if most of the documents contain the term , the score is very small, and if none of the documents contain the term , the score becomes some positive number that grows with the total number of documents.

This means if the document being scored contains terms in our query that don't appear in the other documents very often, it's bound to be relevant. But, if the document contains terms in the our query that almost all documents have, it may not be that relevant .

Now, recall the limits we examined for the DLN. As the length of the document being scored becomes much longer than the average length , the DLN term in the denominator becomes larger and larger, causing the score to be smaller. While, as the length becomes much shorter than the average , the DLN term is just some number between zero and one and won't affect the score very much.

It seems then, that this term is mostly used to penalize documents that are much longer than the average length. We don't want the scores for these terms to blow up just because they're much longer and contain many more words (i.e. TF is potentially larger compared to shorter documents) .

What does this all mean ? Well, we see that a document will get a high score for a given term if the term appears in the document frequently and it's one of the few documents that contain the term . As the frequency of the term in the document goes down or as the number of documents that also contain the term goes up, the score will go down . Furthermore, if the document is very long its score will be penalized, and if the term frequency is very large its score will be capped off.

With all this being said, it seems that a high score means a more relevant document, and as the score goes down the document becomes less relevant to the search terms. This is helpful to understand, because documents are ranked using this score through a cosine-similarity test .

This test checks the similarity between two vectors through a simple dot product (comparing the angle between the two) . If the two vectors are the same, the angle between them is equal to zero, and their similarity score is equal to one. As the angle between the two grows, the similarity score gets closer to zero (for strictly positive vector elements) . Let's check out the cosine-similarity score for two vectors \(\textbf{A}\) and \(\textbf{B}\) to see what I mean :

\[ S_C\left( \textbf{A}, \textbf{B} \right) = \cos\left(\theta_{AB}\right) = \frac{\textbf{A}\cdot\textbf{B}}{\left|\textbf{A}\right|\left|\textbf{B}\right|} = \frac{\sum_{i}^{N}A_i B_i}{\sqrt{\sum_i^N A_i^2} \sqrt{\sum_i^N B_i^2}} \]

where \(A_i\) and \(B_i\) are the \(i\)-th components of the vectors \(\textbf{A}\) and \(\textbf{B}\). So, all we need to do is add up all the components for the vectors to get the score .

In our case, the vectors that we want to compare are just the sparse vectors that we've been discussing. But how does the func_bm25 turn the text fields into sparse vectors ?

When we add a document to Milvus, it first does a lot of preprocessing behind the scenes, including tokenization and stop word removal. From this preprocessing, it obtains the set of all tokens within the document and the TF value for each token is obtained . The sparse vector for the document is then created from these TF values and stored.

When a user then invokes the full-text search with a query , the query goes through similar preprocessing to obtain the search tokens and a sparse vector representation from the TF values of a given document is obtained. This query vector is then combined with the IDF values according to the BM25 score and compared to the stored vector of the document using the cosine-similarity score .

Notice that when Milvus gives us results from the full-text search, it includes a distance parameter . This is just the cosine-similarity score between the query vector combined with the IDF and the stored vector of the document. As expected, higher distances pertain to more relevant documents (both cosine-similarity and IDF increase with increasing relevance) .

And that's it ! We've gone through all the code and math that's needed to understand how to use Miluvs to search our custom data for relevant information .

Next Steps & Learning Resources⚓︎

If you've followed along with the servers tutorials up to this point, we've finished building all the individual servers that we'll need in order to start our agent builds . However, there is one final (and very simple tutorial) to show how to combine all the servers covered in the series. This last tutorial will show how to build the complete server stack that we'll use for our specialized agent builds .

Continue learning how to build the final server stack by choosing the last tutorial in the servers series, learn how pass this Milvus server to an agent and interact with it through a Gradio web UI in the document agent tutorial, or checkout other agent builds in the rest of the agents tutorials.

Just like all the other tutorials, all the source code is available so you can plug and play any of tutorial code right away .

Contributing⚓︎

This tutorial is a work in progress. If you'd like to suggest or add improvements , fix bugs or typos , ask questions to clarify , or discuss your understanding , feel free to contribute through participating in the site discussions! Check out the contributing guidelines to get started.

These vectors are called sparse because most of the indices are zero. They're highly focused in a small number of dimensions . Or, in other words, almost all of their magnitude is in a handful of select directions. These seem like good vectors to describe an exact keyword . ↩
Where most available dimensions are naught for sparse vectors, dense vectors are more spread across the dimensions in complex ways . This is good for modeling complex relationships between data and for finding relevant data that may not necessarily contain the exact keywords in the query . ↩
We can also customize the fields that describe our data , the index parameters that describe how we search our data , and the functions with which we embed our data . I leave these examples for a future tutorial where we learn how to give a LangChain agent the ability to do hybrid searches on our documents . ↩

Milvus with Docker and Python⚓︎

About This Project⚓︎

Getting Started⚓︎

Example Use Cases⚓︎

Data Management and Search through the Command Line⚓︎

Data Management and Search through Running Scripts⚓︎

Project Structure⚓︎

milvus_utils.py

milvus_types.py

Code Deep Dive⚓︎

File 1 | milvus_utils.py⚓︎

Method 1.1 | create_collection⚓︎

Method 1.2 | _init_client⚓︎

Method 1.3 | insert⚓︎

Method 1.4 | full_text_search⚓︎

Math Deep Dive⚓︎

BM25⚓︎

Function 1.1 | Inverse Document Frequency⚓︎

Function 1.2 | Document Length Normalization⚓︎

Function 1.3 | Final Form of BM25⚓︎

Next Steps & Learning Resources⚓︎

Contributing⚓︎

File 1 | `milvus_utils.py`⚓︎

Method 1.1 | `create_collection`⚓︎

Method 1.2 | `_init_client`⚓︎

Method 1.3 | `insert`⚓︎

Method 1.4 | `full_text_search`⚓︎