Skip to content

Ollama with Docker and Python⚓︎


Ollama Python Docker

Ollama Docker Intro PNG

TL;DR

Learn how to chat with LMs on your local machine . Then, you can use this setup as a base to power locally run AI agents .


About This Project⚓︎

In this tutorial, we're going to setup an Ollama server in Docker for chatting with LMs on our local machines using the Ollama Python library.

The code we learn and use here will serve as a foundation for building AI agents, showcasing how to connect to an LM and invoke responses . This is how we'll power the decision making and response generating processes of our agents . For a general overview of what we're going to do with these agents, checkout the next series of tutorials.

As previously mentioned, the way we're going to build agents is by first building local servers for all the gadgets that our agents will need . Then, we can learn how to pass these gadgets over to our agents with LangChain and LangGraph.

To kick off the tutorials, we'll start with the most fundamental server that we'll need to pass to our agents, without which our agents are basically powered off : the LM server .

As for the software to host the LM, I like using Ollama. It's insanely easy to setup and get started chatting right out of the box (as we'll see next) and it plays nicely with all the other libraries that we'll be using .


Some alternatives that I've found for chatting with LMs are LM Studio, Open WebUI, and Huggingface Transformers. There was a time when one could chat with really large LLMs like Qwen3 235B for free using Huggingface's HuggingChat. Thanks for all your help my ephemeral friend .


Now, let's get building !


Getting Started⚓︎

First, we're going to setup and build the repo to make sure that it works . Then, we can play around with the code and learn more about it .

Check out all the source code here .

Toggle for visual instructions

This is currently under construction.

To setup and build the repo follow these steps:

  1. Make sure Docker is installed and running.

  2. Clone the repo, head there, then create a Python environment:

    git clone https://github.com/anima-kit/ollama-docker.git
    cd ollama-docker
    python -m venv venv
    

  3. Activate the Python environment:

    venv/Scripts/activate
    

  4. Install the necessary Python libraries:

    pip install -r requirements.txt
    

  5. Choose to run the LM on GPU or CPU, then build and run the Docker container:

    docker compose -f docker-compose-gpu.yml up -d
    
    docker compose -f docker-compose-cpu.yml up -d
    

    Getting an LM response can be run on your GPU or CPU. Using the GPU will generally be faster , but not all GPUs will be compatible. Also, you may not be able to run certain models regardless of whether you use GPU or CPU.

    After successfully completing this step, the Ollama server should be running on http://localhost:11434. For me, when everything is setup correctly and I click the URL link, I get a page that says Ollama is running.

  6. Run the test script to ensure the default LM (Qwen3 0.6B) can be invoked:

    python ollama_test.py
    

    From the Docker setup, all Ollama data (including models) will be located in the local folder ./ollama_data/. All logs from the test script are output in the console and stored in the ./ollama-docker.log file.

  7. When you're done, stop the Docker containers and cleanup with:

    docker compose -f docker-compose-gpu.yml down
    
    docker compose -f docker-compose-cpu.yml down
    

Example Use Cases⚓︎

Now that the repo is built and working, let's play around with the code a bit .

Right now, we're only setting up an Ollama server to chat with an LM. That means we're only going to be using one method, namely something that gets an LM response for a given user message. To get our own chat going, all we need to do is initialize the class that holds this method, then we can send messages and get responses as we please .


The main class to interact with an LM is the OllamaClient class of the ./ollama_utils.py file which is built on the Ollama Python library. Once this class is initialized, the get_response method can be used to get an LM response for a given user message.

To facilitate interactions with the LM, the get_response method can be executed in the command line or in Python scripts . However, in later tutorials we'll implement Gradio web UIs so that we can easily chat with our agents in an intuitive way. For now, we can see how to interact with LMs through these less intutive ways .

To start off, let's interact with an LM through the command line .


Chatting with an LM through the Command Line⚓︎

Toggle for visual instructions

This is currently under construction.

To chat with an LM, follow these steps:

  1. Do step 3 then step 5 of the Getting Started section to activate the Python environment and run the Ollama server in Docker.

  2. Call the Python environment to the command line:

    python
    
  3. Now that you're in the Python environment, import the OllamaClient class:

    from ollama_utils import OllamaClient
    
  4. Initialize the OllamaClient class:

    client = OllamaClient()
    

  5. Define a message:

    message = 'What is the average temperature of the universe?'
    

  6. Then get a response:

    client.get_response(message=message)
    
  7. Repeat step 5 and step 6 any number of times to send different messages and get responses.

  8. Do step 7 of the Getting Started section to stop the containers when you're done.

Just like with the test script, all logs will be printed in the console and stored in ./ollama-docker.log. The get_response method can also be executed with the default message Why is the sky blue? by running the method with no variables: client.get_response().


Now that we know how to get responses for different messages , let's try different LMs. By default, we're getting responses from Qwen3 0.6B. But, the get_response method can also be executed with a specified LM by utilizing the lm_name argument.

In the next example below, I show how to do this by creating and running a custom script to get responses for a given LM .


Chatting with an LM through Running Scripts⚓︎

Toggle for visual instructions

This is currently under construction.

To chat with an LM, follow these steps:

  1. Do step 3 then step 5 of the Getting Started section to activate the Python environment and run the Ollama server in Docker.

  2. Create a script named my-lm-chat-ex.py with the following:

    # Import OllamaClient class
    from ollama_utils import OllamaClient
    
    # Initialize client
    client = OllamaClient()
    
    # Define LM to use and message to send
    # Change these variables to use a different LM or send a different message
    lm_name = 'deepseek-r1:1.5b'
    message = 'What is the average temperature of the universe?'
    
    # Get response
    client.get_response(lm_name=lm_name, message=message)
    

  3. Run the script:

    python my-lm-chat-ex.py
    
  4. Do step 7 of the Getting Started section to stop the containers when you're done.

Again, all logs will be printed in the console and stored in ./ollama-docker.log. The name of the Python script doesn't matter as long as you use the same name in step 2 and step 3.


Now, you have the tools to edit the script (or create an entirely new script) to send whichever messages to whichever LMs you wish . For more structured chats, you can loop through invoking different LMs for different messages with something like the following:

# Define LMs to use
lm_names = ['deepseek-r1:1.5b', 'qwen3:1.7b']

# Define messages to send
messages = [
    'What is the mathematical symbol PI?',
    'Discuss prominent factors in the evolution of humans on Earth.',
    'What will the weather be like in Houston, TX tomorrow?'
]

# Get response for each message in messages from each LM in lm_names
for lm_name in lm_names:
    for message in messages:
        client.get_response(lm_name=lm_name, message=message)

Notice that the LMs won't be able to give a good answer for the last question because they were built on static stores of knowledge . However, in a future tutorial we'll see how to give an agent a web search tool so that it's easy to answer questions like the last one.


Once set up, you can use this foundation to chat with different LMs . I've gotten a lot of mileage out of the Qwen3 and DeepSeek-R1 models and they're the defaults that I use in my tutorials. I always like to start with the tiniest Qwen3 model (Qwen3 0.6B) for testing purposes because it's lightweight and fast . However, there are a lot of different models out there for different purposes .

Check out the Ollama library for a list of available models. You can also search the Huggingface library and the model you like (or a similar one) might be available in the Ollama library.

Now that we understand how to use the code, let's open it up to check out the gears .


Project Structure⚓︎

Before we take a deep dive into the source code , let's look at the repo structure to see what code we'll want to learn .

├── docker-compose-cpu.yml  # Docker settings for CPU build of Ollama container
├── docker-compose-gpu.yml  # Docker settings for GPU build of Ollama container
├── logger.py               # Python logger for tracking progress
├── ollama_test.py          # Python test of methods
├── ollama_utils.py         # Python methods to use Ollama server
├── requirements.txt        # Required Python libraries
├── tests/                  # Testing suite
│   └── test_integration.py # Integration tests for use with Ollama API
└── └── test_unit.py        # Unit tests for Python methods

tests/

This folder contains unit and integration tests to make sure the logic of the source code works as expected without calling the Ollama API , as well as when the API is available . I won't go over these files as they're not important for our results, though I did leave a lot of comments for anyone that's interested . Learning how to create testing suites has not only helped me better understand how my code works but also how to start my code off on the right track so that I don't have as many surprise bugs popping up along the way . You can check out the best practices bonus info to see how to run the tests.


logger.py

The logger.py file is used to get logs from the code that we run (i.e. all the status updates that are output in the console and saved in the ./ollama-docker.log file). This file doesn't matter for our results either. It's just an extra bow to put on top so that our interactions are informative and nice to look at . Though, if you're interested the Logging and Rich Python libraries are worth looking into.


requirements.txt

The requirements.txt file tells Python what libraries we need to install in our Python environment to use the code. This is the file we used in step 4 of the Getting Started section to install all the libraries that we needed .


ollama_utils.py

Now, the ollama_utils.py file has the class we need to instantiate and the method we need to get responses . We're going to be spending most of our deep dive on this file .


ollama_test.py

The ollama_test.py file basically does what we did when running the script in the Example Use Cases section, namely use the code in the ollama_utils.py file to get a response .


docker-compose-*.yml

Finally, this leaves the Docker compose files: docker-compose-gpu.yml and docker-compose-cpu.yml. These are the files that tell Docker how to build the Ollama container, one for building with GPU support and one for building with CPU support. We're also going to want to understand how these work because they'll be crucial to all of our agent builds .


Ok, that's all the files . Let's go diving !


Code Deep Dive⚓︎

Here, we're going to look at the relevant files in more detail . We're going to start with looking at the full files to see which parts of the code we'll want to learn, then we can further probe each of the important pieces to see how they work .


File 1 | ollama_utils.py⚓︎

Toggle file visibility

ollama_utils.py skeleton
### ollama_utils
## Defines functions needed to setup and query an LM with an Ollama client.
## Based on README instructions of Ollama Python library | https://github.com/ollama/ollama-python

import re
import ollama
from ollama import Client

## Define constants
# LM | Must be available in Ollama library (https://ollama.com/library)
lm_name: str = 'qwen3:0.6b'

# URL | Ollama server url from Docker setup
url: str = 'http://localhost:11434'

# Message | Example message to send to LM
message: str = 'Why is the sky blue?'


class OllamaClient:
    def __init__(
        self, 
        url = url
    ):
        self.url = url
        self.client = self._init_client()


    ## Connect to the Ollama client
    def _init_client(
        self
    ):
        # Create the client
        client = Client(host=self.url)            
        return client


    ## Check existing Ollama models
    def _list_pulled_models(
        self
    ):
        # List all models available with Ollama
        ollama_models = ollama.list()
        # List all model names
        model_names = [
            model.model for model in ollama_models.models
        ]
        return model_names


    ## Pull LM
    def _pull_lm(
        self,
        lm_name = lm_name
    ):
        response = ollama.pull(lm_name)
        return response


    ## Setup LM
    def _init_lm(
        self, 
        lm_name = lm_name
    ):
        # Check existing models in Ollama
        model_names = self._list_pulled_models()

        # If `lm_name` not in model_names, pull it from Ollama
        if lm_name not in model_names:
            self._pull_lm(lm_name=lm_name)


    ## Clean LM response
    def _remove_think_tags(
        self, 
        text
    ):
        # Find <think>...</think> within the text
        think_tag_pattern = re.compile(r'<think>.*?</think>\s*', re.DOTALL)
        # If the tag is not fully closed, we handle that separately
        if not think_tag_pattern:
            # Find any instances of <think> and </think> and substitute them with empty strings
            outside_tags = re.sub(r'</?think>', '', text).strip()

        # If matched and closed, clean up tags from outside
        else:
            # Change the <think>...</think> to an empty string
            cleaned_text = think_tag_pattern.sub('', text).strip()
            # Make sure no other tags remain
            outside_tags = re.sub(r'</?think>', '', cleaned_text).strip()

        return outside_tags


    ## Query the LM
    def get_response(
        self, 
        lm_name = lm_name, 
        message = message,
        options = None
    ):
        ## Make sure LM is available
        self._init_lm(lm_name=lm_name)

        ## Get LM response
        # Format the message properly
        messages = {
            'role': 'user',
            'content': message,
        }
        # Send a message to the model
        response = self.client.chat(
            model=lm_name, 
            messages=[messages],
            options=options
        )

        ## Return the response
        content = self._remove_think_tags(response.message.content)
        return content
ollama_utils.py full
### ollama_utils
## Defines functions needed to setup and query an LM with an Ollama client.
## Based on README instructions of Ollama Python library | https://github.com/ollama/ollama-python

import re
from re import Match
import ollama
from ollama import (
    Client, 
    ChatResponse, 
    ListResponse,
    ProgressResponse
)
from typing import (
    List,
    Pattern
)

from logger import (
    logger,
    with_spinner
)

## Define constants
# LM | Must be available in Ollama library (https://ollama.com/library)
lm_name: str = 'qwen3:0.6b'

# URL | Ollama server url from Docker setup
url: str = 'http://localhost:11434'

# Message | Example message to send to LM
message: str = 'Why is the sky blue?'


class OllamaClient:
    """
    An Ollama client that can be used to chat with LMs.

    The user can chat with an LM by initializing this class then using the get_response method for a given LM and message.

    For example, to initialize the client for a given url:
    ```python
    url = 'my_url'
    client = OllamaClient(url=url)
    ```

    Then to get a response for a given message from a given LM:
    ```python
    lm_name = 'my_model'
    message = 'My message'

    client.get_response(lm_name=lm_name, message=message)
    ```

    Attributes
    ------------
        url: str, Optional
            The URL on which to host the Ollama client
            Defaults to 'http://localhost:11434'
        client: Client (Ollama Python)
            The Ollama client to use to get responses
    """
    def __init__(
        self, 
        url: str = url
    ) -> None:
        """
        Initialize the Ollama client hosted on the given url.

        Args
        ------------
            url: str, Optional
                The url on which to host the Ollama client
                Defaults to 'http://localhost:11434'

        Raises
        ------------
            Exception: 
                If initialization fails, error is logged and raised
        """
        try:
            self.url = url
            self.client = self._init_client()
        except Exception as e:
            logger.info(f'❌ Problem initializing Ollama client: {str(e)}')
            raise


    ## Connect to the Ollama client
    def _init_client(
        self
    ) -> Client:
        """
        Connect the Ollama client.

        Returns
        ------------
            Client (Ollama Python): 
                The Ollama client instance

        Raises
        ------------
            Exception: 
                If client connection fails, error is logged and raised
        """
        logger.info(f'⚙️ Starting Ollama client on URL `{self.url}`')
        try:
            # Create the client
            client = Client(host=self.url)

            # Test the connection with list and validate response
            response = client.list()

            # Check if `models` attribute exists
            if not hasattr(response, 'models'):
                raise ValueError("The response from client.list() is missing the 'models' attribute")

            # Check that `models` attribute is a list
            if not isinstance(response.models, list):
                raise ValueError("The 'models' attribute of client.list() should be a list")

            logger.info(f"✅ Ollama client connected successfully. Found {len(response.models)} models available.")
            return client
        except Exception as e:
            logger.error(f"❌ Failed connecting to Ollama client at {self.url}: {str(e)}")
            raise


    ## Check existing Ollama models
    def _list_pulled_models(
        self
    ) -> List[str | None]:
        """
        List all models available in Ollama storage.

        Returns
        ------------
            List[str | None]: 
                A list of all the available models

        Raises
        ------------
            Exception: 
                If getting list fails, error is logged and raised
        """
        try:
            # List all models available with Ollama
            ollama_models: ListResponse = ollama.list()
            # List all model names
            model_names: List[str | None] = [
                model.model for model in ollama_models.models
            ]
            logger.info(f'📝 Existing models `{model_names}`')
            return model_names
        except Exception as e:
            logger.error(f'❌ Problem listing models available in Ollama: `{str(e)}`')
            raise


    ## Pull LM
    def _pull_lm(
        self,
        lm_name: str = lm_name
    ) -> ProgressResponse:
        """
        Pull LM with given name from Ollama library if not already pulled (https://ollama.com/library).

        Args
        ------------
            lm_name: str, Optional 
                Name of the LM to pull from Ollama
                Defaults to 'qwen3:0.6b'
                Must be available in Ollama library (https://ollama.com/library)

        Raises
        ------------
            Exception: 
                If LM pull fails, error is logged and raised
        """
        response = ollama.pull(lm_name)
        return response


    ## Setup LM
    def _init_lm(
        self, 
        lm_name: str = lm_name
    ) -> None:
        """
        Initialize LM by checking that it's avaible from Ollama data.
        Pulls the model is it isn't avaialable.

        Args
        ------------
            lm_name: str, Optional 
                Name of the LM to initialize from Ollama
                Defaults to 'qwen3:0.6b'
                Must be available in Ollama library (https://ollama.com/library)

        Raises
        ------------
            Exception: 
                If LM initialization fails, error is logged and raised
        """
        logger.info(f'⚙️ Setting up Ollama model for LM `{lm_name}`')
        try:
            # Check existing models in Ollama
            model_names = self._list_pulled_models()

            # If `lm_name` not in model_names, pull it from Ollama
            if lm_name not in model_names:
                logger.info(f'⚙️ Pulling `{lm_name}` from Ollama')

                # Show spinner during the pulling process
                with with_spinner(description=f"⚙️ Pulling LM..."):
                    self._pull_lm(lm_name=lm_name)

                logger.info(f'✅ Model `{lm_name}` pulled from Ollama')
        except Exception as e:
            logger.error(f'❌ Problem getting ollama model: `{str(e)}`')
            raise


    ## Clean LM response
    def _remove_think_tags(
        self, 
        text: str | None
    ) -> str:
        """
        Remove <think></think> tags and all text within the tags.

        Args
        ------------
            text: str 
                Text to remove <think></think> tags from

        Returns
        ------------
            str: 
                The resulting string without the <think></think> tags and all the text within

        Raises
        ------------
            Exception: 
                If cleaning the text fails, error is logged and raised
        """
        try:
            # Find <think>...</think> within the text
            think_tag_pattern: Pattern[str] = re.compile(r'<think>.*?</think>\s*', re.DOTALL)
            # Make sure text is str
            # text can be typed as Optional[str] because of response.message.content type in Ollama Python library
            if text is None:
                error_message: str = '❌ Problem cleaning response, check error logs for details.'
                logger.error(error_message)
                raise ValueError(error_message)
            else:
                # If the tag is not fully closed, we handle that separately
                if not think_tag_pattern:
                    # Find any instances of <think> and </think> and substitute them with empty strings
                    outside_tags: str = re.sub(r'</?think>', '', text).strip()

                # If matched and closed, clean up tags from outside
                else:
                    # Change the <think>...</think> to an empty string
                    cleaned_text: str = think_tag_pattern.sub('', text).strip()
                    # Make sure no other tags remain
                    outside_tags = re.sub(r'</?think>', '', cleaned_text).strip()

                return outside_tags
        except Exception as e:
            logger.error(f'❌ Failed to clean text: `{str(e)}`')
            raise


    ## Query the LM
    def get_response(
        self, 
        lm_name: str = lm_name, 
        message: str = message,
        options: dict | None = None
    ) -> str | None:
        """
        Get a response for the given message from the LM with the given lm_name.

        For example, to get a response first initalize the class then use the get_response method:
        ```python
        url = 'my_url'
        lm_name = 'my_model'
        message = 'My Message'

        client = OllamaClient(url=url)
        client.get_response(lm_name=lm_name, message=message)
        ```

        Args
        ------------
            lm_name: str, Optional 
                Name of the LM to pull from Ollama
                Defaults to 'qwen3:0.6b'
                Must be available in Ollama library (https://ollama.com/library)
            message: str, Optional
                Message to send to LM and get response
                Defaults to 'Why is the sky blue?'
            options: dict, Optional
                Extra options to pass to the Ollama Client chat method
                Defaults to None

        Returns
        ------------
            str: 
                The LM response (without any thinking context)
                To keep thinking context, refrain from using the `self._remove_think_tags` method

        Raises
        ------------
            Exception: 
                If getting a response fails, error is logged and raised
        """
        logger.info(f'⚙️ Getting LM response for message `{message}`')
        ## Make sure LM is available
        try:
            self._init_lm(lm_name=lm_name)
        except Exception as e:
            logger.error(f'❌ Problem pulling LM `{lm_name}` from Ollama: `{str(e)}`')
            raise

        ## Get LM response
        try:
            with with_spinner(description=f"🔮 LM working its magic..."):
                # Format the message properly
                messages: dict = {
                    'role': 'user',
                    'content': message,
                }
                # Send a message to the model
                response: ChatResponse = self.client.chat(
                    model=lm_name, 
                    messages=[messages],
                    options=options
                )

            ## If no response, raise error
            error_message = '❌ Problem returning response, check error logs for details.'
            if response is None:
                logger.error(error_message)
                raise ValueError(error_message)

            ## Else return the response
            else:
                #content = response.message.content
                # If you want to keep LM's thinking context, comment out line below and use line above instead
                content = self._remove_think_tags(response.message.content)
                logger.info(f'📝 LM response `{content}`\n')
                return content
        except Exception as e:
            logger.error(f'❌ Problem getting LM response: `{str(e)}`')
            raise

Above, I show the ollama_utils.py file in all its full glory as well as in a skeleton version . This version is all the code needed to work and almost none of the code for some crucial best practices like logging, error handling, type checking, and documenting.

What about these best practices?

Logging , error handling , type checking , documenting , and using testing suites are practices that I didn't fully appreciate while I was learning how to code through finding tutorials online and chatting with LMs, because these sources usually just demonstrate how to get the code working . But after I learned how to implement them into my code , it made sense why all the repos I had been digging through had these practices in spades . They're great for catching early problems, keeping your code working as expected, and for easily keeping yourself (and others) in the loop about how the code is working . I highly suggest trying to implement some or all of these practices into your working code throughout the tutorials .

I use mypy for type checking and pytest, pytest-order, and unittest for creating and running the testing suite. If you want to use type checking and the testing suite be sure to install the necessary Python libraries:

  1. Activate the python environment (step 3 of the Getting Started section)
  2. Install the development libraries: pip install -r requirements-dev.txt

If you want to do type checking for the ollama_utils.py file, run the type checker with:

mypy ollama_utils.py --verbose

The --verbose flag is optional and just gives more information while running. If you want to use the testing suite, run the tests with:

pytest tests/ -v

The -v flag is, again, optional and just gives more information while running.

Notice that all but one of the methods in the class have _ at the beginning of the name. All of the methods with _ are internal methods , meaning they were created to help the other methods in the class, but they're not suggested to be used outside of the class.

There is one, and only one, method that was created to be used outside of the class: the get_response method. This is what we used when working in the command line and running scripts in the Example Use Cases section. Let's check this method out .



Method 1.1 | get_response⚓︎

See lines 96-120 of ollama_utils.py

We've already seen that the get_response method can take in an LM name and a message then output a response . Now, we can open up the method to see how this is all done .

get_response method of OllamaClient
# LM | Must be available in Ollama library (https://ollama.com/library)
lm_name: str = 'qwen3:0.6b'

# Message | Example message to send to LM
message: str = 'Why is the sky blue?'

## Query the LM
def get_response(
    self, 
    lm_name = lm_name, 
    message = message
):
    ## Make sure LM is available
    self._init_lm(lm_name=lm_name)

    ## Get LM response
    # Format the message properly
    messages = {
        'role': 'user',
        'content': message,
    }
    # Send a message to the model
    response = self.client.chat(
        model=lm_name, 
        messages=[messages]
    )

    ## Return the response
    content = self._remove_think_tags(response.message.content)
    return content

As a brief overview, we're first initializing an LM with the _init_lm method on line 14, then we're formatting the user message to properly work with our LM on lines 18-21 . After that, we utilize the Ollama Python library by using the chat method of our Ollama Client to get a response on lines 23-26. Finally, we cleanup the response with the _remove_think_tags method on line 29 and return the cleaned result on line 30 .

Wasn't there another argument in client.chat?

Yep . In the skeleton and full version, the client.chat method has an extra options argument. I added this here to allow a temperature parameter to be passed to the model, but I omitted it during the deep dive because this parameter is only used in the testing suite .

The temperature parameter controls the randomness of the model with lower temperatures giving more deterministic results. The options argument should be a dictionary with any of the available parameters which can be found here.

Let's look at each of these aspects in turn .



Method 1.2 | _init_lm⚓︎

See line 14 of get_response and lines 61-70 of ollama_utils.py

This method makes sure the specified LM is available to the Ollama client .

_init_lm method of OllamaClient
# LM | Must be available in Ollama library (https://ollama.com/library)
lm_name: str = 'qwen3:0.6b'

## Setup LM
def _init_lm(
    self, 
    lm_name = lm_name
):
    # Check existing models in Ollama
    model_names = self._list_pulled_models()

    # If `lm_name` not in model_names, pull it from Ollama
    if lm_name not in model_names:
        self._pull_lm(lm_name=lm_name)

We first list the models available to the Ollama client using the _list_pulled_models method on line 10 . Then if the model isn't available, we pull it from the Ollama library using the _pull_lm method on line 14 . This method is a simple wrapper of the pull method of our Ollama Client:

_pull_lm method of OllamaClient
import ollama

# LM | Must be available in Ollama library (https://ollama.com/library)
lm_name: str = 'qwen3:0.6b'

## Pull LM
def _pull_lm(
    self,
    lm_name = lm_name
):
    response = ollama.pull(lm_name)
    return response
How does _list_pulled_models work?

The _list_pulled_models method (line 10 of _init_lm and lines 39-48 of ollama_utils.py) is given by:

_list_pulled_models method of OllamaClient
import ollama

## Check existing Ollama models
def _list_pulled_models(
    self
):
    # List all models available with Ollama
    ollama_models = ollama.list()
    # List all model names
    model_names = [
        model.model for model in ollama_models.models
    ]
    return model_names

This method uses the list method of our Ollama Client to list all the available models on line 8 . From this, we get a list of abstract objects from which we can get the model names by pointing to the proper attributes . The code on lines 10-12 does just this when it loops through all of the available abstract objects to get a list of the available model names .

Now that we know the LM will be available , we can invoke our Ollama client to get a response. Let's check this method out .



Method 1.3 | client.chat⚓︎

See lines 23-26 of get_response and line 26 of ollama_utils.py

The class attribute, client, is defined in the __init__ method of the class:

__init__ method of OllamaClient
# URL | Ollama server url from Docker setup
url: str = 'http://localhost:11434'

class OllamaClient:
    def __init__(
        self, 
        url = url
    ):
        self.url = url
        self.client = self._init_client()

We utilize the Ollama Python library by instantianting an Ollama Client instance on line 10 with the _init_client method:

_init_client method of OllamaClient
1
2
3
4
5
6
7
8
9
from ollama import Client

## Connect to the Ollama client
def _init_client(
    self
):
    # Create the client
    client = Client(host=self.url)            
    return client

Remember that we setup an Ollama server in Docker. This is the server URL that we point to when creating the client (see lines 2 and 9 of the __init__ method and line 8 of the _init_client method).

The client attribute of the OllamaClient class (line 10 of the __init__ method) is then just an instance of the Ollama Client object from the Ollama Python library. By telling the client to utilize our Ollama server, we can then use its chat method to get responses from LMs .

Now that we can get a response from our Ollama client , let's see how to clean it up .



Method 1.4 | _remove_think_tags⚓︎

See line 29 of get_response and lines 74-92 of ollama_utils.py

The sole purpose of this method is to remove the <think></think> tags and all content within that some LMs output (including our default models Qwen3 and DeepSeek-R1). These models have a thinking phase before they give a final response to a user message. All the content of this phase is wrapped in <think></think> tags to denote the purpose.

For now, we're going to remove these tags and all the content within . However, in later tutorials we'll see how to output the thinking content separately from the response content in a pretty slick way using a Gradio web UI .

_remove_think_tags method of ollama_utils.py
import re

## Clean LM response
def _remove_think_tags(
    self, 
    text
):
    # Find <think>...</think> within the text
    think_tag_pattern = re.compile(r'<think>.*?</think>\s*', re.DOTALL)
    # If the tag is not fully closed, we handle that separately
    if not think_tag_pattern:
        # Find any instances of <think> and </think> and 
        # substitute them with empty strings
        outside_tags = re.sub(r'</?think>', '', text).strip()

    # If matched and closed, clean up tags from outside
    else:
        # Change the <think>...</think> to an empty string
        cleaned_text = think_tag_pattern.sub('', text).strip()
        # Make sure no other tags remain
        outside_tags = re.sub(r'</?think>', '', cleaned_text).strip()

    return outside_tags

When I first tried cleaning the output from these models, I started with something like lines 9 and 19. Let's look at this little part more closely:

remove tags for complete match
1
2
3
4
5
6
7
import re

# Find <think>...</think> within the text
think_tag_pattern = re.compile(r'<think>.*?</think>\s*', re.DOTALL)

# Change the <think>...</think> to an empty string
cleaned_text = think_tag_pattern.sub('', text).strip()

Here, we're using Python's regular expression operations to find all instances of <think>...</think> in a given text (line 4), where ... can be any string content. It then replaces all those instances with an empty string (line 7) while the strip method just gets rid of all leading and trailing whitespace characters that remain.

This works great if the LM always outputs the thinking phase fully enclosed in the proper tags . But, what if something messes up along the way and the LM accidentally drops one of the tags or adds an extra one ? This is rare but I've seen it, especially if the user message includes one of the <think> or </think> tags. So, we need to take into account situations in which the full think_tag_pattern isn't found . This will also take care of responses without any thinking phases at all .

We've already seen that responses with a fully enclosed thinking phase are taken care of with lines 9 and 19, however we also want to make sure no tags remain by replacing any instances of <think> or </think> with empty strings on line 21 .

To take care of the rest, on line 11 we check if the think_tag_pattern isn't found (i.e. one of the tags were dropped or added accidentally or the message has no thinking tags at all). On line 14, we then follow the same procedure to clean the text as we did on line 21 .

This will sometimes result in some or all of the LM thinking phase being output with the final response, but I'd rather have too much than too little context . Alternatively, I bet there's an approach somewhere out there that ensures only the final response is output .


And that's it ! Those are all the methods that we need to dig through in order to understand how we're getting responses from LMs using the Ollama Python library.

Now, how exactly do we create the Ollama server that we'll be pointing to in order to get responses ?


File 2 | docker-compose.yml⚓︎

Toggle file visibility

docker-compose-gpu.yml
#### docker-compose-gpu
### Sets up an Ollama server to handle LM calls with GPU acceleration through port `localhost:11434`

## All Ollama model data is saved in the local folder `./ollama_data` by default
## This is the exact same file as `docker-compose-cpu` but with Ollama configured for GPU support

### Creating volumes 
## Optional for Docker volume instead of local folder
## If you want to store Ollama data in Docker instead of a local folder, uncomment two lines below and change volume definition in Ollama service 
#volumes:
#  ollama_data:    

### Creating containers
services:
  ollama: ## Ollama container (GPU-enabled)
    image: ollama/ollama
    container_name: ollama
    ports:  ## Allow external processes to access Ollama server here
      - "127.0.0.1:11434:11434"       

    ## Configure Ollama to use GPU
    gpus: # Use all available GPUs
      - device: all                                  

    ## Can store Ollama data (LMs, etc.) in Docker volume or in folder on local system.    
    ## Store Ollama data in local folder `./ollama_data`
    volumes: # Store Ollama data in folder of parent directory                    
      - ./ollama_data:/root/.ollama

    ## To store Ollama data in a Docker volume instead, comment out the two volume creation lines above and use the two volume creation lines below
    #volumes:  
    #  - ollama_data:/root/.ollama
docker-compose-cpu.yml
#### docker-compose-cpu
### Sets up an Ollama server to handle LM calls using CPU through port `localhost:11434`

## All Ollama model data is saved in the local folder `./ollama_data` by default
## This is the exact same file as `docker-compose-gpu` but with Ollama configured for CPU support

### Creating volumes
## Optional for Docker volume instead of local folder
## If you want to store Ollama data in Docker instead of a local folder, uncomment two lines below and change volume definition in Ollama service 
#volumes:
#  ollama_data:    

### Creating containers
services:
  ollama: ## Ollama container (CPU-enabled)
    image: ollama/ollama
    container_name: ollama
    ports:  ## Allow external processes to access Ollama server here
      - "127.0.0.1:11434:11434"             

    ## Can store Ollama data (LMs, etc.) in Docker volume or in folder on local system.    
    ## Store Ollama data in local folder `./ollama_data`
    volumes: # Store Ollama data in folder of parent directory                    
      - ./ollama_data:/root/.ollama

    ## To store Ollama data in a Docker volume instead, comment out the two volume creation lines above and use the two volume creation lines below
    #volumes:  
    #  - ollama_data:/root/.ollama  

The Docker compose file is a special file that tells Docker how to create the containers that you want. In our case we want one container, an Ollama server, and there are two different ways to create it, for GPU or CPU support 1.

Now, let's take a closer look at the files .

docker-compose-gpu.yml piece
### Creating containers
services:
  ollama: ## Ollama container (GPU-enabled)
    image: ollama/ollama
    container_name: ollama
    ports:  ## Allow external processes to access Ollama server here
      - "127.0.0.1:11434:11434"       

    ## Configure Ollama to use GPU
    gpus: # Use all available GPUs
      - device: all                                  

    ## Store Ollama data in local folder `./ollama_data`
    volumes: # Store Ollama data in folder of parent directory                    
      - ./ollama_data:/root/.ollama
docker-compose-cpu.yml piece
### Creating containers
services:
  ollama: ## Ollama container (CPU-enabled)
    image: ollama/ollama
    container_name: ollama
    ports:  ## Allow external processes to access Ollama server here
      - "127.0.0.1:11434:11434"             

    ## Store Ollama data in local folder `./ollama_data`
    volumes: # Store Ollama data in folder of parent directory                    
      - ./ollama_data:/root/.ollama

These are typical Docker compose files starting with a definition of all the services (containers ) that we want to build. In our case, we only want an Ollama server, so we add that to the service definitions .

We want to use the latest Docker image found here, and we want to interact with the server by using our localhost network to send requests to port 11434 (the designated port where the Ollama API can be reached). This is where we point when we initialize the OllamaClient class of the ollama_utils.py file and the URL that we pass to the Ollama Client using the Client object .

Finally, in this demo we're going to store all of the Ollama data (like the models that we pull) in a local folder called ./ollama_data/. However, in later tutorials we'll use Docker volumes for all our data 2.

By switching between the two code snippets, you can see that the GPU setup is exactly the same as the CPU setup, only with two extra lines that tell Ollama to use all of our available GPUs .


That's it ! We've gone through all the code in this repo that's needed to understand how to setup an Ollama server in Docker and use it to chat with LMs in a local Python environment .


Next Steps & Learning Resources⚓︎

Now that you've finished this tutorial, the next tutorials in the servers series will be a breeze because the code to build all the servers is really similar. The Milvus and SearXNG servers have different nuances that can be worked through, but the main process of building the servers and interacting with them through Python will structurally be the same .

Continue learning how to build the rest of the servers by following along with another tutorial in the servers series, or learn how to pass this Ollama server to a LangChain chatbot and interact with it through a Gradio web UI in the chatbot tutorial. You can also check out more advanced agent builds in the rest of the agents tutorials.

Just like all the other tutorials, all the source code is available so you can plug and play any of tutorial code right away .


Contributing⚓︎

This tutorial is a work in progress. If you'd like to suggest or add improvements , fix bugs or typos , ask questions to clarify , or discuss your understanding , feel free to contribute through participating in the site discussions! Check out the contributing guidelines to get started.



  1. Usually, we would only have one Docker compose file that can handle different user preferences by using something like environment variables (we'll see how to do this when we get into the SearXNG server and chatbot tutorials). However, in our case GPU support is built by adding two lines (see lines 10-11 of the GPU snippet) while CPU support is built by omitting these two lines. I couldn't figure out a good way to dynamically add or omit these lines when building , so I just resorted to using two different files. 

  2. You can see how to setup using a Docker volume for this project in the full Docker compose files