Adventures in Import-land, Part II

By on 8 April 2024

KeyError: 'GOOGLE_APPLICATION_CREDENTIALS‘”

It was way too early in the morning for this error. See if you can spot the problem. I hadn’t had my coffee before trying to debug the code I’d written the night before, so it will probably take you less time than it did me.

app.py:

from dotenv import load_dotenv
from file_handling import initialize_constants

load_dotenv()
#...

file_handling.py:

import os
from google.cloud import storage

UPLOAD_FOLDER=None
DOWNLOAD_FOLDER = None

def initialize_cloud_storage():
    """
    Initializes the Google Cloud Storage client.
    """
    os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
    storage_client = storage.Client()
    bucket_name = #redacted
    return storage_client.bucket(bucket_name)

def set_upload_folder():
    """
    Determines the environment and sets the path to the upload folder accordingly.
    """
    if os.environ.get("FLASK_ENV") in ["production", "staging"]:
        UPLOAD_FOLDER = os.path.join("/tmp", "upload")
        os.makedirs(UPLOAD_FOLDER, exist_ok=True)
    else:
        UPLOAD_FOLDER = os.path.join("src", "upload_folder")
    return UPLOAD_FOLDER

def initialize_constants():
    """
    Initializes the global constants for the application.
    """
    UPLOAD_FOLDER = initialize_upload_folder()
    DOWNLOAD_FOLDER = initialize_cloud_storage()
    return UPLOAD_FOLDER, DOWNLOAD_FOLDER
  
DOWNLOAD_FOLDER=initialize_cloud_storage()

def write_to_gcs(content: str, file: str):
    "Writes a text file to a Google Cloud Storage file."
    blob = DOWNLOAD_FOLDER.blob(file)
    blob.upload_from_string(content, content_type="text/plain")

def upload_file_to_gcs(file_path:str, gcs_file: str):
    "Uploads a file to a Google Cloud Storage bucket"
    blob = DOWNLOAD_FOLDER.blob(gcs_file)
    with open(file_path, "rb") as f:
        blob.upload_from_file(f, content_type="application/octet-stream")

See the problem?

This was just the discussion of a recent Pybites article.

When app.py imported initialize_constants from file_handling, the Python interpreter ran

DOWNLOAD_FOLDER = initialize_cloud_storage()

and looked for GOOGLE_APPLICATION_CREDENTIALS from the environment path, but load_dotenv hadn’t added them to the environment path from the .env file yet.

Typically, configuration variables, secret keys, and passwords are stored in a file called .env and then read as environment variables rather than as pure text using a package such as python-dotenv, which is what is being used here.

So, I had a few options.

I could call load_dotenv before importing from file_handling:

from dotenv import load_dotenv
load_dotenv()

from file_handling import initialize_constants

But that’s not very Pythonic.

I could call initialize_cloud_storage inside both upload_file_to_gcs and write_to_gcs

def write_to_gcs(content: str, file: str):
    "Writes a text file to a Google Cloud Storage file."
    DOWNLOAD_FOLDER = initialize_cloud_storage()
    blob = DOWNLOAD_FOLDER.blob(file)
    blob.upload_from_string(content, content_type="text/plain")

def upload_file_to_gcs(file_path:str, gcs_file: str):
    "Uploads a file to a Google Cloud Storage bucket"
    DOWNLOAD_FOLDER = initialize_cloud_storage()
    blob = DOWNLOAD_FOLDER.blob(gcs_file)
    with open(file_path, "rb") as f:
        blob.upload_from_file(f, content_type="application/octet-stream")

But this violates the DRY principle. Plus we really shouldn’t be initializing the storage client multiple times. In fact, we already are initializing it twice in the way the code was originally written.

Going Global

So what about this?

DOWNLOAD_FOLDER = None
 
def initialize_constants():
    """
    Initializes the global constants for the application.
    """
    global DOWNLOAD_FOLDER
    UPLOAD_FOLDER = initialize_upload_folder()
    DOWNLOAD_FOLDER = initialize_cloud_storage()
    return UPLOAD_FOLDER, DOWNLOAD_FOLDER

Here, we are defining DOWNLOAD_FOLDER as having global scope.

This will work here.

This will work here, because upload_file_to_gcs and write_to_gcs are in the same module. But if they were in a different module, it would break.

Why does it matter?

Well, let’s go back to how Python handles imports. Remember that Python runs any code outside of a function or class at import. That applies to variable (or constant) assignment, as well. So if upload_file_to_gcs and write_to_gcs were in another module and importing DOWNLOAD_FOLDER from file_handling,p it would be importing it while assigned a value of None. It wouldn’t matter that by the time it was needed, it wouldn’t be assigned to None any longer. Inside this other module, it would still be None.

What would be necessary in this situation would be another function called get_download_folder.

def get_download_folder():
    """
    Returns the current value of the Google Cloud Storage bucket
    """
    return DOWNLOAD_FOLDER

Then, in this other module containing the upload_file_to_gcs and write_to_gcs functions, I would import get_download_folder instead of DOWNLOAD_FOLDER. By importing get_download_folder, you can get the value of DOWNLOAD_FOLDER after it has been assigned to an actual value, because get_download_folder won’t run until you explicitly call it. Which, presumably wouldn’t be until after you’ve let initialize_cloud_storage do its thing.

I have another part of my codebase where I have done this. On my site, I have a tool that helps authors create finetunes of GPT 3.5 from their books. This Finetuner is BYOK, or ‘bring your own key’ meaning that users supply their own OpenAI API key to use the tool. I chose this route because charging authors to fine-tune a model and then charging them to use it, forever, is just not something that benefits either of us. This way, they can take their finetuned model and use it an any of the multiple other BYOK AI writing tools that are out there, and I don’t have to maintain writing software on top of everything else. So the webapp’s form accepts the user’s API key, and after a valid form submit, starts a thread of my Finetuner application.

This application starts in the training_management.py module, which imports set_client and get_client from openai_client.py and passes the user’s API key to set_client right away. I can’t import client directly, because client is None until set_client has been passed the API key, which happens after import.

from openai import OpenAI

client = None

def set_client(api_key:str):
    """
    Initializes OpenAI API client with user API key
    """
    global client
    client = OpenAI(api_key = api_key)

def get_client():
    """
    Returns the initialized OpenAI client
    """
    return client

When the function that starts a fine tuning job starts, it calls get_client to retrieve the initialized client. And by moving the API client initialization into another module, it becomes available to be used for an AI-powered chunking algorithm I’m working on. Nothing amazing. Basically, just generating scene beats from each chapter to use as the prompt, with the actual chapter as the response. It needs work still, but it’s available for authors who want to try it.

A Class Act

Now, we could go one step further from here. The code we’ve settled on so far relies on global names. Perhaps we can get away with this. DOWNLOAD_FOLDER is a constant. Well, sort of. Remember, it’s defined by initializing a connection to a cloud storage container. It’s actually a class. By rights, we should be encapsulating all of this logic inside of another class.

So what could that look like? Well, it should initialize the upload and download folders, and expose them as properties, and then use the functions write_to_gcs and upload_file_to_gcs as methods like this:

class FileStorageHandler:
    def __init__(self):
        self._upload_folder = self._set_upload_folder()
        self._download_folder = self._initialize_cloud_storage()
    
    @property
    def upload_folder(self):
        return self._upload_folder
    
    @property
    def download_folder(self):
        return self._download_folder

    def _initialize_cloud_storage(self):
        """
        Initializes the Google Cloud Storage client.
        """
        os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
        storage_client = storage.Client()
        bucket_name = #redacted
        return storage_client.bucket(bucket_name)

    def _set_upload_folder(self):
        """
        Determines the environment and sets the path to the upload folder accordingly.
        """
        if os.environ.get("FLASK_ENV") in ["production", "staging"]:
            upload_folder = os.path.join("/tmp", "upload")
            os.makedirs(upload_folder, exist_ok=True)
        else:
            upload_folder = os.path.join("src", "upload_folder")
        return upload_folder

    def write_to_gcs(self, content: str, file_name: str):
        """
        Writes a text file to a Google Cloud Storage file.
        """
        blob = self._download_folder.blob(file_name)
        blob.upload_from_string(content, content_type="text/plain")

    def upload_file_to_gcs(self, file_path: str, gcs_file_name: str):
        """
        Uploads a file to a Google Cloud Storage bucket.
        """
        blob = self._download_folder.blob(gcs_file_name)
        with open(file_path, "rb") as file_obj:
            blob.upload_from_file(file_obj)

Now, we can initialize an instance of FileStorageHandler in app.py and assign UPLOAD_FOLDER and DOWNLOAD_FOLDER to the properties of the class.

from dotenv import load_dotenv
from file_handling import FileStorageHandler

load_dotenv()

folders = FileStorageHandler()

UPLOAD_FOLDER = folders.upload_folder
DOWNLOAD_FOLDER = folders.download_folder

Key take away

In the example, the error arose because initialize_cloud_storage was called at the top level in file_handling.py. This resulted in Python attempting to access environment variables before load_dotenv had a chance to set them.

I had been thinking of module level imports as “everything at the top runs at import.” But that’s not true. Or rather, it is true, but not accurate. Python executes code based on indentation, and functions are indented within the module. So, it’s fair to say that every line that isn’t indented is at the top of the module. In fact, it’s even called that: top-level code, which is defined as basically anything that is not part of a function, class or other code block.

And top-level code runs runs when the module is imported. It’s not enough to bury an expression below some functions, it will still run immediately when the module is imported, whether you are ready for it to run or not. Which is really what the argument against global variables and state is all about, managing when and how your code runs.

Understanding top-level code execution at import helped solved the initial error and design a more robust pattern.

Next steps

The downside with using a class is that if it gets called again, a new instance is created, with a new connection to the cloud storage. To get around this, something to look into would be to implement something called a Singleton Pattern, which is outside of the scope of this article.

Also, the code currently doesn’t handle exceptions that might arise during initialization (e.g., issues with credentials or network connectivity). Adding robust error handling mechanisms will make the code more resilient.

Speaking of robustness, I would be remiss if I didn’t point out that a properly abstracted initialization method should retrieve the bucket name from a configuration or .env file instead of leaving it hardcoded in the method itself.

Want a career as a Python Developer but not sure where to start?