Use Python to control storage in Google Cloud

Course Notes

OVERVIEW

Life is short.
You don't have time to do everything manually.
Automation helps.

In this lesson, we are going to automate parts of Google Cloud to save YOU time. In particular, we're going to build on our knowledge of logging from the previous lesson in order to set up buckets in Cloud Storage, to create subdirectories in buckets, and to programmatically delete buckets.

A bucket is a container used to store data.
In Google Cloud Documentation, you might see a bucket referenced by a Uniform Resource Indicator such as gs://my-bucket.

LEARN THE BASICS OF USING CLOUD STORAGE

Complete the following exercises using the code that is supplied with the paid version of the course . These exercises focus on the following unit tests:

  • test_create_bucket: How to create a bucket on the cloud and then delete that storage container
  • test_create_path: How to add an empty subdirectory to a bucket
  • test_create: Shows you how to create a bucket with a Cloud Storage unique identifier (e.g. gs://my-bucket )

test_create_bucket

If you run programs that store data on the cloud, you want to make sure that your bucket exists, and that your program can communicate with the cloud. It seems basic, but it is helpful to implement these processes programmatically so that your programs don’t have errors when they are running.

Let's debug a unit test to better understand how to automatically create buckets.

PREPARE TO WALK THROUGH THE CODE

  1. Set breakpoints in tests/test_gcp_storage.py in your code editor

    First, prepare to run the unit test by setting breakpoints at notable areas in the code.

    
    
      # Excerpt from tests/test_gcp_storage.py
                  
      class TestStorage(unittest.TestCase):
        def setUp(self):
    
          # Place breakpoint at the line below
          infer_credential_set()
          self.creds = confirm_credentials()
          self.project_name = self.creds.google.project_name
          self.bucket_name = self.creds.google.project_name + \
            "_whiteowl_test_bucket"
    
    
        def test_create_bucket(self):
    
          # Place breakpoint at the line below
          Storage().create_bucket(self.bucket_name, 
            self.creds.google.bucket_location)
          
          time.sleep(3)
          self.assertTrue(Storage().is_bucket(self.bucket_name))
    
          # Place breakpoint at the line below
          Storage().delete_bucket(self.bucket_name)
          time.sleep(3)
          self.assertFalse(Storage().is_bucket(self.bucket_name))
                    
  2. Set additional breakpoints

    Place breakpoints as indicated in the code comments below:

  3. 
      # Code excerpts from feeds/util/gcp/storage.py
                    
      class Storage():
        class __OnlyOneStorage:
    
            def __init__(self):
    
                # Place breakpoint at the line below
                confirm_credentials()
                self.project = os.environ["GCLOUD_PROJECT"]
                self.storage_client = storage.Client()
                return
    
        instance = None
    
        def __init__(self):
            # Place breakpoint at the line below
            if not Storage.instance:
                Storage.instance = Storage.__OnlyOneStorage()
    
            self.storage_client = Storage.instance.storage_client
            self.project = Storage.instance.project
    
        def create_bucket(self,
                         bucket_name, 
                         bucket_location="us-central1",
                         storage_class="STANDARD"):
            try:
                # Place breakpoint at the line below
                client = self.storage_client
                ...
        
        def is_bucket(self, bucket_name):
          try:
              # Place breakpoint at the line below
              client = self.storage_client
              ...
                  

VISUALLY CONFIRM A "CLEAN SLATE"

  1. Go to https://console.cloud.google.com/storage. Confirm storage is empty.

    It is helpful to know what you currently have in Cloud Storage before you start adding items. For the new project that you just created, you should not see any buckets and the result should look like the picture below:

WALK THROUGH THE CODE

    Now that we have gone through all this preparation, we're ready to start to take a look at the code in order to solidify concepts about how to automate storage.

  1. Start the debugging process

    Initially you need to start the unit test. In PyCharm, this means that you're right clicking on the green arrow next to test_create_bucket, and you are selecting debug.

    In PyCharm, you can run the unit test by right clicking on the green arrow.
  2. Confirm that you can advance from where you are in the debugging process to the next breakpoint.

    In PyCharm, this is done by pressing the F9 key. If you press the F9 key after reaching the first breakpoint in setup, you should now advance to the first line in test_create_bucket.

  3. Examine Storage initialization to understand code which only uses one instance of the google-cloud-storage Client.

    Press F9 again. If you have set up your breakpoints correctly, you will now be in the first line for Storage().__init()

    
    # Code excerpt from feeds/util/gcp/storage.py
    
    class Storage():
      class __OnlyOneStorage:
    
      ...
    
      def __init__(self):
          if not Storage.instance:
              Storage.instance = Storage.__OnlyOneStorage()
          ... 
    
                    

    This Storage utility class is set up so that you are not simultaneously making calls to create and delete a bucket.

    • Storage is built off of a Singleton python design pattern so that there is only one instance in memory of the client that connects the program to Cloud Storage.

    • The code confirms that there is “only one client” by only setting up a connection if this step has not been previously performed.

  4. Create a bucket using the storage.bucket.Bucket class
    
      def create_bucket(self,
                        bucket_name, 
                        bucket_location="us-central1",
                        storage_class="STANDARD"):
      try:
          client = self.storage_client
    
          bucket = storage.bucket.Bucket(client, bucket_name,
                                         self.project)
          
          bucket.location = bucket_location
          bucket.storage_class = storage_class
          bucket = client.create_bucket(bucket)
    
          assert isinstance(bucket, storage.bucket.Bucket)
    
      except Exception as ex:
          print("exception!\n{}".format(ex))
                    

    Pressing F9 again should take you to the start of the create_bucket function. Step through the code (repeatedly hitting F8 in PyCharm) in order to create the bucket.

    As seen above, as you press F8 multiple times, a bucket is created for a project. It is assigned either a bucket_location or a default location of “us-central1.” After the bucket is configured, the Client is called to create the bucket on the cloud.

    Storage location should be close to the consumer of the data. If you do not live close to the us-central1 region, change this default to a region that is closer to you.

  5. View the created bucket using the Google Cloud Console

    After the bucket is created, the unit test will execute the sleep function shown below:

    
      # Code from tests/test_gcp_storage.py
    
      def test_create_bucket(self):
      """
      Test confirms that you can create a bucket
      on the cloud with the JSON key file.
    
      Results can be visually verified at
      https://console.cloud.google.com/storage. 
    
      """
    
      # create bucket in default region (e.g. "us-central1")
      Storage().create_bucket(self.bucket_name,
                 self.creds.google.bucket_location)
      
      time.sleep(3)
      self.assertTrue(Storage().is_bucket(self.bucket_name))
      Storage().delete_bucket(self.bucket_name)
      time.sleep(3)
      self.assertFalse(Storage().is_bucket(self.bucket_name))
                      

    Before examining is_bucket, go to https://console.cloud.google.com/storage again to confirm that your bucket exists.

    After the bucket is created, it will show up in the Google Cloud Console.
  6. Use Python to confirm that the bucket exists

    At this point, if you continue through the code (F9 in PyCharm), you will see the implementation of the is_bucket function that uses the google-cloud-storage client to determine if a bucket exists.

    
      # Code from feeds/util/gcp/storage.py
    
      def is_bucket(self, bucket_name):
        try:
            client = self.storage_client
            bucket = client.bucket(bucket_name)
            return bucket.exists()
        except:
            return False                
                    
  7. Run code to delete the bucket

    As we continue to step through the code, the last major piece of unit test that we want to examine is a function that deletes a bucket.

    It is important to have automation in place that removes resources that are no longer in use. This is critical to best manage the fees that Google Cloud charges for the use of its services.

    The code to delete the bucket looks like the following:

    
      # Code from feeds/util/gcp/storage.py
                    
      def delete_bucket(self, bucket_name):
        try:
            client = self.storage_client
            bucket = client.bucket(bucket_name)
            bucket.delete()
        except Exception as ex:
            print("exception!\n{}".format(ex))
        return
                    
  8. Visually confirm that the bucket no longer exists

    At this point, the code has completed. Go back into https://console.cloud.google.com/storage and confirm that the bucket has been deleted.

test_create_path

It could be quite possible in the future that you generate data that you would like to store. In that event, it is better if you could set up a subdirectory to better organize your data.

In this section, we are going to run a test that helps us to learn this concept of building a subdirectory programmatically.

PREPARE TO WALK THROUGH THE CODE

  1. Set breakpoints in a unit test so that you can move through code quickly

    Let's go ahead and set some breakpoints in test_create_path.

    
    
    # Code from tests/test_gcp_storage.py
    
    def test_create_path(self):
        """
        Test the creation of a path in a bucket on the cloud
        :return:
        """
        Storage().create_bucket(self.bucket_name)
        time.sleep(3)
    
        # Place breakpoint below to confirm bucket is created
        self.assertTrue(Storage().is_bucket(self.bucket_name))
    
      
        result = Storage().create_path(self.bucket_name,
                                   "sample/dir/structure/")
    
        # Place breakpoint below to confirm subdir is created
        self.assertTrue(result)
    
    
        # Bucket can only be deleted if it is empty
        result = Storage().delete_full_path_all_contents(
                    self.bucket_name, "sample/dir/structure/")
    
        # Place breakpoint below to confirm subdir is deleted
        self.assertTrue(result)
    
        Storage().delete_bucket(self.bucket_name)
        time.sleep(3)
        self.assertFalse(Storage().is_bucket(self.bucket_name))
                      
                    

    A couple of things are going on here

    • The first breakpoint above confirms that there is a bucket that has been created on the cloud.
    • We will need to implement a create_path function that returns true if a subdirectory is created in a bucket, and that returns false if the subdirectory. The associated breakpoint confirms that this function is working.
    • We need to create a delete_full_path_all_contents function because we can’t release the unused bucket unless it is empty. The third breakpoint confirms the successful deletion of the subdirectory.

    In this exercise, we really want to understand what is required to create and delete a subdirectory, so lets go ahead and place the breakpoints seen in the comments below:

    
      # Code excerpt - feeds/util/gcp/storage.py
    
      def create_path(self, bucket_name, path):
          
        try:
          # Set breakpoint in the line below
          gcs_client = self.storage_client
          ...
          return True
      
          except Exception:
              return False
                    
    
      # Code from feeds/util/gcp/storage.py
    
    
      def delete_full_path_all_contents(self, 
                                        bucket_name,
                                        path):
          try:
            # Set breakpoint in the line below
            gcs_client = self.storage_client
            ...
            return True
    
          except Exception as e:
            # This is currently returning false
            return False       
                    

WALK THROUGH THE CODE

In the previous exercise, we confirmed that we can create buckets and programmatically see that they exist. The first part of this test recreates that work.

  1. Create a subdirectory programmatically

    Google Cloud treats everything in a bucket as an object. This means that if we want to create a subdirectory, we really want to create an empty object.

    
      # Code from feeds/util/gcp/storage.py
    
      def create_path(self, bucket_name, path):
      """
      path: some/folder/name/ , MUST have the trailing slash
      Return: bool - True if successful and False otherwise.
      """
      try:
          gcs_client = self.storage_client
          bucket = gcs_client.get_bucket(bucket_name)
          blob = bucket.blob(path)
    
          blob.upload_from_string('', content_type=
          'application/x-www-form-urlencoded;charset=UTF-8')
          
          return True
    
      except Exception:
          return False
                    

    In the above code, we use a bucket that we created at the beginning of the test to construct an object. The google-cloud-storage library refers to these objects as blobs, and as a result, we create a placeholder variable called “blob” using the following syntax:

    
      blob = bucket.blob(path)
                    

    Then, we upload an empty string (a “0 byte object”) to create a subdirectory in the cloud. If you pause the code at this point, go into https://console.cloud.google.com/storage, and “drill-down” into the bucket, you will see that the directory has been created.

  2. Remove a subdirectory programmatically

    Now, we are going to do the reverse, and remove the subdirectory. In short, you are going to use Python to remove ALL objects from a bucket before you delete the bucket.

    Because delete_full_path_all_contents REMOVES all objects from a bucket, you want to only use this function when you are doing final cleanup of resources.
    
      # Code from feeds/util/gcp/storage.py
    
      def delete_full_path_all_contents(self,
                                        bucket_name,
                                        path):
      """
      path: some/folder/name/  -  MUST have the trailing slash
      Return: bool - True if successful and False otherwise.
      """
      try:
          gcs_client = self.storage_client
          bucket = gcs_client.get_bucket(bucket_name)
    
          blob_name_iterator = gcs_client.list_blobs(
            bucket, prefix=path)
    
          # The following code works with data at a small scale. 
          # It pulls in file names up to max of what fits in memory.
          blob_name_list = list(blob_name_generator)
    
          # error is thrown because there is no length
          bucket.delete_blobs(blob_name_list)
    
          # You might have to traverse up to delete the tree
    
          return True
    
      except Exception as e:
          # This is currently returning false
          return False
                
                  

    In the code above, we get an iterator which can be used to find blobs in the bucket. For illustration purposes, all files in this path are then listed from the iterator. Finally, we delete any files that we found INCLUDING the 0 byte placeholder for the path.

    If you continue now with the remainder of test_create_path, the test will complete successfully, and when you go into https://console.cloud.google.com/storage , you will see that the bucket has been deleted.

test_create

Finally, it is worth examining a test covering a helper function that can create the bucket and the subdirectory in one line of code.

PREPARE TO WALK THROUGH THE CODE

  1. Set breakpoints for easy code navigation
  2. Go ahead and set some breakpoints in the create and the delete helper functions.

    
      # Code Excerpt from feeds/util/gcp/storage.py
    
      def create(self, path):
      """
      This function creates the bucket and the full path specified.
      Path: A string such as gs://{bucket-name}/path/in/bucket/ 
      The path must have a / at the end.
      
      Return: bool - Return True if successful and False otherwise
      """
      
          try:
              # Set a breakpoint on the line below
              pattern = re.compile(r"(gs:\/\/)(?P[a-zA-Z0-9_\- ]+?)(\/)(?P.*)")
              ...
      
              # Set a breakpoint on the line below
              return True
      
          except Exception:
              return False
                  
    
      # Code Excerpt from feeds/util/gcp/storage.py
    
      def delete(self, path):
          """
          This function deletes the bucket and the full path specified.
          Path: A URI (e.g. gs://{bucket-name}/path/in/bucket/ )
          Return: bool - Return True if successful and False otherwise
          """
          try:
              # Set the breakpoint on the line below
              pattern = re.compile(r"(gs:\/\/)(?P[a-zA-Z0-9_\- ]+?)(\/)(?P.*)")
              ...
      
              # Set a breakpoint on the line below
              return True
      
          except Exception:
              return False
                  

    In both cases, we are simply setting breakpoints at the top of each function so that we can go through these functions “line by line.” We are also going to set a breakpoint at the end of the function so that we can pause the code and visually inspect results in the console.

WALK THROUGH THE CODE

Google Cloud typically uses Uniform Resource Indicators (URIs) to uniquely identify resources. For Cloud Storage, these URIs start with a gs:// prefix.

The unit test that we're examining is going to create and delete buckets and subdirectories using this URI format.


  # Code from tests/test_gcp_storage.py

  def test_create(self):
  
      storage = Storage()

      # The bucket name that you use will be determined by the 
      # project name that you changed in the credentials file
  
      # STEP 1 – Use a URI to create a bucket
      result = storage.create(f"gs://{self.bucket_name}/")
      time.sleep(3)
      self.assertTrue(result)
  
      # STEP 2 – Use a URI to delete a bucket
      delete_result = storage.delete(f"gs://{self.bucket_name}/")
      time.sleep(3)
      self.assertTrue(delete_result)
  
      # STEP 3 – Use a URI to create a bucket and a subdirectory
      result = storage.create(f"gs://{self.bucket_name}/nested/file_structure/")
      time.sleep(3)
      self.assertTrue(result)
  
      # STEP 4 – Use a URI to delete a bucket that has an existing path
      
      # Use the following to clean up unused resources.
      # This will delete the bucket as well as the path
      delete_result = storage.delete(f"gs://{self.bucket_name}/nested/file_structure/")
      time.sleep(3)
      self.assertTrue(delete_result)
            
  1. Examine code to see how Cloud Storage URI can be used to create a bucket with or without a subdirectory

    After starting to debug the unit test above, you will “step through” code that creates a bucket using just gs://{self.bucket_name} or that creates a bucket and a subdirectory using a string similar to gs://my-bucket/my/sub/directory/ .

    
      def create(self, path):
        try:
            # Code block that uses regex
            pattern = re.compile(r"(gs:\/\/)(?P[a-zA-Z0-9_\- ]+?)(\/)(?P.*)")
            m = pattern.search(path)
            bucket_name = m.group('bucket')
            folder = m.group('folder')
    
            # Now, use code that we have already constructed to determine
            # if the bucket already exists
            bucket_exists = self.is_bucket(bucket_name)
    
            # If the bucket does not exist create it
            if bucket_exists == False:
                self.create_bucket(bucket_name)
    
            # If a folder was specified, create the corresponding path
            if len(folder) > 0:
                self.create_path(bucket_name, folder)
    
        # Set a breakpoint on the line below
        # When you reach this breakpoint, visually inspect the console
            return True
    
        except Exception:
            return False
                
                  

    A couple things to note about this create function:

    • The first code block uses regular expressions to identify a bucket and an optional subdirectory (which is called ‘folder’ in the above code).
    • At the time of this writing, Pythex is one webpage that can be used in order to get a deeper understanding of regular expressions.
    • Once the regular expression “figures out” the name of the bucket and the optional name of the subdirectory, then we simply reuse code discussed earlier in this lesson to create resources within Google Cloud.
    • One of the breakpoints that we set is on the line that occurs right before you return out of the function.
  2. Examine code to see how to delete a bucket with or without a subdirectory
     
      def delete(self, path):
      """
      This function creates the bucket and the full path specified.
      Path: A URI - gs://{bucket-name}/path/in/bucket/
      Return: bool - Return True if successful and False otherwise
      """
      try:
          pattern = re.compile(r"(gs:\/\/)(?P[a-zA-Z0-9_\- ]+?)(\/)(?P.*)")
          m = pattern.search(path)
          bucket_name = m.group('bucket')
          folder = m.group('folder')
    
          if len(folder) > 0:
              self.delete_full_path_all_contents(bucket_name, folder)
    
          bucket_exists = self.is_bucket(bucket_name)
    
          if bucket_exists == True:
              self.delete_bucket(bucket_name)
    
          return True
    
      except Exception:
          return False
                              
                  

    A couple things to note here:

    • This code is extremely similar to the create function in how the bucket and folder is identified from the URI.
    • If a subdirectory is detected as part of the URI, the folder is deleted from the bucket.
    • If the bucket identified in the URI exists on the cloud, then the bucket is also deleted.

CONGRATULATIONS

You should now have a better understanding of how to use Python to create buckets and subdirectories in Google Cloud.