Here at Internet2, we are fortunate to be working with a wonderful group of students from Notre Dame's Master of Science in Business Analytics program. The group is working to help us gain insight from detailed usage data we get from the NET+ AWS and GCP programs. Our hopes are that we will be able to use that data to observe emerging patterns of cloud infrastructure in higher education and research, and to use that knowledge to help the community support effective cloud use.

In order to provide analytic access to the data, which is kept in Google Big Query tables, we wanted to provide the students with a Jupyter notebook environment where they would not need to download or store the data on their own personal laptops while they work with us. This post documents how we are providing that environment using Managed Notebooks in GCP's Vertex AI Workbench.

We have set up a Google Group for the class project, containing the members of the class as well as the Internet2 staff working on the project with them. In order to allow the group the ability to create notebooks, we added the Notebooks Admin role for the group within our GCP project (as described in (https://cloud.google.com/vertex-ai/docs/workbench/user-managed/iam). Open question: Would Notebooks Runner be adequate for our purposes?

For our purposes, as we only have four students in our group, we used the GCP Console to manually create the notebooks. The process could be automated  for larger repeated use (or one could use Google's Rad Lab Data Science repo).

The process for creating Managed Notebooks is documented here: https://cloud.google.com/vertex-ai/docs/workbench/managed/create-managed-notebooks-instance

At present Managed Notebooks are only available for a single user, so we created an individual instance for each student, naming each notebook with the student's email identifier. Each notebook can be assigned a single owner (at the bottom of the Advanced Settings screen), which is where you assign the notebook to the student's email address.

To help in managing costs, we reduced the size of the instances from the default n1-standard-4 to n1-standard-2, and reduced the idle timeout period from 180 minutes to 60 minutes.

The result of creating notebooks manually in the console is a running notebook process, viewable in the Vertex AI Workbench screen in the console. We then stop those processes, as we will rely on the students to start them up when they want to use a notebook.

To give the notebooks access to our Big Query tables required assigning the BigQuery Read Session User role to our group. The group already had the BigQuery Data Viewer and BigQuery Job User roles assigned within our project.

The process for accessing Big Query data from a Jupyter notebook is documented here: https://cloud.google.com/bigquery/docs/visualize-jupyter

Because we are using GCP Managed Notebooks, all the necessary pieces for accessing Big Query are pre-installed (as are the usual Python data science modules), so the notebooks are ready to go once started.

We anticipate very low costs for using this service: Managed Notebooks are currently in Preview, and there is no management fee for managed notebooks while in Preview. The instance costs for the n1-standard-2 machines are $0.10 per hour. There can be costs for queries submitted to Big Query, but we anticipate that our uses will remain well within the free tier of Big Query usage.

Many thanks to Maddie Howe for helping to test and troubleshoot this process!

We sent out the following instructions to the students to let them know how to access their notebooks.

I’ve set you each up with a Jupyter environment in our GCP organization for work on the capstone project. 
To get to the environment, follow these instructions:
  1. Go to the Managed Notebooks page in the GCP console: 
    https://console.cloud.google.com/vertex-ai/workbench/list/managed?_ga=2.66336813.283589364.1646256329-1869828962.1513966007
  2. You should see a notebook named with your email id – e.g. nd-capstone-jdoe
  3. Click in the checkbox next to your notebook name and then click on the Start icon up on the Workbench 
    menu line at the top of the page.
    (if you don’t see the Start icon, click on the three dots there and you will).
    It takes 5-10 minutes to spin up the instance.
  4. Once your instance is running, click on Open Jupyterlab and you’ll get a new tab with 
    Jupyterlab – that can also take a few minutes.
  5. You can then start a new notebook.
  6. You should be able to access our Big Query tables as documented here:

    https://cloud.google.com/bigquery/docs/visualize-jupyter
A sample query to test:

%%bigquery testdf
SELECT distinct Product_Name FROM `projectname.datasetname.tablename`
order by Product_Name

That will put the result of the query in a pandas dataframe called testdf. To verify:
print(testdf)

A few notes:
- When you’re done using Jupyter, please go back into the console and stop your instance.
- The instances time out after 60 minutes of no use, so it’s not the end of the world
if you don’t stop it, but it’s a good practice to get into.
- The instances are not huge – 2 CPU, 7.5 GB of RAM, no GPU, 100 GB of disk. If you need more power, please let me know.

Update: March 9, 2022

Aaron Gussman from Google sent along an example of using the notebooks API to create a managed notebook instance, which doesn't appear to be in Google's documentation anywhere yet.

Here is the API example to create a Managed Notebooks runtime with Idle Shutdown settings:

BASE_ADDRESS="notebooks.googleapis.com"

LOCATION="us-central1"

PROJECT_ID="YOUR_PROJECT_ID"

AUTH_TOKEN=$(gcloud auth application-default print-access-token)

RUNTIME_ID="my-runtime"

OWNER_EMAIL="YOUR_EMAIL"

RUNTIME_BODY="{

  'access_config': {

    'access_type': 'SINGLE_USER',

    'runtime_owner': '${OWNER_EMAIL}'

  },

  'software_config': {

    'idle_shutdown': true,

    'idle_shutdown_timeout': 180

  }

}"


curl -X POST https://${BASE_ADDRESS}/v1/projects/$PROJECT_ID/locations/$LOCATION/runtimes?runtime_id=${RUNTIME_ID} -d "${RUNTIME_BODY}" \

 -H "Content-Type: application/json" \

 -H "Authorization: Bearer $AUTH_TOKEN" -v