GCP Project Deployment

The purpose of the GCP Deployment Request is to allow our client`s Data Science team access Boxalino datasets, for the goal of running jupyter/notebook processes in the designed anaconda environments.

Environment Details

On a deployed application, the following stack is available:

Python 3.7
git
Anaconda3
pip / pip3
setuptools
papermill
jupyter
google-api-python-client & google SDK libraries

Integration steps

Make a GCP Project Deployment Request with the Required Information information. Shortly, Boxalino will provide the project.
Your user email (as the requester) will be given the editor role.
Prepare the Required Files (project structure) and load them in a GCS bucket from the project.
Provide to Boxalino the information for the Application Launch
The application is launched in a VM in the project. The commands from commands.txt are executed. Additionally, you can SSH on the VM and update/check content.

Further tools will be provided which you can use to update the code running in the VM.

Integration Access

Because the application is launched in the scope of the project, the following Google Cloud tools can be used:

the Compute Engine - launch more applications
the project`s BigQuery dataset (to store results, if required)
the Google Cloud Storage (GCS) to load files & store logs
the Cloud Scheduler to create events (for automatic runs)

Required Information

When contacting Boxalino with a GCP project deployment request, please provide the following information:

1	project name	as will appear in your project`s list; restrictions: space, - and _ allowed.
2	email	the requestor is the one managing the applications running on the project; this email will receive messages (alert and notifications) for when the project is ready to be used; the email alerts for the VM / application run - is part of the instance.txt** file, specific for every application launch
3	client name	(also known as the Boxalino account name) this is to ensure the access to the views, core & reports datasets
4	labels	optional; the labels are used as project meta-information. see Labels
5	permissions	optional; by default, the requestor will have full access and can further share with others. see Permissions

Once the project is created (2-3 min), the requestor will have access to it, in their Google Cloud Console.

As an editor on the project, the requestor will be able to:

access the project from Google Console https://console.cloud.google.com/
add new permissions from the IAM Admin panel
create GCS bucket

Labels

Labels are key-value pairs meant to better organize the projects.

Lets imagine the scenario when the client has multiple Data Scientists working on different projects. Labeling them will allow better structure.

team:data-science
component:<main-application-name>
environment:production

More information on labels: https://cloud.google.com/resource-manager/docs/creating-managing-labels

Permissions

The permissions are added when the project is created.

By default, the requestor`s email has the project editor role
Once the project is released, the requestor can add more emails / users to the IAM policies of the project.

Here is a sample of provided permissions (optional):

user:dana@boxalino.com:roles/editor
user:dana@boxalino.com:roles/resourcemanager.projectIamAdmin
user:dana@boxalino.com:roles/compute.osLogin
user:dana@boxalino.com:roles/compute.osAdminLogin
user:dana@boxalino.com:roles/bigquery.dataOwner
serviceAccount:service-account-from-other-projects:roles/iam.serviceAccountUser
serviceAccount:service-account-from-other-projects:roles/bigquery.dataOwner
serviceAccount:service-account-from-other-projects:roles/bigquery.dataEditor

More information on permissions: https://cloud.google.com/iam/docs/understanding-roles

Required Files

1	instance.txt	properties for the VM machine (name, size, root path, etc) (see instance.txt)
2	requirements.txt	environment requirements (for pip/anaconda install) (see requirements.txt) `pip freeze > requirements.txt` - command to create the file from a tested environment
3	commands.txt	a list of commands to be executed as part of your application run process (see comands.txt) *can be left empty as well (if you chose to SSH on the VM and run your own processes from the project`s scope)
4	env.yml	(optional) anaconda environment file; if no file is provided - the environment is not created
5	your jupyter/python/application files	the content of your application (in python, jupyter notebooks, etc)

The #1-#4 files must be named as instructed. They are used as a base for the application start-up.

instance.txt, requirements.txt and commands.txt must end with an empty line (they are parsed dynamically)

These files (and other required structures) must be uploaded in a GCS bucket.

The GCS bucket name is provided for the Application Launch Request

instance.txt

	Property	Default	Required	Description
1	instance-name	project name	yes	the instance name is the VM name as appears in the Compute Engine view
2	machine-type	e2-micro	yes	the value depends on what the application needs: more CPU or more RAM? for options, please check the Google Cloud documentation
3	email-to		yes	the email is used once to receive an alert for when the VM is ready.
4	home	\/home\/project-name	no	the path on the server where the content of the GCS bucket is uploaded; this is also used for the commands from the commands.txt file in order to launch/trigger your application execution. alternatives: \/home\/<your-gcs-bucket> , \/srv\/app _{when you SSH in the machine (ex: your email is}_{data-science-guru@boxalino-client.com}_{) , the VM creates a directory /home/data-science-guru (this is default for any server) so this is your local path;}
5	image-family	ubuntu-2004-lts	no
6	boot-disk-size	30	no
7	zone	europe-west1-b	no

instance-name:application-name
machine-type:e2-micro
email-to:data-science-guru@boxalino-client.com
home:\/home\/project-name
image-family:ubuntu-2004-lts
boot-disk-size:30
zone:europe-west1-b

requirements.txt

This is a typical requirements.txt file for a jupyter/python application.

If your setup has been tested locally, you can save all the packages in the file with the command
pip freeze > requirements.txt.
Keep in mind that in this case, requirements.txt file will list all packages that have been installed in virtual environment, regardless of where they came from

They will be installed on the conda environment.

google-api-core==1.20.0
google-api-python-client==1.9.3
google-auth==1.17.2
google-auth-httplib2==0.0.3
[..]
google-cloud-core==1.3.0
google-cloud-storage==1.29.0
google-pasta==0.2.0

commands.txt

You can include the commands to run the jupyter process on application launch.

You can also include other commands same as the application has been tested on the local environment.

chmod -R 777 <home value from instance.txt>/*
papermill <home value from instance.txt>/process.ipynb <root-dir value from instance.txt>/process-output.ipynb

env.yml

The env.yml file is used in the setup step in order to create the anaconda environment.

name: gcp-application-name
channels:
  - defaults
dependencies:
  - ca-certificates=2020.1.1=0
  - <a list of dependencies>
  - pip:
      - google-api-core==1.22.2
      - google-api-python-client==1.9.3
      - google-auth==1.17.2
      - <more-libraries required for the application>
prefix: /opt/conda/envs/gcp-application-env

Application Launch

Once the Required Files are uploaded in a GCS bucket from the project, contact Boxalino.
Provide the following information:

1

project ID

the project ID is unique;

the project ID is diplayed on the dashboard of your project https://console.cloud.google.com/home/dashboard

2

GCS bucket name

the bucket name where the Required Files are located (ex: gs://project-name); the contents will be made available on the application as well.

the bucket name must be unique, for this purpose - we recommend that every bucket-name starts with your project name.

3

launch date

optional; projects can be scheduled for launch at a later day (tomorrow, etc)

BigQuery access

As a data scientist, chances are that you have been provided with a Service Account (SA) to access the client`s private projects.
The application is run by the project's own Compute Engine Service Account (CE SA).
Because the project is in the scope of Boxalino, it will have direct read access to the client's datasets.

In order for the Application Launch to be able to access the client's own project dataset,
the CE SA will have to "impersonate" the SA in the client's project.

Go to the client`s private project IAM & Admin menu
Navigate to Service Accounts
Locate the service account with access provided for the scope of the application deployment
On top-right, click on “Show Info Panel”.
Add the CE SA and set the “Service Account User” permission

The output of the application can be stored directly in the deployed GCP project scope (BigQuery, GCS, etc)