Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

The purpose The purpose of this document is to allow our client`s Data Science team prepare and deploy their models in a Google Cloud environment.

Table of Contents

Steps

  1. Project Deploy.

    1. Make a GCP Project Deployment Request with the Required Information information.

    2. a GCP project will be provided to the requestor

  2. Billing Information

    1. Set the billing account on the new project.

    2. This is required in order to be able to use the GCP resources.

  3. Application Content

    1. Prepare the Required Files (application structure)

    2. Load the content in a GCS bucket from the project.

  4. Application Launch

Note

Your user email (as the requester) will be given the Editor, Owner and Project Billing Manager role.

Share access to other people who need access to the project.

Tip

The application is launched in a Virtual Machine in the project. The commands from commands.txt are executed. Additionally, you can SSH on the VM and update/check content.

...

You can follow the provided practices on:

  1. how to check out the application state

  2. how to update the application content

  3. how to start/stop/delete an application

We, at Boxalino, will extend the services available for the Data Science needs (schedulers, instance management, etc). Make sure to review the documentation.

Integration Access

Because the application is launched in the scope of the project, the following Google Cloud tools can be used:

  1. the Compute Engine - launch more applications https://console.cloud.google.com/compute/instances

  2. the project`s BigQuery dataset (to store results, if required) https://console.cloud.google.com/bigquery

  3. the Google Cloud Storage (GCS) to load files & store logs https://console.cloud.google.com/storage/browser

  4. the Cloud Scheduler to create events (for automatic runs) https://console.cloud.google.com/cloudscheduler/start

BigQuery Datasets Access

Additionally, the project`s Compute Engine Default Service Account will have the following permissions on the client`s datasets in the Boxalino scope:

  1. BigQuery Data Editor : <client>_lab, <client>_views

  2. BigQuery Data Viewer : <client>_core, <client>_stage, <client>_reports, <client>_intelligence

1. Project Deploy

Required Information

In order to create a GCP Project, in which the application will be run, the following information is required:

...

project name

...

as will appear in your project`s list
naming requirements: space, - and _ are allowed.

...

email

...

If your application is meant to edit the subscriber properties, get familiar with the structure of the subscriber properties https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/303825148/subscriber+properties

1. Project Deploy

Required Information

Create a project in the Boxalino ecosystem by completing the form https://gcp-deploy-du3do2ydza-ew.a.run.app/project

In order to create a GCP Project, in which the application will be run, the following information is required:

1

project name

as will appear in your project's list
naming requirements: space, - and _ are allowed.

2

email

the requestor is the one managing the applications running on the project;

this email will receive messages (alert and notifications) for when the project is ready to be used;

** the email alerts for the VM / application run - is part of the instance.txt file, specific for every application launch

3

client name

(also known as the Boxalino account name) this is to ensure the access to the views, core & reports datasets (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/

303792129/GCP+Project+Deployment#BigQuery

303792129#BigQuery-Datasets-Access )

4

labels

optional; the labels are used as project meta-information. see Labels

5

permissions

optional; by default, the requestor will have full access and can further share with others. see Permissions

Once the project Once the project is created (2-3 min), the requestor will have access to it, in their Google Cloud Console.

Tip

As an editor on the project, the requestor will be able to:

...

Labels (optional)

Labels are key-value pairs meant to better organize the projects.

Lets imagine the scenario when the client has multiple Data Scientists working on different projects. Labeling them will allow better structure.

Code Block
team:data-science
component:<main-application-name>
environment:production

More information on labels: https://cloud.google.com/resource-manager/docs/creating-managing-labels

Permissions (optional)

The permissions are added when the project is created.

  • By default, the requestor`s email has the project editor role

  • Once the project is released, the requestor can add more emails / users to the IAM policies of the project.

Here is a sample of provided permissions (optional):

Code Block
user:dana@boxalino.com:roles/editor
user:dana@boxalino.com:roles/resourcemanager.projectIamAdmin
user:dana@boxalino.com:roles/compute.osLogin
user:dana@boxalino.com:roles/compute.osAdminLogin
user:dana@boxalino.com:roles/bigquery.dataOwner
serviceAccount:service-account-from-other-projects:roles/iam.serviceAccountUser
serviceAccount:service-account-from-other-projects:roles/bigquery.dataOwner
serviceAccount:service-account-from-other-projects:roles/bigquery.dataEditor

More information on permissions: https://cloud.google.com/iam/docs/understanding-rolesConsole.

Tip

As an editor on the project, the requestor will be able to:

...

2. Billing Information

In order to access the Google Cloud resources - a billing account must be set on the project.

In order to achieve this:

  1. go to the Billing menu in GCP console or check the billing projects https://console.cloud.google.com/billing/projects

    Image Modified

  2. Identify the project and click on the 3 dots. Select “Change Billing”

    Image Modified

3. From the appeared window, select the Billing Account on which the costs of the Application will be billed

...

If you do not have access to a billing account, provide the Project Billing Manager role to someone who does. Use the IAM menu for this https://console.cloud.google.com/iam-admin/iam

3. Application Content

Info

In order to launch the application, the source files must be loaded in a Google Storage Bucket https://console.cloud.google.com/storage/browser

Note

The Google Storage Bucket must have an unique name. Due to this, we recommend that every bucket-name starts with your project name.

Required Files

1

instance.txt

properties for the Virtual Machine (VM) (name, size, home, etc) (see instance.txt)

2

requirements.txt

environment requirements (for pip/anaconda install) (see requirements.txt)

  • pip freeze > requirements.txt - command to create the file from a tested environment

3

commands.txt

a list of commands to be executed as part of your application run process (see comands.txt)

*can be left empty as well (if you chose to SSH on the VM and run your own processes from the project`s scope)

4

env.yml

(optional) anaconda environment file;
if no file is provided - the environment is not created

5

your jupyter/python/application files

the content of your application (in python, jupyter notebooks, etc)

Note

The #1-#4 files must be named as instructed. They are used as a base for the application start-up.

...

Info

These files (and other required structures) must be uploaded in a GCS bucket.

The GCS bucket name is provided for the Application Launch Request

instance.txt

Property

Default

Required

Description

1

instance-name

project name

yes

the instance name is the VM name as appears in the Compute Engine view

2

machine-type

e2-micro

yes

the value depends on what the application needs: more CPU or more RAM? for options, please check the Google Cloud documentation

3

email-to

yes

the email is used once to receive an alert for when the VM is ready.

4

home

\/home\/project-name

no

the path on the server where the content of the GCS bucket is uploaded;
this is also used for the commands from the commands.txt file in order to launch/trigger your application execution.

alternatives: \/home\/<your-gcs-bucket> , \/srv\/app

when you SSH in the machine (ex: your email is data-science-guru@boxalino-client.com) , the VM creates a directory /home/data-science-guru (this is default for any server

) so this is your local path; 5

image-family

ubuntu-2004-lts

no

6

boot-disk-size

30

no

7

zone

europe-west1-b

no

this property can be left empty;

Noteuse a zone which is in Europe.

) so this is your local path;

5

image-family

ubuntu-2004-lts

no

6

boot-disk-size

30

no

7

zone

europe-west1-b

no

this property can be left empty;

Note

use a zone which is in Europe.

8

accelerator-type

no

define this property if your VM requires GPU;
https://cloud.google.com/sdk/gcloud/reference/compute/accelerator-types

9

accelerator-count

no

define this property if your VM requires GPU

Code Block
instance-name:application-name
machine-type:e2-micro
email-to:data-science-guru@boxalino-client.com
home:\/home\/project-name
image-family:ubuntu-2004-lts
boot-disk-size:30
zone:europe-west1-b

requirements.txt

This is a typical requirements.txt file for a jupyter/python application.

...

Code Block
google-api-core==1.20.0
google-api-python-client==1.9.3
google-auth==1.17.2
google-auth-httplib2==0.0.3
[..]
google-cloud-core==1.3.0
google-cloud-storage==1.29.0
google-pasta==0.2.0

commands.txt

You can include the commands to run the jupyter process on application launch.

...

Code Block
chmod -R 777 <home value from instance.txt>/*
python3 <home value from instance.txt>/my-python-application.py
papermill <home value from instance.txt>/process.ipynb <home value from instance.txt>/process-output.ipynb

env.yml

The env.yml file is used in the setup step in order to create the anaconda environment.

Code Block
name: gcp-application-name
channels:
  - defaults
dependencies:
  - ca-certificates=2020.1.1=0
  - <a list of dependencies>
  - pip:
      - google-api-core==1.22.2
      - google-api-python-client==1.9.3
      - google-auth==1.17.2
      - <more-libraries required for the application>
prefix: /opt/conda/envs/gcp-application-env

4. Application Launch

Note

Before launching the application, make sure that the Required Files are uploaded in a GCS bucket.

...


Provide the following information:

1
project

Project ID

Project Number

the project ID is unique;

the project ID is diplayed on the dashboard of your project https://console.cloud.google.com/home/dashboard

Image Modified
2

GCS bucket name

the bucket name where the Required Files are located (ex: gs://<project-name>-<app-name>);

the contents will be made available on the application as well.

Note

the bucket must be located in EUROPE; either use EU(multi-region) or europe-west-1 (singural region)

Note

the bucket name must be unique, for this purpose - we recommend that every bucket-name starts with your project name.

3

access code

as provided by Boxalino

Tip

Once the application has been launched, a script will initialize the environment and load all your content from the GCS bucket.

Environment Details

On a deployed application, the following stack is available:

  1. Python 3.7

  2. git

  3. Anaconda3

  4. pip / pip3

  5. setuptools

  6. papermill

  7. jupyter

  8. google-api-python-client & google SDK libraries

Checking out the application state

1. SSH on the application VM

Go to your project Compute Engine page: https://console.cloud.google.com/compute/instances

...

Code Block
tail -f -n 2000 /var/log/syslog

2. Monitoring tools from GCP

Go to your project Compute Engine page (https://console.cloud.google.com/compute/instances) and click on the running application instance. Switch to the Monitoring tab.

In the MONITORING view of your Application, you are able to track the resources available & consumed:

  1. CPU utilization

  2. Memory Utilization

  3. Disk Space Utilization

...

3. GCP Logs Explorer

Google Cloud Platform provides a series of tools for monitoring the project resources.

...

Go to your Logs Explorer https://console.cloud.google.com/logs/query? and review the application logs.

...

Updating the application

Restart the application

Upon application restart (stop and start) - the GCS bucket content is being re-syncrhonized:

  1. application files are syncrhonized with the content of the GCS bucket (from instance.txt)

  2. requirements.txt updated

  3. commands.txt is run

Note

Restarting the application will not update your anaconda environment.

Note

Restarting the application will not update the instance properties (the ones defined in instance.txt) - unless you manually edit them.

Resyncrhonize the bucket

If you do not want to delete & re-run the application launch, you can re-syncrhonize the content of the GCS bucket with your application home directory:

Code Block
sudo gsutil rsync -r gs://<BUCKET>/ <APPLICATION-PATH>
Note

Replace <BUCKET> with your storage bucket name (where the application files have been loaded).

Replace <APPLICATION-PATH> with the path to your application (default: /home/project-name).

Application Management

If you want to stop/start/delete the application, you can freely do it from your Compute Engine view, or use the provided view by Boxalino https://gcp-deploy-du3do2ydza-ew.a.run.app/instance

BigQuery access

  1. As a data scientist, chances are that you have been provided with a Service Account (SA) to access the client`s private projects.

  2. The application is run by the project's own Compute Engine Service Account (CE SA).
    Because the project is in the scope of Boxalino, it will have direct read access to the client's datasets.

In order for the Application Launch to be able to access the client's own project dataset,
the CE SA will have to "impersonate" the SA in the client's project.

...

  1. Go to the client`s private project IAM & Admin menu

  2. Navigate to Service Accounts

  3. Locate the service account with access provided for the scope of the application deployment

  4. On top-right, click on “Show Info Panel”.

  5. Add the CE SA and set the “Service Account User” permission

The output of the application can be stored directly in the deployed GCP project scope (BigQuery, GCS, etc)

...