The purpose of this document is to allow our client`s Data Science team prepare and deploy their models in a Google Cloud environment.
Environment Details
On a deployed application, the following stack is available:
Python 3.7
git
Anaconda3
pip / pip3
setuptools
papermill
jupyter
google-api-python-client & google SDK libraries
Steps
Make a GCP Project Deployment Request with the Required Information information.
a GCP project will be provided to the requestor
Set the billing account on the new project.
This is required in order to be able to use the GCP resources.
Prepare the Required Files (application structure)
Load the content in a GCS bucket from the project.
Your user email (as the requester) will be given the Editor, Owner and Project Billing Manager role.
Share access to other people who need access to the project.
The application is launched in a Virtual Machine in the project. The commands from commands.txt are executed. Additionally, you can SSH on the VM and update/check content.
As the manager of the application, you are responsible for the VM state.
You can follow the provided practices on:
We, at Boxalino, will extend the services available for the Data Science needs (schedulers, instance management, etc). Make sure to review the documentation.
Integration Access
Because the application is launched in the scope of the project, the following Google Cloud tools can be used:
the Compute Engine - launch more applications https://console.cloud.google.com/compute/instances
the project`s BigQuery dataset (to store results, if required) https://console.cloud.google.com/bigquery
the Google Cloud Storage (GCS) to load files & store logs https://console.cloud.google.com/storage/browser
the Cloud Scheduler to create events (for automatic runs) https://console.cloud.google.com/cloudscheduler/start
BigQuery Datasets Access
Additionally, the project`s Compute Engine Default Service Account will have the following permissions on the client`s datasets in the Boxalino scope:
BigQuery Data Editor : <client>_lab, <client>_views
BigQuery Data Viewer : <client>_core, <client>_stage, <client>_reports, <client>_intelligence
1. Project Deploy
Required Information
In order to create a GCP Project, in which the application will be run, the following information is required:
1 | project name | as will appear in your project`s list |
2 | the requestor is the one managing the applications running on the project; this email will receive messages (alert and notifications) for when the project is ready to be used; | |
3 | client name | (also known as the Boxalino account name) this is to ensure the access to the views, core & reports datasets (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/303792129/GCP+Project+Deployment#BigQuery-Datasets-Access ) |
4 | optional; the labels are used as project meta-information. see Labels | |
5 | optional; by default, the requestor will have full access and can further share with others. see Permissions |
Once the project is created (2-3 min), the requestor will have access to it, in their Google Cloud Console.
As an editor on the project, the requestor will be able to:
access the project from Google Console https://console.cloud.google.com/
set a billing account for the project
share access to the project with other members in the IAM Admin panel
create GCS buckets and load different applications, which can be triggered
Labels (optional)
Labels are key-value pairs meant to better organize the projects.
Lets imagine the scenario when the client has multiple Data Scientists working on different projects. Labeling them will allow better structure.
team:data-science component:<main-application-name> environment:production
More information on labels: https://cloud.google.com/resource-manager/docs/creating-managing-labels
Permissions (optional)
The permissions are added when the project is created.
By default, the requestor`s email has the project editor role
Once the project is released, the requestor can add more emails / users to the IAM policies of the project.
Here is a sample of provided permissions (optional):
user:dana@boxalino.com:roles/editor user:dana@boxalino.com:roles/resourcemanager.projectIamAdmin user:dana@boxalino.com:roles/compute.osLogin user:dana@boxalino.com:roles/compute.osAdminLogin user:dana@boxalino.com:roles/bigquery.dataOwner serviceAccount:service-account-from-other-projects:roles/iam.serviceAccountUser serviceAccount:service-account-from-other-projects:roles/bigquery.dataOwner serviceAccount:service-account-from-other-projects:roles/bigquery.dataEditor
More information on permissions: https://cloud.google.com/iam/docs/understanding-roles
2. Billing Information
In order to access the Google Cloud resources - a billing account must be set on the project.
In order to achieve this:
go to the Billing menu in GCP console or check the billing projects https://console.cloud.google.com/billing/projects
Identify the project and click on the 3 dots. Select “Change Billing”
3. From the appeared window, select the Billing Account on which the costs of the Application will be billed
If you do not have access to a billing account, provide the Project Billing Manager role to someone who does. Use the IAM menu for this https://console.cloud.google.com/iam-admin/iam
3. Application Content
In order to launch the application, the source files must be loaded in a Google Storage Bucket https://console.cloud.google.com/storage/browser
The Google Storage Bucket must have an unique name. Due to this, we recommend that every bucket-name starts with your project name.
Required Files
1 | instance.txt | properties for the Virtual Machine (VM) (name, size, home, etc) (see instance.txt) |
2 | requirements.txt | environment requirements (for pip/anaconda install) (see requirements.txt)
|
3 | commands.txt | a list of commands to be executed as part of your application run process (see comands.txt) |
4 | env.yml | (optional) anaconda environment file; |
5 | your jupyter/python/application files | the content of your application (in python, jupyter notebooks, etc) |
The #1-#4 files must be named as instructed. They are used as a base for the application start-up.
instance.txt, requirements.txt and commands.txt must end with an empty line (they are parsed dynamically)
These files (and other required structures) must be uploaded in a GCS bucket.
The GCS bucket name is provided for the Application Launch Request
instance.txt
Property | Default | Required | Description | |
---|---|---|---|---|
1 | instance-name | project name | yes | the instance name is the VM name as appears in the Compute Engine view |
2 | machine-type | e2-micro | yes | the value depends on what the application needs: more CPU or more RAM? for options, please check the Google Cloud documentation |
3 | email-to | yes | the email is used once to receive an alert for when the VM is ready. | |
4 | home | \/home\/project-name | no | the path on the server where the content of the GCS bucket is uploaded; |
5 | image-family | ubuntu-2004-lts | no | |
6 | boot-disk-size | 30 | no | |
7 | zone | europe-west1-b | no | this property can be left empty; use a zone which is in Europe. |
instance-name:application-name machine-type:e2-micro email-to:data-science-guru@boxalino-client.com home:\/home\/project-name image-family:ubuntu-2004-lts boot-disk-size:30 zone:europe-west1-b
requirements.txt
This is a typical requirements.txt file for a jupyter/python application.
If your setup has been tested locally, you can save all the packages in the file with the command
pip freeze > requirements.txt.
Keep in mind that in this case, requirements.txt file will list all packages that have been installed in virtual environment, regardless of where they came from
They will be installed on the conda environment.
google-api-core==1.20.0 google-api-python-client==1.9.3 google-auth==1.17.2 google-auth-httplib2==0.0.3 [..] google-cloud-core==1.3.0 google-cloud-storage==1.29.0 google-pasta==0.2.0
commands.txt
You can include the commands to run the jupyter process on application launch.
You can also include other commands same as the application has been tested on the local environment.
chmod -R 777 <home value from instance.txt>/* python3 <home value from instance.txt>/my-python-application.py papermill <home value from instance.txt>/process.ipynb <home value from instance.txt>/process-output.ipynb
env.yml
The env.yml file is used in the setup step in order to create the anaconda environment.
name: gcp-application-name channels: - defaults dependencies: - ca-certificates=2020.1.1=0 - <a list of dependencies> - pip: - google-api-core==1.22.2 - google-api-python-client==1.9.3 - google-auth==1.17.2 - <more-libraries required for the application> prefix: /opt/conda/envs/gcp-application-env
4. Application Launch
Before launching the application, make sure that the Required Files are uploaded in a GCS bucket.
To launch the application, complete the form in the Application Launch service https://gcp-deploy-du3do2ydza-ew.a.run.app/application.
Provide the following information:
1 | project ID | the project ID is unique; the project ID is diplayed on the dashboard of your project https://console.cloud.google.com/home/dashboard |
2 | GCS bucket name | the bucket name where the Required Files are located (ex: gs://<project-name>-<app-name>); the contents will be made available on the application as well. the bucket must be located in EUROPE; either use EU(multi-region) or europe-west-1 (singural region) the bucket name must be unique, for this purpose - we recommend that every bucket-name starts with your project name. |
3 | access code | as provided by Boxalino |
Once the application has been launched, a script will initialize the environment and load all your content from the GCS bucket.
Checking out the application state
1. SSH on the application VM
Go to your project Compute Engine page: https://console.cloud.google.com/compute/instances
You can further log/ SSH on the virtual machine and check out the output or inspect the contents.
If you want to track the process live, as it happens, you can run the following command:
tail -f -n 2000 /var/log/syslog
2. Monitoring tools from GCP
Go to your project Compute Engine page (https://console.cloud.google.com/compute/instances) and click on the running application instance. Switch to the Monitoring tab.
In the MONITORING view of your Application, you are able to track the resources available & consumed:
CPU utilization
Memory Utilization
Disk Space Utilization
3. GCP Logs Explorer
Google Cloud Platform provides a series of tools for monitoring the project resources.
One of these tools is the Logging Client.
Go to your Logs Explorer https://console.cloud.google.com/logs/query? and review the application logs.
Updating the application
Restart the application
Upon application restart (stop and start) - the GCS bucket content is being re-syncrhonized:
application files are syncrhonized with the content of the GCS bucket (from instance.txt)
requirements.txt updated
commands.txt is run
Restarting the application will not update your anaconda environment.
Restarting the application will not update the instance properties (the ones defined in instance.txt) - unless you manually edit them.
Resyncrhonize the bucket
If you do not want to delete & re-run the application launch, you can re-syncrhonize the content of the GCS bucket with your application home directory:
sudo gsutil rsync -r gs://<BUCKET>/ <APPLICATION-PATH>
Replace <BUCKET> with your storage bucket name (where the application files have been loaded).
Replace <APPLICATION-PATH> with the path to your application (default: /home/project-name).
Application Management
If you want to stop/start/delete the application, you can freely do it from your Compute Engine view, or use the provided view by Boxalino https://gcp-deploy-du3do2ydza-ew.a.run.app/instance
BigQuery access
As a data scientist, chances are that you have been provided with a Service Account (SA) to access the client`s private projects.
The application is run by the project's own Compute Engine Service Account (CE SA).
Because the project is in the scope of Boxalino, it will have direct read access to the client's datasets.
In order for the Application Launch to be able to access the client's own project dataset,
the CE SA will have to "impersonate" the SA in the client's project.
Go to the client`s private project IAM & Admin menu
Navigate to Service Accounts
Locate the service account with access provided for the scope of the application deployment
On top-right, click on “Show Info Panel”.
Add the CE SA and set the “Service Account User” permission
The output of the application can be stored directly in the deployed GCP project scope (BigQuery, GCS, etc)
Add Comment