Integrate your Data in Boxalino BigQuery Data Science Eco-System

DATA INTEGRATION STEPS - OVERVIEW

Introduction

Boxalino provides a complete Data Science Eco-system built upon Google BigQuery and fully integrated with Boxalino Real-Time Platform.

In this section we present the 5 key steps of a Data Integration project which are then explained in details in the following sections of the document.

4 Key Steps

To integrate new data, the key steps are:

  1. Set-up your Google Cloud Environment
    You will need an Organization, Project, Billing Account, Service Account, Storage Bucket and BigQuery Datasets (Stage and Core Datasets)

  2. Export your data to Google Cloud Storage
    Set-up an Automatic Uploading Process for your Full historical data and Partial daily data as CSV or JSON files in Google Cloud Storage

  3. Set-up your Data Processing
    Create your Core Dataset Tables as an Operational Data Store and set-up an Automatic Data Processing of your files to your Stage Dataset and Core Datasets
    Watch the tutorial video: https://youtu.be/yYz_HmeX4T8

  4. Integrate your Core Data with Boxalino
    Boxalino integrates your data from your Core Dataset into Boxalino Data Science Eco-System

 

 

For an optimal integration, your data will go through 3 phases: Google Storage as files, BigQuery stage as direct representation of your files and BigQuery core as consolidated data. The process is typically executed once per day, but can be adjusted to be executed more frequently.

System overview

 

The Operational System with Data (the system which manages the data to be exported) typically exports the data as files to a Google Cloud Storage bucket using the Google Cloud SDK:

https://cloud.google.com/appengine/docs/standard/go/download

A variant where another machine (e.g.: a VM hosted in the same secured location as the Operational System) can be used to send the files to Google Storage with the Google Cloud SDK.

This variant is typically not required and is not recommended as the Google Cloud SDK can be installed on any machine safely and, by doing so, the data have the least amount of temporary storage to ensure optimal security.

STEP 1: SET-UP YOUR GOOGLE ENVIRONMENT

Summary

In this step, all the required preparation points of your Google Environment should be done.

Typically, if you didn’t have a Google Cloud account to start with, it is possible to do it in 2 hours.

You will need the required credentials (including payments) to complete the set-up.

The key parts of this step are:

  1. Create a Google Cloud Organization

  2. Create a Google Cloud Project

  3. Create a Google Billing Account

  4. Create a Google Cloud Storage Bucket

  5. Create a Google Cloud Service Account

  6. Create STAGE and CORE BigQuery Datasets

  7. Create a Google Group for Users

Create A Google Cloud Organization

This is the resource representing your company in Google Cloud. An Organization resource is available for Google G Suite and Cloud Identity customers.

You can create a Google Cloud Organization in two ways: Cloud Identity or GSuite.

The information to decide which one is the right one for you and how to do it are described here:

https://cloud.google.com/resource-manager/docs/creating-managing-organization

Then you can view and manage billing accounts and projects under your organization here:

https://cloud.google.com/resource-manager/docs/organization-resource-management

Create A Google Cloud Project

Google Cloud Platform (GCP) projects form the basis for creating, enabling, and using all GCP services including managing APIs, enabling billing, adding and removing collaborators, and managing permissions for GCP resources. This is the main container for all of our project activities, both related to the uploading of files and their processing in BigQuery.

Please follow the instructions here:

https://cloud.google.com/resource-manager/docs/creating-managing-projects

Create A Google Billing Account

https://cloud.google.com/billing/docs/how-to/manage-billing-account

Create A Google Cloud Storage Bucket

Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing capabilities.
https://cloud.google.com/storage/

Google Cloud Storage Bucket is the “folder” which will contain all your exported files.

Please make sure to use the region europe-west1.

https://cloud.google.com/storage/docs/creating-buckets

Create A Google Service Account

This is the account to be used for automated process in Step 2 and 4. The Service account is the account the machine will use to perform the data loading and transformation tasks. The service account must have access to your Storage Bucket and your BigQuery Project.

https://cloud.google.com/iam/docs/service-accounts

Make sure to give access rights for this service account to the Google Cloud Storage Bucket and provide with the authentication information to the required people for the set-up of Step 2.

Create STAGE and CORE BigQuery Datasets

Please use the location EU as below.

Then please share your datasets with Boxalino by adding as members both with the role of Viewer and BigQuery Data Viewer following accounts as per the screen-shot below:

Google BigQuery
BigQuery is a RESTful web service that enables interactive analysis of massive datasets working in conjunction with Google Storage. It is a serverless Software-as-a-Service that may be used complementarily with MapReduce.
https://cloud.google.com/bigquery

Google BigQuery Dataset
A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views. A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.
In other words, dataset is the same as a MySQL “database” but in BigQuery.

Here is a general introduction into BigQuery Datasets.

https://cloud.google.com/bigquery/docs/datasets-intro

Create a dataset called [account]_stage and [account]_core (where account is your account name which can be defined typically based on the name of the system providing these data, possibly for a specific account in case several different accounts are exported from the same system in parallel):

https://cloud.google.com/bigquery/docs/datasets

Create a Google Group

Create a Google Group and include there all users that should have access to BigQuery (you can manage yourself who has access directly from there in the future)

https://support.google.com/cloudidentity/answer/33343?hl=en

Assign relevant access roles to the new user group, see the below links

STEP 2: EXPORT YOUR DATA TO GOOGLE CLOUD STORAGE

Summary

Before proceeding further, make sure that your Google Environment has been prepared as per earlier mentioned instructions.

As a result, you should now have ready the following information:

  • Google Service Account (an e-mail address) and a certificate with a private key

  • Address of the Cloud Storage Bucket (e.g.: gs://my-bucket)

The key activities in this step are:

  1. Install Google Cloud SDK

  2. Authenticate with the Service Account

  3. Set up an export process generating CSV or JSON files

  4. Upload the generated files into a Google Cloud Storage Bucket

  5. Automate the process

Install Google Cloud SDK

Install the Google Cloud SDK in the environment exporting your data as per the below instructions:

https://cloud.google.com/appengine/docs/standard/go/download

Authenticate with your Google Service Account

Authenticate with the Google Service Account created in Step 1 (including the certificate with the private key) using the SDK command gcloud auth as per the instructions here:

https://cloud.google.com/sdk/gcloud/reference/auth/activate-service-account

Set up an export process generating CSV or JSON files

The process is not described in detail as it is client-specific.

Its output should be a set of files containing the exported data. There can be two kinds of exports: the full data (typically customer masterdata) and the partial data (typically transaction data for the chosen period, such as transactions from the day before).

Transaction data
By transaction data we mean data connecting customers to products, such as the purchase history. While the purchase history is the main (and typically most important) use case, other transaction data could be also considered (sales leads, support requests, etc.).

This data typically comes structured in the following two ways (examples given for the CSV format)

  • either with one line per order with the required internal order and customer identifiers as well as an external order identifier (which should be used to map the data to pre-existing order data in the Boxalino Data Science Eco-system and which might be the same as the internal order identifier) and descriptive columns about the order (date, status, total value, etc.)

  • or with one line per product per order with the required product identifier as well as descriptive columns about the ordered product (quantity, price before and after discounts, etc.).

For details look at the Annex: WHAT DATA DO WE NEED?

Customer data
By customer data we mean data describing a customer in the system exporting the data (depending on the system, a customer can be defined as a unique physical person, a company, an account or using another logic). This data is typically contained in one CSV file (with one line per customer with the required customer id as referred to in the transaction data, one column for the external identifiers (which should be used to map the data to pre-existing customer data in the Boxalino Data Science Eco-system and might be the same as the internal customer identifier) as well as descriptive columns about the customer (e.g.: gender, zip code, date of birth, etc.). We recommend to export anonymized customer data only unless there is a specific compelling reason to do otherwise.

For details look at the Annex: WHAT DATA DO WE NEED?

Full data
The full data includes all information available at and consistent to a chosen point in time (typically the moment of the extraction of the data from its source system). Also known as “full snapshot”. One of the two basic kinds of data provided by the Automatic Upload Process.

Partial (aka DELTA) data
One of the two basic kinds of data provided by the Automatic Upload Process. (full or partial). The partial daily data (aka DAILY DELTA) only contains data that has changed over a period of the previous day (typically all changes from the day before) but we can also generate weekly or monthly deltas. If the data is not easily exported as a delta (i.e. the customer data is often more difficult to export as such compared to transactions) then it is possible to export the full data, instead. The files can be either in CSV or JSON.

CSV files should have column names in the first row.

JSON files should use the Google-compliant format with one JSON record per row (and not a single JSON object with all the data in it).

If possible (and to be compatible with the standard SQL and Shell-scripts samples provided in Step 4) the files should follow the following naming pattern:

  • First part of file name should be date (+ time) to support wildcard (“*”) matching

  • Second part should be F (for full data) or D (for delta data)

  • Name of the file should indicate the contents of the export (customer, transactions, product… etc) and optionally a numerical post-fix before the file extension (e.g. when the export is spread over several files)

e.g.: 20190822T071553_D_products.json

Copy the files to Google Cloud Storage Bucket

Copy the files to Google Cloud Storage Bucket defined in step 1 using the command gsutil cp as described here:

https://cloud.google.com/storage/docs/gsutil/commands/cp

it is recommended to put the files into a folder structure with 3 levels (starting from the root):

  1. Your bucket name

  2. data (fix name)

  3. Type of contents (e.g.: “customers” or “orders”)

optionally: the files can be compressed using gzip

Automate the process

Automatically run the export of the Partial daily data (typically once per day at a set hour during the night). Set up an easy facility to run (manually) the export of the full data.

STEP 3: SET UP YOUR DATA PROCESSING

Watch the tutorial video: https://youtu.be/yYz_HmeX4T8

Summary

This step describes the creation of Core Dataset Tables that act as an Operational Data Store and the set-up of an Automatic Data Processing within Google Cloud to load the last files provided by the Automatic Uploading Process Run.

There are many ways to set up such a process. We give an example that is very simple standard copy&paste implementation using a Google VM Instance:

Google Cloud Virtual Machine Instance (VM Instance)
An Automatic Data Processing loads the files generated by the last Automatic Uploading Process Run into the STAGE Dataset (direct mapping, load the files as they are into tables which names are derived from the names of the files) and then update the CORE Dataset based on the generated stage tables.

https://cloud.google.com/compute/docs/instances/

We suggest to create a process that is capable of handling both the full and the delta exports.

In order to do so the following tasks should be done:

  1. Create a VM Instance

  2. Set up your VM Instance with a Shell Script

  3. Run a Setup Script with chosen parameters

  4. Set up a Cron-job (for automated execution of the data processing)

 

Automatic Data Processing
An Automatic Data Processing loads the files generated by the last Automatic Uploading Process Run into the Stage Dataset (direct mapping, load the files as they are into tables which names are derived from the names of the files) and then update the Core Dataset on the basis of these generated stage tables.

Automatic Data Processing Run
An Automatic Data Processing Run is specific execution (with its timestamp) of an Automatic Data Processing. Typically It should be scheduled to run shortly after the Automatic Uploading Process Run (it is not required, but it could be triggered by the end of the Automatic Uploading Process Run instead of being scheduled independently, but this requires an additional step in the setup of the Automatic Uploading Process.

Create a VM Instance

In this part, we describe how to create a VM instance in Google Cloud. It can be done in the Google cloud console in : Compute Engine -> VM Instance -> Create Instance

https://cloud.google.com/compute/docs/instances/create-start-instance

Here are some of the important parameters for the VM:

  1. VM machine type : G1-small is sufficient (usually)

  2. Location must be europe-west-1

  3. Make sure your Service Account has access to Cloud Storage and BigQuery APIs and can connect to the VM

You can then simply start the VM Instance (it normally starts automatically after creation) and open an SSH session in a browser window to perform the next steps.

Setup of your VM with a Shell Script

  • Copy the Setup Script from the following section into your home/user directory

  • Make the script files executable using command chmod u+x FILENAME

  • Execute the script with correct parameters from inside your home/user directory. The command should follow this pattern: [STAGE and CORE dataset prefix] [Your GCS bucket name] [Table name],[Column1],[Column2],[Column3],[Column4],[Column5]...

  • Here is an example of how it could look like: bash configure.sh boxalino boxalino-data transactions,order_id,order_date,order_status,order_amount,order_currency

  • f you want to create multiple tables, you must append them to the end of the command and separate each table with a space. If you don't separate it, it will be used as a whole table.

  • Be careful, you should only run the setup script once because if you've already entered data in one of your tables and run the script again, each table will be overwritten and you will lose your current changes. Just make sure you add all the necessary tables from the start.

install a system package gettext-base in case it is missing in your system (using yum or dnf or apt package manager depending on your linux distribution).

Setup Script

This script has to be run only once (see point 5.6 below). It will generate several directories, two SQL files and the shell script for regular automated data processing:

  • A file with DDL statements to create CORE tables creation that needs to be executed manually once (create-core.sql)

  • A file with SQL code for regular (daily) loads (stage2core.sql) that takes data from the STAGE, transforms and stores it in the CORE area.

  • A file with Bash shell script that activates and steers the data processing (process.sh, here referenced also as “Automation Script”). This file can be called from the linux cron daemon to guarantee a reliable schedule-based execution.

Full code follows.

 

#!/usr/bin/env bash # good shell scripting practices set -o errexit # exit when a command fails set -o nounset # exit when an undeclared variable is used set -o pipefail # return exit status of the last command that threw a non-zero exit code even when piped together # set -o xtrace # enable debugging # functions usage () { cat 1>&2 <<HELP_USAGE Usage: $0 CLIENT GCS_BUCKET EXPORT_DEFINITION... CLIENT is the name that prefixes the STAGE and CORE datasets in BigQuery GCS_BUCKET is the name of the storage bucket in Google Cloud Storage where source systems deliver exported data. The script expects the source files and their location to conform to the following pattern: gs://GCS_BUCKET/<export_name>/YYYYMMDD*_<export_name>.(csv|json) EXPORT_DEFINITION any number of definitions can be provided separated by spaces a single export definition has the form of EXPORT_NAME,COLUMN1,COLUMN2,... the export name determines the expected filenames, names of STAGE tables and the names of the tables in the CORE and for this reason ONLY names containing the following character classes [a-zA-Z0-9_] are allowed it is assumed that the first COLUMN is the primary key used for matching records between STAGE and CORE NOTE that the order of the export definitions determines the order of the loading / processing the data Artifacts created: logs sql/${DDLSQL} sql/${DMLSQL} tmp ${ETLSCRIPT} Example: $0 [STAGE and CORE dataset prefix] [Your GCS bucket name] [Table name],[Column1], / [Column2],[Column3],[Column4],[Column5]... the above call will create a process to load files named "YYYYMMDD*[Table name].(csv|json)" into BigQuery STAGE table "[STAGE dataset prefix]_stage.YYYYMMDD*[Table name]" and during the CORE load into "[CORE dataset prefix]_core.[Table name]" HELP_USAGE } # global configuration ROOTDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" TOPDIRS="logs sql tmp" DDLSQL="create-core.sql" DMLSQL="stage2core.sql" ETLSCRIPT="process.sh" if [[ ${#} -lt 3 ]]; then usage exit 22 fi # what is our client CLIENT=${1} CLIENT_BUCKET=${2} CLIENT_STAGE=${CLIENT}_stage CLIENT_CORE=${CLIENT}_core shift 2 # create directories for d in ${TOPDIRS}; do newdir=${ROOTDIR}/${d} mkdir -p ${newdir} echo "created directory ${newdir}" done; echo # initiate artifacts echo -e "-- autogenerated DDL for creation of core tables\n" >${ROOTDIR}/sql/${DDLSQL} echo -e "-- autogenerated SQL for stage2core load\n" >${ROOTDIR}/sql/${DMLSQL} cat >${ROOTDIR}/${ETLSCRIPT} <<EOF #!/usr/bin/env bash # this script was autogenerated on $(date +%Y-%m-%d\ %T) set -o errexit # exit when a command fails set -o nounset # exit when an undeclared variable is used set -o pipefail # return exit status of the last command that threw a non-zero exit code even when piped together # set -o xtrace # enable debugging export ETLDATE=\$1 if [[ -z \${ETLDATE} || ! \$ETLDATE =~ ^20[1-2][0-9][0-1][0-9][0-3][0-9]$ ]]; then echo "Missing argument ETLDATE or format mismatch (expected 20YYMMDD)" exit 22 fi DEFAULT_CSV_DELIMITER=, EOF for ed; do # start processing export definitions provided by the user exportname=${ed%%,*} echo "Started processing export ${exportname}" # parse input arguments columnstring=${ed#$exportname,} declare -a columns columns=($(cut -d ',' -f 1- - --output-delimiter=' ' <<< "${columnstring}")) # start generating DDL for the export echo -e "CREATE OR REPLACE TABLE \`${CLIENT_CORE}.${exportname}\` (" >>${ROOTDIR}/sql/${DDLSQL} for c in "${columns[@]}"; do if [[ ${c} = "${columns[0]}" ]]; then echo -e " ${c} STRING NOT NULL" >>${ROOTDIR}/sql/${DDLSQL} else echo -e " ,${c} STRING" >>${ROOTDIR}/sql/${DDLSQL} fi done; echo -e " ,create_tm DATETIME NOT NULL,update_tm DATETIME NOT NULL,src_cd STRING" >>${ROOTDIR}/sql/${DDLSQL} echo -e ") OPTIONS(description=\"autogenerated\");\n" >>${ROOTDIR}/sql/${DDLSQL} echo "DDL for ${exportname} written" # append commands to stage2core SQL for the export echo -e "-- ${exportname}" >>${ROOTDIR}/sql/${DMLSQL} echo -e "DELETE FROM \`${CLIENT_CORE}.${exportname}\`" >>${ROOTDIR}/sql/${DMLSQL} echo -e "WHERE ${columns[0]} IN (SELECT CAST(${columns[0]} AS STRING) FROM \`${CLIENT_STAGE}.\${${exportname}_stage_table}\`);" >>${ROOTDIR}/sql/${DMLSQL} echo -e "INSERT INTO \`${CLIENT_CORE}.${exportname}\` (${columnstring},create_tm,update_tm,src_cd)" >>${ROOTDIR}/sql/${DMLSQL} echo -e "SELECT " >>${ROOTDIR}/sql/${DMLSQL} for c in "${columns[@]}"; do if [[ ${c} = "${columns[0]}" ]]; then echo -e " CAST(${c} AS STRING) AS ${c}" >>${ROOTDIR}/sql/${DMLSQL} else echo -e " ,CAST(${c} AS STRING) AS ${c}" >>${ROOTDIR}/sql/${DMLSQL} fi done; echo -e " ,current_datetime AS create_tm, current_datetime AS update_tm, '${exportname}' AS src_cd" >>${ROOTDIR}/sql/${DMLSQL} echo -e "FROM \`${CLIENT_STAGE}.\${${exportname}_stage_table}\`;\n" >>${ROOTDIR}/sql/${DMLSQL} echo "SQL for ${exportname} written" # continue generating ETLSCRIPT contents echo -e "# set up variables for ${exportname} processing" >>${ROOTDIR}/${ETLSCRIPT} echo -e "${exportname}_file=\$(gsutil ls gs://${CLIENT_BUCKET}/data/${exportname}/\${ETLDATE}*${exportname}.* | grep -iE 'csv|json' | tail -n 1)" >>${ROOTDIR}/${ETLSCRIPT} echo -e "${exportname}_file_type=\${${exportname}_file##*.}" >>${ROOTDIR}/${ETLSCRIPT} echo -e "export ${exportname}_stage_table=\$(basename \${${exportname}_file%.*})" >>${ROOTDIR}/${ETLSCRIPT} echo -e "export ${exportname}_core_table=${exportname}" >>${ROOTDIR}/${ETLSCRIPT} echo -e "# load ${exportname} data from storage into BigQuery STAGE dataset" >>${ROOTDIR}/${ETLSCRIPT} echo -e "if [[ \$${exportname}_file_type = CSV || \$${exportname}_file_type = csv ]]; then bq load --source_format CSV --skip_leading_rows 1 --field_delimiter=\${DEFAULT_CSV_DELIMITER} --autodetect --replace ${CLIENT_STAGE}.\${${exportname}_stage_table} \$${exportname}_file; fi" >>${ROOTDIR}/${ETLSCRIPT} echo -e "if [[ \$${exportname}_file_type = JSON || \$${exportname}_file_type = json ]]; then bq load --source_format NEWLINE_DELIMITED_JSON --autodetect --replace ${CLIENT_STAGE}.\${${exportname}_stage_table} \$${exportname}_file; fi\n\n" >>${ROOTDIR}/${ETLSCRIPT} echo -e "Finished processing export ${exportname}\n" done # write the remaining portion of ETLSCRIPT contents echo -e "# STAGE to CORE queries from ${DMLSQL} - template and execute in synchronous fashion" >>${ROOTDIR}/${ETLSCRIPT} echo -e "live_sql_file=\$(date +%Y%m%dT%H%M%S)_${DMLSQL}" >>${ROOTDIR}/${ETLSCRIPT} echo -e "envsubst <sql/${DMLSQL} >tmp/\${live_sql_file}" >>${ROOTDIR}/${ETLSCRIPT} echo -e "bq query --nouse_legacy_sql <tmp/\${live_sql_file}" >>${ROOTDIR}/${ETLSCRIPT} # make generated processing script executable chmod ug+x ${ROOTDIR}/${ETLSCRIPT} # print cron instructions echo -e "To enable automatic daily execution type the following command" echo -e "\tcrontab -e" echo -e "and add the following line" echo -e "\t* H * * * ${ROOTDIR}/${ETLSCRIPT} \$(date --date '1 day ago' +%Y%m%d) 1>&2 2>${ROOTDIR}/logs/\$(date +%Y%m%dT%H%M%S)_${ETLSCRIPT%%.*}.log" echo -e "where 'H' stands for the HOUR of automatic execution (we recommend running very early in the morning)\n" echo -e "ALL DONE" exit 0

Automation Script

Here we include a sample the shell script referred to in the prior section which needs to be executed by the cron. This shell script is generated during the configuration and its contents depend on parameters (name of the client and exports and their structure) entered by the user.

#!/usr/bin/env bash # this script was autogenerated on 2019-11-11 11:06:57 set -o errexit # exit when a command fails set -o nounset # exit when an undeclared variable is used set -o pipefail # return exit status of the last command that threw a non-zero exit code even when piped together # set -o xtrace # enable debugging export ETLDATE=$1 if [[ -z ${ETLDATE} || ! $ETLDATE =~ ^20[1-2][0-9][0-1][0-9][0-3][0-9]$ ]]; then echo "Missing argument ETLDATE or format mismatch (expected 20YYMMDD)" exit 22 fi DEFAULT_CSV_DELIMITER=, # set up variables for customer processing customer_file=$(gsutil ls gs://integration-bucket/data/customer/${ETLDATE}*customer.* | grep -iE 'csv|json' | tail -n 1) customer_file_type=${customer_file##*.} export customer_stage_table=$(basename ${customer_file%.*}) export customer_core_table=customer # load customer data from storage into BigQuery STAGE dataset if [[ $customer_file_type = CSV || $customer_file_type = csv ]]; then bq load --source_format CSV --skip_leading_rows 1 --field_delimiter=${DEFAULT_CSV_DELIMITER} --autodetect --replace kevinllado_stage.${customer_stage_table} $customer_file; fi if [[ $customer_file_type = JSON || $customer_file_type = json ]]; then bq load --source_format NEWLINE_DELIMITED_JSON --autodetect --replace kevinllado_stage.${customer_stage_table} $customer_file; fi # STAGE to CORE queries from stage2core.sql - template and execute in synchronous fashion live_sql_file=$(date +%Y%m%dT%H%M%S)_stage2core.sql envsubst <sql/stage2core.sql >tmp/${live_sql_file} bq query --nouse_legacy_sql <tmp/${live_sql_file}

Create the Core tables

After you have executed the Setup Shell-Script submit the SQL file with the code for the table creation (“create-core.sql“ under the directory “sql”) using the BigQuery command line utility called bq. Make sure the CORE dataset already exists in BigQuery!

Type the following command in your $USER_HOME directory to create all necessary CORE tables:

bq query --nouse_legacy_sql <sql/create-core.sql

Alternatively you can open the SQL file with CREATE statements using a linux text editor of your choice and copy / paste / execute the SQL statements one after another in your BigQuery Admin Console, here: https://console.cloud.google.com/bigquery

Create statements for the table creation are documented at

https://cloud.google.com/bigquery/docs/tables

Before executing the generated CORE tables creation script you are welcome to review the code and validate the definitions. You may decide to change the column names, options or data types of the CORE tables. We particularly recommend the use of partitioned and clustered tables in the CORE so as to reduce BigQuery costs when querying the data.

Review the STAGE to CORE load

Google BigQuery CORE Dataset is a standard BigQuery Dataset (typically named [account]_core with [account] as in your account name) which will contain a set of tables with the consolidated data from all Automatic Uploading Process runs that went through the Automatic Data Processing. Typically there will be as many tables as there are files in your Google Storage for one complete run of your Automatic Uploading Process.

The Setup Script generates both commands (in “process.sh“) and SQL statements to perform the load of the raw files into STAGE and then from there into the CORE. The file with the SQL code for the STAGE to CORE load will be generated as “stage2core.sql“ under the directory “sql”. This file will work out-of-the-box, however, it may need to be modified manually for the following reasons:

  • The automation script assumes that the persistent identifier is in the first column. This means that, before inserting new data into the core, it will delete previously existing data with the same identifier value. If the identifier is not in the first column (or requires a more complex logic to be fulfilled) the SQL code must be modified accordingly.

  • You can change the format of the fields wherever necessary (by default, your core tables will be created with all columns being of String type). The following data types are available in BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
    Please set the desired type using the CAST function. The generated SQL commands already include it: CAST(<column_name> AS STRING). Both DELETE and INSERT statements need to be adjusted consistently in case you edit the type of first column of the table which we consider that table’s Primary Key by convention.

  • You may also choose to transform the flat data coming from a CSV using complex or nested types such as ARRAYs and STRUCTs. Or you could combine information coming from several CSV files into one table (e.g. several different files with different types of customer properties - with each type being an ARRAY or a STRUCT in the final desired CORE table). For more information see
    https://cloud.google.com/bigquery/docs/nested-repeated

  • If you changed the data type of any column in any of the CORE tables in the prior section, you need to adjust the corresponding “cast(column_name as string)” to match the right format. The Same applies for advanced transformations that map flat data into complex / nested data types in the CORE.

Setup the Cron-job

The final step is to set up the cron-job which will automate the daily execution of your script file. The corresponding instructions will be printed on the terminal after the execution of the Setup Shell-Script has finished.

You can test the Automation Script by typing this in the command line: bash process.sh [Date in the format YYYYMMDD].

For details, you can watch the simple how-to video:

https://www.youtube.com/watch?v=llUw3RtD-Yw

The Automation Script (“process“) requires one parameter to be passed in from the cron-job: the ETLDATE in the format YYYYMMDD.

Typically today’s load processes the data from yesterday. The correct ETLDATE value will thus be generted using shell subcommand $(date --date '1 day ago' +%Y%m%d)

Possible errors and how to solve them

Here are the most important errors which can appear during the execution process of the Automation script.

Wrong datatype

There is an error, which tells you that it was not able to insert specific data. It tries to insert it as a false datatype, an example would be that he tries to put a postal code which contains some letters into an integer datatype. This happens because the autodetect function of bigquery tries to be smart and identifies the datatype by himself. In a case where postal codes mostly consist of numbers, it will try to insert the ones which contain letters as an integer as well. A solution to bypass this is by creating your own schema manually. To do that you have to create a JSON file, where you write the schema for the table which failed. The schema should have this pattern:

After you created the schema, save it and go into the process.sh script, there you have to search the table you want to pass the schema, the line you search for should look like this one:

You should remove the --autodetect and paste the path to your JSON file at the end of this line. This will fix your problem.

Wrong separator

Another problem you might encounter is that your data was not recognised properly and wrongly inserted into the BigQuery tables. The reason for that will be the separator of your CSV files. If your CSV files are using semicolon as separators, then you have to change the script. In the Automation script, the standard separator is a comma. If you don’t change it, it won’t be able to fill your data into the BigQuery tables. To change this, open up the process.sh script. Almost at the top you should see the following line: DEFAULT_CSV_DELIMITER=,. The only thing you have to do here is to change it into that: DEFAULT_CSV_DELIMITER=\;

STEP 4: INTEGRATE YOUR CORE DATA WITH BOXALINO

Summary

Boxalino will integrate your data in the Boxalino Data Science Eco-System based on its representation in your CORE dataset.

This process offers a degree of customization which can further be discussed and explained during a workshop.

More detailed information is available on demand.

As preparation, the questionnaire in annex 2 should be filled.

 

EXECUTION PLAN

Example Execution Plan

It is recommended to structure the execution of the 5 steps according to the following plan:

 

Phase 1 (typically 1-2 weeks):

Workshop to define (typically 2 hours with the Client, the Client Provider and Boxalino):

  • Content and format of the exported files

  • Key aspects of the BigQuery Core Dataset tables

Implementation:

  • Step 1 – Typically by the IT of the Client

  • Step 2 – Typically by the Provider with the data to export

 

Phase 2 (typically 1-2 weeks):

Workshop to define (typically 2 hours with the Client and Boxalino):

  • Full aspects of the BigQuery Core Dataset tables

  • Key aspects of the integration

Implementation:

  • Step 3 – Typically by the IT of the Client (with the support of Boxalino)

  • Step 4 – Typically by the IT of the Client (with the support of Boxalino)

  • Step 5 – Typically by Boxalino (typically 1 person Day of Work)

GLOSSARY

Google Cloud Organization
This is the resource representing your company in Google Cloud. An Organization resource is available for Google G Suite and Cloud Identity customers.
https://cloud.google.com/resource-manager/docs/creating-managing-organization

Google Cloud Project
Google Cloud Platform (GCP) projects forms the basis for creating, enabling, and using all GCP services including managing APIs, enabling billing, adding and removing collaborators, and managing permissions for GCP resources. This is the main container for all project activities, including the upload of files and their processing in BigQuery.
https://cloud.google.com/resource-manager/docs/creating-managing-projects

Google Cloud Service Account
This is the account to be used for process automation in Step 2 and 4. The Service account is a technical account meant for a machine / process to perform the data loading and transformation tasks. The service account must have access to your Storage Bucket and your BigQuery Project.
https://cloud.google.com/iam/docs/service-accounts

Google Cloud Storage
Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing capabilities.
https://cloud.google.com/storage/

Google Cloud Storage Bucket
This is the “folder” which will contain all your exported files.
Please make sure to use the region europe-west1.
https://cloud.google.com/storage/docs/creating-buckets

Google BigQuery
BigQuery is a RESTful web service that enables interactive analysis of massive datasets working in conjunction with Google Storage. It is a serverless Software as a Service / Datawarehouse as a Service that may be used complementarily with MapReduce.
https://cloud.google.com/bigquery

Google BigQuery Dataset
A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views. A table or view must belong to a dataset, thus you need to create at least one dataset before loading data into BigQuery.
In other words, Dataset is the concept of a MySQL “database” for BigQuery.
https://cloud.google.com/bigquery/docs/datasets-intro

Automatic Uploading Process
This is a process that have to be integrated on the systems your data should be exported from. The process uploads the export as CSV or JSON files to the Google Cloud Storage using Google Cloud SDK.

Automatic Uploading Process Run
An Automatic Uploading Process Run is specific execution (with its timestamp) of an Automatic Uploading Process

Google BigQuery Stage Dataset
This is a standard BigQuery Dataset (typically named [account]_stage with [account] as your account name) which will contain the automatically generated tables from every Automatic Data Processing (typically there are as many tables as they are files in your Google Storage in one run of your Automatic Uploading Process and each file contains the date or datetime of the run (compliant with the datetime indicated in your file names).

Full data
The full data includes all information available at and consistent to a chosen point in time (typically the moment of the extraction of the data from its source system). Also known as “full snapshot”. One of the two basic kinds of data provided by the Automatic Upload Process.

Transaction data
By Transaction data we mean data connecting customers to products, like the purchase history. While the purchase history is the main (and typically most important) use case, other Transaction data could be also considered (Sales leads, support requests, etc.).

This data are typically contained into two CSV files

  • one with one line per order with the required internal order and customer identifiers as well as an external order identifier (which should be used to map the data to pre-existing order data in the Boxalino Data Science Eco-system and which might be the same as the internal order identifier) and descriptive columns about the order (date, status, total value, etc.)

  • one with one line per product per order with the required product identifier as well as descriptive columns about the ordered product (quantity, price before and after discounts, etc.).

For details look at the Annex: WHAT DATA DO WE NEED?

Customer data
By Customer data we mean data describing a customer in the system exporting the data (depending on the system, a customer can be defined by a unique physical person, or an account or another logic). This data are typically contained in one CSV file (with one line per customer with the required customer id as referred to in the transaction data, one column for the external identifiers (which should be used to map the data to pre-existing customer data in the Boxalino Data Science Eco-system and might be the same as the internal customer identifier) as well as descriptive columns about the customers (e.g.: gender, zip code, date of birth, etc.). Typically, it is recommended to export only anonymised customer data unless specifically decided otherwise.

For details look at the Annex: WHAT DATA DO WE NEED?

Partial daily data
This is one of the two possible content of an Automatic Uploading Process (Full historical or Partial daily). The Partial daily data contains only the data which have (or might have) changed since the last Partial daily data. If the data are not easily convertible to delta (for example, customer data are sometimes more difficult to export as a delta than transaction data), it is typically possible to export the full data for some of the files (e.g.: the customers data are exported in full, but the transaction data are only exported for the last month).

Operational Data Store
An operational data store (or "ODS") is used for operational reporting and as a source of data for the Enterprise Data Warehouse (EDW). It is a complementary element to an EDW in a decision support landscape, and is used for operational reporting, controls and decision making, as opposed to the EDW, which is used for tactical and strategic decision support. An ODS is a database designed to integrate data from multiple sources for additional operations on the data, for reporting, controls and operational decision support. Unlike a production master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting.

Google BigQuery Core Dataset

This is a standard BigQuery Dataset (typically named [account]_core with [account] as your account name) which will have a set of manually defined tables which will contain the consolidated data from the run of all the Automatic Uploading Process runs. contain the automatically generated tables from every Automatic Data Processing (typically there are as many tables as they are files in your Google Storage in one run of your Automatic Uploading Process and each file ends with the date or datetime of the run (compliant with the datetime indicated in your file names).

Automatic Data Processing
An Automatic Data Processing loads the files generated by the last Automatic Uploading Process Run into the Stage Dataset (direct mapping, load the files as they are into tables which names are derived from the names of the files) and then update the Core Dataset on the basis of these generated stage tables.

Automatic Data Processing Run
An Automatic Data Processing Run is specific execution (with its timestamp) of an Automatic Data Processing. Typically It should be scheduled to run shortly after the Automatic Uploading Process Run (it is not required, but it could be triggered by the end of the Automatic Uploading Process Run instead of being scheduled independently, but this requires an additional step in the setup of the Automatic Uploading Process.

Google Cloud Virtual Machine Instance (VM Instance)
An Automatic Data Processing loads the files generated by the last Automatic Uploading Process Run into the Stage Dataset (direct mapping, load the files as they are into tables which names are derived from the names of the files) and then update the Core Dataset on the basis of these generated stage tables.

https://cloud.google.com/compute/docs/instances/

Boxalino Data Science Eco-System

Boxalino Data Science Eco-System create a highly standardized “nucleus” of a Data Warehouse in Google BigQuery and connects the result of data science analytics process (report and lab) to be automatically uploaded in Boxalino Real-Time platform. The process will integrate the base standard data from the e-shops (including product data, behavioral data and sometimes also customers and transactions data) in addition to any additional data source set-up through the process described in this document.

Please consider that the usage of the words “stage” and “core” in this diagram refers to Boxalino internal stage and core datasets, not to the ones described in this document which are specific to the process described in this document.

ANNEX #1: WHAT DATA DO WE NEED?

Summary

The data to be exported are typically of two types: Transaction and Customer data.

Transaction data
By Transaction data we mean data connecting customers to products, like the purchase history. While the purchase history is the main (and typically most important) use case, other Transaction data could be also considered (Sales leads, support requests, etc.).

This data are typically contained into two CSV files (you have to make sure that the fields are comma and not semicolons separated)

  • one with one line per order with the required internal order and customer identifiers as well as an external order identifier (which should be used to map the data to pre-existing order data in the Boxalino Data Science Eco-system and which might be the same as the internal order identifier) and descriptive columns about the order (date, status, total value, etc.)

  • one with one line per product per order with the required product identifier as well as descriptive columns about the ordered product (quantity, price before and after discounts, etc.).

The key information needed in Transaction Data are:

  1. Internal ID (unique per order)

  2. External ID (can be the same as Internal ID) matching the order ID of the E-shop system (to detect duplicate orders present in both systems)

  3. External Product ID matching the product ID or a product attribute value of the E-shop (please indicate what attribute)

  4. Internal Customer ID (matching the internal id of the customer data below)

  5. External Customer ID matching the customer ID or a customer attribute value of the E-shop (please indicate what attribute)

  6. A list of Order Attribute with their values such as:

    1. Order date

    2. Order status

    3. Order full order price

    4. Qty, Order and price before and after discount of each purchased product

    5. System (Store-id, online store, etc.)

    6. … (key relevant order attributes should be added)

Customer data
By Customer data we mean data describing a customer in the system exporting the data (depending on the system, a customer can be defined by a unique physical person, or an account or another logic). This data are typically contained in one CSV file (with one line per customer with the required customer id as referred to in the transaction data, one column for the external identifiers (which should be used to map the data to pre-existing customer data in the Boxalino Data Science Eco-system and might be the same as the internal customer identifier) as well as descriptive columns about the customers (e.g.: gender, zip code, date of birth, etc.). Typically, it is recommended to export only anonymised customer data unless specifically decided otherwise.

The key information needed in Customer Data are:

  1. Internal ID (unique per customer ID)

  2. External ID (can be the same as Internal ID) matching the customer ID or a customer attribute value of the E-shop (please indicate what attribute)
    A hashed customer ID or hashed e-mail can be used

  3. A list of Customer and Attribute with their values such as:

    1. Gender

    2. Date of Birth

    3. Zip Code

    4. Loyalty points information

    5. Newsletter subscriptions

    6. … (key relevant customer attributes should be added)

 

ANNEX #2: DATA QUESTIONNAIRE

Summary

Transaction data
How many files are there (please provide their names and a short description)?

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

If you have only 1 file, is there one row per order and per product? if not, what is the logic of the rows?

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

What are the column names of each files (please provide their names, format, and a short description)?

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

Which columns of which file contain the internal and external order ids?

By Internal id, we mean the unique id of the order of the system providing the data.

By external id, we mean the id of the e-shop data (e.g.: the Magento order id).

If the external id does not match directly the id of the e-shop data, please indicate the name of the attribute of the e-shop data which contains this value.

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

Same question for the internal and external customer ids?

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

Same question for the internal and external product ids?

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

Which columns of which files contain the order date and what is the exact format?

___________________________________________________________________________________________

___________________________________________________________________________________________

Which columns of which files contain the order status and what are the possible values and their meaning?

___________________________________________________________________________________________

___________________________________________________________________________________________

What condition makes an order or a specific product of an order successfully bought?

___________________________________________________________________________________________

___________________________________________________________________________________________

Which columns of which files contain the billing and delivery address (including zip code)?

___________________________________________________________________________________________

___________________________________________________________________________________________

If any, which columns of which files contain customer data (except the Internal / External ids)?

___________________________________________________________________________________________

___________________________________________________________________________________________

If any, which columns of which files contain product data (except the Internal / External ids)?

___________________________________________________________________________________________

___________________________________________________________________________________________

Customer data
How many files are there (please provide their names and a short description)?

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

What are the column names of each files (please provide their names, format, and a short description)?

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

Which columns of which file contain the internal and external customer ids?

By Internal id, we mean the unique id of the order of the system providing the data.

By external id, we mean the id of the e-shop data (e.g.: the Magento order id).

If the external id does not match directly the id of the e-shop data, please indicate the name of the attribute of the e-shop data which contains this value.

___________________________________________________________________________________________

___________________________________________________________________________________________

___________________________________________________________________________________________

Which columns of which files contain the birth date and what is the exact format?

___________________________________________________________________________________________

___________________________________________________________________________________________

Which columns of which files contain the gender and what are the possible values and their meaning?

___________________________________________________________________________________________

___________________________________________________________________________________________

Which columns of which files contain the e-mail and and can there be more than 1 per internal id?

___________________________________________________________________________________________

___________________________________________________________________________________________

Which columns of which files contain the billing and delivery address (including zip code)?

___________________________________________________________________________________________

___________________________________________________________________________________________