Integrate your Data in Boxalino BigQuery Data Science Eco-System
- 1 DATA INTEGRATION STEPS - OVERVIEW
- 1.1 Introduction
- 1.2 4 Key Steps
- 1.3 System overview
- 2 STEP 1: SET-UP YOUR GOOGLE ENVIRONMENT
- 3 STEP 2: EXPORT YOUR DATA TO GOOGLE CLOUD STORAGE
- 4 STEP 3: SET UP YOUR DATA PROCESSING
- 5 STEP 4: INTEGRATE YOUR CORE DATA WITH BOXALINO
- 5.1 Summary
- 6 EXECUTION PLAN
- 7 GLOSSARY
- 8 ANNEX #1: WHAT DATA DO WE NEED?
- 8.1 Summary
- 9 ANNEX #2: DATA QUESTIONNAIRE
- 9.1 Summary
DATA INTEGRATION STEPS - OVERVIEW
Introduction
Boxalino provides a complete Data Science Eco-system built upon Google BigQuery and fully integrated with Boxalino Real-Time Platform.
In this section we present the 5 key steps of a Data Integration project which are then explained in details in the following sections of the document.
4 Key Steps
To integrate new data, the key steps are:
Set-up your Google Cloud Environment
You will need an Organization, Project, Billing Account, Service Account, Storage Bucket and BigQuery Datasets (Stage and Core Datasets)Export your data to Google Cloud Storage
Set-up an Automatic Uploading Process for your Full historical data and Partial daily data as CSV or JSON files in Google Cloud StorageSet-up your Data Processing
Create your Core Dataset Tables as an Operational Data Store and set-up an Automatic Data Processing of your files to your Stage Dataset and Core Datasets
Watch the tutorial video: https://youtu.be/yYz_HmeX4T8Integrate your Core Data with Boxalino
Boxalino integrates your data from your Core Dataset into Boxalino Data Science Eco-System
For an optimal integration, your data will go through 3 phases: Google Storage as files, BigQuery stage as direct representation of your files and BigQuery core as consolidated data. The process is typically executed once per day, but can be adjusted to be executed more frequently.
System overview
The Operational System with Data (the system which manages the data to be exported) typically exports the data as files to a Google Cloud Storage bucket using the Google Cloud SDK:
https://cloud.google.com/appengine/docs/standard/go/download
A variant where another machine (e.g.: a VM hosted in the same secured location as the Operational System) can be used to send the files to Google Storage with the Google Cloud SDK.
This variant is typically not required and is not recommended as the Google Cloud SDK can be installed on any machine safely and, by doing so, the data have the least amount of temporary storage to ensure optimal security.
STEP 1: SET-UP YOUR GOOGLE ENVIRONMENT
Summary
In this step, all the required preparation points of your Google Environment should be done.
Typically, if you didn’t have a Google Cloud account to start with, it is possible to do it in 2 hours.
You will need the required credentials (including payments) to complete the set-up.
The key parts of this step are:
Create a Google Cloud Organization
Create a Google Cloud Project
Create a Google Billing Account
Create a Google Cloud Storage Bucket
Create a Google Cloud Service Account
Create STAGE and CORE BigQuery Datasets
Create a Google Group for Users
Create A Google Cloud Organization
This is the resource representing your company in Google Cloud. An Organization resource is available for Google G Suite and Cloud Identity customers.
You can create a Google Cloud Organization in two ways: Cloud Identity or GSuite.
The information to decide which one is the right one for you and how to do it are described here:
https://cloud.google.com/resource-manager/docs/creating-managing-organization
Then you can view and manage billing accounts and projects under your organization here:
https://cloud.google.com/resource-manager/docs/organization-resource-management
Create A Google Cloud Project
Google Cloud Platform (GCP) projects form the basis for creating, enabling, and using all GCP services including managing APIs, enabling billing, adding and removing collaborators, and managing permissions for GCP resources. This is the main container for all of our project activities, both related to the uploading of files and their processing in BigQuery.
Please follow the instructions here:
https://cloud.google.com/resource-manager/docs/creating-managing-projects
Create A Google Billing Account
https://cloud.google.com/billing/docs/how-to/manage-billing-account
Create A Google Cloud Storage Bucket
Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing capabilities.
https://cloud.google.com/storage/
Google Cloud Storage Bucket is the “folder” which will contain all your exported files.
Please make sure to use the region europe-west1.
https://cloud.google.com/storage/docs/creating-buckets
Create A Google Service Account
This is the account to be used for automated process in Step 2 and 4. The Service account is the account the machine will use to perform the data loading and transformation tasks. The service account must have access to your Storage Bucket and your BigQuery Project.
https://cloud.google.com/iam/docs/service-accounts
Make sure to give access rights for this service account to the Google Cloud Storage Bucket and provide with the authentication information to the required people for the set-up of Step 2.
Create STAGE and CORE BigQuery Datasets
Please use the location EU as below.
Then please share your datasets with Boxalino by adding as members both with the role of Viewer and BigQuery Data Viewer following accounts as per the screen-shot below:
Google BigQuery
BigQuery is a RESTful web service that enables interactive analysis of massive datasets working in conjunction with Google Storage. It is a serverless Software-as-a-Service that may be used complementarily with MapReduce.
https://cloud.google.com/bigquery
Google BigQuery Dataset
A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views. A table or view must belong to a dataset, so you need to create at least one dataset before loading data into BigQuery.
In other words, dataset is the same as a MySQL “database” but in BigQuery.
Here is a general introduction into BigQuery Datasets.
https://cloud.google.com/bigquery/docs/datasets-intro
Create a dataset called [account]_stage and [account]_core (where account is your account name which can be defined typically based on the name of the system providing these data, possibly for a specific account in case several different accounts are exported from the same system in parallel):
https://cloud.google.com/bigquery/docs/datasets
Create a Google Group
Create a Google Group and include there all users that should have access to BigQuery (you can manage yourself who has access directly from there in the future)
https://support.google.com/cloudidentity/answer/33343?hl=en
Assign relevant access roles to the new user group, see the below links
https://cloud.google.com/iam/docs/understanding-roles#bigquery-roles
https://cloud.google.com/bigquery/docs/dataset-access-controls
STEP 2: EXPORT YOUR DATA TO GOOGLE CLOUD STORAGE
Summary
Before proceeding further, make sure that your Google Environment has been prepared as per earlier mentioned instructions.
As a result, you should now have ready the following information:
Google Service Account (an e-mail address) and a certificate with a private key
Address of the Cloud Storage Bucket (e.g.: gs://my-bucket)
The key activities in this step are:
Install Google Cloud SDK
Authenticate with the Service Account
Set up an export process generating CSV or JSON files
Upload the generated files into a Google Cloud Storage Bucket
Automate the process
Install Google Cloud SDK
Install the Google Cloud SDK in the environment exporting your data as per the below instructions:
https://cloud.google.com/appengine/docs/standard/go/download
Authenticate with your Google Service Account
Authenticate with the Google Service Account created in Step 1 (including the certificate with the private key) using the SDK command gcloud auth as per the instructions here:
https://cloud.google.com/sdk/gcloud/reference/auth/activate-service-account
Set up an export process generating CSV or JSON files
The process is not described in detail as it is client-specific.
Its output should be a set of files containing the exported data. There can be two kinds of exports: the full data (typically customer masterdata) and the partial data (typically transaction data for the chosen period, such as transactions from the day before).
Transaction data
By transaction data we mean data connecting customers to products, such as the purchase history. While the purchase history is the main (and typically most important) use case, other transaction data could be also considered (sales leads, support requests, etc.).
This data typically comes structured in the following two ways (examples given for the CSV format)
either with one line per order with the required internal order and customer identifiers as well as an external order identifier (which should be used to map the data to pre-existing order data in the Boxalino Data Science Eco-system and which might be the same as the internal order identifier) and descriptive columns about the order (date, status, total value, etc.)
or with one line per product per order with the required product identifier as well as descriptive columns about the ordered product (quantity, price before and after discounts, etc.).
For details look at the Annex: WHAT DATA DO WE NEED?
Customer data
By customer data we mean data describing a customer in the system exporting the data (depending on the system, a customer can be defined as a unique physical person, a company, an account or using another logic). This data is typically contained in one CSV file (with one line per customer with the required customer id as referred to in the transaction data, one column for the external identifiers (which should be used to map the data to pre-existing customer data in the Boxalino Data Science Eco-system and might be the same as the internal customer identifier) as well as descriptive columns about the customer (e.g.: gender, zip code, date of birth, etc.). We recommend to export anonymized customer data only unless there is a specific compelling reason to do otherwise.
For details look at the Annex: WHAT DATA DO WE NEED?
Full data
The full data includes all information available at and consistent to a chosen point in time (typically the moment of the extraction of the data from its source system). Also known as “full snapshot”. One of the two basic kinds of data provided by the Automatic Upload Process.
Partial (aka DELTA) data
One of the two basic kinds of data provided by the Automatic Upload Process. (full or partial). The partial daily data (aka DAILY DELTA) only contains data that has changed over a period of the previous day (typically all changes from the day before) but we can also generate weekly or monthly deltas. If the data is not easily exported as a delta (i.e. the customer data is often more difficult to export as such compared to transactions) then it is possible to export the full data, instead. The files can be either in CSV or JSON.
CSV files should have column names in the first row.
JSON files should use the Google-compliant format with one JSON record per row (and not a single JSON object with all the data in it).
If possible (and to be compatible with the standard SQL and Shell-scripts samples provided in Step 4) the files should follow the following naming pattern:
First part of file name should be date (+ time) to support wildcard (“*”) matching
Second part should be F (for full data) or D (for delta data)
Name of the file should indicate the contents of the export (customer, transactions, product… etc) and optionally a numerical post-fix before the file extension (e.g. when the export is spread over several files)
e.g.: 20190822T071553_D_products.json
Copy the files to Google Cloud Storage Bucket
Copy the files to Google Cloud Storage Bucket defined in step 1 using the command gsutil cp as described here:
https://cloud.google.com/storage/docs/gsutil/commands/cp
it is recommended to put the files into a folder structure with 3 levels (starting from the root):
Your bucket name
data (fix name)
Type of contents (e.g.: “customers” or “orders”)
optionally: the files can be compressed using gzip
Automate the process
Automatically run the export of the Partial daily data (typically once per day at a set hour during the night). Set up an easy facility to run (manually) the export of the full data.
STEP 3: SET UP YOUR DATA PROCESSING
Watch the tutorial video: https://youtu.be/yYz_HmeX4T8
Summary
This step describes the creation of Core Dataset Tables that act as an Operational Data Store and the set-up of an Automatic Data Processing within Google Cloud to load the last files provided by the Automatic Uploading Process Run.
There are many ways to set up such a process. We give an example that is very simple standard copy&paste implementation using a Google VM Instance:
Google Cloud Virtual Machine Instance (VM Instance)
An Automatic Data Processing loads the files generated by the last Automatic Uploading Process Run into the STAGE Dataset (direct mapping, load the files as they are into tables which names are derived from the names of the files) and then update the CORE Dataset based on the generated stage tables.
https://cloud.google.com/compute/docs/instances/
We suggest to create a process that is capable of handling both the full and the delta exports.
In order to do so the following tasks should be done:
Create a VM Instance
Set up your VM Instance with a Shell Script
Run a Setup Script with chosen parameters
Set up a Cron-job (for automated execution of the data processing)
Automatic Data Processing
An Automatic Data Processing loads the files generated by the last Automatic Uploading Process Run into the Stage Dataset (direct mapping, load the files as they are into tables which names are derived from the names of the files) and then update the Core Dataset on the basis of these generated stage tables.
Automatic Data Processing Run
An Automatic Data Processing Run is specific execution (with its timestamp) of an Automatic Data Processing. Typically It should be scheduled to run shortly after the Automatic Uploading Process Run (it is not required, but it could be triggered by the end of the Automatic Uploading Process Run instead of being scheduled independently, but this requires an additional step in the setup of the Automatic Uploading Process.
Create a VM Instance
In this part, we describe how to create a VM instance in Google Cloud. It can be done in the Google cloud console in : Compute Engine -> VM Instance -> Create Instance
https://cloud.google.com/compute/docs/instances/create-start-instance
Here are some of the important parameters for the VM:
VM machine type : G1-small is sufficient (usually)
Location must be europe-west-1
Make sure your Service Account has access to Cloud Storage and BigQuery APIs and can connect to the VM
You can then simply start the VM Instance (it normally starts automatically after creation) and open an SSH session in a browser window to perform the next steps.
Setup of your VM with a Shell Script
Copy the Setup Script from the following section into your home/user directory
Make the script files executable using command chmod u+x FILENAME
Execute the script with correct parameters from inside your home/user directory. The command should follow this pattern: [STAGE and CORE dataset prefix] [Your GCS bucket name] [Table name],[Column1],[Column2],[Column3],[Column4],[Column5]...
Here is an example of how it could look like: bash configure.sh boxalino boxalino-data transactions,order_id,order_date,order_status,order_amount,order_currency
f you want to create multiple tables, you must append them to the end of the command and separate each table with a space. If you don't separate it, it will be used as a whole table.
Be careful, you should only run the setup script once because if you've already entered data in one of your tables and run the script again, each table will be overwritten and you will lose your current changes. Just make sure you add all the necessary tables from the start.
install a system package gettext-base in case it is missing in your system (using yum
or dnf
or apt
package manager depending on your linux distribution).
Setup Script
This script has to be run only once (see point 5.6 below). It will generate several directories, two SQL files and the shell script for regular automated data processing:
A file with DDL statements to create CORE tables creation that needs to be executed manually once (
create-core.sql
)A file with SQL code for regular (daily) loads (
stage2core.sql
) that takes data from the STAGE, transforms and stores it in the CORE area.A file with Bash shell script that activates and steers the data processing (
process.sh
, here referenced also as “Automation Script”). This file can be called from the linuxcron
daemon to guarantee a reliable schedule-based execution.
Full code follows.
#!/usr/bin/env bash
# good shell scripting practices
set -o errexit # exit when a command fails
set -o nounset # exit when an undeclared variable is used
set -o pipefail # return exit status of the last command that threw a non-zero exit code even when piped together
# set -o xtrace # enable debugging
# functions
usage () {
cat 1>&2 <<HELP_USAGE
Usage:
$0 CLIENT GCS_BUCKET EXPORT_DEFINITION...
CLIENT is the name that prefixes the STAGE and CORE datasets in BigQuery
GCS_BUCKET is the name of the storage bucket in Google Cloud Storage where
source systems deliver exported data. The script expects the source files
and their location to conform to the following pattern:
gs://GCS_BUCKET/<export_name>/YYYYMMDD*_<export_name>.(csv|json)
EXPORT_DEFINITION any number of definitions can be provided separated
by spaces
a single export definition has the form of EXPORT_NAME,COLUMN1,COLUMN2,...
the export name determines the expected filenames, names of STAGE tables
and the names of the tables in the CORE and for this reason ONLY
names containing the following character classes [a-zA-Z0-9_]
are allowed
it is assumed that the first COLUMN is the primary key used for
matching records between STAGE and CORE
NOTE that the order of the export definitions determines the order
of the loading / processing the data
Artifacts created:
logs
sql/${DDLSQL}
sql/${DMLSQL}
tmp
${ETLSCRIPT}
Example:
$0 [STAGE and CORE dataset prefix] [Your GCS bucket name] [Table name],[Column1], /
[Column2],[Column3],[Column4],[Column5]...
the above call will create a process to load files named
"YYYYMMDD*[Table name].(csv|json)" into BigQuery STAGE table
"[STAGE dataset prefix]_stage.YYYYMMDD*[Table name]" and during the CORE load
into "[CORE dataset prefix]_core.[Table name]"
HELP_USAGE
}
# global configuration
ROOTDIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
TOPDIRS="logs sql tmp"
DDLSQL="create-core.sql"
DMLSQL="stage2core.sql"
ETLSCRIPT="process.sh"
if [[ ${#} -lt 3 ]]; then
usage
exit 22
fi
# what is our client
CLIENT=${1}
CLIENT_BUCKET=${2}
CLIENT_STAGE=${CLIENT}_stage
CLIENT_CORE=${CLIENT}_core
shift 2
# create directories
for d in ${TOPDIRS}; do
newdir=${ROOTDIR}/${d}
mkdir -p ${newdir}
echo "created directory ${newdir}"
done; echo
# initiate artifacts
echo -e "-- autogenerated DDL for creation of core tables\n" >${ROOTDIR}/sql/${DDLSQL}
echo -e "-- autogenerated SQL for stage2core load\n" >${ROOTDIR}/sql/${DMLSQL}
cat >${ROOTDIR}/${ETLSCRIPT} <<EOF
#!/usr/bin/env bash
# this script was autogenerated on $(date +%Y-%m-%d\ %T)
set -o errexit # exit when a command fails
set -o nounset # exit when an undeclared variable is used
set -o pipefail # return exit status of the last command that threw a non-zero exit code even when piped together
# set -o xtrace # enable debugging
export ETLDATE=\$1
if [[ -z \${ETLDATE} || ! \$ETLDATE =~ ^20[1-2][0-9][0-1][0-9][0-3][0-9]$ ]]; then
echo "Missing argument ETLDATE or format mismatch (expected 20YYMMDD)"
exit 22
fi
DEFAULT_CSV_DELIMITER=,
EOF
for ed; do
# start processing export definitions provided by the user
exportname=${ed%%,*}
echo "Started processing export ${exportname}"
# parse input arguments
columnstring=${ed#$exportname,}
declare -a columns
columns=($(cut -d ',' -f 1- - --output-delimiter=' ' <<< "${columnstring}"))
# start generating DDL for the export
echo -e "CREATE OR REPLACE TABLE \`${CLIENT_CORE}.${exportname}\` (" >>${ROOTDIR}/sql/${DDLSQL}
for c in "${columns[@]}"; do
if [[ ${c} = "${columns[0]}" ]]; then
echo -e " ${c} STRING NOT NULL" >>${ROOTDIR}/sql/${DDLSQL}
else
echo -e " ,${c} STRING" >>${ROOTDIR}/sql/${DDLSQL}
fi
done;
echo -e " ,create_tm DATETIME NOT NULL,update_tm DATETIME NOT NULL,src_cd STRING" >>${ROOTDIR}/sql/${DDLSQL}
echo -e ") OPTIONS(description=\"autogenerated\");\n" >>${ROOTDIR}/sql/${DDLSQL}
echo "DDL for ${exportname} written"
# append commands to stage2core SQL for the export
echo -e "-- ${exportname}" >>${ROOTDIR}/sql/${DMLSQL}
echo -e "DELETE FROM \`${CLIENT_CORE}.${exportname}\`" >>${ROOTDIR}/sql/${DMLSQL}
echo -e "WHERE ${columns[0]} IN (SELECT CAST(${columns[0]} AS STRING) FROM \`${CLIENT_STAGE}.\${${exportname}_stage_table}\`);" >>${ROOTDIR}/sql/${DMLSQL}
echo -e "INSERT INTO \`${CLIENT_CORE}.${exportname}\` (${columnstring},create_tm,update_tm,src_cd)" >>${ROOTDIR}/sql/${DMLSQL}
echo -e "SELECT " >>${ROOTDIR}/sql/${DMLSQL}
for c in "${columns[@]}"; do
if [[ ${c} = "${columns[0]}" ]]; then
echo -e " CAST(${c} AS STRING) AS ${c}" >>${ROOTDIR}/sql/${DMLSQL}
else
echo -e " ,CAST(${c} AS STRING) AS ${c}" >>${ROOTDIR}/sql/${DMLSQL}
fi
done;
echo -e " ,current_datetime AS create_tm, current_datetime AS update_tm, '${exportname}' AS src_cd" >>${ROOTDIR}/sql/${DMLSQL}
echo -e "FROM \`${CLIENT_STAGE}.\${${exportname}_stage_table}\`;\n" >>${ROOTDIR}/sql/${DMLSQL}
echo "SQL for ${exportname} written"
# continue generating ETLSCRIPT contents
echo -e "# set up variables for ${exportname} processing" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "${exportname}_file=\$(gsutil ls gs://${CLIENT_BUCKET}/data/${exportname}/\${ETLDATE}*${exportname}.* | grep -iE 'csv|json' | tail -n 1)" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "${exportname}_file_type=\${${exportname}_file##*.}" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "export ${exportname}_stage_table=\$(basename \${${exportname}_file%.*})" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "export ${exportname}_core_table=${exportname}" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "# load ${exportname} data from storage into BigQuery STAGE dataset" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "if [[ \$${exportname}_file_type = CSV || \$${exportname}_file_type = csv ]]; then bq load --source_format CSV --skip_leading_rows 1 --field_delimiter=\${DEFAULT_CSV_DELIMITER} --autodetect --replace ${CLIENT_STAGE}.\${${exportname}_stage_table} \$${exportname}_file; fi" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "if [[ \$${exportname}_file_type = JSON || \$${exportname}_file_type = json ]]; then bq load --source_format NEWLINE_DELIMITED_JSON --autodetect --replace ${CLIENT_STAGE}.\${${exportname}_stage_table} \$${exportname}_file; fi\n\n" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "Finished processing export ${exportname}\n"
done
# write the remaining portion of ETLSCRIPT contents
echo -e "# STAGE to CORE queries from ${DMLSQL} - template and execute in synchronous fashion" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "live_sql_file=\$(date +%Y%m%dT%H%M%S)_${DMLSQL}" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "envsubst <sql/${DMLSQL} >tmp/\${live_sql_file}" >>${ROOTDIR}/${ETLSCRIPT}
echo -e "bq query --nouse_legacy_sql <tmp/\${live_sql_file}" >>${ROOTDIR}/${ETLSCRIPT}
# make generated processing script executable
chmod ug+x ${ROOTDIR}/${ETLSCRIPT}
# print cron instructions
echo -e "To enable automatic daily execution type the following command"
echo -e "\tcrontab -e"
echo -e "and add the following line"
echo -e "\t* H * * * ${ROOTDIR}/${ETLSCRIPT} \$(date --date '1 day ago' +%Y%m%d) 1>&2 2>${ROOTDIR}/logs/\$(date +%Y%m%dT%H%M%S)_${ETLSCRIPT%%.*}.log"
echo -e "where 'H' stands for the HOUR of automatic execution (we recommend running very early in the morning)\n"
echo -e "ALL DONE"
exit 0
Automation Script
Here we include a sample the shell script referred to in the prior section which needs to be executed by the cron
. This shell script is generated during the configuration and its contents depend on parameters (name of the client and exports and their structure) entered by the user.
#!/usr/bin/env bash
# this script was autogenerated on 2019-11-11 11:06:57
set -o errexit # exit when a command fails
set -o nounset # exit when an undeclared variable is used
set -o pipefail # return exit status of the last command that threw a non-zero exit code even when piped together
# set -o xtrace # enable debugging
export ETLDATE=$1
if [[ -z ${ETLDATE} || ! $ETLDATE =~ ^20[1-2][0-9][0-1][0-9][0-3][0-9]$ ]]; then
echo "Missing argument ETLDATE or format mismatch (expected 20YYMMDD)"
exit 22
fi
DEFAULT_CSV_DELIMITER=,
# set up variables for customer processing
customer_file=$(gsutil ls gs://integration-bucket/data/customer/${ETLDATE}*customer.* | grep -iE 'csv|json' | tail -n 1)
customer_file_type=${customer_file##*.}
export customer_stage_table=$(basename ${customer_file%.*})
export customer_core_table=customer
# load customer data from storage into BigQuery STAGE dataset
if [[ $customer_file_type = CSV || $customer_file_type = csv ]]; then bq load --source_format CSV --skip_leading_rows 1 --field_delimiter=${DEFAULT_CSV_DELIMITER} --autodetect --replace kevinllado_stage.${customer_stage_table} $customer_file; fi
if [[ $customer_file_type = JSON || $customer_file_type = json ]]; then bq load --source_format NEWLINE_DELIMITED_JSON --autodetect --replace kevinllado_stage.${customer_stage_table} $customer_file; fi
# STAGE to CORE queries from stage2core.sql - template and execute in synchronous fashion
live_sql_file=$(date +%Y%m%dT%H%M%S)_stage2core.sql
envsubst <sql/stage2core.sql >tmp/${live_sql_file}
bq query --nouse_legacy_sql <tmp/${live_sql_file}
Create the Core tables
After you have executed the Setup Shell-Script submit the SQL file with the code for the table creation (“create-core.sql“ under the directory “sql”) using the BigQuery command line utility called bq
. Make sure the CORE dataset already exists in BigQuery!
Type the following command in your $USER_HOME directory to create all necessary CORE tables:
bq query --nouse_legacy_sql <sql/create-core.sql
Alternatively you can open the SQL file with CREATE statements using a linux text editor of your choice and copy / paste / execute the SQL statements one after another in your BigQuery Admin Console, here: https://console.cloud.google.com/bigquery
Create statements for the table creation are documented at
https://cloud.google.com/bigquery/docs/tables
Before executing the generated CORE tables creation script you are welcome to review the code and validate the definitions. You may decide to change the column names, options or data types of the CORE tables. We particularly recommend the use of partitioned and clustered tables in the CORE so as to reduce BigQuery costs when querying the data.
Review the STAGE to CORE load
Google BigQuery CORE Dataset is a standard BigQuery Dataset (typically named [account]_core with [account] as in your account name) which will contain a set of tables with the consolidated data from all Automatic Uploading Process runs that went through the Automatic Data Processing. Typically there will be as many tables as there are files in your Google Storage for one complete run of your Automatic Uploading Process.
The Setup Script generates both commands (in “process.sh“) and SQL statements to perform the load of the raw files into STAGE and then from there into the CORE. The file with the SQL code for the STAGE to CORE load will be generated as “stage2core.sql“ under the directory “sql”. This file will work out-of-the-box, however, it may need to be modified manually for the following reasons:
The automation script assumes that the persistent identifier is in the first column. This means that, before inserting new data into the core, it will delete previously existing data with the same identifier value. If the identifier is not in the first column (or requires a more complex logic to be fulfilled) the SQL code must be modified accordingly.
You can change the format of the fields wherever necessary (by default, your core tables will be created with all columns being of String type). The following data types are available in BigQuery: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
Please set the desired type using the CAST function. The generated SQL commands already include it: CAST(<column_name> AS STRING). Both DELETE and INSERT statements need to be adjusted consistently in case you edit the type of first column of the table which we consider that table’s Primary Key by convention.You may also choose to transform the flat data coming from a CSV using complex or nested types such as ARRAYs and STRUCTs. Or you could combine information coming from several CSV files into one table (e.g. several different files with different types of customer properties - with each type being an ARRAY or a STRUCT in the final desired CORE table). For more information see
https://cloud.google.com/bigquery/docs/nested-repeatedIf you changed the data type of any column in any of the CORE tables in the prior section, you need to adjust the corresponding “cast(column_name as string)” to match the right format. The Same applies for advanced transformations that map flat data into complex / nested data types in the CORE.
Setup the Cron-job
The final step is to set up the cron
-job which will automate the daily execution of your script file. The corresponding instructions will be printed on the terminal after the execution of the Setup Shell-Script has finished.
You can test the Automation Script by typing this in the command line: bash process.sh [Date in the format YYYYMMDD].
For details, you can watch the simple how-to video:
https://www.youtube.com/watch?v=llUw3RtD-Yw
The Automation Script (“process“) requires one parameter to be passed in from the cron
-job: the ETLDATE in the format YYYYMMDD.
Typically today’s load processes the data from yesterday. The correct ETLDATE value will thus be generted using shell subcommand $(date --date '1 day ago' +%Y%m%d)
Possible errors and how to solve them
Here are the most important errors which can appear during the execution process of the Automation script.
Wrong datatype
There is an error, which tells you that it was not able to insert specific data. It tries to insert it as a false datatype, an example would be that he tries to put a postal code which contains some letters into an integer datatype. This happens because the autodetect function of bigquery tries to be smart and identifies the datatype by himself. In a case where postal codes mostly consist of numbers, it will try to insert the ones which contain letters as an integer as well. A solution to bypass this is by creating your own schema manually. To do that you have to create a JSON file, where you write the schema for the table which failed. The schema should have this pattern:
After you created the schema, save it and go into the process.sh script, there you have to search the table you want to pass the schema, the line you search for should look like this one:
You should remove the --autodetect and paste the path to your JSON file at the end of this line. This will fix your problem.
Wrong separator
Another problem you might encounter is that your data was not recognised properly and wrongly inserted into the BigQuery tables. The reason for that will be the separator of your CSV files. If your CSV files are using semicolon as separators, then you have to change the script. In the Automation script, the standard separator is a comma. If you don’t change it, it won’t be able to fill your data into the BigQuery tables. To change this, open up the process.sh script. Almost at the top you should see the following line: DEFAULT_CSV_DELIMITER=,
. The only thing you have to do here is to change it into that: DEFAULT_CSV_DELIMITER=\;
STEP 4: INTEGRATE YOUR CORE DATA WITH BOXALINO
Summary
Boxalino will integrate your data in the Boxalino Data Science Eco-System based on its representation in your CORE dataset.
This process offers a degree of customization which can further be discussed and explained during a workshop.
More detailed information is available on demand.
As preparation, the questionnaire in annex 2 should be filled.
EXECUTION PLAN
Example Execution Plan
It is recommended to structure the execution of the 5 steps according to the following plan:
Phase 1 (typically 1-2 weeks):
Workshop to define (typically 2 hours with the Client, the Client Provider and Boxalino):
Content and format of the exported files
Key aspects of the BigQuery Core Dataset tables
Implementation:
Step 1 – Typically by the IT of the Client
Step 2 – Typically by the Provider with the data to export
Phase 2 (typically 1-2 weeks):
Workshop to define (typically 2 hours with the Client and Boxalino):
Full aspects of the BigQuery Core Dataset tables
Key aspects of the integration
Implementation:
Step 3 – Typically by the IT of the Client (with the support of Boxalino)
Step 4 – Typically by the IT of the Client (with the support of Boxalino)
Step 5 – Typically by Boxalino (typically 1 person Day of Work)
GLOSSARY
Google Cloud Organization
This is the resource representing your company in Google Cloud. An Organization resource is available for Google G Suite and Cloud Identity customers.
https://cloud.google.com/resource-manager/docs/creating-managing-organization
Google Cloud Project
Google Cloud Platform (GCP) projects forms the basis for creating, enabling, and using all GCP services including managing APIs, enabling billing, adding and removing collaborators, and managing permissions for GCP resources. This is the main container for all project activities, including the upload of files and their processing in BigQuery.
https://cloud.google.com/resource-manager/docs/creating-managing-projects
Google Cloud Service Account
This is the account to be used for process automation in Step 2 and 4. The Service account is a technical account meant for a machine / process to perform the data loading and transformation tasks. The service account must have access to your Storage Bucket and your BigQuery Project.
https://cloud.google.com/iam/docs/service-accounts
Google Cloud Storage
Google Cloud Storage is a RESTful online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google's cloud with advanced security and sharing capabilities.
https://cloud.google.com/storage/
Google Cloud Storage Bucket
This is the “folder” which will contain all your exported files.
Please make sure to use the region europe-west1.
https://cloud.google.com/storage/docs/creating-buckets
Google BigQuery
BigQuery is a RESTful web service that enables interactive analysis of massive datasets working in conjunction with Google Storage. It is a serverless Software as a Service / Datawarehouse as a Service that may be used complementarily with MapReduce.
https://cloud.google.com/bigquery
Google BigQuery Dataset
A dataset is contained within a specific project. Datasets are top-level containers that are used to organize and control access to your tables and views. A table or view must belong to a dataset, thus you need to create at least one dataset before loading data into BigQuery.
In other words, Dataset is the concept of a MySQL “database” for BigQuery.
https://cloud.google.com/bigquery/docs/datasets-intro
Automatic Uploading Process
This is a process that have to be integrated on the systems your data should be exported from. The process uploads the export as CSV or JSON files to the Google Cloud Storage using Google Cloud SDK.
Automatic Uploading Process Run
An Automatic Uploading Process Run is specific execution (with its timestamp) of an Automatic Uploading Process
Google BigQuery Stage Dataset
This is a standard BigQuery Dataset (typically named [account]_stage with [account] as your account name) which will contain the automatically generated tables from every Automatic Data Processing (typically there are as many tables as they are files in your Google Storage in one run of your Automatic Uploading Process and each file contains the date or datetime of the run (compliant with the datetime indicated in your file names).
Full data
The full data includes all information available at and consistent to a chosen point in time (typically the moment of the extraction of the data from its source system). Also known as “full snapshot”. One of the two basic kinds of data provided by the Automatic Upload Process.
Transaction data
By Transaction data we mean data connecting customers to products, like the purchase history. While the purchase history is the main (and typically most important) use case, other Transaction data could be also considered (Sales leads, support requests, etc.).
This data are typically contained into two CSV files
one with one line per order with the required internal order and customer identifiers as well as an external order identifier (which should be used to map the data to pre-existing order data in the Boxalino Data Science Eco-system and which might be the same as the internal order identifier) and descriptive columns about the order (date, status, total value, etc.)
one with one line per product per order with the required product identifier as well as descriptive columns about the ordered product (quantity, price before and after discounts, etc.).
For details look at the Annex: WHAT DATA DO WE NEED?
Customer data
By Customer data we mean data describing a customer in the system exporting the data (depending on the system, a customer can be defined by a unique physical person, or an account or another logic). This data are typically contained in one CSV file (with one line per customer with the required customer id as referred to in the transaction data, one column for the external identifiers (which should be used to map the data to pre-existing customer data in the Boxalino Data Science Eco-system and might be the same as the internal customer identifier) as well as descriptive columns about the customers (e.g.: gender, zip code, date of birth, etc.). Typically, it is recommended to export only anonymised customer data unless specifically decided otherwise.
For details look at the Annex: WHAT DATA DO WE NEED?
Partial daily data
This is one of the two possible content of an Automatic Uploading Process (Full historical or Partial daily). The Partial daily data contains only the data which have (or might have) changed since the last Partial daily data. If the data are not easily convertible to delta (for example, customer data are sometimes more difficult to export as a delta than transaction data), it is typically possible to export the full data for some of the files (e.g.: the customers data are exported in full, but the transaction data are only exported for the last month).
Operational Data Store
An operational data store (or "ODS") is used for operational reporting and as a source of data for the Enterprise Data Warehouse (EDW). It is a complementary element to an EDW in a decision support landscape, and is used for operational reporting, controls and decision making, as opposed to the EDW, which is used for tactical and strategic decision support. An ODS is a database designed to integrate data from multiple sources for additional operations on the data, for reporting, controls and operational decision support. Unlike a production master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting.
Google BigQuery Core Dataset
This is a standard BigQuery Dataset (typically named [account]_core with [account] as your account name) which will have a set of manually defined tables which will contain the consolidated data from the run of all the Automatic Uploading Process runs. contain the automatically generated tables from every Automatic Data Processing (typically there are as many tables as they are files in your Google Storage in one run of your Automatic Uploading Process and each file ends with the date or datetime of the run (compliant with the datetime indicated in your file names).
Automatic Data Processing
An Automatic Data Processing loads the files generated by the last Automatic Uploading Process Run into the Stage Dataset (direct mapping, load the files as they are into tables which names are derived from the names of the files) and then update the Core Dataset on the basis of these generated stage tables.
Automatic Data Processing Run
An Automatic Data Processing Run is specific execution (with its timestamp) of an Automatic Data Processing. Typically It should be scheduled to run shortly after the Automatic Uploading Process Run (it is not required, but it could be triggered by the end of the Automatic Uploading Process Run instead of being scheduled independently, but this requires an additional step in the setup of the Automatic Uploading Process.
Google Cloud Virtual Machine Instance (VM Instance)
An Automatic Data Processing loads the files generated by the last Automatic Uploading Process Run into the Stage Dataset (direct mapping, load the files as they are into tables which names are derived from the names of the files) and then update the Core Dataset on the basis of these generated stage tables.
https://cloud.google.com/compute/docs/instances/
Boxalino Data Science Eco-System
Boxalino Data Science Eco-System create a highly standardized “nucleus” of a Data Warehouse in Google BigQuery and connects the result of data science analytics process (report and lab) to be automatically uploaded in Boxalino Real-Time platform. The process will integrate the base standard data from the e-shops (including product data, behavioral data and sometimes also customers and transactions data) in addition to any additional data source set-up through the process described in this document.
Please consider that the usage of the words “stage” and “core” in this diagram refers to Boxalino internal stage and core datasets, not to the ones described in this document which are specific to the process described in this document.
ANNEX #1: WHAT DATA DO WE NEED?
Summary
The data to be exported are typically of two types: Transaction and Customer data.
Transaction data
By Transaction data we mean data connecting customers to products, like the purchase history. While the purchase history is the main (and typically most important) use case, other Transaction data could be also considered (Sales leads, support requests, etc.).
This data are typically contained into two CSV files (you have to make sure that the fields are comma and not semicolons separated)
one with one line per order with the required internal order and customer identifiers as well as an external order identifier (which should be used to map the data to pre-existing order data in the Boxalino Data Science Eco-system and which might be the same as the internal order identifier) and descriptive columns about the order (date, status, total value, etc.)
one with one line per product per order with the required product identifier as well as descriptive columns about the ordered product (quantity, price before and after discounts, etc.).
The key information needed in Transaction Data are:
Internal ID (unique per order)
External ID (can be the same as Internal ID) matching the order ID of the E-shop system (to detect duplicate orders present in both systems)
External Product ID matching the product ID or a product attribute value of the E-shop (please indicate what attribute)
Internal Customer ID (matching the internal id of the customer data below)
External Customer ID matching the customer ID or a customer attribute value of the E-shop (please indicate what attribute)
A list of Order Attribute with their values such as:
Order date
Order status
Order full order price
Qty, Order and price before and after discount of each purchased product
System (Store-id, online store, etc.)
… (key relevant order attributes should be added)
Customer data
By Customer data we mean data describing a customer in the system exporting the data (depending on the system, a customer can be defined by a unique physical person, or an account or another logic). This data are typically contained in one CSV file (with one line per customer with the required customer id as referred to in the transaction data, one column for the external identifiers (which should be used to map the data to pre-existing customer data in the Boxalino Data Science Eco-system and might be the same as the internal customer identifier) as well as descriptive columns about the customers (e.g.: gender, zip code, date of birth, etc.). Typically, it is recommended to export only anonymised customer data unless specifically decided otherwise.
The key information needed in Customer Data are:
Internal ID (unique per customer ID)
External ID (can be the same as Internal ID) matching the customer ID or a customer attribute value of the E-shop (please indicate what attribute)
A hashed customer ID or hashed e-mail can be usedA list of Customer and Attribute with their values such as:
Gender
Date of Birth
Zip Code
Loyalty points information
Newsletter subscriptions
… (key relevant customer attributes should be added)
ANNEX #2: DATA QUESTIONNAIRE
Summary
Transaction data
How many files are there (please provide their names and a short description)?
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
If you have only 1 file, is there one row per order and per product? if not, what is the logic of the rows?
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
What are the column names of each files (please provide their names, format, and a short description)?
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
Which columns of which file contain the internal and external order ids?
By Internal id, we mean the unique id of the order of the system providing the data.
By external id, we mean the id of the e-shop data (e.g.: the Magento order id).
If the external id does not match directly the id of the e-shop data, please indicate the name of the attribute of the e-shop data which contains this value.
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
Same question for the internal and external customer ids?
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
Same question for the internal and external product ids?
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
Which columns of which files contain the order date and what is the exact format?
___________________________________________________________________________________________
___________________________________________________________________________________________
Which columns of which files contain the order status and what are the possible values and their meaning?
___________________________________________________________________________________________
___________________________________________________________________________________________
What condition makes an order or a specific product of an order successfully bought?
___________________________________________________________________________________________
___________________________________________________________________________________________
Which columns of which files contain the billing and delivery address (including zip code)?
___________________________________________________________________________________________
___________________________________________________________________________________________
If any, which columns of which files contain customer data (except the Internal / External ids)?
___________________________________________________________________________________________
___________________________________________________________________________________________
If any, which columns of which files contain product data (except the Internal / External ids)?
___________________________________________________________________________________________
___________________________________________________________________________________________
Customer data
How many files are there (please provide their names and a short description)?
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
What are the column names of each files (please provide their names, format, and a short description)?
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
Which columns of which file contain the internal and external customer ids?
By Internal id, we mean the unique id of the order of the system providing the data.
By external id, we mean the id of the e-shop data (e.g.: the Magento order id).
If the external id does not match directly the id of the e-shop data, please indicate the name of the attribute of the e-shop data which contains this value.
___________________________________________________________________________________________
___________________________________________________________________________________________
___________________________________________________________________________________________
Which columns of which files contain the birth date and what is the exact format?
___________________________________________________________________________________________
___________________________________________________________________________________________
Which columns of which files contain the gender and what are the possible values and their meaning?
___________________________________________________________________________________________
___________________________________________________________________________________________
Which columns of which files contain the e-mail and and can there be more than 1 per internal id?
___________________________________________________________________________________________
___________________________________________________________________________________________
Which columns of which files contain the billing and delivery address (including zip code)?
___________________________________________________________________________________________
___________________________________________________________________________________________