DI-SAAS (ELT Flow)

At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform).

The ELT flow is recommended when wanting to load your data as is (no transformation) in BQ and then perform the transformation logic (SQL) in Dataform. Integration-effort is minimal (push all data to GCS - Boxalino-owned or private client-owned, load data to BQ, write SQL in Dataform).

 

More information about the Integration Strategies is available on Confluence.
https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/690585601

 

Data Integration (DI) Generic Flow

The Generic Data Integration (DI) Flow expects for the client, themselves to execute the steps described in https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/252149803 .

For certain platforms (Shopware6, Magento2), the step #1 can be done with a Boxalino Data Integration plugin.

For generic clients, the JSONL generation is managed by the clients` team by following the Boxalino Data Structure requirements https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/252280881 (as documented).

The Boxalino Data Structure and integration flows/traits are publicly available as a PHP repository https://github.com/boxalino/data-integration-doc-php

Data Integration (DI) SAAS Flow

The di-saas service is designed to:

  1. be used for custom integrations by ingesting data (.csv, .jsonl) from SFTP, GCS, BQ or public links.

  2. integrate data from targetted platforms (ex: PlentyMarkets)

For both cases, a mapping is done between the Boxalino Data Structure https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/252280881 and the input sources (either files or API endpoints) with the help of BQ & Dataform.

In the case of a custom integration, the di-saas flow is composed of the following scopes:

  1. Data connector & Data load

    1. downloads the files from remote source to GCS

    2. loads the raw content to the transform dataset (<account>_T)*

      1. *if options-> load_dataset & load_gcp_project values are set - the loaded raw content will be available in load_gcp_project.load_dataset instead of rtux-data-integration.<account>_T

  2. Data transformation (with logic/SQL provided by the client)

    1. Steps #1-#4 from the Generic Flow

  3. DI automation

DI-SAAS Request (Overview)

The DI (data integration) request has all the required information in a JSON body.

The elements are for connector (+ files load options) and the di (data integration) request parameters (default configurations).

[ { "connector": { "type": "sftp|gcs|plentymarket|plytix|boxalino", "options": { // specific for each connector type }, "load": { "options": { // for loading the data in BQ (BQ parameters) "format": "CSV|NEWLINE_DELIMITED_JSON", "field_delimiter": ";", "autodetect": true, "schema": "", "skipRows": 1(CSV)|0(JSONL), "max_bad_records": 0, "quote": "", "write_disposition": "WRITE_TRUNCATE", "create_disposition": "", "encoding": "", "allow_quoted_newlines" : 1, "allow_jagged_rows" : 0 }, "doc_X": { "property_node_from_doc_data_structure": [ { "source": "file___NODASHDATE__.jsonl", "name": "<used to create suffix for BQ table>", "schema": "" } ] } } }, "di": { "configuration": { "languages": [ "de", "fr" ], "currencies": [ "CHF" ], "mapping": { "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"} }, "default": { "language": "de", "currency": "CHF" } } } } ]

1. Data Connector & Data Load

The access to the data is managed by the client. These are a few guidelines for naming patterns:

  1. The relevant data sources are available in .csv or JSONL (prefered) format

  2. The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)

    1. this will help automating the integration

  3. The files required to update a certain data type (ex: product, order, etc) are available in the same path

  4. The files are available on an endpoint (SFTP, GCS, 3rd party API, public URL) to which Boxalino has access

    1. for GCS/ GCP sources: access must be given https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Access-sharing-(BQ-and-GCS)

    2. for AWS / SFTP : the client`s own AWS/SFTP environment with a Boxalino user & credentials

"connector": { "type": "gcs", "options": { "source": { "project": "<private-gcp-project>", "bucket": "<GCS-bucket-name>", "pattern": "path/to/data/__NODASHDATE__/" } }, "load": { "options": { "format": "NEWLINE_DELIMITED_JSON", "autodetect": true, "schema": "", "skipRows": 0, "max_bad_records": 0, "write_disposition": "WRITE_TRUNCATE" }, "doc_attribute": { "properties": [ { "source": "file1___NODASHDATE__.jsonl" } ], .. }, "doc_attribute_value": { "properties": [ { "source": "file2___NODASHDATE__.jsonl" } ], .. }, "doc_product": { "entity": [ { "source": "file3___NODASHDATE__.jsonl", "autodetect": 0, "schema": "SCHEMA FOR LOAD TO BQ" "name": "" } ] } }
"connector": { "type": "sftp", "options": { "hostname": "<server>", "port": 22, "username": "<user for Boxalino access>", "password": "<password for Boxalino access>", "source": { "pattern": "___NODASHDATE__", "files": [ ], "path": "<path on server>" } }, "load": { "options": { "format": "CSV", "field_delimiter": ";", "autodetect": true, "schema": "", "skipRows": 1, "max_bad_records": 0, "quote": "", "write_disposition": "WRITE_TRUNCATE", "create_disposition": "", "encoding": "" }, "doc_attribute": { "properties": [ { "source": "file1___NODASHDATE__.csv" } ], .. }, "doc_attribute_value": { "properties": [ { "source": "file2___NODASHDATE__.csv" } ], .. }, "doc_product": { "entity": [ { "source": "file3___NODASHDATE__.csv", "autodetect": 0, "schema": "field1:STRING,field2:INT64,field3:FLOAT64,field4:BOOL" "name": "used to create suffix for BQ table" } ] } },

 

 

2. Data Transformation

Once the data is loaded in GCS and BQ, it is time to transform it in the necessary data structure.

The transformation happens by preparing a BQ SQL for every node part of the doc_X data structure (ex: title, stock, price, string_attributes, etc). The output of the SQL will be of a certain format/structure for each property.

Dataform

The transformation happens with the help of Google Dataform https://cloud.google.com/dataform .

This implies the following:

  1. The client has access to a GCP project

  2. The client will create a Dataform repository https://cloud.google.com/dataform/docs/repositories

  3. The client has access to a GitHub or GitLab repository (to connect it to the Dataform repository) https://cloud.google.com/dataform/docs/connect-repository

  4. The client has given “Dataform Admin” permission to Boxalino Service Account boxalino-dataform@rtux-data-integration.iam.gserviceaccount.com

 

When using dataform for transforming the exported data (your custom CSV/JSONL files) into Boxalino Data Structure, the JSON body (for the SYNC REQUEST) has the following information (next to connector and di):

The DI-SAAS SYNC request

The DI request will use the same headers (client, tm, mode, type, authorization) and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.

REQUEST DEFINITION

As an integrator, please create the bellow request to the provided endpoint.

1

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

2

Action

/sync

3

Method

POST

4

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

5

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

**it is advisable to use different API credentials when integrating with 3rd party environments. Please generate your own by following the documentation https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/713785345
6

 

Content-Type

application/json

7

 

client

account name

8

 

mode

data sync mode: F for full, D for delta, E for enrichments

9

 

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

10

 

tm

(optional) time , in format: YmdHis;

technical: used to identify the documents version

11

 

ts

(optional) timestamp, must be millisecond based in UNIX timestamp

12

 

dev

(optional) use this to mark the content for the dev data index

13

 

loadOnly

(optional) if true, the data will only be loaded to BQ

14

 

transformOnly

(optional) if true, the data will only be transformed (ex: dataform tag execution)

This is a sample of a triggered full product sync (minimal data):

The request above created the following resources:

  1. GCS (raw data, as migrated from the connector)

    1. gs://prod_rtux-data-integration_<account>/product/202303112000/F_product.jsonl

    2. gs://prod_rtux-data-integration_<account>/product/202303112000/F_crossell.jsonl

    3. gs://prod_rtux-data-integration_<account>/product/202303112000/F_urlrecord.jsonl

  2. BQ (the T dataset - raw data as loaded from BQ)

    1. rtux-data-integration.<account>_T.202303112000-doc_product-entity

    2. rtux-data-integration.<account>_T.202303112000-doc_product-product_relations-crossell

    3. rtux-data-integration.<account>_T.202303112000-doc_product-link

  3. GCS (transformed doc_X JSONL)

    1. gs://prod_rtux-data-integration_<account>/doc_product_F_202303112000.jsonl

    2. gs://prod_rtux-data-integration_<account>/doc_language_F_202303112000.jsonl

  4. BQ (the F dataset - transformed data to doc_X data structure)

    1. rtux-data-integration.<account>_F.doc_product_F_202303112000

    2. rtux-data-integration.<account>_F.doc_language_F_202303112000

 

 

 

The DI-SAAS CORE request

In order to achieve this, there are 2 requests to be made:

  1. load data to GCS https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino)

  2. load data from GCS to <client>_core.<destination> table

REQUEST DEFINITION

As an integrator, please create the bellow request to the provided endpoint.

1

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

2

Action

/core

3

Method

POST

4

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

5

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

**it is advisable to use different API credentials when integrating with 3rd party environments. Please generate your own by following the documentation https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/713785345
6

 

Content-Type

application/json

7

 

client

account name

8

 

mode

data sync mode: F for full, D for delta, E for enrichments

9

 

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

10

 

tm

(optional) time , in format: YmdHis;

technical: used to identify the documents version

11

 

dev

(optional) use this to mark the content for the dev data index

12

 

 

 

 

Loading content to GCS (connector: boxalino, gcs)

There are 2 available flows, based on the size of your data:

  1. The content is exported as the body of your POST request

  2. The content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )

Option #1 is recommended for data volume less than 32MB.

Option #2 is allowed for any data size.

 

The LOAD request will create a GCS file in your project`s GCS buckte: gs://prod_rtux-data-integration_<account>/<type>/<tm>/<mode>_<doc>

  • there is no content validation at this step

A. Loading content less than 32 MB

Files under 32MB can be loaded directly as REQUEST BODY CONTENT in Boxalino`s GCS.

The sample request bellow would create the languagefeed_20220317.csv in your GCS bucket gs://prod_rtux-data-integration_<account>/<type>/<tm>/<mode>_<filename>

B. Loading undefined data size

For content over 32MB, we provide an endpoint to access a Signed GCS Url that would put all your streamed content into a file (currently there is no defined file size limit in GCS)

1. Make a request for public upload link

This is the generic POST request:

The response will be an upload link that can only be used in order to create the file in the clients` GCS bucket. The link is valid for 30 minutes.

2. Upload the content on the public link

 

Technical notes

  1. Every request to our nodes returns a response (200 or 400). This can be used internally for testing purposes as well.

  2. It is possible to review the STATUS of the requests at any given time https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/747208705

 

Access sharing (BQ and GCS)

The DI-SAAS flow can be used to access privately owned files in GCS and to update/create privately owned BigQuery resources as well.

Google Cloud Storage GCS

If your connector is gcs and you want to push and read data to a privately owned resource, share access to boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com or bi-with-ai.<client>@boxalino.com (your Boxalino users group)

Required permissions:

  1. storage.objects.get,

  2. storage.buckets.get

  3. storage.objects.list

This is achieved through Storage Legacy Bucket Reader & Storage Legacy Object Reader roles.

BigQuery Access (BQ)

If the goal is to load the raw GCS file to a privately owned BQ resource (defined as: loadoptionsload_dataset), you have to ensure that the dataset EXISTS and it is using location EU.

The same Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com or bi-with-ai.<client>@boxalino.com (your Boxalino users group) must have read/write access to the dataset where the GCS data is loaded. For simplicity, BigQuery Admin is suggested.