Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Tip

At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform).

...

  1. The relevant data sources are available in .csv or JSONL (prefered) format

  2. The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)

    1. this will help automating the integration

  3. The files required to update a certain data type (ex: product, order, etc) are available in the same path

  4. The files are available on an endpoint (SFTP, GCS, 3rd party API, public URL) to which Boxalino has access

    1. for GCS/ GCP sources: access being shared to Boxalino`s Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com (storage.objects.get, storage.buckets.getand storage.objects.list , which can be achieved by Storage Legacy Bucket Reader & Storage Legacy Object Reader)

      image-20241119-123626.pngImage Added

    2. for AWS / SFTP : the client`s own AWS/SFTP environment with a Boxalino user & credentials

...

The DI-SAAS SYNC request

The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.

...

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

1

Action

/sync

2

Method

POST

3

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

4

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

5

 

Content-Type

application/json

6

 

client

account name

7

 

mode

data sync mode: F for full, D for delta, E for enrichments

8

 

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

9

 

tm

(optional) time , in format: YmdHis;

Note

if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

technical: used to identify the documents version

10

 

ts

(optional) timestamp, must be millisecond based in UNIX timestamp

11

 

dev

(optional) use this to mark the content for the dev data index

12

loadOnly

(optional) if true, the data will only be loaded to BQ

13

transformOnly

(optional) if true, the data will only be transformed (ex: dataform tag execution)

Expand
titleSample

This is a sample of a triggered full product sync (minimal data):

Code Block
curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/sync" \
  -X POST \
  -d "[{\"connector\":{\"type\":\"gcs\",\"options\":{\"source\":{\"bucket\":\"my-account-bucket\",\"pattern\":\"boxalino/product/__NODASHDATE__/\"}},\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"write_disposition\":\"WRITE_TRUNCATE\"},\"doc_product\":{\"entity\":[{\"source\":\"product.jsonl\",\"autodetect\":0,\"schema\":\"ProductId:INT64,Name:STRING,ProductTypeId:INT64,Sku:STRING,GroupId:STRING,Price:FLOAT64,SalePrice:FLOAT64,created:DATETIME\"}],\"product_relations\":[{\"source\":\"crosssell.jsonl\",\"name\":\"crosssell\"}],\"link\":[{\"source\":\"urlrecord.jsonl\"}]}}},\"di\":{\"configuration\":{\"languages\":[\"de\",\"fr\"],\"currencies\":[\"CHF\"],\"mapping\":{\"languagesCountryCode\":{\"de\":\"CH_de\",\"fr\":\"CH_fr\"}},\"default\":{\"language\":\"de\",\"currency\":\"CHF\"}}}}]" \
  -H "Content-Type: application/json" \
  -H "account: <boxalino-account-name>" \
  -H "mode: F" \
  -H "tm: 202303112000" \
  -H "type: product" \
  -H "Authorization: <base64_encode(api_key:api_secret)>" 

The request above created the following resources:

  1. GCS (raw data, as migrated from the connector)

    1. gs://prod_rtux-data-integration_<account>/product/202303112000/F_product.jsonl

    2. gs://prod_rtux-data-integration_<account>/product/202303112000/F_crossell.jsonl

    3. gs://prod_rtux-data-integration_<account>/product/202303112000/F_urlrecord.jsonl

  2. BQ (the T dataset - raw data as loaded from BQ)

    1. rtux-data-integration.<account>_T.202303112000-doc_product-entity

    2. rtux-data-integration.<account>_T.202303112000-doc_product-product_relations-crossell

    3. rtux-data-integration.<account>_T.202303112000-doc_product-link

  3. GCS (transformed doc_X JSONL)

    1. gs://prod_rtux-data-integration_<account>/doc_product_F_202303112000.jsonl

    2. gs://prod_rtux-data-integration_<account>/doc_language_F_202303112000.jsonl

  4. BQ (the F dataset - transformed data to doc_X data structure)

    1. rtux-data-integration.<account>_F.doc_product_F_202303112000

    2. rtux-data-integration.<account>_F.doc_language_F_202303112000

...

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

1

Action

/core

2

Method

POST

3

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

4

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

5

 

Content-Type

application/json

6

 

client

account name

7

 

mode

data sync mode: F for full, D for delta, E for enrichments

8

 

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

9

 

tm

(optional) time , in format: YmdHis;

Note

if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

technical: used to identify the documents version

10

 

dev

(optional) use this to mark the content for the dev data index

11

(optional) use this

Expand
titleLoading data to _core dataset

For the integration bellow, the rti JSON payload is loaded in the client`s dataset.

Note

The same tm value is used.

Code Block
curl  "https://boxalino-di-stage-krceabfwya-ew.a.run.app/transformer/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20231106000000" \
  -H "type: content" \
  -H "mode: T" \
  -H "doc: <filename>.json" \
  -d "<JSONL data>" \
  -H "Authorization: Basic <encode<apiKey:apiSecret>>"
Code Block
curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/core" \
  -X POST \
  -d "[{\"connector\":{\"type\":\"boxalino\",\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"field_delimiter\":\"\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"quote\":\"\",\"write_disposition\":\"WRITE_TRUNCATE\",\"create_disposition\":\"\",\"encoding\":\"\"},\"doc_content\":{\"rti\":[{\"source\":\"<filename>*.json\",\"autodetect\":1,\"destination\":\"<table name from core dataset>\",\"primary_field\":\"<primary field in your table>\"}]}}}}]" \
  -H "Content-Type: application/json" \
  -H "mode: F" \
  -H "dev: 0" \
  -H "tm: 20231106000000" \
  -H "type: content" \
  -H "client: <client>" \
  -H "Authorization: Basic <encode<apiKey:apiSecret>>"
Note

Note the JSON payload for table definition
"doc_content": {
"rti": [
{
"source": "rti*.jsonl",
"autodetect": 1,
"destination": "doc_content“project”: “project-id”, //optional, only in the case the load is in a different GCP project
“dataset”:”dataset-name”, //optional, only in the case when the load is in a different GCP project
"destination": "doc_content_rti", // BQ table
"primary_field": "id" //Column used as primary key
}
]
}

The final output table is <client>_core.doc_content_rti.
Depending on the write_disposition property, the data is either rewritten in the table or appended.
If the optional project/ dataset details are set, the load is happening in a different GCP project (project-id.dataset-name.doc_content_rti).

Loading content to

...

GCS (connector: boxalino, gcs)

Note

By following these steps, you can push your data (JSONL or CSV) directly in your client`s GCS bucket in the Boxalino scope . After or in a GCS bucket in a private GCP project (access share required). After all data has been loaded in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino .or connector → type: gcs

There are 2 available flows, based on the size of your data:

...

Code Block
curl "https://boxalino-di-stageprocess-krceabfwya-ew.a.run.app/transformersaas/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: T" \
  -H "gcp_project: <private-client-GCP-project>" \
  -H "gcs_bucket: <private-client-GCS-bucket>" \
  -H "doc: <filename>" \
  -d "<JSONL>" \
  -H "Authorization: Basic <encode of the account>"

...

Expand
titleSample

For example, the request bellow would create a gs://prod_rtux-data-integration_<account>/product/20230301161554/T_doc_language.json in your clients` GCS bucket. This is a necessary data for the type:product integration.

Code Block
curl --connect-timeout 30 --max-time 300 "https://boxalino-di-stageprocess-krceabfwya-ew.a.run.app/transformersaas/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20230301161554" \
  -H "type: product" \
  -H "mode: T" \
  -H "doc: doc_language.json" \
  -d "{\"language\":\"en\",\"country_code\":\"en-GB\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}\n{\"language\":\"de\",\"country_code\":\"de-CH\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}" \
  -H "Authorization: Basic <encode of the account>" 

...

Code Block
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stageprocess-krceabfwya-ew.a.run.app/transformersaas/load/url" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: T" \
  -H "chunkgcp_project: <id><private-client-GCP-project>" \
  -H "docgcs_bucket: <filename><private-client-GCS-bucket>" \
  -H "chunk: <id>" \
  -H "doc: <filename>" \
  -H "Authorization: Basic <encode of the account>"

...

Expand
titleSample

Lets say, we need a public link to upload a document called product-entity.jsonl. The following request can be used to generate the link:

Code Block
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stageprocess-krceabfwya-ew.a.run.app/transformersaas/load/url" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20230301161554" \
  -H "type: product" \
  -H "mode: T" \
  -H "doc: product-entity.jsonl" \
  -H "Authorization: Basic <encode of the account>"

...

Panel
panelIconId1f44c
panelIcon:ok_hand:
panelIconText👌
bgColor#FFEBE6

The use of the header chunk is required if the same file/document is exported in batches/sections.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunkproperty for each /transformer/load/url request.

Tip

Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

Only after all the files are available in GCS, you can move on to step#3.

Tip

After all required documents (doc) for the given type data sync (ex: product, order, etc) have been made available in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

...

  1. Every request to our nodes returns a response (200 or 400). This can be used internally for testing purposes as well.

  2. It is possible to review the STATUS of the requests at any given time Status Review

Access sharing (BQ and GCS)

The DI-SAAS flow can be used to access privately owned files in GCS and to update/create privately owned BigQuery resources as well.

Google Cloud Storage GCS

If your connector is gcs and you want to push and read data to a privately owned resource, share access to boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com

Required permissions:

  1. storage.objects.get,

  2. storage.buckets.get

  3. storage.objects.list

This is achieved through Storage Legacy Bucket Reader & Storage Legacy Object Reader roles.

  1. image-20241119-123626.pngImage Added

The gcs bucket used in the connector definition, must be located in europe-west6 or eu zone.

If you use the Boxalino services to load data to your private GCS bucket (<endpoint>/saas/load or <endpoint>/saas/load/url ), you must also give WRITE permissions to the SA.

...

BigQuery Access (BQ)

If the goal is to load the raw GCS file to a privately owned BQ resource (defined as: loadoptionsdataset_stage), you have to ensure that the dataset EXISTS and is in location EU.

We recommend to create a dedicated dataset for this purpose. To avoid BQ storage costs, we recommend to set an expiration date for the tables (ex: 7 days)

image-20241119-133935.pngImage Added

The same Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com must have read/write access to the dataset where the GCS data is loaded. For simplicity, BigQuery Admin is suggested.

...