Tip
At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform).

...

Data connector & Data load
1. downloads the files from remote source to GCS
2. loads the raw content to the transform dataset (<account>_T)*
  1. *if options-> load_dataset & load_gcp_project values are set - the loaded raw content will be available in load_gcp_project.load_dataset instead of rtux-data-integration.<account>_T
Data transformation (with logic/SQL provided by the client)
1. Steps #1-#4 from the Generic Flow
DI automation

...

Code Block

[
  {
    "connector": {
      "type": "sftp|gcs|plentymarket|plytix|boxalino",
      "options": {
          // specific for each connector type
      },
      "load": {
        "options": {
          // for loading the data in BQ (BQ parameters)
          "format": "CSV|NEWLINE_DELIMITED_JSON",
          "field_delimiter": ";",
          "autodetect": true,
          "schema": "",
          "skipRows": 1(CSV)|0(JSONL),
          "max_bad_records": 0,
          "quote": "",
          "write_disposition": "WRITE_TRUNCATE",
          "create_disposition": "",
          "encoding": ""
      ,
 },         "docallow_quoted_Xnewlines" : {1,
          "allow_jagged_rows" : 0
        },
        "doc_X": {
          "property_node_from_doc_data_structure": [
            {
              "source": "file___NODASHDATE__.jsonl",
              "name": "<used to create suffix for BQ table>",
              "schema": ""
            }
          ]
        }
      }
    },
    "di": {
      "configuration": {
        "languages": [
          "de",
          "fr"
        ],
        "currencies": [
          "CHF"
        ],
        "mapping": {
          "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
        },
        "default": {
          "language": "de",
          "currency": "CHF"
        }
      }
    }
  }
]

...

The relevant data sources are available in .csv or JSONL (prefered) format
The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)
1. this will help automating the integration
The files required to update a certain data type (ex: product, order, etc) are available in the same path
The files are available on an endpoint (SFTP, GCS, 3rd party API, public URL) to which Boxalino has access
1. for GCS/ GCP sources: access being shared to Boxalino`s Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com (storage.objects.get, storage.buckets.getand storage.objects.list , which can be achieved by Storage Legacy Bucket Reader & Storage Legacy Object Reader)
  Image Removed
  for must be given https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Access-sharing-(BQ-and-GCS)
2. for AWS / SFTP : the client`s own AWS/SFTP environment with a Boxalino user & credentials

Expand

title	GCS connector properties (sample for JSONL files load definition)

Code Block

"connector": {
  "type": "gcs",
    "options": {
      "source": {
        "project": "<private-gcp-project>",
        "bucket": "<GCS-bucket-name>",
        "pattern": "path/to/data/__NODASHDATE__/"
      }
  },
  "load": {
    "options": {
      "format": "NEWLINE_DELIMITED_JSON",
      "autodetect": true,
      "schema": "",
      "skipRows": 0,
      "max_bad_records": 0,
      "write_disposition": "WRITE_TRUNCATE"
    },
    "doc_attribute": {
      "properties": [
        {
          "source": "file1___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_attribute_value": {
      "properties": [
        {
          "source": "file2___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_product": {
      "entity": [
        {
          "source": "file3___NODASHDATE__.jsonl",
          "autodetect": 0,
          "schema": "SCHEMA FOR LOAD TO BQ"
          "name": ""
        }
      ]
    }
}

...

Panel

panelIconId	atlassian-check_mark
panelIcon	:check_mark:
panelIconText	:check_mark:
bgColor	#DEEBFF

It is possible to configure dynamic source names (to identify the files loaded in GCS) (ex: if the content to be loaded in BQ is exported in batches or has a dynamic suffix).

The supported variations are:
__DATE__ (Y-m-d), __NODASHDATE__ (Ymd), __TM__ (Y-m-dTH:i:s), __NODASHTM__ (YmdHis), __DATE_PATH__ (Y/m/d), __JN_DATE_PATH__ (Y/n/j), __DATE_YESTERDAY__ (Y-m-d -1 day), __DATE_MONTH__ (Y-m-01), __NODASHDATE_MONTH__ (Ym01), __NODASHDATE_YESTERDAY__ (Ymd -1 day).), __RTM__ (the tm value from your SYNC/CORE request)

For example, if there are 3 chunks for order-<tm>-1.jsonl, the configuration for load can be:
"load":{"options":{<add the default BQ-load options}, "doc_order":{"entity":[{"source":"doc_order___NODASHTM__*.jsonl}]}}

...

The DI-SAAS SYNC request

The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.

...

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

1

Action

/core

2

Method

POST

3

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

4

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

5

Content-Type

application/json

6

client

account name

7

mode

data sync mode: F for full, D for delta, E for enrichments

8

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

9

tm

(optional) time , in format: YmdHis;

Note
if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

technical: used to identify the documents version

10

dev

(optional) use this to mark the content for the dev data index

11

(optional) use this

Expand

title	Loading data to _core dataset

For the integration bellow, the rti JSON payload is loaded in the client`s dataset.

Note
The same tm value is used.

Code Block

curl  "https://boxalino-di-stage-krceabfwya-ew.a.run.app/transformer/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20231106000000" \
  -H "type: content" \
  -H "mode: T" \
  -H "doc: <filename>.json" \
  -d "<JSONL data>" \
  -H "Authorization: Basic <encode<apiKey:apiSecret>>"

Code Block

curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/core" \
  -X POST \
  -d "[{\"connector\":{\"type\":\"boxalino\",\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"field_delimiter\":\"\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"quote\":\"\",\"write_disposition\":\"WRITE_TRUNCATE\",\"create_disposition\":\"\",\"encoding\":\"\"},\"doc_content\":{\"rti\":[{\"source\":\"<filename>*.json\",\"autodetect\":1,\"destination\":\"<table name from core dataset>\",\"primary_field\":\"<primary field in your table>\"}]}}}}]" \
  -H "Content-Type: application/json" \
  -H "mode: F" \
  -H "dev: 0" \
  -H "tm: 20231106000000" \
  -H "type: content" \
  -H "client: <client>" \
  -H "Authorization: Basic <encode<apiKey:apiSecret>>"

Note

Note the JSON payload for table definition
"doc_content": {
"rti": [
{
"source": "rti*.jsonl",
"autodetect": 1,
“project”: “project-id”, //optional, only in the case the load is in a different GCP project
“dataset”:”dataset-name”, //optional, only in the case when the load is in a different GCP project
"destination": "doc_content_rti", // BQ table
"primary_field": "id" //Column used as primary key
}
]
}

The final output table is <client>_core.doc_content_rti.
Depending on the write_disposition property, the data is either rewritten in the table or appended.
If the optional project/ dataset details are set, the load is happening in a different GCP project (project-id.dataset-name.doc_content_rti).

...

Panel

panelIconId	1f44c
panelIcon	:ok_hand:
panelIconText	👌
bgColor	#FFEBE6

The use of the header chunk is required if the same file/document is exported in batches/sections.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunkproperty for each /transformer/load/url request.

Tip

Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

Only after all the files are available in GCS, you can move on to step#3.

Tip
After all required documents (doc) for the given `type` data sync (ex: product, order, etc) have been made available in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

...

The DI-SAAS flow can be used to access privately owned files in GCS and to update/create privately owned BigQuery resources as well.

Tip
If your project already shared access to your Boxalino users group, `bi-with-ai.<client>@boxalino.com` , t

Google Cloud Storage GCS

If your connector is gcs and you want to push and read data to a privately owned resource, share access to boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com or bi-with-ai.<client>@boxalino.com (your Boxalino users group)

Required permissions:

storage.objects.get,
storage.buckets.get
storage.objects.list

...

If the goal is to load the raw GCS file to a privately owned BQ resource (defined as: load → options → load_dataset_stage), you have to ensure that the dataset EXISTS and it is in using location EU.

We recommend to create a dedicated dataset for this purpose. To avoid BQ storage costs, we recommend to set an expiration date for the tables (ex: 7 days)

The same Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com or bi-with-ai.<client>@boxalino.com (your Boxalino users group) must have read/write access to the dataset where the GCS data is loaded. For simplicity, BigQuery Admin is suggested.

Note
Share access only on the resources that Boxalino is allowed to access. For example, if your integrations only work on a temporary dataset - `boxalino_saas_stage` or `boxalino_saas`- you can share access at resource level (instead of project level).

...

Version	Old Version 14	New Version 16
Changes made by	Dana Negrescu	Dana Negrescu
Saved on	Nov 19, 2024	Nov 19, 2024

Content Comparison

Versions Compared

Key

The DI-SAAS SYNC request

Google Cloud Storage GCS