Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Tip

At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform).

...

The DI-SAAS SYNC request

The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.

...

Info

There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino.

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

1

Action

/sync

2

Method

POST

3

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

4

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

Image Modified
5

 

Content-Type

application/json

6

 

client

account name

7

 

mode

data sync mode: F for full, D for delta, E for enrichments

8

 

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

9

 

tm

(optional) time , in format: YmdHis;

Note

if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

technical: used to identify the documents version

10

 

ts

(optional) timestamp, must be millisecond based in UNIX timestamp

11

 

dev

(optional) use this to mark the content for the dev data index

Expand
titleSample

This is a sample of a triggered full product sync (minimal data):

Code Block
curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/sync" \
  -X POST \
  -d "[{\"connector\":{\"type\":\"gcs\",\"options\":{\"source\":{\"bucket\":\"my-account-bucket\",\"pattern\":\"boxalino/product/__NODASHDATE__/\"}},\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"write_disposition\":\"WRITE_TRUNCATE\"},\"doc_product\":{\"entity\":[{\"source\":\"product.jsonl\",\"autodetect\":0,\"schema\":\"ProductId:INT64,Name:STRING,ProductTypeId:INT64,Sku:STRING,GroupId:STRING,Price:FLOAT64,SalePrice:FLOAT64,created:DATETIME\"}],\"product_relations\":[{\"source\":\"crosssell.jsonl\",\"name\":\"crosssell\"}],\"link\":[{\"source\":\"urlrecord.jsonl\"}]}}},\"di\":{\"configuration\":{\"languages\":[\"de\",\"fr\"],\"currencies\":[\"CHF\"],\"mapping\":{\"languagesCountryCode\":{\"de\":\"CH_de\",\"fr\":\"CH_fr\"}},\"default\":{\"language\":\"de\",\"currency\":\"CHF\"}}}}]" \
  -H "Content-Type: application/json" \
  -H "account: <boxalino-account-name>" \
  -H "mode: F" \
  -H "tm: 202303112000" \
  -H "type: product" \
  -H "Authorization: <base64_encode(api_key:api_secret)>" 

The request above created the following resources:

  1. GCS (raw data, as migrated from the connector)

    1. gs://prod_rtux-data-integration_<account>/product/202303112000/F_product.jsonl

    2. gs://prod_rtux-data-integration_<account>/product/202303112000/F_crossell.jsonl

    3. gs://prod_rtux-data-integration_<account>/product/202303112000/F_urlrecord.jsonl

  2. BQ (the T dataset - raw data as loaded from BQ)

    1. rtux-data-integration.<account>_T.202303112000-doc_product-entity

    2. rtux-data-integration.<account>_T.202303112000-doc_product-product_relations-crossell

    3. rtux-data-integration.<account>_T.202303112000-doc_product-link

  3. GCS (transformed doc_X JSONL)

    1. gs://prod_rtux-data-integration_<account>/doc_product_F_202303112000.jsonl

    2. gs://prod_rtux-data-integration_<account>/doc_language_F_202303112000.jsonl

  4. BQ (the F dataset - transformed data to doc_X data structure)

    1. rtux-data-integration.<account>_F.doc_product_F_202303112000

    2. rtux-data-integration.<account>_F.doc_language_F_202303112000

Expand
titleSample request with Boxalino connector (ex: data is already available in Boxalino GCP project, in client`s GCS bucket)
Code Block
[
  {
    "connector": {
      "type": "boxalino",
       "load": {
        "options": {
          "format": "NEWLINE_DELIMITED_JSON",
          "autodetect": true,
          "schema": "",
          "skipRows": 0,
          "max_bad_records": 0,
          "write_disposition": "WRITE_TRUNCATE"
        },
        "doc_product": {
          "entity": [
            {
              "source": "product.jsonl",
              "autodetect": 0,
              "schema": "ProductId:INT64,Name:STRING,ProductTypeId:INT64,Sku:STRING,GroupId:STRING,Price:FLOAT64,SalePrice:FLOAT64,created:DATETIME"
            }
          ],
          "product_relations": [
            {
              "source": "crosssell.jsonl",
              "name": "crosssell"
            }
          ],
          "link": [
            {
              "source": "urlrecord.jsonl"
            }
          ]
        }
      }
    },
    "di": {
      "configuration": {
        "languages": [
          "de",
          "fr"
        ],
        "currencies": [
          "CHF"
        ],
        "mapping": {
          "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
        },
        "default": {
          "language": "de",
          "currency": "CHF"
        }
      }
    }
  }
]

The DI-SAAS CORE request

The CORE request is used to facilitate the load of data (as is) in the clients` _core dataset from the Boxalino ecosystem.

In order to achieve this, there are 2 requests to be made:

  1. load data to GCS https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino)

  2. load data from GCS to <client>_core.<destination> table

REQUEST DEFINITION

As an integrator, please create the bellow request to the provided endpoint.

Info

There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino.

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

1

Action

/core

2

Method

POST

3

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

4

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

Image Added
5

 

Content-Type

application/json

6

 

client

account name

7

 

mode

data sync mode: F for full, D for delta, E for enrichments

8

 

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

9

 

tm

(optional) time , in format: YmdHis;

Note

if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

technical: used to identify the documents version

10

 

dev

(optional) use this to mark the content for the dev data index

Expand
titleLoading data to _core dataset

For the integration bellow, the rti JSON payload is loaded in the client`s dataset.

Note

The same tm value is used.

The jsonl payload data should have a primary field (ex: id)

Code Block
curl  "https://boxalino-di-stage-krceabfwya-ew.a.run.app/transformer/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20231106000000" \
  -H "type: content" \
  -H "mode: T" \
  -H "doc: <filename>.json" \
  -d "<JSONL data>" \
  -H "Authorization: Basic <encode<apiKey:apiSecret>>"
Code Block
curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/core" \
  -X POST \
  -d "[{\"connector\":{\"type\":\"boxalino\",\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"field_delimiter\":\"\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"quote\":\"\",\"write_disposition\":\"WRITE_TRUNCATE\",\"create_disposition\":\"\",\"encoding\":\"\"},\"doc_content\":{\"rti\":[{\"source\":\"<filename>*.json\",\"autodetect\":1,\"destination\":\"<table name from core dataset>\",\"primary_field\":\"<primary field in your table>\"}]}}}}]" \
  -H "Content-Type: application/json" \
  -H "mode: F" \
  -H "dev: 0" \
  -H "tm: 20231106000000" \
  -H "type: content" \
  -H "client: <client>" \
  -H "Authorization: Basic <encode<apiKey:apiSecret>>"
Note

Note the JSON payload for table definition
"doc_content": {
"rti": [
{
"source": "rti*.jsonl",
"autodetect": 1,
"destination": "doc_content_rti",
"primary_field": "id"
}
]
}

The final output table is <client>_core.doc_content_rti.
Depending on the write_disposition property, the data is either rewritten in the table or appended.

Loading content to Boxalino GCS (connector: boxalino)

...

Panel
panelIconId1f44c
panelIcon:ok_hand:
panelIconText👌
bgColor#FFEBE6

The use of the header chunk is required if the same file/document is exported in batches/sections.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunkproperty for each /transformer/load/url request.

Tip

Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

Only after all the files are available in GCS, you can move on to step#3.

Tip

After all required documents (doc) for the given type data sync (ex: product, order, etc) have been made available in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

...