Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Tip

At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform).

...

  1. Data connector & Data load

    1. downloads the files from remote source to GCS

    2. loads the raw content to the transform dataset (<account>_T)*

      1. *if options-> load_dataset & load_gcp_project values are set - the loaded raw content will be available in load_gcp_project.load_dataset instead of rtux-data-integration.<account>_T

  2. Data transformation (with logic/SQL provided by the client)

    1. Steps #1-#4 from the Generic Flow

  3. DI automation

...

Code Block
[
  {
    "connector": {
      "type": "sftp|gcs|plentymarket|plytix|boxalino",
      "options": {
          // specific for each connector type
      },
      "load": {
        "options": {
          // for loading the data in BQ (BQ parameters)
          "format": "CSV|NEWLINE_DELIMITED_JSON",
          "field_delimiter": ";",
          "autodetect": true,
          "schema": "",
          "skipRows": 1(CSV)|0(JSONL),
          "max_bad_records": 0,
          "quote": "",
          "write_disposition": "WRITE_TRUNCATE",
          "create_disposition": "",
          "encoding": "",
          "allow_quoted_newlines" : 1,
          "allow_jagged_rows" : 0
        },
        "doc_X": {
          "property_node_from_doc_data_structure": [
            {
              "source": "file___NODASHDATE__.jsonl",
              "name": "<used to create suffix for BQ table>",
              "schema": ""
            }
          ]
        }
      }
    },
    "di": {
      "configuration": {
        "languages": [
          "de",
          "fr"
        ],
        "currencies": [
          "CHF"
        ],
        "mapping": {
          "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
        },
        "default": {
          "language": "de",
          "currency": "CHF"
        }
      }
    }
  }
]

...

  1. The relevant data sources are available in .csv or JSONL (prefered) format

  2. The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)

    1. this will help automating the integration

  3. The files required to update a certain data type (ex: product, order, etc) are available in the same path

  4. The files are available on an endpoint (SFTP, GCS, 3rd party API, public URL) to which Boxalino has access

    1. for GCS/ GCP sources: access being shared to Boxalino`s Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com (storage.objects.get, storage.buckets.getand storage.objects.list , which can be achieved by Storage Legacy Bucket Reader & Storage Legacy Object Reader)

      image-20241119-123626.pngImage Removed

      must be given https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Access-sharing-(BQ-and-GCS)

    2. for AWS / SFTP : the client`s own AWS/SFTP environment with a Boxalino user & credentials

Expand
titleGCS connector properties (sample for JSONL files load definition)
Code Block
"connector": {
  "type": "gcs",
    "options": {
      "source": {
        "project": "<private-gcp-project>",
        "bucket": "<GCS-bucket-name>",
        "pattern": "path/to/data/__NODASHDATE__/"
      }
  },
  "load": {
    "options": {
      "format": "NEWLINE_DELIMITED_JSON",
      "autodetect": true,
      "schema": "",
      "skipRows": 0,
      "max_bad_records": 0,
      "write_disposition": "WRITE_TRUNCATE"
    },
    "doc_attribute": {
      "properties": [
        {
          "source": "file1___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_attribute_value": {
      "properties": [
        {
          "source": "file2___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_product": {
      "entity": [
        {
          "source": "file3___NODASHDATE__.jsonl",
          "autodetect": 0,
          "schema": "SCHEMA FOR LOAD TO BQ"
          "name": ""
        }
      ]
    }
}

...

Panel
panelIconIdatlassian-check_mark
panelIcon:check_mark:
panelIconText:check_mark:
bgColor#DEEBFF

It is possible to configure dynamic source names (to identify the files loaded in GCS) (ex: if the content to be loaded in BQ is exported in batches or has a dynamic suffix).

The supported variations are:
__DATE__ (Y-m-d), __NODASHDATE__ (Ymd), __TM__ (Y-m-dTH:i:s), __NODASHTM__ (YmdHis), __DATE_PATH__ (Y/m/d), __JN_DATE_PATH__ (Y/n/j), __DATE_YESTERDAY__ (Y-m-d -1 day), __DATE_MONTH__ (Y-m-01), __NODASHDATE_MONTH__ (Ym01), __NODASHDATE_YESTERDAY__ (Ymd -1 day).), __RTM__ (the tm value from your SYNC/CORE request)

For example, if there are 3 chunks for order-<tm>-1.jsonl, the configuration for load can be:
"load":{"options":{<add the default BQ-load options}, "doc_order":{"entity":[{"source":"doc_order___NODASHTM__*.jsonl}]}}

...

The DI-SAAS SYNC request

The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.

...

(optional) use this

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

1

Action

/core

2

Method

POST

3

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

4

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

5

 

Content-Type

application/json

6

 

client

account name

7

 

mode

data sync mode: F for full, D for delta, E for enrichments

8

 

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

9

 

tm

(optional) time , in format: YmdHis;

Note

if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

technical: used to identify the documents version

10

 

dev

(optional) use this to mark the content for the dev data index

11

Expand
titleLoading data to _core dataset

For the integration bellow, the rti JSON payload is loaded in the client`s dataset.

Note

The same tm value is used.

Code Block
curl  "https://boxalino-di-stage-krceabfwya-ew.a.run.app/transformer/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20231106000000" \
  -H "type: content" \
  -H "mode: T" \
  -H "doc: <filename>.json" \
  -d "<JSONL data>" \
  -H "Authorization: Basic <encode<apiKey:apiSecret>>"
Code Block
curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/core" \
  -X POST \
  -d "[{\"connector\":{\"type\":\"boxalino\",\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"field_delimiter\":\"\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"quote\":\"\",\"write_disposition\":\"WRITE_TRUNCATE\",\"create_disposition\":\"\",\"encoding\":\"\"},\"doc_content\":{\"rti\":[{\"source\":\"<filename>*.json\",\"autodetect\":1,\"destination\":\"<table name from core dataset>\",\"primary_field\":\"<primary field in your table>\"}]}}}}]" \
  -H "Content-Type: application/json" \
  -H "mode: F" \
  -H "dev: 0" \
  -H "tm: 20231106000000" \
  -H "type: content" \
  -H "client: <client>" \
  -H "Authorization: Basic <encode<apiKey:apiSecret>>"
Note

Note the JSON payload for table definition
"doc_content": {
"rti": [
{
"source": "rti*.jsonl",
"autodetect": 1,
“project”: “project-id”, //optional, only in the case the load is in a different GCP project
“dataset”:”dataset-name”, //optional, only in the case when the load is in a different GCP project
"destination": "doc_content_rti", // BQ table
"primary_field": "id" //Column used as primary key
}
]
}

The final output table is <client>_core.doc_content_rti.
Depending on the write_disposition property, the data is either rewritten in the table or appended.
If the optional project/ dataset details are set, the load is happening in a different GCP project (project-id.dataset-name.doc_content_rti).

...

Panel
panelIconId1f44c
panelIcon:ok_hand:
panelIconText👌
bgColor#FFEBE6

The use of the header chunk is required if the same file/document is exported in batches/sections.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunkproperty for each /transformer/load/url request.

Tip

Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

Only after all the files are available in GCS, you can move on to step#3.

Tip

After all required documents (doc) for the given type data sync (ex: product, order, etc) have been made available in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

...