Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Tip

At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform).

...

Code Block
[
  {
    "connector": {
      "type": "sftp|gcs|plentymarket|plytix|boxalino",
      "options": {
          // specific for each connector type
      },
      "load": {
        "options": {
          // for loading the data in BQ (BQ parameters)
          "format": "CSV|NEWLINE_DELIMITED_JSON",
          "field_delimiter": ";",
          "autodetect": true,
          "schema": "",
          "skipRows": 1(CSV)|0(JSONL),
          "max_bad_records": 0,
          "quote": "",
          "write_disposition": "WRITE_TRUNCATE",
          "create_disposition": "",
          "encoding": ""
        },
        "doc_X": {
          "property_node_from_doc_data_structure": [
            {
              "source": "file___NODASHDATE__.jsonl",
              "name": "<used to create suffix for BQ table>",
              "schema": ""
            }
          ]
        }
      }
    },
    "di": {
      "configuration": {
        "languages": [
          "de",
          "fr"
        ],
        "currencies": [
          "CHF"
        ],
        "mapping": {
          "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
        },
        "default": {
          "language": "de",
          "currency": "CHF"
        }
      }
    }
  }
]

...

  1. The relevant data sources are available in .csv or JSONL (prefered) format

  2. The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)

    1. this will help automating the integration

  3. The files required to update a certain data type (ex: product, order, etc) are available in the same path

  4. The files are available on an endpoint (SFTP, GCS, public3rd party API, public URL) to which Boxalino has access

    1. for GCS/ GCP sources: access being shared to Boxalino`s Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com

    2. for AWS / SFTP : the client`s own AWS/SFTP environment with a Boxalino user & credentials must be provided

Expand
titleGCS connector properties (with sample for JSONL files load definition)
Code Block
"connector": {
  "type": "gcs",
    "options": {
      "source": {
        "bucket": "<GCS-bucket-name>",
        "pattern": "path/to/data/__NODASHDATE__/"
      }
  },
  "load": {
    "options": {
      "format": "NEWLINE_DELIMITED_JSON",
      "autodetect": true,
      "schema": "",
      "skipRows": 0,
      "max_bad_records": 0,
      "write_disposition": "WRITE_TRUNCATE"
    },
    "doc_attribute": {
      "properties": [
        {
          "source": "file1___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_attribute_value": {
      "properties": [
        {
          "source": "file2___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_product": {
      "entity": [
        {
          "source": "file3___NODASHDATE__.jsonl",
          "autodetect": 0,
          "schema": "SCHEMA FOR LOAD TO BQ"
          "name": ""
        }
      ]
    }
}
Expand
title(client-owned environment) SFTP connector properties (with sample for CSV files load definition)
Code Block
"connector": {
      "type": "sftp",
      "options": {
        "hostname": "<server>",
        "port": 22,
        "username": "<boxalino user><user for Boxalino access>",
        "password": "<boxalino password><password for Boxalino access>",
        "source": {
          "pattern": "___NODASHDATE__",
          "files": [
          ],
          "path": "<path on server>"
        }
      },
      "load": {
        "options": {
          "format": "CSV",
          "field_delimiter": ";",
          "autodetect": true,
          "schema": "",
          "skipRows": 1,
          "max_bad_records": 0,
          "quote": "",
          "write_disposition": "WRITE_TRUNCATE",
          "create_disposition": "",
          "encoding": ""
        },
        "doc_attribute": {
          "properties": [
            {
              "source": "file1___NODASHDATE__.csv"
            }
          ],
          ..
        },
        "doc_attribute_value": {
          "properties": [
            {
              "source": "file2___NODASHDATE__.csv"
            }
          ],
          ..
        },
        "doc_product": {
          "entity": [
            {
              "source": "file3___NODASHDATE__.csv",
              "autodetect": 0,
              "schema": "SCHEMA FOR LOAD TO BQfield1:STRING,field2:INT64,field3:FLOAT64,field4:BOOL"
              "name": "used to create suffix for BQ table"
            }
          ]
        }
    },

...

Panel
panelIconIdatlassian-warning
panelIcon:warning:
panelIconText:warning:
bgColor#EAE6FF

The load configuration defines the GCP properties for loading content to BQ.

The requirements specified above (#1-#4) are necessary if the data is accessed from a remote (outside Boxalino) scope.

If your integration exports the data directly in Boxalino (as described https://boxalino-internal.atlassian.net/wiki/spaces/DOC/pages/2606792705/Boxalino+Data+Integration+DI-SAAS+-+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ), please continue with the Data Transformation step.

2. Data Transformation

...

Panel
panelIconIdatlassian-check_mark
panelIcon:check_mark:
panelIconText:check_mark:
bgColor#DEEBFF

It is possible to configure dynamic source names (to identify the files loaded in GCS) (ex: if the content to be loaded in BQ is exported in batches or has a dynamic suffix).

The supported variations are:
__DATE__ (Y-m-d), __NODASHDATE__ (Ymd), __TM__ (Y-m-dTH:i:s), __NODASHTM__ (YmdHis), __DATE_PATH__ (Y/m/d), __JN_DATE_PATH__ (Y/n/j), __DATE_YESTERDAY__ (Y-m-d -1 day), __DATE_MONTH__ (Y-m-01), __NODASHDATE_MONTH__ (Ym01), __NODASHDATE_YESTERDAY__ (Ymd -1 day).

For example, if there are 3 chunks for order-<tm>-1.jsonl, the configuration for load can be:
"load":{"options":{<add the default BQ-load options}, "doc_order":{"entity":[{"source":"doc_order___NODASHTM__*.jsonl}]}}

2. Data Transformation

Once the data is loaded in GCS and BQ, it is time to transform it in the necessary data structure.

...

  1. The client has access to a GCP project

  2. The client will create a Dataform repository https://cloud.google.com/dataform/docs/repositories

  3. The client has access to a GitHub or GitLab repository (to connect it to the Dataform repository) https://cloud.google.com/dataform/docs/connect-repository

  4. The client has given “Dataform Admin” permission to Boxalino Service Account boxalino-dataform@rtux-data-integration.iam.gserviceaccount.com

The DI-SAAS SYNC request

The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.

...

Endpoint

production

https://boxalino-di-saas-krceabfwya-ew.a.run.app

1

Action

/sync

2

Method

POST

3

Body

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

4

Headers

Authorization

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

5

 

Content-Type

application/json

6

 

client

account name

7

 

mode

data sync mode: F for full, D for delta, E for enrichments

8

 

type

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

9

 

tm

(optional) time , in format: YmdHis;

Note

if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

technical: used to identify the documents version

10

 

ts

(optional) timestamp, must be millisecond based in UNIX timestamp

11

 

dev

(optional) use this to mark the content for the dev data index

...

Panel
panelIconId1f44c
panelIcon:ok_hand:
panelIconText👌
bgColor#FFEBE6

The use of the header chunk is required if the same file/document is exported in batches/sections.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunkproperty for each /transformer/load/url request.

Tip

Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

Only after all the files are available in GCS, you can move on to step#3.

Tip

After all required documents (doc) for the given type data sync (ex: product, order, etc) have been made available in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

...