Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Tip

At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform).

...

  1. Data connector & Data load

    1. downloads the files from remote source to GCS

    2. loads the raw content to the transform dataset (<account>_T)

    Data transformation
    1. *

      1. *if options-> load_dataset & load_gcp_project values are set - the loaded raw content will be available in load_gcp_project.load_dataset instead of rtux-data-integration.<account>_T

  2. Data transformation (with logic/SQL provided by the client)

    1. Steps #1-#4 from the Generic Flow

  3. DI automation

...

Code Block
[
  {
    "connector": {
      "type": "sftp|gcs|plentymarket|plytix|boxalino",
      "options": {
          // specific for each connector type
      },
      "load": {
        "options": {
          // for loading the data in BQ (BQ parameters)
          "format": "CSV|NEWLINE_DELIMITED_JSON",
          "field_delimiter": ";",
          "autodetect": true,
          "schema": "",
          "skipRows": 1(CSV)|0(JSONL),
          "max_bad_records": 0,
          "quote": "",
          "write_disposition": "WRITE_TRUNCATE",
          "create_disposition": "",
          "encoding": "",
          "allow_quoted_newlines" : 1,
          "allow_jagged_rows" : 0
        },
        "doc_X": {
          "property_node_from_doc_data_structure": [
            {
              "source": "file___NODASHDATE__.jsonl",
              "name": "<used to create suffix for BQ table>",
              "schema": ""
            }
          ]
        }
      }
    },
    "di": {
      "configuration": {
        "languages": [
          "de",
          "fr"
        ],
        "currencies": [
          "CHF"
        ],
        "mapping": {
          "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
        },
        "default": {
          "language": "de",
          "currency": "CHF"
        }
      }
    }
  }
]

...

  1. The relevant data sources are available in .csv or JSONL (prefered) format

  2. The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)

    1. this will help automating the integration

  3. The files required to update a certain data type (ex: product, order, etc) are available in the same path

  4. The files are available on an endpoint (SFTP, GCS, 3rd party API, public URL) to which Boxalino has access

    1. for GCS/ GCP sources: access being shared to Boxalino`s Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.commust be given https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Access-sharing-(BQ-and-GCS)

    2. for AWS / SFTP : the client`s own AWS/SFTP environment with a Boxalino user & credentials must be provided

Expand
titleGCS connector properties (with sample for JSONL files load definition)
Code Block
"connector": {
  "type": "gcs",
    "options": {
      "source": {
        "project": "<private-gcp-project>",
        "bucket": "<GCS-bucket-name>",
        "pattern": "path/to/data/__NODASHDATE__/"
      }
  },
  "load": {
    "options": {
      "format": "NEWLINE_DELIMITED_JSON",
      "autodetect": true,
      "schema": "",
      "skipRows": 0,
      "max_bad_records": 0,
      "write_disposition": "WRITE_TRUNCATE"
    },
    "doc_attribute": {
      "properties": [
        {
          "source": "file1___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_attribute_value": {
      "properties": [
        {
          "source": "file2___NODASHDATE__.jsonl"
        }
      ],
      ..
    },
    "doc_product": {
      "entity": [
        {
          "source": "file3___NODASHDATE__.jsonl",
          "autodetect": 0,
          "schema": "SCHEMA FOR LOAD TO BQ"
          "name": ""
        }
      ]
    }
}
Expand
title(client-owned environment) SFTP connector properties (with sample for CSV files load definition)
Code Block
"connector": {
      "type": "sftp",
      "options": {
        "hostname": "<server>",
        "port": 22,
        "username": "<boxalino user><user for Boxalino access>",
        "password": "<boxalino password><password for Boxalino access>",
        "source": {
          "pattern": "___NODASHDATE__",
          "files": [
          ],
          "path": "<path on server>"
        }
      },
      "load": {
        "options": {
          "format": "CSV",
          "field_delimiter": ";",
          "autodetect": true,
          "schema": "",
          "skipRows": 1,
          "max_bad_records": 0,
          "quote": "",
          "write_disposition": "WRITE_TRUNCATE",
          "create_disposition": "",
          "encoding": ""
        },
        "doc_attribute": {
          "properties": [
            {
              "source": "file1___NODASHDATE__.csv"
            }
          ],
          ..
        },
        "doc_attribute_value": {
          "properties": [
            {
              "source": "file2___NODASHDATE__.csv"
            }
          ],
          ..
        },
        "doc_product": {
          "entity": [
            {
              "source": "file3___NODASHDATE__.csv",
              "autodetect": 0,
              "schema": "SCHEMA FOR LOAD TO BQfield1:STRING,field2:INT64,field3:FLOAT64,field4:BOOL"
              "name": "used to create suffix for BQ table"
            }
          ]
        }
    },

...

Panel
panelIconIdatlassian-warning
panelIcon:warning:
panelIconText:warning:
bgColor#EAE6FF

The load configuration defines the GCP properties for loading content to BQ.

The requirements specified above (#1-#4) are necessary if the data is accessed from a remote (outside Boxalino) scope.

If your integration exports the data directly in Boxalino (as described https://boxalino-internal.atlassian.net/wiki/spaces/DOC/pages/2606792705/Boxalino+Data+Integration+DI-SAAS+-+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ), please continue with the Data Transformation step.

2. Data Transformation

Once the data is loaded in GCS and BQ, it is time to transform it in the necessary data structure.

The transformation happens by preparing a BQ SQL for every node part of the doc_X data structure (ex: title, stock, price, string_attributes, etc). The output of the SQL will be of a certain format/structure for each property.

Info

As this step was done in-house by Boxalino, for the POC of our ELT solution, further definition will be provided.

Dataform

The transformation happens with the help of Google Dataform https://cloud.google.com/dataform .

This implies the following:

  1. The client has access to a GCP project

  2. The client will create a Dataform repository https://cloud.google.com/dataform/docs/repositories

  3. The client has access to a GitHub or GitLab repository (to connect it to the Dataform repository) https://cloud.google.com/dataform/docs/connect-repository

  4. The client has given “Dataform Admin” permission to Boxalino Service Account boxalino-dataform@rtux-data-integration.iam.gserviceaccount.com

The DI-SAAS request

The DI request will use the same headers (client, tm, mode, type, authorization) and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.

  • it will load the files to BQ in a T (transform) dataset (ex: SELECT * FROM rtux-data-integration.<account>_T.<tm>_<doc_X>_<property>_<name>)

  • it will run the transform flow (step #1-#3 from Data Integration )

    • generating each doc_X JSONL content

    • loading the doc_X JSONL in your GCS bucket

    • loading the doc_X JSONL to BQ

  • it will run the SYNC request Sync Request for the process

    • loads the content in core tables in BQ

    • loads the content in the respective data index (for products)

REQUEST DEFINITION

As an integrator, please create the bellow request to the provided endpoint.

Info

There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino.

...

Endpoint

...

production

...

https://boxalino-di-saas-krceabfwya-ew.a.run.app

...

Action

...

/sync

...

Method

...

POST

...

Body

...

https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

...

Headers

...

Authorization

...

Basic base64<<DATASYNC API key : DATASYNC API Secret>>

note: use the API credentials from your Boxalino account that have the ADMIN role assigned

...

 

...

Content-Type

...

application/json

...

 

...

client

...

account name

...

 

...

mode

...

data sync mode: F for full, D for delta

...

 

...

type

...

product, user, content, user_content, order.

if left empty - it will check for all tables with the given tm

...

 

...

tm

...

(optional) time , in format: YmdHis;

Note

if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

technical: used to identify the documents version

...

 

...

ts

...

(optional) timestamp, must be millisecond based in UNIX timestamp

...

 

...

dev

...

(optional) use this to mark the content for the dev data index

Expand
titleSample

This is a sample of a triggered full product sync (minimal data):

Code Block
curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/sync" \
  -X POST \
  -d "[{\"connector\":{\"type\":\"gcs\",\"options\":{\"source\":{\"bucket\":\"my-account-bucket\",\"pattern\":\"boxalino/product/__NODASHDATE__/\"}},\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"write_disposition\":\"WRITE_TRUNCATE\"},\"doc_product\":{\"entity\":[{\"source\":\"product.jsonl\",\"autodetect\":0,\"schema\":\"ProductId:INT64,Name:STRING,ProductTypeId:INT64,Sku:STRING,GroupId:STRING,Price:FLOAT64,SalePrice:FLOAT64,created:DATETIME\"}],\"product_relations\":[{\"source\":\"crosssell.jsonl\",\"name\":\"crosssell\"}],\"link\":[{\"source\":\"urlrecord.jsonl\"}]}}},\"di\":{\"configuration\":{\"languages\":[\"de\",\"fr\"],\"currencies\":[\"CHF\"],\"mapping\":{\"languagesCountryCode\":{\"de\":\"CH_de\",\"fr\":\"CH_fr\"}},\"default\":{\"language\":\"de\",\"currency\":\"CHF\"}}}}]" \
  -H "Content-Type: application/json" \
  -H "account: <boxalino-account-name>" \
  -H "mode: F" \
  -H "tm: 202303112000" \
  -H "type: product" \
  -H "Authorization: <base64_encode(api_key:api_secret)>" 

The request above created the following resources:

  • GCS (raw data, as migrated from the connector)

    1. gs://prod_rtux-data-integration_<account>/product/202303112000/F_product.jsonl

    2. gs://prod_rtux-data-integration_<account>/product/202303112000/F_crossell.jsonl

    3. gs://prod_rtux-data-integration_<account>/product/202303112000/F_urlrecord.jsonl

  • BQ (the T dataset - raw data as loaded from BQ)

    1. rtux-data-integration.<account>_T.202303112000-doc_product-entity

    2. rtux-data-integration.<account>_T.202303112000-doc_product-product_relations-crossell

    3. rtux-data-integration.<account>_T.202303112000-doc_product-link

  • GCS (transformed doc_X JSONL)

    Panel
    panelIconIdatlassian-check_mark
    panelIcon:check_mark:
    panelIconText:check_mark:
    bgColor#DEEBFF

    It is possible to configure dynamic source names (to identify the files loaded in GCS) (ex: if the content to be loaded in BQ is exported in batches or has a dynamic suffix).

    The supported variations are:
    __DATE__ (Y-m-d), __NODASHDATE__ (Ymd), __TM__ (Y-m-dTH:i:s), __NODASHTM__ (YmdHis), __DATE_PATH__ (Y/m/d), __JN_DATE_PATH__ (Y/n/j), __DATE_YESTERDAY__ (Y-m-d -1 day), __DATE_MONTH__ (Y-m-01), __NODASHDATE_MONTH__ (Ym01), __NODASHDATE_YESTERDAY__ (Ymd -1 day), __RTM__ (the tm value from your SYNC/CORE request)

    For example, if there are 3 chunks for order-<tm>-1.jsonl, the configuration for load can be:
    "load":{"options":{<add the default BQ-load options}, "doc_order":{"entity":[{"source":"doc_order___NODASHTM__*.jsonl}]}}

    2. Data Transformation

    Once the data is loaded in GCS and BQ, it is time to transform it in the necessary data structure.

    The transformation happens by preparing a BQ SQL for every node part of the doc_X data structure (ex: title, stock, price, string_attributes, etc). The output of the SQL will be of a certain format/structure for each property.

    Info

    As this step was done in-house by Boxalino, for the POC of our ELT solution, further definition will be provided.

    Dataform

    The transformation happens with the help of Google Dataform https://cloud.google.com/dataform .

    This implies the following:

    1. The client has access to a GCP project

    2. The client will create a Dataform repository https://cloud.google.com/dataform/docs/repositories

    3. The client has access to a GitHub or GitLab repository (to connect it to the Dataform repository) https://cloud.google.com/dataform/docs/connect-repository

    4. The client has given “Dataform Admin” permission to Boxalino Service Account boxalino-dataform@rtux-data-integration.iam.gserviceaccount.com

    When using dataform for transforming the exported data (your custom CSV/JSONL files) into Boxalino Data Structure, the JSON body (for the SYNC REQUEST) has the following information (next to connector and di):

    Code Block
    "transform": {
        "vars": {
          "var1": "value"
        },
        "tags": [
          "order"
        ],
        "dataform": {
          "project": "rtux-data-integration",
          "location": "europe-west1",
          "repository": "boxalino-di-saas-elt",
          "workspace": "dataform"
        }
      }
    Note

    The values for dataform are available directly in your dataform project link:
    https://console.cloud.google.com/bigquery/dataform/locations/<location>/repositories/<repository>/workspaces/<workspace>?project=<project>

    The DI-SAAS SYNC request

    The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.

    • it will load the files to BQ in a T (transform) dataset (ex: SELECT * FROM rtux-data-integration.<account>_T.<tm>_<doc_X>_<property>_<name>)

    • it will run the transform flow (step #1-#3 from Data Integration )

      • generating each doc_X JSONL content

      • loading the doc_X JSONL in your GCS bucket

      • loading the doc_X JSONL to BQ

    • it will run the SYNC request Sync Request for the process

      • loads the content in core tables in BQ

      • loads the content in the respective data index (for products)

    REQUEST DEFINITION

    As an integrator, please create the bellow request to the provided endpoint.

    Info

    There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino.

    Endpoint

    production

    https://boxalino-di-saas-krceabfwya-ew.a.run.app

    1

    Action

    /sync

    2

    Method

    POST

    3

    Body

    https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

    4

    Headers

    Authorization

    Basic base64<<DATASYNC API key : DATASYNC API Secret>>

    note: use the API credentials from your Boxalino account that have the ADMIN role assigned

    Image Added
    5

     

    Content-Type

    application/json

    6

     

    client

    account name

    7

     

    mode

    data sync mode: F for full, D for delta, E for enrichments

    8

     

    type

    product, user, content, user_content, order.

    if left empty - it will check for all tables with the given tm

    9

     

    tm

    (optional) time , in format: YmdHis;

    Note

    if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

    technical: used to identify the documents version

    10

     

    ts

    (optional) timestamp, must be millisecond based in UNIX timestamp

    11

     

    dev

    (optional) use this to mark the content for the dev data index

    12

    loadOnly

    (optional) if true, the data will only be loaded to BQ

    13

    transformOnly

    (optional) if true, the data will only be transformed (ex: dataform tag execution)

    ...

    Expand
    titleSample

    This is a sample of a triggered full product sync (minimal data):

    Code Block
    curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/sync" \
      -X POST \
      -d "[{\"connector\":{\"type\":\"gcs\",\"options\":{\"source\":{\"bucket\":\"my-account-bucket\",\"pattern\":\"boxalino/product/__NODASHDATE__/\"}},\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"write_disposition\":\"WRITE_TRUNCATE\"},\"doc_product\":{\"entity\":[{\"source\":\"product.jsonl\",\"autodetect\":0,\"schema\":\"ProductId:INT64,Name:STRING,ProductTypeId:INT64,Sku:STRING,GroupId:STRING,Price:FLOAT64,SalePrice:FLOAT64,created:DATETIME\"}],\"product_relations\":[{\"source\":\"crosssell.jsonl\",\"name\":\"crosssell\"}],\"link\":[{\"source\":\"urlrecord.jsonl\"}]}}},\"di\":{\"configuration\":{\"languages\":[\"de\",\"fr\"],\"currencies\":[\"CHF\"],\"mapping\":{\"languagesCountryCode\":{\"de\":\"CH_de\",\"fr\":\"CH_fr\"}},\"default\":{\"language\":\"de\",\"currency\":\"CHF\"}}}}]" \
      -H "Content-Type: application/json" \
      -H "account: <boxalino-account-name>" \
      -H "mode: F" \
      -H "tm: 202303112000" \
      -H "type: product" \
      -H "Authorization: <base64_encode(api_key:api_secret)>" 

    The request above created the following resources:

    1. GCS (raw data, as migrated from the connector)

      1. gs://prod_rtux-data-integration_<account>/product/202303112000/F_product.jsonl

      2. gs://prod_rtux-data-integration_<account>/doc_product_/202303112000/F_202303112000crossell.jsonl

      3. gs://prod_rtux-data-integration_<account>/doc_language_/product/202303112000/F_202303112000urlrecord.jsonl

    2. BQ (the F T dataset - transformed data to doc_X data structureraw data as loaded from BQ)

      1. rtux-data-integration.<account>_FT.202303112000-doc_product_F_202303112000-entity

      2. rtux-data-integration.<account>_T.202303112000-doc_product-product_relations-crossell

      3. rtux-data-integration.<account>_T.202303112000-doc_product-link

    3. GCS (transformed doc_X JSONL)

      1. gs://prod_rtux-data-integration._<account>/doc_product_F_202303112000.jsonl

      2. gs://prod_rtux-data-integration_<account>/doc_language_F_202303112000

    ...

      1. .jsonl

    1. BQ (the F dataset - transformed data to doc_X data structure)

      1. rtux-data-integration.<account>_F.doc_product_F_202303112000

      2. rtux-data-integration.<account>_F.doc_language_F_202303112000

    Expand
    titleSample request with Boxalino connector (ex: data is already available in Boxalino GCP project, in client`s GCS bucket)
    Code Block
    [
      {
        "connector": {
          "type": "boxalino",
           "load": {
            "options": {
              "format": "NEWLINE_DELIMITED_JSON",
              "autodetect": true,
              "schema": "",
              "skipRows": 0,
              "max_bad_records": 0,
              "write_disposition": "WRITE_TRUNCATE"
            },
            "doc_product": {
              "entity": [
                {
                  "source": "product.jsonl",
                  "autodetect": 0,
                  "schema": "ProductId:INT64,Name:STRING,ProductTypeId:INT64,Sku:STRING,GroupId:STRING,Price:FLOAT64,SalePrice:FLOAT64,created:DATETIME"
                }
              ],
              "product_relations": [
                {
                  "source": "crosssell.jsonl",
                  "name": "crosssell"
                }
              ],
              "link": [
                {
                  "source": "urlrecord.jsonl"
                }
              ]
            }
          }
        },
        "di": {
          "configuration": {
            "languages": [
              "de",
              "fr"
            ],
            "currencies": [
              "CHF"
            ],
            "mapping": {
              "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
            },
            "default": {
              "language": "de",
              "currency": "CHF"
            }
          }
        }
      }
    ]

    Loading content to Boxalino GCS (connector: boxalino)

    Note

    By following these steps, you can push your data (JSONL or CSV) directly in your client`s GCS bucket in the Boxalino scope. After all data has been loaded in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

    There are 2 available flows, based on the size of your data:

    1. The content is exported as the body of your POST request

    2. The content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )

    Option #1 is recommended for data volume less than 32MB.

    Option #2 is allowed for any data size.

    The described steps can be followed if you want to export your data in Boxalino GCP project, in your clients` account.

    The LOAD request will create a GCS file in your project`s GCS buckte:

    
              "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"}
            },
            "default": {
              "language": "de",
              "currency": "CHF"
            }
          }
        }
      }
    ]

    The DI-SAAS CORE request

    Use this flow if the purpose is to expose data -as is- from your system to Boxalino. The data will be available in the clients` _core dataset in the Boxalino ecosystem.

    In order to achieve this, there are 2 requests to be made:

    1. load data to GCS https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino)

    2. load data from GCS to <client>_core.<destination> table

    Note

    The data payload (CSV or JSONL) must have a field to be used as primary field (ex: id).

    REQUEST DEFINITION

    As an integrator, please create the bellow request to the provided endpoint.

    Info

    There should be a process within your own project that triggers the data sync between a 3rd party source (connector) and Boxalino.

    Endpoint

    production

    https://boxalino-di-saas-krceabfwya-ew.a.run.app

    1

    Action

    /core

    2

    Method

    POST

    3

    Body

    https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#DI-SAAS-Request-(Overview)

    4

    Headers

    Authorization

    Basic base64<<DATASYNC API key : DATASYNC API Secret>>

    note: use the API credentials from your Boxalino account that have the ADMIN role assigned

    Image Added
    5

     

    Content-Type

    application/json

    6

     

    client

    account name

    7

     

    mode

    data sync mode: F for full, D for delta, E for enrichments

    8

     

    type

    product, user, content, user_content, order.

    if left empty - it will check for all tables with the given tm

    9

     

    tm

    (optional) time , in format: YmdHis;

    Note

    if the data was loaded in Boxalino GCS (https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Loading-content-to-Boxalino-GCS-(connector%3A-boxalino) ) - must re-use the tm from the load step

    technical: used to identify the documents version

    10

     

    dev

    (optional) use this to mark the content for the dev data index

    11

    Expand
    titleLoading data to _core dataset

    For the integration bellow, the rti JSON payload is loaded in the client`s dataset.

    Note

    The same tm value is used.

    Code Block
    curl  "https://boxalino-di-stage-krceabfwya-ew.a.run.app/transformer/load" \
      -X POST \
      -H "Content-Type: application/json" \
      -H "client: <account>" \
      -H "tm: 20231106000000" \
      -H "type: content" \
      -H "mode: T" \
      -H "doc: <filename>.json" \
      -d "<JSONL data>" \
      -H "Authorization: Basic <encode<apiKey:apiSecret>>"
    Code Block
    curl "https://boxalino-di-saas-krceabfwya-ew.a.run.app/core" \
      -X POST \
      -d "[{\"connector\":{\"type\":\"boxalino\",\"load\":{\"options\":{\"format\":\"NEWLINE_DELIMITED_JSON\",\"field_delimiter\":\"\",\"autodetect\":true,\"schema\":\"\",\"skipRows\":0,\"max_bad_records\":0,\"quote\":\"\",\"write_disposition\":\"WRITE_TRUNCATE\",\"create_disposition\":\"\",\"encoding\":\"\"},\"doc_content\":{\"rti\":[{\"source\":\"<filename>*.json\",\"autodetect\":1,\"destination\":\"<table name from core dataset>\",\"primary_field\":\"<primary field in your table>\"}]}}}}]" \
      -H "Content-Type: application/json" \
      -H "mode: F" \
      -H "dev: 0" \
      -H "tm: 20231106000000" \
      -H "type: content" \
      -H "client: <client>" \
      -H "Authorization: Basic <encode<apiKey:apiSecret>>"
    Note

    Note the JSON payload for table definition
    "doc_content": {
    "rti": [
    {
    "source": "rti*.jsonl",
    "autodetect": 1,
    “project”: “project-id”, //optional, only in the case the load is in a different GCP project
    “dataset”:”dataset-name”, //optional, only in the case when the load is in a different GCP project
    "destination": "doc_content_rti", // BQ table
    "primary_field": "id" //Column used as primary key
    }
    ]
    }

    The final output table is <client>_core.doc_content_rti.
    Depending on the write_disposition property, the data is either rewritten in the table or appended.
    If the optional project/ dataset details are set, the load is happening in a different GCP project (project-id.dataset-name.doc_content_rti).

    Loading content to GCS (connector: boxalino, gcs)

    Note

    By following these steps, you can push your data (JSONL or CSV) directly in your client`s GCS bucket in the Boxalino scope or in a GCS bucket in a private GCP project (access share required). After all data has been loaded in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino or connector → type: gcs

    There are 2 available flows, based on the size of your data:

    1. The content is exported as the body of your POST request

    2. The content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )

    Option #1 is recommended for data volume less than 32MB.

    Option #2 is allowed for any data size.

    The described steps can be followed if you want to export your data in Boxalino GCP project, in your clients` account.

    The LOAD request will create a GCS file in your project`s GCS buckte: gs://prod_rtux-data-integration_<account>/<type>/<tm>/<mode>_<doc>

    • there is no content validation at this step

    Panel
    panelIconId1f9d0
    panelIcon:face_with_monocle:
    panelIconText🧐
    bgColor#FFEBE6

    When loading data to be transformed, make sure to use the T (transform) mode. This will ensure that the data is loaded in BQ in the transform datasets (<client>_T), which is available only 24h.

    A. Loading content less than 32 MB

    Files under 32MB can be loaded directly as REQUEST BODY CONTENT in Boxalino`s GCS.

    The sample request bellow would create the languagefeed_20220317.csv in your GCS bucket gs://prod_rtux-data-integration_<account>/<type>/<tm>/<mode>_<filename>

    Code Block
    curl "https://boxalino-di-process-krceabfwya-ew.a.run.app/saas/load" \
      -X POST \
      -H "Content-Type: application/json" \
      -H "client: <account>" \
      -H "dev: true|false" \
      -H "tm: YYYYmmddHHiiss" \
      -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
      -H "mode: T" \
      -H "gcp_project: <private-client-GCP-project>" \
      -H "gcs_bucket: <private-client-GCS-bucket>" \
      -H "doc: <filename>" \
      -d "<JSONL>" \
      -H "Authorization: Basic <encode of the account>"
    Tip

    Step #1 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

    Expand
    titleSample

    For example, the request bellow would create a gs://prod_rtux-data-integration_<account>/

    ...

    product/

    ...

    20230301161554/

    ...

    • there is no content validation at this step

    A. Loading content less than 32 MB

    Files under 32MB can be loaded directly as REQUEST BODY CONTENT in Boxalino`s GCS.

    The sample request bellow would create the languagefeed_20220317.csv in your GCS bucket gs://prod_rtux-data-integration_<account>/<type>/<tm>/<mode>_<filename>

    Code Block
    curl

    T_doc_language.json in your clients` GCS bucket. This is a necessary data for the type:product integration.

    Code Block
    curl --connect-timeout 30 --max-time 300 "https://boxalino-di-
    stage
    process-krceabfwya-ew.a.run.app/
    transformer
    saas/load" \
      -X
    POST
     POST \
      -H "Content-Type: application/json" \
      -H "client: <account>" \
      -H "tm: 20230301161554" \
      -H "type: product" \
      -H "mode: T" \
      -H "
    Content-Type
    doc: 
    application/
    doc_language.json" \
      
    -H "client: <account>" \ -H "dev: true|false" \ -H "tm: YYYYmmddHHiiss" \ -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \ -H "mode: F|D|I" \ -H "doc: <filename>" \ -d "<JSONL>
    -d "{\"language\":\"en\",\"country_code\":\"en-GB\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}\n{\"language\":\"de\",\"country_code\":\"de-CH\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}" \
      -H "Authorization: Basic <encode of the account>" 
    Tip

    Step #1 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

    For example, the request bellow would create a gs://prod_rtux-data-integration_<account>/product/20230301161554/T_doc_language.json in your clients` GCS bucket. This is a necessary data for the type:product integration.
    Expand
    titleSample
    Note

    The same tm value must be used across your other requests. This identifies the timestamp of your computation process.

    Warning

    If the service response is an error like: 413 Request Entity Too Large - please use the 2nd flow.

    B. Loading undefined data size

    For content over 32MB, we provide an endpoint to access a Signed GCS Url that would put all your streamed content into a file (currently there is no defined file size limit in GCS)

    Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)

    This is the generic POST request:

    Code Block
    curl --connect-timeout 
    30
    60 --max-time 300 "https://boxalino-di-
    stage
    process-krceabfwya-ew.a.run.app/
    transformer
    saas/load/url" \
      -X
    POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "tm: 20230301161554" \ -H "type: product"
     POST \
      -H "
    mode
    Content-Type: 
    F
    application/json" \
      -H "
    doc
    client: 
    doc_language.json
    <account>" \
      -
    d "{\"language\":\"en\",\"country_code\":\"en-GB\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}\n{\"language\":\"de\",\"country_code\":\"de-CH\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}
    H "dev: true|false" \
      -H "tm: YYYYmmddHHiiss" \
      -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
      -H "mode: T" \
      -H "gcp_project: <private-client-GCP-project>" \
      -H "gcs_bucket: <private-client-GCS-bucket>" \
      -H "chunk: <id>" \
      -H "
    Authorization
    doc: <filename>" 
    Basic
    \
    
    <encode
     
    of
     
    the
    -H 
    account>
    "
    Note

    The same tm value must be used across your other requests. This identifies the timestamp of your computation process.

    Warning

    If the service response is an error like: 413 Request Entity Too Large - please use the 2nd flow.

    B. Loading undefined data size

    For content over 32MB, we provide an endpoint to access a Signed GCS Url that would put all your streamed content into a file (currently there is no defined file size limit in GCS)

    Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)

    ...

    Authorization: Basic <encode of the account>"

    The response will be an upload link that can only be used in order to create the file in the clients` GCS bucket. The link is valid for 30 minutes.

    Expand
    titleSample

    Lets say, we need a public link to upload a document called product-entity.jsonl. The following request can be used to generate the link:

    Code Block
    curl --connect-timeout 60 --max-time 300 "https://boxalino-di-
    stage
    process-krceabfwya-ew.a.run.app/
    transformer
    saas/load/url" \
      -X POST \
      -H "Content-Type: application/json" \
      -H "client: <account>" \
      -H "
    dev
    tm: 
    true|false
    20230301161554" \
      -H "
    tm
    type: 
    YYYYmmddHHiiss
    product" \
      -H "
    type: product|content|order|user|communication_history|communication_planning|user_generated_content
    mode: T" \
      -H "
    mode
    doc: 
    F|D|I|E
    product-entity.jsonl" \
      -H "
    chunk
    Authorization:
    <id>" \
     Basic <encode of the account>"
    Code Block
    curl --connect-timeout 60 --timeout 0 <GCS-signed-url> \
        -X PUT \
        -H "docContent-Type: <filename>application/octet-stream" \
        -Hd "Authorization: Basic <encode of the account>"

    The response will be an upload link that can only be used in order to create the file in the clients` GCS bucket. The link is valid for 30 minutes.

    Expand
    titleSample

    Lets say, we need a public link to upload a document called product-entity.jsonl. The following request can be used to generate the link:

    Code Block
    curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/transformer/load/url" \
      -X POST \
      -H "Content-Type: application/json" \
      -H "client: <account>" \
      -H "tm: 20230301161554" \
      -H "type: product" \
      -H "mode: F" \
      -H "doc: product-entity.jsonl" \
      -H "Authorization: Basic <encode of the account>"
    Code Block
    curl --connect-timeout 60 --timeout 0 <GCS-signed-url> \
        -X PUT \
        -H "Content-Type: application/octet-stream" \
        -d "<YOUR DOCUMENT JSONL CONTENT (STREAM)>"

    ...

    panelIconId1f44c
    panelIcon:ok_hand:
    panelIconText👌
    bgColor#FFEBE6

    ...

    Tip

    Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

    Only after all the files are available in GCS, you can move on to step#3.

    Tip

    After all required documents (doc) for the given type data sync (ex: product, order, etc) have been made available in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

    Technical notes

    ...

    Every request to our nodes returns a response (200 or 400). This can be used internally for testing purposes as well.

    ...

    <YOUR DOCUMENT JSONL CONTENT (STREAM)>"
    Panel
    panelIconId1f44c
    panelIcon:ok_hand:
    panelIconText👌
    bgColor#FFEBE6

    The use of the header chunk is required if the same file/document is exported in batches/sections.
    Repeat steps 1+2 for every data batch loaded in GCS.
    Make sure to increment the value of the chunkproperty for each /transformer/load/url request.

    Tip

    Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type)

    Only after all the files are available in GCS, you can move on to step#3.

    Tip

    After all required documents (doc) for the given type data sync (ex: product, order, etc) have been made available in GCS, the DI-SAAS request can be called https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#The-DI-request , assigning connector → type : boxalino.

    Technical notes

    1. Every request to our nodes returns a response (200 or 400). This can be used internally for testing purposes as well.

    2. It is possible to review the STATUS of the requests at any given time Status Review

    Access sharing (BQ and GCS)

    The DI-SAAS flow can be used to access privately owned files in GCS and to update/create privately owned BigQuery resources as well.

    Tip

    If your project already shared access to your Boxalino users group, bi-with-ai.<client>@boxalino.com , t

    Google Cloud Storage GCS

    If your connector is gcs and you want to push and read data to a privately owned resource, share access to boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com or bi-with-ai.<client>@boxalino.com (your Boxalino users group)

    Required permissions:

    1. storage.objects.get,

    2. storage.buckets.get

    3. storage.objects.list

    This is achieved through Storage Legacy Bucket Reader & Storage Legacy Object Reader roles.

    1. image-20241119-123626.pngImage Added

    The gcs bucket used in the connector definition, must be located in europe-west6 or eu zone.

    If you use the Boxalino services to load data to your private GCS bucket (<endpoint>/saas/load or <endpoint>/saas/load/url ), you must also give WRITE permissions to the SA.

    ...

    BigQuery Access (BQ)

    If the goal is to load the raw GCS file to a privately owned BQ resource (defined as: loadoptionsload_dataset), you have to ensure that the dataset EXISTS and it is using location EU.

    We recommend to create a dedicated dataset for this purpose. To avoid BQ storage costs, we recommend to set an expiration date for the tables (ex: 7 days)

    image-20241119-133935.pngImage Added

    The same Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com or bi-with-ai.<client>@boxalino.com (your Boxalino users group) must have read/write access to the dataset where the GCS data is loaded. For simplicity, BigQuery Admin is suggested.

    Note

    Share access only on the resources that Boxalino is allowed to access. For example, if your integrations only work on a temporary dataset - boxalino_saas_stage or boxalino_saas- you can share access at resource level (instead of project level).

    ...