Overview

As an integrator, you are tasked to expose the clients` data to Boxalino Data Structure. The following steps can be used as guidelines for a technical approach:

Transform your data source to Boxalino Data Structure Data Structure as a JSONL content
Load the transformed content to GCS Load Request
Load the content to BQ
Tell our system to compute your content Sync Request

In the upcoming sections, we will present sample cURL requests that, further on, can be translated into your own programming language of choice.

For clients who use their own Google Cloud Platform project for storage of the documents, the documented rules/naming patterns of the BigQuery resources must be used https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Using-private-GCP-resources-for-DI . In such scenarios, only the step #4 is of interest.

1. Transform your data source

The Boxalino Data Structure is publicly available in our git repository: https://github.com/boxalino/data-integration-doc-schema

You can use the repository to identify the data formats & data elements expected for each document.

You can test that your JSONL is valid by doing a test load in your own GCP project https://github.com/boxalino/data-integration-doc-schema#are-you-an-integrator
You can test that your JSONL is valid by doing a test with the generator https://github.com/boxalino/data-integration-doc-schema/blob/master/schema/generator.html

For certain headless CMS, Boxalino has designed a Transformer service Transformer

2. Loading content to GCS and BigQuery

There are 2 available flows, based on the size of your data:

The content is exported as the body of your POST request
The content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )

Option #1 is allowed for data volume less than 32MB.

Option #2 is allowed for any data size.

1. Loading content less than 32 MB

For this use-case, there is a single request to be made: https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Request-Definition

 curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "doc: product|language|attribute_value|attribute|order|user|communication_history|communication_planning|user_generated_content" \
  -d "<JSONL>" \
  -H "Authorization: Basic <encode of the account>"

For example, the request bellow would create a doc_language_F_20230301161554.json in your clients` GCS bucket. This is a necessary data for the type:product integration.

curl --connect-timeout 30 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20230301161554" \
  -H "type: product" \
  -H "mode: F" \
  -H "doc: language" \
  -d "{\"language\":\"en\",\"country_code\":\"en-GB\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}\n{\"language\":\"de\",\"country_code\":\"de-CH\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}" \
  -H "Authorization: Basic <encode of the account>"

The same tm value must be used across your other requests. This identifies the timestamp of your computation process.

The use of /load endpoint is also loading the content to BigQuery.

If the service response is an error like: 413 Request Entity Too Large - please use the 2nd flow.

2. Loading undefined data size

This flow is also described in other pages https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Load-in-Batches-%2F-data-%3E-32-MB

1. Make a request for public upload link

This is the generic POST request:

curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/chunk" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "chunk: <id>" \
  -H "doc: product|language|attribute_value|attribute|order|user|communication_history|communication_planning|user_generated_content" \
  -H "Authorization: Basic <encode of the account>"

The response will be an upload link that can only be used in order to create the document doc_<doc>_<mode>_<tm>-<chunk>.json in the clients` GCS bucket. The link is valid for 30minutes.

Lets say, we need a public link to upload the product JSONL. The following request can be used to generate the link:

curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/chunk" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20230301161554" \
  -H "type: product" \
  -H "mode: F" \
  -H "chunk: 1" \
  -H "doc: product" \
  -H "Authorization: Basic <encode of the account>"

2. Upload the content on the public link

curl --connect-timeout 60 --timeout 0 <GCS-signed-url> \
    -X PUT \
    -H "Content-Type: application/octet-stream" \
    -d "<JSONL>"

Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)

The use of the header chunk is required if the data is exported in batches.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunk property for each /load/chunk request.

Only after the full content is available in GCS, you can move on to step#3.

3. Loading content to BigQuery

Once the content was created in GCS, it is time to import it in BigQuery.

curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/bq" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "doc: product|language|attribute_value|attribute|order|user|communication_history|communication_planning|user_generated_content" \
  -H "Authorization: Basic <encode of the account>"

After the request #2, the file doc_product_F_20230301161554.json is available to load in the BQ table <client>_F.doc_product_F_20230301161554 :

curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/bq" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20230301161554" \
  -H "type: product" \
  -H "mode: F" \
  -H "Authorization: Basic <encode of the account>"

After all required documents (doc) for the given type data sync (ex: product, order, etc) have been made available in BigQuery, the computation request can be called for.

3. How to register a compute request

This step is required. With this step, the data exported is being computed and made available throughout the Boxalino Data Warehouse Ecosystem for future uses.

For this step, there is a single request to be made: Sync Request

curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/sync" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "Authorization: Basic <encode of the account>"

For the test scenario before - product data synchronization, the following SYNC request can be made once the documents have been loaded to BQ:

curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/sync" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20230301161554" \
  -H "type: product" \
  -H "mode: F" \
  -H "Authorization: Basic <encode of the account>"

If your project uses their own private GCP project & resources, please add as well the headers for project, dataset

After making the SYNC request, the data is being computed and updated in relevant feeds (data index, real time injections, reports, etc)

We encourage to have a stable fallback & retry policy.
We recommend to also review the status of your requests as they happen: Status Review

Integration Flow