1 Overview
2 1. Transform your data source
3 2. Loading content to GCS and BigQuery
- 3.1 A. Loading content less than 32 MB
- 3.2 B. Loading undefined data size
  - 3.2.1 1. Make a request for public upload link
  - 3.2.2 2. Upload the content on the public link
  - 3.2.3 3. Loading content to BigQuery
4 3. How to register a compute request

Overview

As an integrator, you are tasked to expose the clients` data to Boxalino Data Structure. The following steps can be used as guidelines for a technical approach:

Transform your data source to Boxalino Data Structure Data Structure as a JSONL content
Load the transformed content to GCS Load Request
Load the content to BQ
Tell our system to compute your content Sync Request

In the upcoming sections, we will present sample cURL requests that, further on, can be translated into your own programming language of choice.

For clients who use their own Google Cloud Platform project for storage of the documents, the documented rules/naming patterns of the BigQuery resources must be used Load Request | Using private GCP resources for DI . In such scenarios, only the step #4 is of interest.

1. Transform your data source

The Boxalino Data Structure is publicly available in our git repository: https://github.com/boxalino/data-integration-doc-schema

You can use the repository to identify the data formats & data elements expected for each document.

You can test that your JSONL is valid by doing a test load in your own GCP project https://github.com/boxalino/data-integration-doc-schema#are-you-an-integrator
You can test that your JSONL is valid by using the https://github.com/boxalino/data-integration-doc-schema/blob/master/schema/generator.html (guidelines in the repository README.md)

For certain headless CMS, Boxalino has designed a Transformer service Transformer

2. Loading content to GCS and BigQuery

There are 2 available flows, based on the size of your data:

The content is exported as the body of your POST request
The content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )

Option #1 is recommended for data volume less than 32MB.

Option #2 is allowed for any data size.

A. Loading content less than 32 MB

For this use-case, there is a single request to be made: Load Request | Request Definition

 curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "doc: product|language|attribute_value|attribute|order|user|communication_history|communication_planning|user_generated_content" \
  -d "<JSONL>" \
  -H "Authorization: Basic <encode of the account>"

For example, the request bellow would create a doc_language_F_20230301161554.json in your clients` GCS bucket. This is a necessary data for the type:product integration.

curl --connect-timeout 30 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20230301161554" \
  -H "type: product" \
  -H "mode: F" \
  -H "doc: language" \
  -d "{\"language\":\"en\",\"country_code\":\"en-GB\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}\n{\"language\":\"de\",\"country_code\":\"de-CH\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}" \
  -H "Authorization: Basic <encode of the account>"

The same tm value must be used across your other requests. This identifies the timestamp of your computation process.

B. Loading undefined data size

1. Make a request for public upload link

This is the generic POST request:

curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/chunk" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "chunk: <id>" \
  -H "doc: doc_product|doc_language|doc_attribute_value|doc_attribute|doc_order|doc_user|communication_history|communication_planning|doc_user_generated_content" \
  -H "Authorization: Basic <encode of the account>"

The response will be an upload link that can only be used in order to create the document doc_<doc>_<mode>_<tm>-<chunk>.json in the clients` GCS bucket. The link is valid for 30minutes.

Lets say, we need a public link to upload the product JSONL. The following request can be used to generate the link:

2. Upload the content on the public link

3. Loading content to BigQuery

Once the content was created in GCS, it is time to import it in BigQuery.

After the request #2, the file doc_product_F_20230301161554.json is available to load in the BQ table <client>_F.doc_product_F_20230301161554 :

3. How to register a compute request

For this step, there is a single request to be made: Sync Request

For the test scenario before - product data synchronization, the following SYNC request can be made once the documents have been loaded to BQ:

Boxalino Public Knowledge Base

Integration Flow

Analytics