Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Table of Contents

Overview

...

For clients who use their own Google Cloud Platform project for storage of the documents, the documented rules/naming patterns of the BigQuery resources must be used https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Using-private-GCP-resources-for-DI . In such scenarios, only the step #4 is of interest.

1. Transform your data source

The Boxalino Data Structure is publicly available in our git repository: https://github.com/boxalino/data-integration-doc-schema

You can use the repository to identify the data formats & data elements expected for each document.

  1. You can

...

  1. test that your JSONL is valid by doing a test load in your own GCP project https://github.com/boxalino/data-integration-doc-schema#are-you-an-integrator

  2. You can test that your JSONL is valid by using the https://github.com/boxalino/data-integration-doc-schema/blob/master/schema/generator.html (guidelines in the repository README.md)

For certain headless CMS, Boxalino has designed a Transformer service Transformer

2. Loading content to GCS and BigQuery

...

There are 2 available flows, based on the size of your data:

  1. The content is exported as the body of your POST request

  2. The content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )

Option #1 is allowed recommended for data volume less than 32MB.

Option #2 is allowed for any data size.

...

A. Loading content less than 32 MB

For this use-case, there is a single request to be made: https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Request-Definition

Code Block
 curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "chunk: <batch nr>" \
  -H "doc: product|language|attribute_value|attribute|order|user|communication_history|communication_planning|user_generated_content" \
  -d "<JSONL>" \
  -H "Authorization: Basic <encode of the account>" 

...

Warning

If the service response is an error like: 413 Request Entity Too Large - please use the 2nd flow.

...

B. Loading undefined data size

...

Code Block
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/chunk" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "chunk: <id>" \
  -H "doc: doc_product|doc_language|doc_attribute_value|doc_attribute|doc_order|doc_user|communication_history|communication_planning|doc_user_generated_content" \
  -H "Authorization: Basic <encode of the account>"

The response will be an upload link that can only be used in order to create the document doc_<doc>_<mode>_<tm>-<chunk>.json in the clients` GCS bucket. The link is valid for 30minutes.

...

Code Block
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/chunk" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "tm: 20230301161554" \
  -H "type: product" \
  -H "mode: F" \
  -H "chunk: 1" \
  -H "doc: product" \
  -H "Authorization: Basic <encode of the account>"

...

Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)

Panel
panelIconId1f44c
panelIcon:ok_hand:
panelIconText👌
bgColor#FFEBE6

The use of the header chunk is required if the data is exported in batches.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunkproperty for each /load/chunk request.

Only after the full content is available in GCS, you can move on to step#3.

3. Loading content to BigQuery

...

Code Block
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/bq" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "doc: doc_product|doc_language|doc_attribute_value|doc_attribute|doc_order|doc_user|communication_history|communication_planning|doc_user_generated_content" \
  -H "Authorization: Basic <encode of the account>"

...

Tip

After all required documents (doc) for the given type data sync (ex: product, order, etc) have been made available in BigQuery, the computation request can be called for.

3. How to register a compute request

Panel
panelIconId26a0
panelIcon:warning:
panelIconText⚠️
bgColor#FF8F73

This step is required. With this step, the data exported is being computed and made available throughout the Boxalino Data Warehouse Ecosystem for future uses.

For this step, there is a single request to be made: Sync Request

Code Block
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/sync" \
  -X POST \
  -H "Content-Type: application/json" \
  -H "client: <account>" \
  -H "dev: true|false" \
  -H "tm: YYYYmmddHHiiss" \
  -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
  -H "mode: F|D|I" \
  -H "project: <client GCP project>" \
  -H "dataset: <client GCP dataset>" \
  -H "Authorization: Basic <encode of the account>"

...

Info

If your project uses their own private GCP project & resources, please add as well include the headers for project, dataset. For more options, always review the https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/394559761/Sync+Request#Request-Definition

Tip

After making the SYNC request, the data is being computed and updated in relevant feeds (data index, real time injections, reports, etc)

Panel
panelIconId1f913
panelIcon:nerd:
panelIconText🤓
bgColor#FFEBE6#FFFAE6

We encourage to have a stable fallback & retry policy.
We recommend to also review the status of your requests as they happen: Status Review

Panel
panelIconId1f9d0
panelIcon:face_with_monocle:
panelIconText🧐
bgColor#FFBDAD

In the technical samples, the stage endpoint https://boxalino-di-stage-krceabfwya-ew.a.run.app was used (for testing purposes). Once your integration flow is ready for production, make sure to use the adequate endpoints:

  1. For mode: F (full) data pushes: https://boxalino-di-full-krceabfwya-ew.a.run.app

  2. For mode: D (delta) data pushes: https://boxalino-di-delta-krceabfwya-ew.a.run.app

  3. For mode: I (instant) data pushes: https://boxalino-di-instant-krceabfwya-ew.a.run.app