Overview
As an integrator, you are tasked to expose the clients` data to Boxalino Data Structure. The following steps can be used as guidelines for a technical approach:
Transform your data source to Boxalino Data Structure Data Structure as a JSONL content
Load the transformed content to GCS Load Request
Load the content to BQ
Tell our system to compute your content Sync Request
In the upcoming sections, we will present sample cURL requests that, further on, can be translated into your own programming language of choice.
For clients who use their own Google Cloud Platform project for storage of the documents, the documented rules/naming patterns of the BigQuery resources must be used https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Using-private-GCP-resources-for-DI . In such scenarios, only the step #4 is of interest.
1. Transform your data source
The Boxalino Data Structure is publicly available in our git repository: https://github.com/boxalino/data-integration-doc-schema
You can use the repository to identify the data formats & data elements expected for each document.
You can test that your JSONL is valid by doing a test load in your own GCP project https://github.com/boxalino/data-integration-doc-schema#are-you-an-integrator
You can test that your JSONL is valid by doing a test with the generator https://github.com/boxalino/data-integration-doc-schema/blob/master/schema/generator.html
For certain headless CMS, Boxalino has designed a Transformer service Transformer
2. Loading content to GCS and BigQuery
There are 2 available flows, based on the size of your data:
The content is exported as the
body
of yourPOST
requestThe content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )
Option #1 is allowed for data volume less than 32MB.
Option #2 is allowed for any data size.
1. Loading content less than 32 MB
For this use-case, there is a single request to be made: https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Request-Definition
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "dev: true|false" \ -H "tm: YYYYmmddHHiiss" \ -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \ -H "mode: F|D|I" \ -H "doc: product|language|attribute_value|attribute|order|user|communication_history|communication_planning|user_generated_content" \ -d "<JSONL>" \ -H "Authorization: Basic <encode of the account>"
For example, the request bellow would create a doc_language_F_20230301161554.json
in your clients` GCS bucket. This is a necessary data for the type:product
integration.
curl --connect-timeout 30 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "tm: 20230301161554" \ -H "type: product" \ -H "mode: F" \ -H "doc: language" \ -d "{\"language\":\"en\",\"country_code\":\"en-GB\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}\n{\"language\":\"de\",\"country_code\":\"de-CH\",\"creation_tm\":\"2023-03-01 16:15:54\",\"client_id\":0,\"src_sys_id\":0}" \ -H "Authorization: Basic <encode of the account>"
The same tm
value must be used across your other requests. This identifies the timestamp of your computation process.
The use of /load
endpoint is also loading the content to BigQuery.
If the service response is an error like: 413 Request Entity Too Large
- please use the 2nd flow.
2. Loading undefined data size
This flow is also described in other pages https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Load-in-Batches-%2F-data-%3E-32-MB
1. Make a request for public upload link
This is the generic POST request:
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/chunk" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "dev: true|false" \ -H "tm: YYYYmmddHHiiss" \ -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \ -H "mode: F|D|I" \ -H "chunk: <id>" \ -H "doc: product|language|attribute_value|attribute|order|user|communication_history|communication_planning|user_generated_content" \ -H "Authorization: Basic <encode of the account>"
The response will be an upload link that can only be used in order to create the document doc_<doc>_<mode>_<tm>-<chunk>.json
in the clients` GCS bucket. The link is valid for 30minutes.
Lets say, we need a public link to upload the product JSONL. The following request can be used to generate the link:
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/chunk" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "tm: 20230301161554" \ -H "type: product" \ -H "mode: F" \ -H "chunk: 1" \ -H "doc: product" \ -H "Authorization: Basic <encode of the account>"
2. Upload the content on the public link
curl --connect-timeout 60 --timeout 0 <GCS-signed-url> \ -X PUT \ -H "Content-Type: application/octet-stream" \ -d "<JSONL>"
Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)
The use of the header chunk
is required if the data is exported in batches.
Repeat steps 1+2 for every data batch loaded in GCS.
Make sure to increment the value of the chunk
property for each /load/chunk
request.
Only after the full content is available in GCS, you can move on to step#3.
3. Loading content to BigQuery
Once the content was created in GCS, it is time to import it in BigQuery.
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/bq" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "dev: true|false" \ -H "tm: YYYYmmddHHiiss" \ -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \ -H "mode: F|D|I" \ -H "doc: product|language|attribute_value|attribute|order|user|communication_history|communication_planning|user_generated_content" \ -H "Authorization: Basic <encode of the account>"
After the request #2, the file doc_product_F_20230301161554.json is available to load in the BQ table <client>_F.doc_product_F_20230301161554
:
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/bq" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "tm: 20230301161554" \ -H "type: product" \ -H "mode: F" \ -H "Authorization: Basic <encode of the account>"
After all required documents (doc) for the given type
data sync (ex: product, order, etc) have been made available in BigQuery, the computation request can be called for.
3. How to register a compute request
This step is required. With this step, the data exported is being computed and made available throughout the Boxalino Data Warehouse Ecosystem for future uses.
For this step, there is a single request to be made: Sync Request
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/sync" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "dev: true|false" \ -H "tm: YYYYmmddHHiiss" \ -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \ -H "mode: F|D|I" \ -H "Authorization: Basic <encode of the account>"
For the test scenario before - product data synchronization, the following SYNC request can be made once the documents have been loaded to BQ:
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/sync" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "tm: 20230301161554" \ -H "type: product" \ -H "mode: F" \ -H "Authorization: Basic <encode of the account>"
If your project uses their own private GCP project & resources, please add as well the headers for project
, dataset
After making the SYNC request, the data is being computed and updated in relevant feeds (data index, real time injections, reports, etc)
We encourage to have a stable fallback & retry policy.
We recommend to also review the status of your requests as they happen: Status Review
Add Comment