Table of Contents |
---|
Overview
...
For clients who use their own Google Cloud Platform project for storage of the documents, the documented rules/naming patterns of the BigQuery resources must be used https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Using-private-GCP-resources-for-DI . In such scenarios, only the step #4 is of interest.
1. Transform your data source
The Boxalino Data Structure is publicly available in our git repository: https://github.com/boxalino/data-integration-doc-schema
...
You can test that your JSONL is valid by doing a test load in your own GCP project https://github.com/boxalino/data-integration-doc-schema#are-you-an-integrator
You can test that your JSONL is valid by doing a test with using the generator https://github.com/boxalino/data-integration-doc-schema/blob/master/schema/generator.html (guidelines in the repository README.md)
For certain headless CMS, Boxalino has designed a Transformer service Transformer
2. Loading content to GCS and BigQuery
Overview
There are 2 available flows, based on the size of your data:
The content is exported as the
body
of yourPOST
requestThe content is exported with the help of a public GCS Signed URL (https://cloud.google.com/storage/docs/access-control/signed-urls )
Option #1 is allowed recommended for data volume less than 32MB.
Option #2 is allowed for any data size.
...
A. Loading content less than 32 MB
For this use-case, there is a single request to be made: https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Request-Definition
Code Block |
---|
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "dev: true|false" \ -H "tm: YYYYmmddHHiiss" \ -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \ -H "mode: F|D|I" \ -H "chunk: <batch nr>" \ -H "doc: product|language|attribute_value|attribute|order|user|communication_history|communication_planning|user_generated_content" \ -d "<JSONL>" \ -H "Authorization: Basic <encode of the account>" |
...
Warning |
---|
If the service response is an error like: |
...
B. Loading undefined data size
This flow is also described in other pages https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/415432770/Load+Request#Load-in-Batches-%2F-data-%3E-32-MB
...
Code Block |
---|
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/chunk" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "dev: true|false" \ -H "tm: YYYYmmddHHiiss" \ -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \ -H "mode: F|D|I" \ -H "chunk: <id>" \ -H "doc: doc_product|doc_language|doc_attribute_value|doc_attribute|doc_order|doc_user|communication_history|communication_planning|doc_user_generated_content" \ -H "Authorization: Basic <encode of the account>" |
The response will be an upload link that can only be used in order to create the document doc_<doc>_<mode>_<tm>-<chunk>.json
in the clients` GCS bucket. The link is valid for 30minutes.
...
Code Block |
---|
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/chunk" \
-X POST \
-H "Content-Type: application/json" \
-H "client: <account>" \
-H "tm: 20230301161554" \
-H "type: product" \
-H "mode: F" \
-H "chunk: 1" \
-H "doc: product" \
-H "Authorization: Basic <encode of the account>" |
...
Read more about Google Cloud Signed URL https://cloud.google.com/storage/docs/access-control/signed-urls (response samples, uses, etc)
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
The use of the header |
3. Loading content to BigQuery
...
Code Block |
---|
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/load/bq" \ -X POST \ -H "Content-Type: application/json" \ -H "client: <account>" \ -H "dev: true|false" \ -H "tm: YYYYmmddHHiiss" \ -H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \ -H "mode: F|D|I" \ -H "doc: doc_product|doc_language|doc_attribute_value|doc_attribute|doc_order|doc_user|communication_history|communication_planning|doc_user_generated_content" \ -H "Authorization: Basic <encode of the account>" |
...
Tip |
---|
After all required documents (doc) for the given |
3. How to register a compute request
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
This step is required. With this step, the data exported is being computed and made available throughout the Boxalino Data Warehouse Ecosystem for future uses. |
For this step, there is a single request to be made: Sync Request
Code Block |
---|
curl --connect-timeout 60 --max-time 300 "https://boxalino-di-stage-krceabfwya-ew.a.run.app/sync" \
-X POST \
-H "Content-Type: application/json" \
-H "client: <account>" \
-H "dev: true|false" \
-H "tm: YYYYmmddHHiiss" \
-H "type: product|content|order|user|communication_history|communication_planning|user_generated_content" \
-H "mode: F|D|I" \
-H "project: <client GCP project>" \
-H "dataset: <client GCP dataset>" \
-H "Authorization: Basic <encode of the account>" |
...
Info |
---|
If your project uses their own private GCP project & resources, please add as well include the headers for |
Tip |
---|
After making the SYNC request, the data is being computed and updated in relevant feeds (data index, real time injections, reports, etc) |
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
We encourage to have a stable fallback & retry policy. |
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
In the technical samples, the
|