Tip |
---|
At Boxalino we are strong advisors for ELT flows (vs ETL). The major benefit is speed, transparency and maintenence: data is loaded directly into a destination system (BQ), and transformed in-parallel (Dataform). |
...
Data connector & Data load
downloads the files from remote source to GCS
loads the raw content to the transform dataset (<account>_T)*
*if
options
->load_dataset
&load_gcp_project
values are set - the loaded raw content will be available inload_gcp_project.load_dataset
instead ofrtux-data-integration.<account>_T
Data transformation (with logic/SQL provided by the client)
Steps #1-#4 from the Generic Flow
DI automation
...
Code Block |
---|
[ { "connector": { "type": "sftp|gcs|plentymarket|plytix|boxalino", "options": { // specific for each connector type }, "load": { "options": { // for loading the data in BQ (BQ parameters) "format": "CSV|NEWLINE_DELIMITED_JSON", "field_delimiter": ";", "autodetect": true, "schema": "", "skipRows": 1(CSV)|0(JSONL), "max_bad_records": 0, "quote": "", "write_disposition": "WRITE_TRUNCATE", "create_disposition": "", "encoding": "" , }, "docallow_quoted_Xnewlines" : {1, "allow_jagged_rows" : 0 }, "doc_X": { "property_node_from_doc_data_structure": [ { "source": "file___NODASHDATE__.jsonl", "name": "<used to create suffix for BQ table>", "schema": "" } ] } } }, "di": { "configuration": { "languages": [ "de", "fr" ], "currencies": [ "CHF" ], "mapping": { "languagesCountryCode": {"de":"CH_de", "fr":"CH_fr"} }, "default": { "language": "de", "currency": "CHF" } } } } ] |
...
The relevant data sources are available in .csv or JSONL (prefered) format
The files have a timestamp in the naming or in the access path (ex: product_20221206.jsonl)
this will help automating the integration
The files required to update a certain data type (ex: product, order, etc) are available in the same path
The files are available on an endpoint (SFTP, GCS, 3rd party API, public URL) to which Boxalino has access
for GCS/ GCP sources: access being shared to Boxalino`s Service Account
boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com
(storage.objects.get
,storage.buckets.get
andstorage.objects.list
, which can be achieved by Storage Legacy Bucket Reader & Storage Legacy Object Reader)for must be given https://boxalino.atlassian.net/wiki/spaces/BPKB/pages/928874497/DI-SAAS+ELT+Flow#Access-sharing-(BQ-and-GCS)
for AWS / SFTP : the client`s own AWS/SFTP environment with a Boxalino user & credentials
Expand | ||
---|---|---|
| ||
|
...
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
It is possible to configure dynamic The supported variations are: For example, if there are 3 chunks for |
...
The DI-SAAS SYNC request
The DI request will use the same headers (client, tm, mode, type, authorization)and a JSON request body that would provide mapping details between the loaded .jsonl files and data meaning.
...
Endpoint | production | ||||
---|---|---|---|---|---|
1 | Action | /core | |||
2 | Method | POST | |||
3 | Body | ||||
4 | Headers | Authorization | Basic base64<<DATASYNC API key : DATASYNC API Secret>> note: use the API credentials from your Boxalino account that have the ADMIN role assigned | ||
5 |
| Content-Type | application/json | ||
6 |
| client | account name | ||
7 |
| mode | data sync mode: F for full, D for delta, E for enrichments | ||
8 |
| type | product, user, content, user_content, order. if left empty - it will check for all tables with the given tm | ||
9 |
| tm | (optional) time , in format: YmdHis;
technical: used to identify the documents version | ||
10 |
| dev | (optional) use this to mark the content for the dev data index | ||
11 | (optional) use this |
Expand | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
For the integration bellow, the rti JSON payload is loaded in the client`s dataset.
|
...
Panel | ||||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
The use of the header |
Tip |
---|
Step #1 - #2 must be repeated for every file that is required to be added for the given process (same tm, mode & type) Only after all the files are available in GCS, you can move on to step#3. |
Tip |
---|
After all required documents (doc) for the given |
...
The DI-SAAS flow can be used to access privately owned files in GCS and to update/create privately owned BigQuery resources as well.
Tip |
---|
If your project already shared access to your Boxalino users group, |
Google Cloud Storage GCS
If your connector is gcs and you want to push and read data to a privately owned resource, share access to boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com
or bi-with-ai.<client>@boxalino.com
(your Boxalino users group)
Required permissions:
storage.objects.get
,storage.buckets.get
storage.objects.list
...
If the goal is to load the raw GCS file to a privately owned BQ resource (defined as: load
→ options
→ load_dataset_stage
), you have to ensure that the dataset EXISTS and it is in using location EU.
We recommend to create a dedicated dataset for this purpose. To avoid BQ storage costs, we recommend to set an expiration date for the tables (ex: 7 days)
The same Service Account boxalino-di-api@rtux-data-integration.iam.gserviceaccount.com
or bi-with-ai.<client>@boxalino.com
(your Boxalino users group) must have read/write access to the dataset where the GCS data is loaded. For simplicity, BigQuery Admin is suggested.
Note |
---|
Share access only on the resources that Boxalino is allowed to access. For example, if your integrations only work on a temporary dataset - |
...