BigQuery ML Opportunities
We are listing here all the key opportunities we see we could seize with BigQuery ML for e-shops based on Google Analytics 4, Google Ads, the E-shop and Boxalino tracking / requests data.
Factor analysis of parameters affecting the performance of the PDP
Identify automatically what are the factors of a good performance or a bad performance of a PDP (product detail page) (especially for the case of visitors landing on the PDP from Google Shopping).
Example: it might be that products with 3 stars or more (in their ratings) are a major factor to increase the engagement and the conversion rate.
These parameters can be related to the product data, or the session data (device, …) or the combination of both (e.g.: if the users are on the mobile, a long textual description is not a good thing)
Data Available:
Product properties
the item brand and categories are indicated in GA4 view_item events, but Boxalino has a lot more product attribute from the e-shop, including (the properties would be provided as an associate array with property names and values: ARRAY<STRUCT<name, ARRAY<string> values>>):brand
category (several levels)
price (before or after discount)
number of stars
state of discount (in discount and how much)
pictures (number of pictures, urls, …)
description (to see if the length or other aspects could play a role)
delivery state (in stock, …)
…
in example after: product_properties : have string properties for the recommended products
GA4 data
view_item indicating that it is a PDP
traffic source
device
geo
…
in the data : all the ga4 view_item event are therepage_views (to see what page view happens before / after, because it is maybe very different if it is a landing page)
in the data: array_length of prior_pages
fyi- the sample data after are computed (except for the product_properties) fully on the basis of GA4 data.
Performance metrics:
Bounce rate
in the data: next_pages would be emptyNumber of page views
in the data: array_length of next_pagessession conversion rate
in the data: count(distinct session_id) / count(distinct next_purchases.ecommerce.transaction_Id)in scope conversion rate (not only there is a conversion, but there is a conversion with this product)
in the data: compare the main item with the items of the next purchase itemssession $ value (average value per session)
in the data: next_purchases.ecommerce.purchase_revenue and session_idAOV (averge order value)
in the data: next_purchases.ecommerce.purchase_revenue / next_purchases.ecommerce.transaction_id
First attempts
As the first attempt, we did a logistic regression based on the basic item properties of GA4 (brand, category, and price):
CREATE OR REPLACE TABLE `@reports.Data_PDPPrediction_logistic_reg_v1_11M`
as
SELECT
count(distinct vi.session_id) as sessions
, sum(case when p.session_id is null then 0 else 1 end) as purchases
, vi.items[safe_offset(0)].item_brand
, vi.items[safe_offset(0)].item_category
, vi.items[safe_offset(0)].item_category2
, vi.items[safe_offset(0)].item_category3
, vi.items[safe_offset(0)].item_category4
, vi.items[safe_offset(0)].item_category5
, vi.items[safe_offset(0)].price
FROM
`@core.ga_events_enriched` as vi
left join `@core.ga_events_enriched`as p on vi.session_id = p.session_id and p.date >= date_sub(current_date(), interval 6 month) and p.event_name = 'purchase'
WHERE vi.date >= date_sub(current_date(), interval 6 month) and vi.event_name = 'view_item'
group by item_brand, item_category, item_category2, item_category3, item_category4, item_category5, price
;
CREATE OR REPLACE MODEL `@reports.Model_PDPPrediction_logistic_reg_v1_11M`
OPTIONS(
model_type='logistic_reg'
, CATEGORY_ENCODING_METHOD = 'DUMMY_ENCODING'
, CALCULATE_P_VALUES = TRUE,
ENABLE_GLOBAL_EXPLAIN = TRUE
) AS
SELECT
(IF(purchases >= it,1,0)) AS label
, item_brand
, item_category
, item_category2
, item_category3
, item_category4
, item_category5
, price
FROM
`@reports.Data_PDPPrediction_logistic_reg_v1_11M`
join unnest(GENERATE_ARRAY(1,sessions)) as it
;
The results showed a quite limited predictive power (which was to be expected with such a basic set of parameters):
The Interpretability tab shows some information about the attribution to each parameter, showing that the brand and the category have much higher attribution than the price (but this could be partially because the price is a numeric field):
Radical Tracking
As we have released our Radical Tracking on the PDP of some of our customers like Raclette-Mix 125g we did a second attempt considering these visual markers:
CREATE OR REPLACE TABLE `live_nahrin_reports.Data_PDPRadicalPrediction_logistic_reg_v1_11M`
as
SELECT
count(distinct session_id) as sessions
, count(distinct order_id) as purchases
, ifnull((select string_agg(v order by v) from unnest(feature_1) as v where v != ''),'none') as feature_1
, ifnull((select string_agg(v order by v) from unnest(feature_2) as v where v != ''),'none') as feature_2
, ifnull((select string_agg(v order by v) from unnest(feature_3) as v where v != ''),'none') as feature_3
, ifnull((select string_agg(v order by v) from unnest(feature_4) as v where v != ''),'none') as feature_4
, ifnull((select string_agg(v order by v) from unnest(feature_5) as v where v != ''),'none') as feature_5
, ifnull((select string_agg(v order by v) from unnest(feature_6) as v where v != ''),'none') as feature_6
, ifnull((select string_agg(v order by v) from unnest(feature_7) as v where v != ''),'none') as feature_7
, ifnull((select string_agg(v order by v) from unnest(feature_8) as v where v != ''),'none') as feature_8
, ifnull((select string_agg(v order by v) from unnest(feature_9) as v where v != ''),'none') as feature_9
, ifnull((select string_agg(v order by v) from unnest(feature_10) as v where v != ''),'none') as feature_10
from
(
SELECT
s.session_id
, any_value(s.order_id) as order_id
, array_agg(distinct case when feature = 1 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_1
, array_agg(distinct case when feature = 2 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_2
, array_agg(distinct case when feature = 3 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_3
, array_agg(distinct case when feature = 4 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_4
, array_agg(distinct case when feature = 5 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_5
, array_agg(distinct case when feature = 6 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_6
, array_agg(distinct case when feature = 7 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_7
, array_agg(distinct case when feature = 8 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_8
, array_agg(distinct case when feature = 9 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_9
, array_agg(distinct case when feature = 10 then ifnull(concat(v.name, '-', v.value), '-') else '' end) as feature_10
FROM `live_nahrin_views.pdp_radical_tracking_markers`
join unnest(values) as v
join unnest(sessions) as sid
join `live_nahrin_reports.session_primary` as s on s.session_id = sid
group by session_id, feature
)
group by feature_1, feature_2, feature_3, feature_4, feature_5, feature_6, feature_7, feature_8, feature_9, feature_10
;
CREATE OR REPLACE MODEL `live_nahrin_reports.Model_PDPRadicalPrediction_logistic_reg_v1_11M`
OPTIONS(
model_type='logistic_reg'
, CATEGORY_ENCODING_METHOD = 'DUMMY_ENCODING'
, CALCULATE_P_VALUES = TRUE,
ENABLE_GLOBAL_EXPLAIN = TRUE
) AS
SELECT
(IF(purchases >= it,1,0)) AS label
, feature_1, feature_2, feature_3, feature_4, feature_5, feature_6, feature_7, feature_8, feature_9, feature_10
FROM
`live_nahrin_reports.Data_PDPRadicalPrediction_logistic_reg_v1_11M`
join unnest(GENERATE_ARRAY(1,sessions)) as it
;
However, it is still not very clear how this will work
and the features indicated as most important as simply the ones which are detected as so by the feature number logic, so no value so far there in the outcome…
Factor analysis of parameters affecting the performance of product recommendations on the PDP
Identify automatically what are the factors of a good performance or a bad performance of a product recommendation
Example: recommending products which have bad ratings might be a bad idea
Example: recommending products of the same brand than the one of the currently viewed product might be a good idea for some brands and a bad idea for other brands
These parameters can be related to the product data of either the PDP or of the recommended products, as well as the session data (device, …) or the combination of them (e.g.: on products of the brand X recommending products of the brand Y doesn’t work)
Data Available:
Product properties (for both the PDP and the recommended products)
the item brand and categories are indicated in GA4 view_item events, but Boxalino has a lot more product attribute from the e-shop, including (the properties would be provided as an associate array with property names and values: ARRAY<STRUCT<name, ARRAY<string> values>>):number of stars
state of discount (in discount and how much)
pictures (number of pictures, urls, …)
description (to see if the length or other aspects could play a role)
delivery state (in stock, …)
…
in example data :
string_properties : have string properties for the recommended products
numeric_properties : have numeric properties for the recommended products
source_string_properties : have string properties for the current PDP
source_numeric_properties : have numeric properties for the current PDP
Other data
test variant (it is often the case that there are several algorithms tested as different test variant, we could use the test variant as a factor to see if it makes a different (positive or negative) for specific products)
in example data : variant_id
Performance metrics:
Click-Through-Rate (the number of times the recommended product is clicked compared to the number of times it is shown)
in example data : KPI_defined.click / KPI_defined.mainImpressionAdd-To-Basket-RAte (the number of times the recommended product is clicked and then added to the basket compared to the number of times it is shown)
in example data : KPI_defined.click_n_atb/ KPI_defined.mainImpression
Buy-Through-Rate (the number of times the recommended product is clicked and then added to the basket and then bought compared to the number of times it is shown)
in example data : KPI_defined.click_n_buy / KPI_defined.mainImpression
Display $ Value (the average value bought by the Buy-Throughs compared to the number of times the product is shown, so the average value of showing the product as a recommendation)
in example data : KPI_defined.click_n_revenue / KPI_defined.mainImpression
First attempts
We did the first attempt for the case of accessories recommendations (cross-selling on the PDP) as follows:
As a result, the predictive power only based on these few similar parameters is quite low (as it was expected), but could grow with more parameters.
Also, the Explainable AI tab shows possible usage of the model as factor analysis, showing here that the attribution is highest on the same product line and brand and only after the category and price.