Prior to starting a historical data migration, ensure you do the following:
- Create a project on our US or EU Cloud.
- Sign up to a paid product analytics plan on the billing page (historic imports are free but this unlocks the necessary features).
- Raise an in-app support request with the Data pipelines topic detailing where you are sending events from, how, the total volume, and the speed. For example, "we are migrating 30M events from a self-hosted instance to EU Cloud using the migration scripts at 10k events per minute."
- Wait for the OK from our team before starting the migration process to ensure that it completes successfully and is not rate limited.
- Set the
historical_migration
option totrue
when capturing events in the migration.
Migrating data from Google Analytics is a three step process:
- Setting up the Google Analytics BigQuery streaming export
- Querying Google Analytics data from BigQuery
- Converting Google Analytics event data to the PostHog schema and capturing in PostHog
Want a higher-level overview of PostHog? Check out our introduction to PostHog for Google Analytics users.
1. Setting up the Google Analytics BigQuery export
Unfortunately, Google Analytics' historical data exports are limited. The best way to get data from Google Analytics is to set up the BigQuery export.
To do this, start by creating a Google Cloud account and project. Afterwards, do the basic setup for BigQuery, including enabling the BigQuery API. The BigQuery sandbox works for this.
Once done, go to the Admin section of your Google Analytics account. Under Product links, select BigQuery links and then click the Link button.
Choose your project, choose your export type (you can make either daily or streaming work), click Next, and then Submit. This sets up a link to your BigQuery account, but it can take more than 24 hours to see data in BigQuery.
2. Querying Google Analytics data from BigQuery
After the link is set up, data is automatically added to BigQuery under a resource with your Google Analytics property prepended by analytics_
like analytics_123456789
. Within this is an events table based on your export type. If you used daily exports, events tables have a name like events_20240731
.
To query this data, use the BigQuery Python client. This requires setting up and authenticating with the Google Cloud CLI.
Once done, we can then query events like this:
from google.cloud import bigqueryfrom google.cloud.bigquery import Row# Set the project ID explicitlyproject_id = 'cool'client = bigquery.Client(project=project_id)QUERY = ('SELECT * FROM `analytics_123456789.events_20240731` ''WHERE event_name NOT IN ("session_start", "first_visit") ''LIMIT 100')query_job = client.query(QUERY)rows = query_job.result()
This returns a BigQuery RowIterator
object.
3. Converting GA event data to the PostHog schema and capturing
The schema of Google Analytics' exported event data is similar to PostHog's schema, but it requires conversion to work with the rest of PostHog's data. You can see details on the Google Analytics schema in their docs and events and properties PostHog autocaptures in our docs.
For example, many PostHog events and properties are prepended with the $
character.
To convert to PostHog's schema, we need to:
Convert the
event_name
values likeuser_engagement
to$autocapture
andpage_view
to$pageview
.Convert the
event_timestamp
to ISO 8601.Flatten the event data by pulling useful data out of records like
event_params
anditems
as well as dictionaries likedevice
andgeo
.Create an event properties by looping through Google Analytics event data, dropping irrelevant data, converting some to PostHog's schema, and including data unique to Google Analytics or custom properties.
Do the same conversion with person properties. These are then added to the
$set
property.
Once this is done, you can capture events into PostHog using the Python SDK with historical_migration
set to true
.
Here's an example version of a full Python script that gets data from BigQuery, converts it to PostHog's schema, and captures it in PostHog.
from posthog import Posthogfrom datetime import datetimefrom google.cloud import bigqueryposthog = Posthog('<ph_project_api_key>',host='https://us.i.posthog.com',debug=True,historical_migration=True)key_mapping = {'page_title': 'title','page_location': '$current_url','page_referrer': '$referrer','category': '$device_type','operating_system': '$os','operating_system_version': '$os_version','language': '$browser_language','web_info.browser': '$browser','web_info.browser_version': '$browser_version','web_info.hostname': '$host','city': '$geoip_city_name','country': '$geoip_country_name','continent': '$geoip_continent_name','region': '$geoip_subdivision_1_name'}omitted_keys = ['event_date','event_timestamp','event_name','event_bundle_sequence_id','event_server_timestamp_offset','user_id','user_pseudo_id','stream_id','platform','batch_event_index','batch_page_id','batch_ordering_id','session_traffic_source_last_click']def get_record_key_value(param):record_key = param['key']value_dict = param['value']record_value = next((v for v in value_dict.values() if v is not None), None)return record_key, record_valuedef flatten_row(row):flat_row = {}for key, value in row.items():print(f"{key}: {value}")if value == None:continueelif isinstance(value, list) and len(value) == 0:continueelif key in omitted_keys:continueelif key in ['user_properties', 'user_first_touch_timestamp', 'user_ltv']:# We'll handle user properties separatelycontinueelif key in ['event_params', 'items']:# Flatten recordsfor param in value:record_key, record_value = get_record_key_value(param)flat_row[record_key] = record_valueelif key in ['privacy_info', 'device', 'geo', 'app_info', 'collected_traffic_source', 'traffic_source', 'event_dimensions', 'ecommerce']:# Flatten dicts and nested dictsfor key, val in value.items():if isinstance(val, dict):for sub_key, sub_val in val.items():# We might not want to do thisflat_row[f"{key}.{sub_key}"] = sub_valelse:flat_row[key] = valelse:flat_row[key] = valuereturn flat_rowdef get_person_properties(row):person_properties = {}user_properties = row.get('user_properties', [])for prop in user_properties:key, value = get_record_key_value(prop)person_properties[key] = valueuser_first_touch_timestamp = row.get('user_first_touch_timestamp')user_ltv = row.get('user_ltv')if user_first_touch_timestamp is not None:person_properties['user_first_touch_timestamp'] = user_first_touch_timestampif user_ltv is not None:person_properties['user_ltv'] = user_ltvreturn person_propertiesdef get_properties(row):for key, value in flat_row.items():if value == None:continueelif key in omitted_keys:continueelif value in ['(not provided)', '(not set)']:continueelif key in key_mapping:properties[key_mapping[key]] = valueelse:properties[key] = valuereturn properties# Get events from BigQueryproject_id = 'cool'client = bigquery.Client(project=project_id)QUERY = ('SELECT * FROM `analytics_123456789.events_20240731` ''WHERE event_name NOT IN ("session_start", "first_visit") ''LIMIT 100')query_job = client.query(QUERY)rows = query_job.result()# Convert and capture events in PostHogfor row in rows:distinct_id = row.get('user_id') or row.get('user_pseudo_id')ph_event_name = row.get('event_name')if ph_event_name == 'page_view':ph_event_name = '$pageview'if ph_event_name == 'user_engagement':ph_event_name = '$autocapture'event_timestamp = row.get('event_timestamp')ph_timestamp = datetime.fromtimestamp(event_timestamp / 1000000)flat_row = flatten_row(row)properties = get_properties(flat_row)properties['$set'] = get_person_properties(row)posthog.capture(distinct_id=distinct_id,event=ph_event_name,properties=properties,timestamp=ph_timestamp)
This script may need modification depending on the structure of your Google Analytics data, but it gives you a start.