Axel Mendoza
03/12/2024, 3:13 PM@asset(group_name=REC_TEST_GROUP_NAME, deps=[reconciliation_labels])
def reconciliation_labels_latest(
context: AssetExecutionContext,
bigquery: BigQueryResource,
) -> pd.DataFrame:
"""
Args:
context (AssetExecutionContext): Dagster context.
bigquery (BigQueryResource): The BigQuery resource used for querying.
"""
table_id = get_bq_table_id(
project_id=bigquery.project,
dataset_name=env_vars.feature_store_bq_dataset,
table_name="final_reconciliation",
)
query = f"""
SELECT entity_id, date, reporting_provider_id, reporting_site_id
FROM {table_id}
WHERE date >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
""" # nosec
df_reconciliation = run_query(context, bigquery, query, to_dataframe=True)
return df_reconciliation
What is the most efficient way to join the ids I've retrieve from this dataframe with other existing tables in BQ?Scout
03/12/2024, 3:13 PMpython
@asset(group_name=REC_TEST_GROUP_NAME, deps=[reconciliation_labels])
def reconciliation_labels_latest(
context: AssetExecutionContext,
bigquery: BigQueryResource,
) -> pd.DataFrame:
table_id = get_bq_table_id(
project_id=bigquery.project,
dataset_name=env_vars.feature_store_bq_dataset,
table_name="final_reconciliation",
)
other_table_id = "your_project.your_dataset.your_other_table" # Replace with your table's details
# Write a SQL query that performs the join directly in BigQuery
query = f"""
SELECT r.entity_id, r.date, r.reporting_provider_id, r.reporting_site_id, o.other_column
FROM {table_id} AS r
INNER JOIN {other_table_id} AS o ON r.reporting_provider_id = o.provider_id
WHERE r.date >= TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 30 DAY)
""" # nosec
df_reconciliation = run_query(context, bigquery, query, to_dataframe=True)
return df_reconciliation
In this example, other_table_id
should be replaced with the identifier of the other table you want to join with, and o.other_column
sScout
03/12/2024, 3:14 PMON
clause to match the join condition relevant to your data. Remember to always consider the size of the datasets and the costs associated with the operations in BigQuery. If you're dealing with very large datasets, it's often more cost-effective to perform operations within BigQuery rather than transferring large amounts of data into a local environment for processing.