add data/spider2_original

Browse files

Files changed (6) hide show

.gitattributes +2 -0
data/spider2_original/dev_samples_graph_no_evidence.pkl +3 -0
data/spider2_original/review.md +695 -0
data/spider2_original/review.xlsx +3 -0
data/spider2_original/spider2_original_lite_samples.json +3 -0
data/spider2_original/substitution_log.json +0 -0

.gitattributes CHANGED Viewed

@@ -37,3 +37,5 @@ embeddings/grast-0.6b-swift-merged/tokenizer.json filter=lfs diff=lfs merge=lfs
 embeddings/grast-4b-swift-merged/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 embeddings/grast-0.6b-flagembed-merged/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 data/spider2_lite_samples_hf/dev/dataset_info.json filter=lfs diff=lfs merge=lfs -text

 embeddings/grast-4b-swift-merged/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 embeddings/grast-0.6b-flagembed-merged/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 data/spider2_lite_samples_hf/dev/dataset_info.json filter=lfs diff=lfs merge=lfs -text
+data/spider2_original/review.xlsx filter=lfs diff=lfs merge=lfs -text
+data/spider2_original/spider2_original_lite_samples.json filter=lfs diff=lfs merge=lfs -text

data/spider2_original/dev_samples_graph_no_evidence.pkl ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0d771d51ee92eeb2978a55bbc918de1f82562d6240d1eeb58c3badd502c783b7
+size 67644624

data/spider2_original/review.md ADDED Viewed

	@@ -0,0 +1,695 @@

+# Spider 2.0 ORIGINAL (no-wildcards) — review
+**Total samples:** 233
+**Samples with substitutions:** 83
+**Total pattern substitutions:** 396
+**Leftover `*` in graph (should be 0):** 0
+## Substitution reasons
+| reason | count |
+|---|---|
+| `q_year_fallback` | 250 |
+| `hardcoded_default` | 101 |
+| `sql_specific` | 13 |
+| `table_suffix_range` | 10 |
+| `sql_specific_year_alpha` | 7 |
+| `sql_partial_wildcard` | 4 |
+| `sql_8digit_literal` | 3 |
+| `sql_specific_alphanumeric` | 3 |
+| `declare` | 2 |
+| `family_default` | 2 |
+| `usa_1910_current_singleton` | 1 |
+## Top 20 patterns substituted
+| pattern | count |
+|---|---|
+| `ga_sessions_*` | 11 |
+| `zip_codes_*` | 10 |
+| `censustract_*` | 10 |
+| `puma_*` | 10 |
+| `schooldistrictunified_*` | 10 |
+| `state_*` | 10 |
+| `congressionaldistrict_*` | 10 |
+| `place_*` | 10 |
+| `blockgroup_*` | 10 |
+| `cbsa_*` | 10 |
+| `schooldistrictsecondary_*` | 10 |
+| `county_*` | 10 |
+| `schooldistrictelementary_*` | 10 |
+| `zcta_*` | 10 |
+| `events_*` | 9 |
+| `zcta5_*` | 8 |
+| `icoads_core_*` | 6 |
+| `storms_*` | 6 |
+| `tlc_fhv_trips_*` | 6 |
+| `tlc_yellow_trips_*` | 6 |
+## Gold-affecting substitutions (47 suspect samples)
+These are the samples where the substitution changes `used_columns` (the gold).
+### sid=0  iid=bq011  db=ga4
+**Q:** How many distinct pseudo users had positive engagement time  in the 7-day period ending on January 7, 2021 at 23:59:59, but had no positive engagement time in the 2-day period ending on the same date (January 7, 2021 at 23:59:59) ?
+**Old gold (2):** `['events_*.user_pseudo_id', 'events_*.event_timestamp']`
+**New gold (2):** `['events_20210101.user_pseudo_id', 'events_20210101.event_timestamp']`
+**Substitutions:**
+  - `events_*` → `events_20210101` (table_suffix_range:20210101 -> pad8)
+---
+### sid=1  iid=bq010  db=ga360
+**Q:** Find the top-selling product among customers who bought 'Youtube Men’s Vintage Henley' in July 2017, excluding itself.
+**Old gold (2):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.hits']`
+**New gold (2):** `['ga_sessions_20170701.fullVisitorId', 'ga_sessions_20170701.hits']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170701` (sql_partial_wildcard:201707 -> pad8)
+---
+### sid=2  iid=bq009  db=ga360
+**Q:** Which traffic source has the highest total transaction revenue for the year 2017, and what is the difference in millions (rounded to two decimal places) between the highest and lowest monthly total transaction revenue for that traffic source?
+**Old gold (3):** `['ga_sessions_*.date', 'ga_sessions_*.trafficSource', 'ga_sessions_*.totals']`
+**New gold (3):** `['ga_sessions_20170101.date', 'ga_sessions_20170101.trafficSource', 'ga_sessions_20170101.totals']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170101` (sql_partial_wildcard:2017 -> pad8)
+---
+### sid=3  iid=bq001  db=ga360
+**Q:** For each visitor who made at least one transaction in February 2017, how many days elapsed between the date of their first visit in February and the date of their first transaction in February, and on what type of device did they make that first transaction?
+**Old gold (2):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.date']`
+**New gold (2):** `['ga_sessions_20170201.fullVisitorId', 'ga_sessions_20170201.date']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170201` (declare:20170201 -> pad8)
+---
+### sid=4  iid=bq002  db=ga360
+**Q:** During the first half of 2017,  focusing on hits product revenue, which traffic source generated the highest total product revenue, and what were the maximum daily, weekly, and monthly product revenues (in millions) for that top-performing source over this period?
+**Old gold (2):** `['ga_sessions_*.date', 'ga_sessions_*.hits']`
+**New gold (2):** `['ga_sessions_20170101.date', 'ga_sessions_20170101.hits']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170101` (declare:20170101 -> pad8)
+---
+### sid=5  iid=bq003  db=ga360
+**Q:** Between April 1 and July 31 of 2017, using the hits product revenue data along with the totals transactions to classify sessions as purchase (transactions ≥ 1 and productRevenue not null) or non-purchase (transactions null and productRevenue null), compare the average pageviews per visitor for each
+**Old gold (2):** `['ga_sessions_*.date', 'ga_sessions_*.fullVisitorId']`
+**New gold (2):** `['ga_sessions_20170101.date', 'ga_sessions_20170101.fullVisitorId']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170101` (sql_partial_wildcard:2017 -> pad8)
+---
+### sid=7  iid=bq008  db=ga360
+**Q:** In January 2017, among visitors whose campaign name contains 'Data Share' and who accessed any page starting with '/home', which page did they most commonly visit next, and what is the maximum time (in seconds) they spent on the '/home' page before moving on?
+**Old gold (7):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.visitId', 'ga_sessions_*.trafficSource', 'ga_sessions_*.hits', 'ga_sessions_*.hits', 'ga_sessions_*.visitStartTime', 'ga_sessions_*.hits']`
+**New gold (7):** `['ga_sessions_20170101.fullVisitorId', 'ga_sessions_20170101.visitId', 'ga_sessions_20170101.trafficSource', 'ga_sessions_20170101.hits', 'ga_sessions_20170101.hits', 'ga_sessions_20170101.visitStartTime', 'ga_sessions_20170101.hits']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170101` (table_suffix_range:20170101 -> pad8)
+---
+### sid=8  iid=bq269  db=ga360
+**Q:** Between June 1, 2017, and July 31, 2017, consider only sessions that have non-null pageviews. Classify each session as ‘purchase’ if it has at least one transaction, or ‘non_purchase’ otherwise. For each month, sum each visitor’s total pageviews under each classification, then compute the average pa
+**Old gold (4):** `['ga_sessions_*.date', 'ga_sessions_*.fullVisitorId', 'ga_sessions_*.totals', 'ga_sessions_*.totals']`
+**New gold (4):** `['ga_sessions_20170601.date', 'ga_sessions_20170601.fullVisitorId', 'ga_sessions_20170601.totals', 'ga_sessions_20170601.totals']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170601` (table_suffix_range:20170601 -> pad8)
+---
+### sid=9  iid=bq268  db=ga360
+**Q:** Identify the longest number of days between the first visit and the last recorded event (either the last visit or the first transaction) for a user, where the last recorded event is associated with a mobile device. The last recorded event could either be the last visit or the first transaction, and
+**Old gold (4):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.date', 'ga_sessions_*.device', 'ga_sessions_*.hits']`
+**New gold (4):** `['ga_sessions_20170701.fullVisitorId', 'ga_sessions_20170701.date', 'ga_sessions_20170701.device', 'ga_sessions_20170701.hits']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170701` (family_default:ga_sessions=20170701)
+---
+### sid=10  iid=bq270  db=ga360
+**Q:** What were the monthly add-to-cart and purchase conversion rates, calculated as a percentage of pageviews on product details, from January to March 2017?
+**Old gold (3):** `['ga_sessions_*.date', 'ga_sessions_*.hits', 'ga_sessions_*.hits']`
+**New gold (3):** `['ga_sessions_20170101.date', 'ga_sessions_20170101.hits', 'ga_sessions_20170101.hits']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170101` (sql_partial_wildcard:2017 -> pad8)
+---
+### sid=11  iid=bq275  db=ga360
+**Q:** Which visitor IDs belong to users whose first transaction occurred on a device explicitly labeled as 'mobile' on a later date than their first visit?
+**Old gold (4):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.date', 'ga_sessions_*.hits', 'ga_sessions_*.device']`
+**New gold (4):** `['ga_sessions_20170701.fullVisitorId', 'ga_sessions_20170701.date', 'ga_sessions_20170701.hits', 'ga_sessions_20170701.device']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20170701` (family_default:ga_sessions=20170701)
+---
+### sid=12  iid=bq374  db=ga360
+**Q:** Calculates the percentage of new users who, between August 1, 2016, and April 30, 2017, both stayed on the site for more than 5 minutes during their initial visit and made a purchase on a subsequent visit at any later time, relative to the total number of new users in the same period.
+**Old gold (6):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.visitStartTime', 'ga_sessions_*.totals', 'ga_sessions_*.date', 'ga_sessions_*.totals', 'ga_sessions_*.totals']`
+**New gold (6):** `['ga_sessions_20160801.fullVisitorId', 'ga_sessions_20160801.visitStartTime', 'ga_sessions_20160801.totals', 'ga_sessions_20160801.date', 'ga_sessions_20160801.totals', 'ga_sessions_20160801.totals']`
+**Substitutions:**
+  - `ga_sessions_*` → `ga_sessions_20160801` (sql_8digit_literal:20160801)
+---
+### sid=61  iid=bq235  db=cms_data
+**Q:** Can you tell me which healthcare provider incurs the highest combined average costs for both outpatient and inpatient services in 2014?
+**Old gold (12):** `['outpatient_charges_*.provider_state', 'outpatient_charges_*.provider_city', 'outpatient_charges_*.provider_id', 'outpatient_charges_*.provider_name', 'outpatient_charges_*.outpatient_services', 'outpatient_charges_*.average_total_payments', 'inpatient_charges_*.provider_state', 'inpatient_charges_*.provider_city', 'inpatient_charges_*.provider_id', 'inpatient_charges_*.provider_name', 'inpatient_charges_*.total_discharges', 'inpatient_charges_*.average_medicare_payments']`
+**New gold (12):** `['outpatient_charges_2014.provider_state', 'outpatient_charges_2014.provider_city', 'outpatient_charges_2014.provider_id', 'outpatient_charges_2014.provider_name', 'outpatient_charges_2014.outpatient_services', 'outpatient_charges_2014.average_total_payments', 'inpatient_charges_2014.provider_state', 'inpatient_charges_2014.provider_city', 'inpatient_charges_2014.provider_id', 'inpatient_charges_2014.provider_name', 'inpatient_charges_2014.total_discharges', 'inpatient_charges_2014.average_medicare_payments']`
+**Substitutions:**
+  - `outpatient_charges_*` → `outpatient_charges_2014` (sql_specific:2014 -> pad4)
+  - `inpatient_charges_*` → `inpatient_charges_2014` (sql_specific:2014 -> pad4)
+---
+### sid=69  iid=bq419  db=noaa_data
+**Q:** Which 5 states had the most storm events from 1980 to 1995, considering only the top 1000 states with the highest event counts each year? Please use state abbreviations.
+**Old gold (2):** `['storms_*.state', 'storms_*.event_id']`
+**New gold (2):** `['storms_1980.state', 'storms_1980.event_id']`
+**Substitutions:**
+  - `storms_*` → `storms_1980` (sql_specific:1980 -> pad4)
+---
+### sid=71  iid=sf_bq236  db=NOAA_DATA_PLUS
+**Q:** What are the top 5 zip codes of the areas in the United States that have experienced the most hail storm events in the past 10 years? Don't use data from hail reports table.
+**Old gold (7):** `['STORMS_*.event_point', 'STORMS_*.event_type', 'STORMS_*.event_id', 'ZIP_CODES.zip_code', 'ZIP_CODES.city', 'ZIP_CODES.state_name', 'ZIP_CODES.zip_code_geom']`
+**New gold (7):** `['STORMS_2014.event_point', 'STORMS_2014.event_type', 'STORMS_2014.event_id', 'ZIP_CODES.zip_code', 'ZIP_CODES.city', 'ZIP_CODES.state_name', 'ZIP_CODES.zip_code_geom']`
+**Substitutions:**
+  - `STORMS_*` → `STORMS_2014` (sql_specific:2014 -> pad4)
+---
+### sid=75  iid=bq357  db=noaa_data
+**Q:** What are the latitude and longitude coordinates and dates between 2005 and 2015 with the top 5 highest daily average wind speeds, excluding records with missing wind speed values? Using data from tables start with prefix "icoads_core".
+**Old gold (6):** `['icoads_core_*.year', 'icoads_core_*.month', 'icoads_core_*.day', 'icoads_core_*.latitude', 'icoads_core_*.longitude', 'icoads_core_*.wind_speed']`
+**New gold (6):** `['icoads_core_2005.year', 'icoads_core_2005.month', 'icoads_core_2005.day', 'icoads_core_2005.latitude', 'icoads_core_2005.longitude', 'icoads_core_2005.wind_speed']`
+**Substitutions:**
+  - `icoads_core_*` → `icoads_core_2005` (table_suffix_range:2005 -> pad4)
+---
+### sid=93  iid=sf_bq429  db=CENSUS_BUREAU_ACS_2
+**Q:** Which are the top five states with the greatest average difference in median income between 2015 and 2018 at the ZIP code level, and what is the corresponding average number of vulnerable employees across wholesale trade, natural resources and construction, arts and entertainment, information, and r
+**Old gold (26):** `['ZCTA5_2017_5YR.employed_information', 'PLACE_*.geo_id', 'PLACE_*.employed_arts_entertainment_recreation_accommodation_food', 'PLACE_*.employed_wholesale_trade', 'ZIP_CODES_*.employed_information', 'ZCTA5_2018_5YR.geo_id', 'PLACE_*.occupation_natural_resources_construction_maintenance', 'ZCTA5_2015_5YR.geo_id', 'ZCTA5_2017_5YR.employed_arts_entertainment_recreation_accommodation_food', 'ZCTA5_2017_5YR.occupation_natural_resources_construction_maintenance', 'ZCTA5_2017_5YR.employed_retail_trade', 'PLACE_*.employed_information', 'ZCTA5_2018_5YR.median_income', 'ZIP_CODES_*.employed_wholesale_trade', 'ZIP_CODES_*.employed_retail_trade', 'ZIP_CODES_*.median_income', 'ZIP_CODES_*.geo_id', 'ZIP_CODES.state_name', 'PLACE_*.employed_retail_trade', 'PLACE_*.median_income', 'ZIP_CODES_*.occupation_natural_resources_construction_maintenance', 'ZIP_CODES_*.employed_arts_entertainment_recreation_accommodation_food', 'ZIP_CODES.zip_code', 'ZCTA5_2017_5YR.employed_wholesale_trade', 'ZCTA5_2015_5YR.median_income', 'ZCTA5_2017_5YR.geo_id']`
+**New gold (26):** `['ZCTA5_2017_5YR.employed_information', 'PLACE_2015.geo_id', 'PLACE_2015.employed_arts_entertainment_recreation_accommodation_food', 'PLACE_2015.employed_wholesale_trade', 'ZIP_CODES_2018_5YR.employed_information', 'ZCTA5_2018_5YR.geo_id', 'PLACE_2015.occupation_natural_resources_construction_maintenance', 'ZCTA5_2015_5YR.geo_id', 'ZCTA5_2017_5YR.employed_arts_entertainment_recreation_accommodation_food', 'ZCTA5_2017_5YR.occupation_natural_resources_construction_maintenance', 'ZCTA5_2017_5YR.employed_retail_trade', 'PLACE_2015.employed_information', 'ZCTA5_2018_5YR.median_income', 'ZIP_CODES_2018_5YR.employed_wholesale_trade', 'ZIP_CODES_2018_5YR.employed_retail_trade', 'ZIP_CODES_2018_5YR.median_income', 'ZIP_CODES_2018_5YR.geo_id', 'ZIP_CODES.state_name', 'PLACE_2015.employed_retail_trade', 'PLACE_2015.median_income', 'ZIP_CODES_2018_5YR.occupation_natural_resources_construction_maintenance', 'ZIP_CODES_2018_5YR.employed_arts_entertainment_recreation_accommodation_food', 'ZIP_CODES.zip_code', 'ZCTA5_2017_5YR.employed_wholesale_trade', 'ZCTA5_2015_5YR.median_income', 'ZCTA5_2017_5YR.geo_id']`
+**Substitutions:**
+  - `ZIP_CODES_*` → `ZIP_CODES_2018_5YR` (sql_specific_year_alpha:2018_5YR)
+  - `PLACE_*` → `PLACE_2015` (q_year_fallback:2015 -> pad4)
+---
+### sid=100  iid=sf_bq289  db=GEO_OPENSTREETMAP_CENSUS_PLACES
+**Q:** Can you find the shortest distance between any two amenities (either a library, place of worship, or community center) located within Philadelphia, analyzed through pennsylvania table and planet features points?
+**Old gold (5):** `['PLACES_*.place_name', 'PLACES_*.place_geom', 'PLANET_FEATURES_POINTS.all_tags', 'PLANET_FEATURES_POINTS.geometry', 'PLANET_FEATURES_POINTS.osm_id']`
+**New gold (5):** `['PLACES_PENNSYLVANIA.place_name', 'PLACES_PENNSYLVANIA.place_geom', 'PLANET_FEATURES_POINTS.all_tags', 'PLANET_FEATURES_POINTS.geometry', 'PLANET_FEATURES_POINTS.osm_id']`
+**Substitutions:**
+  - `PLACES_*` → `PLACES_PENNSYLVANIA` (sql_specific_alphanumeric:PENNSYLVANIA)
+---
+### sid=132  iid=bq021  db=new_york
+**Q:** For the top 20 Citi Bike routes in 2016, which route is faster than yellow taxis and among those, which one has the longest average bike duration? Please provide the start station name of this route. The coordinates are rounded to three decimals.
+**Old gold (14):** `['citibike_trips.start_station_name', 'citibike_trips.end_station_name', 'citibike_trips.start_station_latitude', 'citibike_trips.start_station_longitude', 'citibike_trips.end_station_latitude', 'citibike_trips.end_station_longitude', 'citibike_trips.tripduration', 'citibike_trips.starttime', 'tlc_yellow_trips_*.pickup_latitude', 'tlc_yellow_trips_*.pickup_longitude', 'tlc_yellow_trips_*.dropoff_latitude', 'tlc_yellow_trips_*.dropoff_longitude', 'tlc_yellow_trips_*.dropoff_datetime', 'tlc_yellow_trips_*.pickup_datetime']`
+**New gold (14):** `['citibike_trips.start_station_name', 'citibike_trips.end_station_name', 'citibike_trips.start_station_latitude', 'citibike_trips.start_station_longitude', 'citibike_trips.end_station_latitude', 'citibike_trips.end_station_longitude', 'citibike_trips.tripduration', 'citibike_trips.starttime', 'tlc_yellow_trips_2016.pickup_latitude', 'tlc_yellow_trips_2016.pickup_longitude', 'tlc_yellow_trips_2016.dropoff_latitude', 'tlc_yellow_trips_2016.dropoff_longitude', 'tlc_yellow_trips_2016.dropoff_datetime', 'tlc_yellow_trips_2016.pickup_datetime']`
+**Substitutions:**
+  - `tlc_yellow_trips_*` → `tlc_yellow_trips_2016` (sql_specific:2016 -> pad4)
+---
+### sid=134  iid=bq185  db=new_york_plus
+**Q:** What is the average trip duration in minutes for all valid Yellow taxi trips that took place between February 1, 2016, and February 7, 2016 (inclusive), with a positive trip duration, more than three passengers, and a trip distance of at least ten miles, where both the pickup and dropoff locations a
+**Old gold (8):** `['tlc_yellow_trips_*.dropoff_datetime', 'tlc_yellow_trips_*.pickup_datetime', 'tlc_yellow_trips_*.passenger_count', 'tlc_yellow_trips_*.trip_distance', 'tlc_yellow_trips_*.pickup_location_id', 'tlc_yellow_trips_*.dropoff_location_id', 'taxi_zone_geom.zone_id', 'taxi_zone_geom.borough']`
+**New gold (8):** `['tlc_yellow_trips_2016.dropoff_datetime', 'tlc_yellow_trips_2016.pickup_datetime', 'tlc_yellow_trips_2016.passenger_count', 'tlc_yellow_trips_2016.trip_distance', 'tlc_yellow_trips_2016.pickup_location_id', 'tlc_yellow_trips_2016.dropoff_location_id', 'taxi_zone_geom.zone_id', 'taxi_zone_geom.borough']`
+**Substitutions:**
+  - `tlc_yellow_trips_*` → `tlc_yellow_trips_2016` (sql_specific:2016 -> pad4)
+---
+### sid=136  iid=bq098  db=new_york_plus
+**Q:** For NYC yellow taxi trips where both the pickup and dropoff occurred between January 1 and 7, 2016, inclusive, calculate the percentage of trips with no tip in each pickup borough, ensuring that only trips where the dropoff occurs after the pickup are included, the passenger count is greater than ze
+**Old gold (12):** `['tlc_yellow_trips_*.dropoff_datetime', 'tlc_yellow_trips_*.pickup_datetime', 'tlc_yellow_trips_*.total_amount', 'tlc_yellow_trips_*.tip_amount', 'tlc_yellow_trips_*.passenger_count', 'tlc_yellow_trips_*.trip_distance', 'tlc_yellow_trips_*.tolls_amount', 'tlc_yellow_trips_*.mta_tax', 'tlc_yellow_trips_*.fare_amount', 'tlc_yellow_trips_*.pickup_location_id', 'taxi_zone_geom.zone_id', 'taxi_zone_geom.borough']`
+**New gold (12):** `['tlc_yellow_trips_2016.dropoff_datetime', 'tlc_yellow_trips_2016.pickup_datetime', 'tlc_yellow_trips_2016.total_amount', 'tlc_yellow_trips_2016.tip_amount', 'tlc_yellow_trips_2016.passenger_count', 'tlc_yellow_trips_2016.trip_distance', 'tlc_yellow_trips_2016.tolls_amount', 'tlc_yellow_trips_2016.mta_tax', 'tlc_yellow_trips_2016.fare_amount', 'tlc_yellow_trips_2016.pickup_location_id', 'taxi_zone_geom.zone_id', 'taxi_zone_geom.borough']`
+**Substitutions:**
+  - `tlc_yellow_trips_*` → `tlc_yellow_trips_2016` (sql_specific:2016 -> pad4)
+---
+### sid=137  iid=bq039  db=new_york_plus
+**Q:** Find the top 10 taxi trips in New York City between July 1 and July 7, 2016 (ensuring both pickup and dropoff times fall within these dates) where the passenger count is greater than five, the trip distance is at least ten miles, and there are no negative fare-related amounts (including tip, tolls,
+**Old gold (13):** `['tlc_yellow_trips_*.pickup_datetime', 'tlc_yellow_trips_*.dropoff_datetime', 'tlc_yellow_trips_*.trip_distance', 'tlc_yellow_trips_*.tip_amount', 'tlc_yellow_trips_*.total_amount', 'tlc_yellow_trips_*.pickup_location_id', 'tlc_yellow_trips_*.dropoff_location_id', 'tlc_yellow_trips_*.passenger_count', 'tlc_yellow_trips_*.tolls_amount', 'tlc_yellow_trips_*.mta_tax', 'tlc_yellow_trips_*.fare_amount', 'taxi_zone_geom.zone_name', 'taxi_zone_geom.zone_id']`
+**New gold (13):** `['tlc_yellow_trips_2016.pickup_datetime', 'tlc_yellow_trips_2016.dropoff_datetime', 'tlc_yellow_trips_2016.trip_distance', 'tlc_yellow_trips_2016.tip_amount', 'tlc_yellow_trips_2016.total_amount', 'tlc_yellow_trips_2016.pickup_location_id', 'tlc_yellow_trips_2016.dropoff_location_id', 'tlc_yellow_trips_2016.passenger_count', 'tlc_yellow_trips_2016.tolls_amount', 'tlc_yellow_trips_2016.mta_tax', 'tlc_yellow_trips_2016.fare_amount', 'taxi_zone_geom.zone_name', 'taxi_zone_geom.zone_id']`
+**Substitutions:**
+  - `tlc_yellow_trips_*` → `tlc_yellow_trips_2016` (sql_specific:2016 -> pad4)
+---
+### sid=175  iid=bq087  db=covid19_symptom_search
+**Q:** Please calculate the overall percentage change in the average weekly search frequency for the symptom 'Anosmia' across the five New York City counties—Bronx County, Queens County, Kings County, New York County, and Richmond County—by comparing the combined data from January 1, 2019, through December
+**Old gold (4):** `['symptom_search_*.symptom_anosmia', 'symptom_search_*.sub_region_1', 'symptom_search_*.sub_region_2', 'symptom_search_*.date']`
+**New gold (4):** `['symptom_search_2019.symptom_anosmia', 'symptom_search_2019.sub_region_1', 'symptom_search_2019.sub_region_2', 'symptom_search_2019.date']`
+**Substitutions:**
+  - `symptom_search_*` → `symptom_search_2019` (q_year_fallback:2019 -> pad4)
+---
+### sid=176  iid=bq088  db=covid19_symptom_search
+**Q:** Please calculate the average levels of anxiety and depression symptoms from the weekly country data for the United States during the periods from January 1, 2019, to January 1, 2020, and from January 1, 2020, to January 1, 2021. Then, compute the percentage increase in these average symptom levels f
+**Old gold (4):** `['symptom_search_*.symptom_anxiety', 'symptom_search_*.symptom_depression', 'symptom_search_*.country_region_code', 'symptom_search_*.date']`
+**New gold (4):** `['symptom_search_2019.symptom_anxiety', 'symptom_search_2019.symptom_depression', 'symptom_search_2019.country_region_code', 'symptom_search_2019.date']`
+**Substitutions:**
+  - `symptom_search_*` → `symptom_search_2019` (q_year_fallback:2019 -> pad4)
+---
+### sid=177  iid=bq089  db=covid19_usa
+**Q:** Given the latest population estimates from the 2018 five-year American Community Survey, what is the number of vaccine sites per 1000 people for counties in California?
+**Old gold (12):** `['state_*.geo_id', 'facility_boundary_us_all.facility_place_id', 'place_*.total_pop', 'facility_boundary_us_all.facility_sub_region_2_code', 'place_*.geo_id', 'censustract_*.total_pop', 'facility_boundary_us_all.facility_sub_region_2', 'congressionaldistrict_*.total_pop', 'state_*.total_pop', 'congressionaldistrict_*.geo_id', 'facility_boundary_us_all.facility_sub_region_1', 'censustract_*.geo_id']`
+**New gold (12):** `['state_2018.geo_id', 'facility_boundary_us_all.facility_place_id', 'place_2018.total_pop', 'facility_boundary_us_all.facility_sub_region_2_code', 'place_2018.geo_id', 'censustract_2018.total_pop', 'facility_boundary_us_all.facility_sub_region_2', 'congressionaldistrict_2018.total_pop', 'state_2018.total_pop', 'congressionaldistrict_2018.geo_id', 'facility_boundary_us_all.facility_sub_region_1', 'censustract_2018.geo_id']`
+**Substitutions:**
+  - `censustract_*` → `censustract_2018` (q_year_fallback:2018 -> pad4)
+  - `state_*` → `state_2018` (q_year_fallback:2018 -> pad4)
+  - `congressionaldistrict_*` → `congressionaldistrict_2018` (q_year_fallback:2018 -> pad4)
+  - `place_*` → `place_2018` (q_year_fallback:2018 -> pad4)
+---
+### sid=178  iid=bq407  db=covid19_usa
+**Q:** Find the top three counties with populations over 50,000, using the 2020 5-year census data, that had the highest COVID-19 case fatality rates on August 27, 2020. For these counties, provide the name, state, median age, total population, number of confirmed COVID-19 cases per 100,000 people, number
+**Old gold (30):** `['state_*.geo_id', 'place_*.geo_id', 'county_*.geo_id', 'schooldistrictelementary_*.geo_id', 'congressionaldistrict_*.median_age', 'schooldistrictsecondary_*.median_age', 'state_*.total_pop', 'summary.state', 'place_*.total_pop', 'summary.county_name', 'summary.date', 'schooldistrictelementary_*.median_age', 'congressionaldistrict_*.total_pop', 'congressionaldistrict_*.geo_id', 'schooldistrictsecondary_*.total_pop', 'summary.deaths', 'puma_*.total_pop', 'puma_*.median_age', 'summary.confirmed_cases', 'state_*.median_age', 'schooldistrictsecondary_*.geo_id', 'summary.county_fips_code', 'county_*.geo_id', 'schooldistrictelementary_*.total_pop', 'county_*.total_pop', 'puma_*.geo_id', 'county_*.median_age', 'county_*.median_age', 'place_*.median_age', 'county_*.total_pop']`
+**New gold (30):** `['state_2020.geo_id', 'place_2020.geo_id', 'county_2020_5yr.geo_id', 'schooldistrictelementary_2020.geo_id', 'congressionaldistrict_2020.median_age', 'schooldistrictsecondary_2020.median_age', 'state_2020.total_pop', 'summary.state', 'place_2020.total_pop', 'summary.county_name', 'summary.date', 'schooldistrictelementary_2020.median_age', 'congressionaldistrict_2020.total_pop', 'congressionaldistrict_2020.geo_id', 'schooldistrictsecondary_2020.total_pop', 'summary.deaths', 'puma_2020.total_pop', 'puma_2020.median_age', 'summary.confirmed_cases', 'state_2020.median_age', 'schooldistrictsecondary_2020.geo_id', 'summary.county_fips_code', 'county_2020_5yr.geo_id', 'schooldistrictelementary_2020.total_pop', 'county_2020_5yr.total_pop', 'puma_2020.geo_id', 'county_2020_5yr.median_age', 'county_2020_5yr.median_age', 'place_2020.median_age', 'county_2020_5yr.total_pop']`
+**Substitutions:**
+  - `puma_*` → `puma_2020` (q_year_fallback:2020 -> pad4)
+  - `state_*` → `state_2020` (q_year_fallback:2020 -> pad4)
+  - `congressionaldistrict_*` → `congressionaldistrict_2020` (q_year_fallback:2020 -> pad4)
+  - `place_*` → `place_2020` (q_year_fallback:2020 -> pad4)
+  - `schooldistrictsecondary_*` → `schooldistrictsecondary_2020` (q_year_fallback:2020 -> pad4)
+  - `county_*` → `county_2020_5yr` (sql_specific_year_alpha:2020_5yr)
+  - `schooldistrictelementary_*` → `schooldistrictelementary_2020` (q_year_fallback:2020 -> pad4)
+---
+### sid=182  iid=bq061  db=census_bureau_acs_1
+**Q:** Which census tract has witnessed the largest increase in median income between 2015 and 2018 in California? Tell me the tract code.
+**Old gold (6):** `['state_*.geo_id', 'census_tracts_*.geo_id', 'censustract_*.median_income', 'censustract_*.geo_id', 'state_*.median_income', 'census_tracts_*.tract_ce']`
+**New gold (6):** `['state_2015.geo_id', 'census_tracts_2015.geo_id', 'censustract_2018_5yr.median_income', 'censustract_2018_5yr.geo_id', 'state_2015.median_income', 'census_tracts_2015.tract_ce']`
+**Substitutions:**
+  - `censustract_*` → `censustract_2018_5yr` (sql_specific_year_alpha:2018_5yr)
+  - `state_*` → `state_2015` (q_year_fallback:2015 -> pad4)
+  - `census_tracts_*` → `census_tracts_2015` (q_year_fallback:2015 -> pad4)
+---
+### sid=183  iid=bq064  db=census_bureau_acs_1
+**Q:** Using the 2017 U.S. Census Tract data from the BigQuery public datasets, you need to proportionally allocate each tract's population and income to the zip codes based on the overlapping area between their geographic boundaries. Then, filter the results to include only those zip codes located within
+**Old gold (18):** `['state_*.geo_id', 'census_tracts_*.geo_id', 'us_census_tracts_national.functional_status', 'us_census_tracts_national.tract_ce', 'state_*.income_per_capita', 'us_census_tracts_national.geo_id', 'census_tracts_*.functional_status', 'censustract_*.total_pop', 'censustract_*.income_per_capita', 'state_*.total_pop', 'zip_codes.zip_code', 'zip_codes.zip_code_geom', 'zip_codes.functional_status', 'census_tracts_*.tract_geom', 'zip_codes.state_code', 'censustract_*.geo_id', 'us_census_tracts_national.tract_geom', 'census_tracts_*.tract_ce']`
+**New gold (18):** `['state_2017.geo_id', 'census_tracts_2017.geo_id', 'us_census_tracts_national.functional_status', 'us_census_tracts_national.tract_ce', 'state_2017.income_per_capita', 'us_census_tracts_national.geo_id', 'census_tracts_2017.functional_status', 'censustract_2017_5yr.total_pop', 'censustract_2017_5yr.income_per_capita', 'state_2017.total_pop', 'zip_codes.zip_code', 'zip_codes.zip_code_geom', 'zip_codes.functional_status', 'census_tracts_2017.tract_geom', 'zip_codes.state_code', 'censustract_2017_5yr.geo_id', 'us_census_tracts_national.tract_geom', 'census_tracts_2017.tract_ce']`
+**Substitutions:**
+  - `censustract_*` → `censustract_2017_5yr` (sql_specific_year_alpha:2017_5yr)
+  - `state_*` → `state_2017` (q_year_fallback:2017 -> pad4)
+  - `census_tracts_*` → `census_tracts_2017` (q_year_fallback:2017 -> pad4)
+---
+### sid=194  iid=bq406  db=google_dei
+**Q:** Please calculate the growth rates for Asians, Black people, Latinx people, Native Americans, White people, US women, US men, global women, and global men from 2014 to 2024 concerning the overall workforce.
+**Old gold (11):** `['dar_non_intersectional_*.race_hispanic_latinx', 'dar_intersectional_*.race_white', 'dar_region_non_intersectional_representation.race_asian', 'dar_region_non_intersectional_representation.report_year', 'dar_non_intersectional_*.gender_us_men', 'dar_non_intersectional_*.gender_global_men', 'dar_non_intersectional_*.gender_global_women', 'dar_intersectional_*.race_black', 'dar_non_intersectional_*.gender_us_women', 'dar_intersectional_*.race_native_american', 'dar_region_non_intersectional_representation.workforce']`
+**New gold (11):** `['dar_non_intersectional_2014.race_hispanic_latinx', 'dar_intersectional_2014.race_white', 'dar_region_non_intersectional_representation.race_asian', 'dar_region_non_intersectional_representation.report_year', 'dar_non_intersectional_2014.gender_us_men', 'dar_non_intersectional_2014.gender_global_men', 'dar_non_intersectional_2014.gender_global_women', 'dar_intersectional_2014.race_black', 'dar_non_intersectional_2014.gender_us_women', 'dar_intersectional_2014.race_native_american', 'dar_region_non_intersectional_representation.workforce']`
+**Substitutions:**
+  - `dar_intersectional_*` → `dar_intersectional_2014` (q_year_fallback:2014 -> pad4)
+  - `dar_non_intersectional_*` → `dar_non_intersectional_2014` (q_year_fallback:2014 -> pad4)
+---
+### sid=228  iid=bq105  db=nhtsa_traffic_fatalities_plus
+**Q:** According to the 2015 and 2016 accident and driver distraction, and excluding cases where the driver’s distraction status is recorded as 'Not Distracted,' 'Unknown if Distracted,' or 'Not Reported,' how many traffic accidents per 100,000 people were caused by driver distraction in each U.S. state fo
+**Old gold (12):** `[' accident_2015.consecutive_number', ' accident_2015.state_name', ' distract_2015.consecutive_number', ' distract_2015.driver_distracted_by_name', 'population_by_zip_*.population', 'population_by_zip_*.zipcode', 'zipcode_area.zipcode', 'zipcode_area.state_name', ' accident_2016.consecutive_number', ' accident_2016.state_name', ' distract_2016.consecutive_number', ' distract_2016.driver_distracted_by_name']`
+**New gold (12):** `[' accident_2015.consecutive_number', ' accident_2015.state_name', ' distract_2015.consecutive_number', ' distract_2015.driver_distracted_by_name', 'population_by_zip_2010.population', 'population_by_zip_2010.zipcode', 'zipcode_area.zipcode', 'zipcode_area.state_name', ' accident_2016.consecutive_number', ' accident_2016.state_name', ' distract_2016.consecutive_number', ' distract_2016.driver_distracted_by_name']`
+**Substitutions:**
+  - `population_by_zip_*` → `population_by_zip_2010` (sql_specific:2010 -> pad4)
+---
+### sid=237  iid=bq352  db=sdoh
+**Q:** Please list the average number of prenatal weeks in 2018 for counties in Wisconsin where more than 5% of the employed population had commutes of 45-59 minutes in 2017.
+**Old gold (25):** `['state_*.geo_id', 'place_*.geo_id', 'cbsa_*.geo_id', 'state_*.employed_pop', 'county_*.geo_id', 'county_*.commute_45_59_mins', 'schooldistrictelementary_*.geo_id', 'place_*.employed_pop', 'place_*.commute_45_59_mins', 'county_*.County_of_Residence', 'cbsa_*.employed_pop', 'county_*.employed_pop', 'censustract_*.commute_45_59_mins', 'state_*.commute_45_59_mins', 'schooldistrictelementary_*.commute_45_59_mins', 'censustract_*.geo_id', 'cbsa_*.commute_45_59_mins', 'schooldistrictelementary_*.employed_pop', 'county_*.County_of_Residence_FIPS', 'censustract_*.employed_pop', 'county_*.Year', 'puma_*.commute_45_59_mins', 'county_*.Ave_Number_of_Prenatal_Wks', 'puma_*.geo_id', 'puma_*.employed_pop']`
+**New gold (25):** `['state_2017.geo_id', 'place_2017.geo_id', 'cbsa_2017.geo_id', 'state_2017.employed_pop', 'county_2017_5yr.geo_id', 'county_2017_5yr.commute_45_59_mins', 'schooldistrictelementary_2017.geo_id', 'place_2017.employed_pop', 'place_2017.commute_45_59_mins', 'county_2017_5yr.County_of_Residence', 'cbsa_2017.employed_pop', 'county_2017_5yr.employed_pop', 'censustract_2017.commute_45_59_mins', 'state_2017.commute_45_59_mins', 'schooldistrictelementary_2017.commute_45_59_mins', 'censustract_2017.geo_id', 'cbsa_2017.commute_45_59_mins', 'schooldistrictelementary_2017.employed_pop', 'county_2017_5yr.County_of_Residence_FIPS', 'censustract_2017.employed_pop', 'county_2017_5yr.Year', 'puma_2017.commute_45_59_mins', 'county_2017_5yr.Ave_Number_of_Prenatal_Wks', 'puma_2017.geo_id', 'puma_2017.employed_pop']`
+**Substitutions:**
+  - `puma_*` → `puma_2017` (q_year_fallback:2017 -> pad4)
+  - `censustract_*` → `censustract_2017` (q_year_fallback:2017 -> pad4)
+  - `state_*` → `state_2017` (q_year_fallback:2017 -> pad4)
+  - `place_*` → `place_2017` (q_year_fallback:2017 -> pad4)
+  - `cbsa_*` → `cbsa_2017` (q_year_fallback:2017 -> pad4)
+  - `county_*` → `county_2017_5yr` (sql_specific_year_alpha:2017_5yr)
+  - `schooldistrictelementary_*` → `schooldistrictelementary_2017` (q_year_fallback:2017 -> pad4)
+---
+### sid=238  iid=bq074  db=sdoh
+**Q:** Count the number of counties that experienced an increase in unemployment from 2015 to 2018, using 5-year ACS data, and a decrease in dual-eligible enrollee counts between December 1, 2015, and December 1, 2018.
+**Old gold (16):** `['dual_eligible_enrollment_by_county_and_program.County_Name', 'state_*.geo_id', 'county_*.unemployed_pop', 'dual_eligible_enrollment_by_county_and_program.Public_Total', 'place_*.geo_id', 'congressionaldistrict_*.unemployed_pop', 'schooldistrictsecondary_*.unemployed_pop', 'puma_*.geo_id', 'dual_eligible_enrollment_by_county_and_program.FIPS', 'county_*.geo_id', 'state_*.unemployed_pop', 'place_*.unemployed_pop', 'dual_eligible_enrollment_by_county_and_program.Date', 'congressionaldistrict_*.geo_id', 'puma_*.unemployed_pop', 'schooldistrictsecondary_*.geo_id']`
+**New gold (16):** `['dual_eligible_enrollment_by_county_and_program.County_Name', 'state_2015.geo_id', 'county_2018_5yr.unemployed_pop', 'dual_eligible_enrollment_by_county_and_program.Public_Total', 'place_2015.geo_id', 'congressionaldistrict_2015.unemployed_pop', 'schooldistrictsecondary_2015.unemployed_pop', 'puma_2015.geo_id', 'dual_eligible_enrollment_by_county_and_program.FIPS', 'county_2018_5yr.geo_id', 'state_2015.unemployed_pop', 'place_2015.unemployed_pop', 'dual_eligible_enrollment_by_county_and_program.Date', 'congressionaldistrict_2015.geo_id', 'puma_2015.unemployed_pop', 'schooldistrictsecondary_2015.geo_id']`
+**Substitutions:**
+  - `puma_*` → `puma_2015` (q_year_fallback:2015 -> pad4)
+  - `state_*` → `state_2015` (q_year_fallback:2015 -> pad4)
+  - `congressionaldistrict_*` → `congressionaldistrict_2015` (q_year_fallback:2015 -> pad4)
+  - `place_*` → `place_2015` (q_year_fallback:2015 -> pad4)
+  - `schooldistrictsecondary_*` → `schooldistrictsecondary_2015` (q_year_fallback:2015 -> pad4)
+  - `county_*` → `county_2018_5yr` (sql_specific_year_alpha:2018_5yr)
+---
+### sid=239  iid=bq066  db=sdoh
+**Q:** Could you assess the relationship between the poverty rates from the previous year's census data and the percentage of births without maternal morbidity for the years 2016 to 2018? Use only data for births where no maternal morbidity was reported and for each year, use the 5-year census data from th
+**Old gold (28):** `['state_*.geo_id', 'county_*.geo_id', 'county_*.Maternal_Morbidity_YN', 'cbsa_*.geo_id', 'place_*.geo_id', 'county_*.geo_id', 'place_*.pop_determined_poverty_status', 'cbsa_*.poverty', 'county_*.pop_determined_poverty_status', 'cbsa_*.pop_determined_poverty_status', 'county_*.geo_id', 'county_*.poverty', 'county_*.pop_determined_poverty_status', 'county_*.geo_id', 'puma_*.poverty', 'county_*.pop_determined_poverty_status', 'state_*.poverty', 'puma_*.pop_determined_poverty_status', 'county_*.Births', 'county_*.County_of_Residence_FIPS', 'county_*.poverty', 'county_*.Year', 'county_*.poverty', 'place_*.poverty', 'puma_*.geo_id', 'state_*.pop_determined_poverty_status', 'county_*.pop_determined_poverty_status', 'county_*.poverty']`
+**New gold (28):** `['state_2016.geo_id', 'county_2015_5yr.geo_id', 'county_2015_5yr.Maternal_Morbidity_YN', 'cbsa_2016.geo_id', 'place_2016.geo_id', 'county_2015_5yr.geo_id', 'place_2016.pop_determined_poverty_status', 'cbsa_2016.poverty', 'county_2015_5yr.pop_determined_poverty_status', 'cbsa_2016.pop_determined_poverty_status', 'county_2015_5yr.geo_id', 'county_2015_5yr.poverty', 'county_2015_5yr.pop_determined_poverty_status', 'county_2015_5yr.geo_id', 'puma_2016.poverty', 'county_2015_5yr.pop_determined_poverty_status', 'state_2016.poverty', 'puma_2016.pop_determined_poverty_status', 'county_2015_5yr.Births', 'county_2015_5yr.County_of_Residence_FIPS', 'county_2015_5yr.poverty', 'county_2015_5yr.Year', 'county_2015_5yr.poverty', 'place_2016.poverty', 'puma_2016.geo_id', 'state_2016.pop_determined_poverty_status', 'county_2015_5yr.pop_determined_poverty_status', 'county_2015_5yr.poverty']`
+**Substitutions:**
+  - `puma_*` → `puma_2016` (q_year_fallback:2016 -> pad4)
+  - `state_*` → `state_2016` (q_year_fallback:2016 -> pad4)
+  - `place_*` → `place_2016` (q_year_fallback:2016 -> pad4)
+  - `cbsa_*` → `cbsa_2016` (q_year_fallback:2016 -> pad4)
+  - `county_*` → `county_2015_5yr` (sql_specific_year_alpha:2015_5yr)
+---
+### sid=267  iid=bq204  db=eclipse_megamovie
+**Q:** Find the user with the highest total clicks across all records from all available photo collections.
+**Old gold (1):** `['photos_v_0_*.user']`
+**New gold (1):** `['photos_v_0_1.user']`
+**Substitutions:**
+  - `photos_v_0_*` → `photos_v_0_1` (sql_specific:1 -> pad4)
+---
+### sid=279  iid=bq049  db=iowa_liquor_sales_plus
+**Q:** Please show the monthly per capita Bourbon Whiskey sales during 2022 in Dubuque County for the zip code that ranks third in total Bourbon Whiskey sales, using only the population aged 21 and older.
+**Old gold (8):** `['sales.category_name', 'sales.date', 'sales.zip_code', 'sales.sale_dollars', 'sales.county', 'population_by_zip_*.zipcode', 'population_by_zip_*.population', 'population_by_zip_*.minimum_age']`
+**New gold (8):** `['sales.category_name', 'sales.date', 'sales.zip_code', 'sales.sale_dollars', 'sales.county', 'population_by_zip_2022.zipcode', 'population_by_zip_2022.population', 'population_by_zip_2022.minimum_age']`
+**Substitutions:**
+  - `population_by_zip_*` → `population_by_zip_2022` (q_year_fallback:2022 -> pad4)
+---
+### sid=281  iid=bq286  db=usa_names
+**Q:** Can you tell me the name of the most popular female baby in Wyoming for the year 2021, based on the proportion of female babies given that name compared to the total number of female babies given the same name across all states?
+**Old gold (5):** `['usa_1910_*.name', 'usa_1910_*.gender', 'usa_1910_*.year', 'usa_1910_*.number', 'usa_1910_*.state']`
+**New gold (5):** `['usa_1910_current.name', 'usa_1910_current.gender', 'usa_1910_current.year', 'usa_1910_current.number', 'usa_1910_current.state']`
+**Substitutions:**
+  - `usa_1910_*` → `usa_1910_current` (usa_1910_current_singleton)
+---
+### sid=284  iid=bq143  db=CPTAC_PDC
+**Q:** Use CPTAC proteomics and RNAseq data for Clear Cell Renal Cell Carcinoma to select 'Primary Tumor' and 'Solid Tissue Normal' samples. Join the datasets on sample submitter IDs and gene symbols. Calculate the correlation between protein abundance (log2 ratio) and gene expression levels (log-transform
+**Old gold (11):** `['quant_proteome_*.case_id', 'quant_proteome_*.aliquot_id', 'quant_proteome_*.gene_symbol', 'quant_proteome_*.protein_abundance_log2ratio', 'aliquot_to_case_mapping_current.sample_submitter_id', 'aliquot_to_case_mapping_current.sample_type', 'aliquot_to_case_mapping_current.case_id', 'aliquot_to_case_mapping_current.aliquot_id', 'RNAseq_hg38_gdc_current.gene_name', 'RNAseq_hg38_gdc_current.fpkm_unstranded', 'RNAseq_hg38_gdc_current.sample_barcode']`
+**New gold (11):** `['quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current.case_id', 'quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current.aliquot_id', 'quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current.gene_symbol', 'quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current.protein_abundance_log2ratio', 'aliquot_to_case_mapping_current.sample_submitter_id', 'aliquot_to_case_mapping_current.sample_type', 'aliquot_to_case_mapping_current.case_id', 'aliquot_to_case_mapping_current.aliquot_id', 'RNAseq_hg38_gdc_current.gene_name', 'RNAseq_hg38_gdc_current.fpkm_unstranded', 'RNAseq_hg38_gdc_current.sample_barcode']`
+**Substitutions:**
+  - `quant_proteome_*` → `quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current` (sql_specific_alphanumeric:CPTAC_CCRCC_discovery_study_pdc_current)
+---
+### sid=318  iid=bq006  db=austin
+**Q:** What is the date with the second highest Z-score for daily counts of 'PUBLIC INTOXICATION' incidents in Austin for the year 2016? List the date in the format of '2016-xx-xx'.
+**Old gold (2):** `['incidents_*.date', 'incidents_*.descript']`
+**New gold (2):** `['incidents_2016.date', 'incidents_2016.descript']`
+**Substitutions:**
+  - `incidents_*` → `incidents_2016` (sql_specific:2016 -> pad4)
+---
+### sid=323  iid=bq430  db=ebi_chembl
+**Q:** Find pairs of different molecules tested in the same assay and standard type, where both have 10–15 heavy atoms, fewer than 5 activities in that assay, fewer than 2 duplicate activities, non-null standard values, and pChEMBL values over 10. For each pair, report the maximum heavy atom count, the lat
+**Old gold (13):** `['activities_*.assay_id', 'activities_*.standard_type', 'activities_*.activity_id', 'activities_*.standard_value', 'activities_*.standard_relation', 'activities_*.pchembl_value', 'activities_*.molregno', 'compound_structures_25.canonical_smiles', 'compound_properties_26.heavy_atoms', 'docs_*.doc_id', 'docs_*.year', 'docs_*.journal', 'docs_*.first_page']`
+**New gold (13):** `['activities_29.assay_id', 'activities_29.standard_type', 'activities_29.activity_id', 'activities_29.standard_value', 'activities_29.standard_relation', 'activities_29.pchembl_value', 'activities_29.molregno', 'compound_structures_25.canonical_smiles', 'compound_properties_26.heavy_atoms', 'docs_29.doc_id', 'docs_29.year', 'docs_29.journal', 'docs_29.first_page']`
+**Substitutions:**
+  - `activities_*` → `activities_29` (sql_specific:29 -> pad4)
+  - `docs_*` → `docs_29` (sql_specific:29 -> pad4)
+---
+### sid=369  iid=ga001  db=ga4
+**Q:** I want to know the preferences of customers who purchased the Google Navy Speckled Tee in December 2020. What other product was purchased with the highest total quantity alongside this item?
+**Old gold (5):** `['events_*.user_pseudo_id', 'events_*.event_name', 'events_*.items', 'events_*.items', 'events_*.items']`
+**New gold (5):** `['events_20201201.user_pseudo_id', 'events_20201201.event_name', 'events_20201201.items', 'events_20201201.items', 'events_20201201.items']`
+**Substitutions:**
+  - `events_*` → `events_20201201` (table_suffix_range:20201201 -> pad8)
+---
+### sid=370  iid=ga002  db=ga4
+**Q:** Tell me the most purchased other products and their quantities by customers who bought the Google Red Speckled Tee each month for the three months starting from November 2020.
+**Old gold (3):** `['events_*.user_pseudo_id', 'events_*.items', 'events_*.event_name']`
+**New gold (3):** `['events_20201101.user_pseudo_id', 'events_20201101.items', 'events_20201101.event_name']`
+**Substitutions:**
+  - `events_*` → `events_20201101` (sql_8digit_literal:20201101)
+---
+### sid=372  iid=ga004  db=ga4
+**Q:** Can you figure out the average difference in pageviews between users who bought something and those who didn’t in December 2020? Just label anyone who was involved in purchase events as a purchaser.
+**Old gold (2):** `['events_*.user_pseudo_id', 'events_*.event_name']`
+**New gold (2):** `['events_20201201.user_pseudo_id', 'events_20201201.event_name']`
+**Substitutions:**
+  - `events_*` → `events_20201201` (table_suffix_range:20201201 -> pad8)
+---
+### sid=373  iid=ga008  db=ga4
+**Q:** Could you provide the total number of page views for each day in November 2020 as well as the average number of page views per user on those days, restricted to users who made at least one purchase in November 2020?
+**Old gold (3):** `['events_*.user_pseudo_id', 'events_*.event_date', 'events_*.event_name']`
+**New gold (3):** `['events_20201101.user_pseudo_id', 'events_20201101.event_date', 'events_20201101.event_name']`
+**Substitutions:**
+  - `events_*` → `events_20201101` (table_suffix_range:20201101 -> pad8)
+---
+### sid=374  iid=ga017  db=ga4
+**Q:** How many distinct users viewed the most frequently visited page during January 2021?
+**Old gold (4):** `['events_*.event_params', 'events_*.event_name', 'events_*.event_timestamp', 'events_*.user_pseudo_id']`
+**New gold (4):** `['events_20210101.event_params', 'events_20210101.event_name', 'events_20210101.event_timestamp', 'events_20210101.user_pseudo_id']`
+**Substitutions:**
+  - `events_*` → `events_20210101` (table_suffix_range:20210101 -> pad8)
+---
+### sid=377  iid=ga018  db=ga4
+**Q:** On January 2nd, 2021, I want to determine the percentage of times users transition from a product list page (PLP) view to a product detail page (PDP) view within the same session, using only page_view events. Could you calculate how many PLP views eventually led to a PDP view in the same session on
+**Old gold (10):** `['events_*.event_name', 'events_*.event_date', 'events_*.event_timestamp', 'events_*.user_pseudo_id', 'events_*.user_id', 'events_*.device', 'events_*.geo', 'events_*.traffic_source', 'events_*.event_params', 'events_*.user_properties']`
+**New gold (10):** `['events_20210102.event_name', 'events_20210102.event_date', 'events_20210102.event_timestamp', 'events_20210102.user_pseudo_id', 'events_20210102.user_id', 'events_20210102.device', 'events_20210102.geo', 'events_20210102.traffic_source', 'events_20210102.event_params', 'events_20210102.user_properties']`
+**Substitutions:**
+  - `events_*` → `events_20210102` (sql_8digit_literal:20210102)
+---
+### sid=382  iid=ga010  db=ga4
+**Q:** Can you give me an overview of our website traffic for December 2020? I'm particularly interested in the channel with the fourth highest number of sessions.
+**Old gold (3):** `['events_*.user_pseudo_id', 'events_*.event_params', 'events_*.event_timestamp']`
+**New gold (3):** `['events_20201201.user_pseudo_id', 'events_20201201.event_params', 'events_20201201.event_timestamp']`
+**Substitutions:**
+  - `events_*` → `events_20201201` (table_suffix_range:20201201 -> pad8)
+---
+### sid=385  iid=ga012  db=ga4
+**Q:** On November 30, 2020, identify the item category with the highest tax rate by dividing tax value in usd by purchase revenue in usd for purchase events, and then retrieve the transaction IDs, total item quantities, and both purchase revenue in usd and purchase revenue for those purchase events in tha
+**Old gold (7):** `['events_*.items', 'events_*.ecommerce', 'events_*.ecommerce', 'events_*.event_name', 'events_*.ecommerce', 'events_*.ecommerce', 'events_*.ecommerce']`
+**New gold (7):** `['events_20200101.items', 'events_20200101.ecommerce', 'events_20200101.ecommerce', 'events_20200101.event_name', 'events_20200101.ecommerce', 'events_20200101.ecommerce', 'events_20200101.ecommerce']`
+**Substitutions:**
+  - `events_*` → `events_20200101` (q_year_fallback:2020 -> pad8)
+---

data/spider2_original/review.xlsx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ea5c49b0a59d75cf15fef63d3060338fefce5caf1de19b4ff0a3c1f60fe85be2
+size 118096

data/spider2_original/spider2_original_lite_samples.json ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:56cd6497cf0e479d8f3d880a7c57395c670d80bedce97325054ffdc1dde95dc1
+size 198491180

data/spider2_original/substitution_log.json ADDED Viewed

The diff for this file is too large to render. See raw diff