thanhdath commited on
Commit
3358602
Β·
verified Β·
1 Parent(s): 67cba51

add data/spider2_original

Browse files
.gitattributes CHANGED
@@ -37,3 +37,5 @@ embeddings/grast-0.6b-swift-merged/tokenizer.json filter=lfs diff=lfs merge=lfs
37
  embeddings/grast-4b-swift-merged/tokenizer.json filter=lfs diff=lfs merge=lfs -text
38
  embeddings/grast-0.6b-flagembed-merged/tokenizer.json filter=lfs diff=lfs merge=lfs -text
39
  data/spider2_lite_samples_hf/dev/dataset_info.json filter=lfs diff=lfs merge=lfs -text
 
 
 
37
  embeddings/grast-4b-swift-merged/tokenizer.json filter=lfs diff=lfs merge=lfs -text
38
  embeddings/grast-0.6b-flagembed-merged/tokenizer.json filter=lfs diff=lfs merge=lfs -text
39
  data/spider2_lite_samples_hf/dev/dataset_info.json filter=lfs diff=lfs merge=lfs -text
40
+ data/spider2_original/review.xlsx filter=lfs diff=lfs merge=lfs -text
41
+ data/spider2_original/spider2_original_lite_samples.json filter=lfs diff=lfs merge=lfs -text
data/spider2_original/dev_samples_graph_no_evidence.pkl ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:0d771d51ee92eeb2978a55bbc918de1f82562d6240d1eeb58c3badd502c783b7
3
+ size 67644624
data/spider2_original/review.md ADDED
@@ -0,0 +1,695 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Spider 2.0 ORIGINAL (no-wildcards) β€” review
2
+
3
+ **Total samples:** 233
4
+ **Samples with substitutions:** 83
5
+ **Total pattern substitutions:** 396
6
+ **Leftover `*` in graph (should be 0):** 0
7
+
8
+ ## Substitution reasons
9
+
10
+ | reason | count |
11
+ |---|---|
12
+ | `q_year_fallback` | 250 |
13
+ | `hardcoded_default` | 101 |
14
+ | `sql_specific` | 13 |
15
+ | `table_suffix_range` | 10 |
16
+ | `sql_specific_year_alpha` | 7 |
17
+ | `sql_partial_wildcard` | 4 |
18
+ | `sql_8digit_literal` | 3 |
19
+ | `sql_specific_alphanumeric` | 3 |
20
+ | `declare` | 2 |
21
+ | `family_default` | 2 |
22
+ | `usa_1910_current_singleton` | 1 |
23
+
24
+ ## Top 20 patterns substituted
25
+
26
+ | pattern | count |
27
+ |---|---|
28
+ | `ga_sessions_*` | 11 |
29
+ | `zip_codes_*` | 10 |
30
+ | `censustract_*` | 10 |
31
+ | `puma_*` | 10 |
32
+ | `schooldistrictunified_*` | 10 |
33
+ | `state_*` | 10 |
34
+ | `congressionaldistrict_*` | 10 |
35
+ | `place_*` | 10 |
36
+ | `blockgroup_*` | 10 |
37
+ | `cbsa_*` | 10 |
38
+ | `schooldistrictsecondary_*` | 10 |
39
+ | `county_*` | 10 |
40
+ | `schooldistrictelementary_*` | 10 |
41
+ | `zcta_*` | 10 |
42
+ | `events_*` | 9 |
43
+ | `zcta5_*` | 8 |
44
+ | `icoads_core_*` | 6 |
45
+ | `storms_*` | 6 |
46
+ | `tlc_fhv_trips_*` | 6 |
47
+ | `tlc_yellow_trips_*` | 6 |
48
+
49
+ ## Gold-affecting substitutions (47 suspect samples)
50
+
51
+ These are the samples where the substitution changes `used_columns` (the gold).
52
+
53
+ ### sid=0 iid=bq011 db=ga4
54
+
55
+ **Q:** How many distinct pseudo users had positive engagement time in the 7-day period ending on January 7, 2021 at 23:59:59, but had no positive engagement time in the 2-day period ending on the same date (January 7, 2021 at 23:59:59) ?
56
+
57
+ **Old gold (2):** `['events_*.user_pseudo_id', 'events_*.event_timestamp']`
58
+
59
+ **New gold (2):** `['events_20210101.user_pseudo_id', 'events_20210101.event_timestamp']`
60
+
61
+ **Substitutions:**
62
+ - `events_*` β†’ `events_20210101` (table_suffix_range:20210101 -> pad8)
63
+
64
+ ---
65
+
66
+ ### sid=1 iid=bq010 db=ga360
67
+
68
+ **Q:** Find the top-selling product among customers who bought 'Youtube Men’s Vintage Henley' in July 2017, excluding itself.
69
+
70
+ **Old gold (2):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.hits']`
71
+
72
+ **New gold (2):** `['ga_sessions_20170701.fullVisitorId', 'ga_sessions_20170701.hits']`
73
+
74
+ **Substitutions:**
75
+ - `ga_sessions_*` β†’ `ga_sessions_20170701` (sql_partial_wildcard:201707 -> pad8)
76
+
77
+ ---
78
+
79
+ ### sid=2 iid=bq009 db=ga360
80
+
81
+ **Q:** Which traffic source has the highest total transaction revenue for the year 2017, and what is the difference in millions (rounded to two decimal places) between the highest and lowest monthly total transaction revenue for that traffic source?
82
+
83
+ **Old gold (3):** `['ga_sessions_*.date', 'ga_sessions_*.trafficSource', 'ga_sessions_*.totals']`
84
+
85
+ **New gold (3):** `['ga_sessions_20170101.date', 'ga_sessions_20170101.trafficSource', 'ga_sessions_20170101.totals']`
86
+
87
+ **Substitutions:**
88
+ - `ga_sessions_*` β†’ `ga_sessions_20170101` (sql_partial_wildcard:2017 -> pad8)
89
+
90
+ ---
91
+
92
+ ### sid=3 iid=bq001 db=ga360
93
+
94
+ **Q:** For each visitor who made at least one transaction in February 2017, how many days elapsed between the date of their first visit in February and the date of their first transaction in February, and on what type of device did they make that first transaction?
95
+
96
+ **Old gold (2):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.date']`
97
+
98
+ **New gold (2):** `['ga_sessions_20170201.fullVisitorId', 'ga_sessions_20170201.date']`
99
+
100
+ **Substitutions:**
101
+ - `ga_sessions_*` β†’ `ga_sessions_20170201` (declare:20170201 -> pad8)
102
+
103
+ ---
104
+
105
+ ### sid=4 iid=bq002 db=ga360
106
+
107
+ **Q:** During the first half of 2017, focusing on hits product revenue, which traffic source generated the highest total product revenue, and what were the maximum daily, weekly, and monthly product revenues (in millions) for that top-performing source over this period?
108
+
109
+ **Old gold (2):** `['ga_sessions_*.date', 'ga_sessions_*.hits']`
110
+
111
+ **New gold (2):** `['ga_sessions_20170101.date', 'ga_sessions_20170101.hits']`
112
+
113
+ **Substitutions:**
114
+ - `ga_sessions_*` β†’ `ga_sessions_20170101` (declare:20170101 -> pad8)
115
+
116
+ ---
117
+
118
+ ### sid=5 iid=bq003 db=ga360
119
+
120
+ **Q:** Between April 1 and July 31 of 2017, using the hits product revenue data along with the totals transactions to classify sessions as purchase (transactions β‰₯ 1 and productRevenue not null) or non-purchase (transactions null and productRevenue null), compare the average pageviews per visitor for each
121
+
122
+ **Old gold (2):** `['ga_sessions_*.date', 'ga_sessions_*.fullVisitorId']`
123
+
124
+ **New gold (2):** `['ga_sessions_20170101.date', 'ga_sessions_20170101.fullVisitorId']`
125
+
126
+ **Substitutions:**
127
+ - `ga_sessions_*` β†’ `ga_sessions_20170101` (sql_partial_wildcard:2017 -> pad8)
128
+
129
+ ---
130
+
131
+ ### sid=7 iid=bq008 db=ga360
132
+
133
+ **Q:** In January 2017, among visitors whose campaign name contains 'Data Share' and who accessed any page starting with '/home', which page did they most commonly visit next, and what is the maximum time (in seconds) they spent on the '/home' page before moving on?
134
+
135
+ **Old gold (7):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.visitId', 'ga_sessions_*.trafficSource', 'ga_sessions_*.hits', 'ga_sessions_*.hits', 'ga_sessions_*.visitStartTime', 'ga_sessions_*.hits']`
136
+
137
+ **New gold (7):** `['ga_sessions_20170101.fullVisitorId', 'ga_sessions_20170101.visitId', 'ga_sessions_20170101.trafficSource', 'ga_sessions_20170101.hits', 'ga_sessions_20170101.hits', 'ga_sessions_20170101.visitStartTime', 'ga_sessions_20170101.hits']`
138
+
139
+ **Substitutions:**
140
+ - `ga_sessions_*` β†’ `ga_sessions_20170101` (table_suffix_range:20170101 -> pad8)
141
+
142
+ ---
143
+
144
+ ### sid=8 iid=bq269 db=ga360
145
+
146
+ **Q:** Between June 1, 2017, and July 31, 2017, consider only sessions that have non-null pageviews. Classify each session as β€˜purchase’ if it has at least one transaction, or β€˜non_purchase’ otherwise. For each month, sum each visitor’s total pageviews under each classification, then compute the average pa
147
+
148
+ **Old gold (4):** `['ga_sessions_*.date', 'ga_sessions_*.fullVisitorId', 'ga_sessions_*.totals', 'ga_sessions_*.totals']`
149
+
150
+ **New gold (4):** `['ga_sessions_20170601.date', 'ga_sessions_20170601.fullVisitorId', 'ga_sessions_20170601.totals', 'ga_sessions_20170601.totals']`
151
+
152
+ **Substitutions:**
153
+ - `ga_sessions_*` β†’ `ga_sessions_20170601` (table_suffix_range:20170601 -> pad8)
154
+
155
+ ---
156
+
157
+ ### sid=9 iid=bq268 db=ga360
158
+
159
+ **Q:** Identify the longest number of days between the first visit and the last recorded event (either the last visit or the first transaction) for a user, where the last recorded event is associated with a mobile device. The last recorded event could either be the last visit or the first transaction, and
160
+
161
+ **Old gold (4):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.date', 'ga_sessions_*.device', 'ga_sessions_*.hits']`
162
+
163
+ **New gold (4):** `['ga_sessions_20170701.fullVisitorId', 'ga_sessions_20170701.date', 'ga_sessions_20170701.device', 'ga_sessions_20170701.hits']`
164
+
165
+ **Substitutions:**
166
+ - `ga_sessions_*` β†’ `ga_sessions_20170701` (family_default:ga_sessions=20170701)
167
+
168
+ ---
169
+
170
+ ### sid=10 iid=bq270 db=ga360
171
+
172
+ **Q:** What were the monthly add-to-cart and purchase conversion rates, calculated as a percentage of pageviews on product details, from January to March 2017?
173
+
174
+ **Old gold (3):** `['ga_sessions_*.date', 'ga_sessions_*.hits', 'ga_sessions_*.hits']`
175
+
176
+ **New gold (3):** `['ga_sessions_20170101.date', 'ga_sessions_20170101.hits', 'ga_sessions_20170101.hits']`
177
+
178
+ **Substitutions:**
179
+ - `ga_sessions_*` β†’ `ga_sessions_20170101` (sql_partial_wildcard:2017 -> pad8)
180
+
181
+ ---
182
+
183
+ ### sid=11 iid=bq275 db=ga360
184
+
185
+ **Q:** Which visitor IDs belong to users whose first transaction occurred on a device explicitly labeled as 'mobile' on a later date than their first visit?
186
+
187
+ **Old gold (4):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.date', 'ga_sessions_*.hits', 'ga_sessions_*.device']`
188
+
189
+ **New gold (4):** `['ga_sessions_20170701.fullVisitorId', 'ga_sessions_20170701.date', 'ga_sessions_20170701.hits', 'ga_sessions_20170701.device']`
190
+
191
+ **Substitutions:**
192
+ - `ga_sessions_*` β†’ `ga_sessions_20170701` (family_default:ga_sessions=20170701)
193
+
194
+ ---
195
+
196
+ ### sid=12 iid=bq374 db=ga360
197
+
198
+ **Q:** Calculates the percentage of new users who, between August 1, 2016, and April 30, 2017, both stayed on the site for more than 5 minutes during their initial visit and made a purchase on a subsequent visit at any later time, relative to the total number of new users in the same period.
199
+
200
+ **Old gold (6):** `['ga_sessions_*.fullVisitorId', 'ga_sessions_*.visitStartTime', 'ga_sessions_*.totals', 'ga_sessions_*.date', 'ga_sessions_*.totals', 'ga_sessions_*.totals']`
201
+
202
+ **New gold (6):** `['ga_sessions_20160801.fullVisitorId', 'ga_sessions_20160801.visitStartTime', 'ga_sessions_20160801.totals', 'ga_sessions_20160801.date', 'ga_sessions_20160801.totals', 'ga_sessions_20160801.totals']`
203
+
204
+ **Substitutions:**
205
+ - `ga_sessions_*` β†’ `ga_sessions_20160801` (sql_8digit_literal:20160801)
206
+
207
+ ---
208
+
209
+ ### sid=61 iid=bq235 db=cms_data
210
+
211
+ **Q:** Can you tell me which healthcare provider incurs the highest combined average costs for both outpatient and inpatient services in 2014?
212
+
213
+ **Old gold (12):** `['outpatient_charges_*.provider_state', 'outpatient_charges_*.provider_city', 'outpatient_charges_*.provider_id', 'outpatient_charges_*.provider_name', 'outpatient_charges_*.outpatient_services', 'outpatient_charges_*.average_total_payments', 'inpatient_charges_*.provider_state', 'inpatient_charges_*.provider_city', 'inpatient_charges_*.provider_id', 'inpatient_charges_*.provider_name', 'inpatient_charges_*.total_discharges', 'inpatient_charges_*.average_medicare_payments']`
214
+
215
+ **New gold (12):** `['outpatient_charges_2014.provider_state', 'outpatient_charges_2014.provider_city', 'outpatient_charges_2014.provider_id', 'outpatient_charges_2014.provider_name', 'outpatient_charges_2014.outpatient_services', 'outpatient_charges_2014.average_total_payments', 'inpatient_charges_2014.provider_state', 'inpatient_charges_2014.provider_city', 'inpatient_charges_2014.provider_id', 'inpatient_charges_2014.provider_name', 'inpatient_charges_2014.total_discharges', 'inpatient_charges_2014.average_medicare_payments']`
216
+
217
+ **Substitutions:**
218
+ - `outpatient_charges_*` β†’ `outpatient_charges_2014` (sql_specific:2014 -> pad4)
219
+ - `inpatient_charges_*` β†’ `inpatient_charges_2014` (sql_specific:2014 -> pad4)
220
+
221
+ ---
222
+
223
+ ### sid=69 iid=bq419 db=noaa_data
224
+
225
+ **Q:** Which 5 states had the most storm events from 1980 to 1995, considering only the top 1000 states with the highest event counts each year? Please use state abbreviations.
226
+
227
+ **Old gold (2):** `['storms_*.state', 'storms_*.event_id']`
228
+
229
+ **New gold (2):** `['storms_1980.state', 'storms_1980.event_id']`
230
+
231
+ **Substitutions:**
232
+ - `storms_*` β†’ `storms_1980` (sql_specific:1980 -> pad4)
233
+
234
+ ---
235
+
236
+ ### sid=71 iid=sf_bq236 db=NOAA_DATA_PLUS
237
+
238
+ **Q:** What are the top 5 zip codes of the areas in the United States that have experienced the most hail storm events in the past 10 years? Don't use data from hail reports table.
239
+
240
+ **Old gold (7):** `['STORMS_*.event_point', 'STORMS_*.event_type', 'STORMS_*.event_id', 'ZIP_CODES.zip_code', 'ZIP_CODES.city', 'ZIP_CODES.state_name', 'ZIP_CODES.zip_code_geom']`
241
+
242
+ **New gold (7):** `['STORMS_2014.event_point', 'STORMS_2014.event_type', 'STORMS_2014.event_id', 'ZIP_CODES.zip_code', 'ZIP_CODES.city', 'ZIP_CODES.state_name', 'ZIP_CODES.zip_code_geom']`
243
+
244
+ **Substitutions:**
245
+ - `STORMS_*` β†’ `STORMS_2014` (sql_specific:2014 -> pad4)
246
+
247
+ ---
248
+
249
+ ### sid=75 iid=bq357 db=noaa_data
250
+
251
+ **Q:** What are the latitude and longitude coordinates and dates between 2005 and 2015 with the top 5 highest daily average wind speeds, excluding records with missing wind speed values? Using data from tables start with prefix "icoads_core".
252
+
253
+ **Old gold (6):** `['icoads_core_*.year', 'icoads_core_*.month', 'icoads_core_*.day', 'icoads_core_*.latitude', 'icoads_core_*.longitude', 'icoads_core_*.wind_speed']`
254
+
255
+ **New gold (6):** `['icoads_core_2005.year', 'icoads_core_2005.month', 'icoads_core_2005.day', 'icoads_core_2005.latitude', 'icoads_core_2005.longitude', 'icoads_core_2005.wind_speed']`
256
+
257
+ **Substitutions:**
258
+ - `icoads_core_*` β†’ `icoads_core_2005` (table_suffix_range:2005 -> pad4)
259
+
260
+ ---
261
+
262
+ ### sid=93 iid=sf_bq429 db=CENSUS_BUREAU_ACS_2
263
+
264
+ **Q:** Which are the top five states with the greatest average difference in median income between 2015 and 2018 at the ZIP code level, and what is the corresponding average number of vulnerable employees across wholesale trade, natural resources and construction, arts and entertainment, information, and r
265
+
266
+ **Old gold (26):** `['ZCTA5_2017_5YR.employed_information', 'PLACE_*.geo_id', 'PLACE_*.employed_arts_entertainment_recreation_accommodation_food', 'PLACE_*.employed_wholesale_trade', 'ZIP_CODES_*.employed_information', 'ZCTA5_2018_5YR.geo_id', 'PLACE_*.occupation_natural_resources_construction_maintenance', 'ZCTA5_2015_5YR.geo_id', 'ZCTA5_2017_5YR.employed_arts_entertainment_recreation_accommodation_food', 'ZCTA5_2017_5YR.occupation_natural_resources_construction_maintenance', 'ZCTA5_2017_5YR.employed_retail_trade', 'PLACE_*.employed_information', 'ZCTA5_2018_5YR.median_income', 'ZIP_CODES_*.employed_wholesale_trade', 'ZIP_CODES_*.employed_retail_trade', 'ZIP_CODES_*.median_income', 'ZIP_CODES_*.geo_id', 'ZIP_CODES.state_name', 'PLACE_*.employed_retail_trade', 'PLACE_*.median_income', 'ZIP_CODES_*.occupation_natural_resources_construction_maintenance', 'ZIP_CODES_*.employed_arts_entertainment_recreation_accommodation_food', 'ZIP_CODES.zip_code', 'ZCTA5_2017_5YR.employed_wholesale_trade', 'ZCTA5_2015_5YR.median_income', 'ZCTA5_2017_5YR.geo_id']`
267
+
268
+ **New gold (26):** `['ZCTA5_2017_5YR.employed_information', 'PLACE_2015.geo_id', 'PLACE_2015.employed_arts_entertainment_recreation_accommodation_food', 'PLACE_2015.employed_wholesale_trade', 'ZIP_CODES_2018_5YR.employed_information', 'ZCTA5_2018_5YR.geo_id', 'PLACE_2015.occupation_natural_resources_construction_maintenance', 'ZCTA5_2015_5YR.geo_id', 'ZCTA5_2017_5YR.employed_arts_entertainment_recreation_accommodation_food', 'ZCTA5_2017_5YR.occupation_natural_resources_construction_maintenance', 'ZCTA5_2017_5YR.employed_retail_trade', 'PLACE_2015.employed_information', 'ZCTA5_2018_5YR.median_income', 'ZIP_CODES_2018_5YR.employed_wholesale_trade', 'ZIP_CODES_2018_5YR.employed_retail_trade', 'ZIP_CODES_2018_5YR.median_income', 'ZIP_CODES_2018_5YR.geo_id', 'ZIP_CODES.state_name', 'PLACE_2015.employed_retail_trade', 'PLACE_2015.median_income', 'ZIP_CODES_2018_5YR.occupation_natural_resources_construction_maintenance', 'ZIP_CODES_2018_5YR.employed_arts_entertainment_recreation_accommodation_food', 'ZIP_CODES.zip_code', 'ZCTA5_2017_5YR.employed_wholesale_trade', 'ZCTA5_2015_5YR.median_income', 'ZCTA5_2017_5YR.geo_id']`
269
+
270
+ **Substitutions:**
271
+ - `ZIP_CODES_*` β†’ `ZIP_CODES_2018_5YR` (sql_specific_year_alpha:2018_5YR)
272
+ - `PLACE_*` β†’ `PLACE_2015` (q_year_fallback:2015 -> pad4)
273
+
274
+ ---
275
+
276
+ ### sid=100 iid=sf_bq289 db=GEO_OPENSTREETMAP_CENSUS_PLACES
277
+
278
+ **Q:** Can you find the shortest distance between any two amenities (either a library, place of worship, or community center) located within Philadelphia, analyzed through pennsylvania table and planet features points?
279
+
280
+ **Old gold (5):** `['PLACES_*.place_name', 'PLACES_*.place_geom', 'PLANET_FEATURES_POINTS.all_tags', 'PLANET_FEATURES_POINTS.geometry', 'PLANET_FEATURES_POINTS.osm_id']`
281
+
282
+ **New gold (5):** `['PLACES_PENNSYLVANIA.place_name', 'PLACES_PENNSYLVANIA.place_geom', 'PLANET_FEATURES_POINTS.all_tags', 'PLANET_FEATURES_POINTS.geometry', 'PLANET_FEATURES_POINTS.osm_id']`
283
+
284
+ **Substitutions:**
285
+ - `PLACES_*` β†’ `PLACES_PENNSYLVANIA` (sql_specific_alphanumeric:PENNSYLVANIA)
286
+
287
+ ---
288
+
289
+ ### sid=132 iid=bq021 db=new_york
290
+
291
+ **Q:** For the top 20 Citi Bike routes in 2016, which route is faster than yellow taxis and among those, which one has the longest average bike duration? Please provide the start station name of this route. The coordinates are rounded to three decimals.
292
+
293
+ **Old gold (14):** `['citibike_trips.start_station_name', 'citibike_trips.end_station_name', 'citibike_trips.start_station_latitude', 'citibike_trips.start_station_longitude', 'citibike_trips.end_station_latitude', 'citibike_trips.end_station_longitude', 'citibike_trips.tripduration', 'citibike_trips.starttime', 'tlc_yellow_trips_*.pickup_latitude', 'tlc_yellow_trips_*.pickup_longitude', 'tlc_yellow_trips_*.dropoff_latitude', 'tlc_yellow_trips_*.dropoff_longitude', 'tlc_yellow_trips_*.dropoff_datetime', 'tlc_yellow_trips_*.pickup_datetime']`
294
+
295
+ **New gold (14):** `['citibike_trips.start_station_name', 'citibike_trips.end_station_name', 'citibike_trips.start_station_latitude', 'citibike_trips.start_station_longitude', 'citibike_trips.end_station_latitude', 'citibike_trips.end_station_longitude', 'citibike_trips.tripduration', 'citibike_trips.starttime', 'tlc_yellow_trips_2016.pickup_latitude', 'tlc_yellow_trips_2016.pickup_longitude', 'tlc_yellow_trips_2016.dropoff_latitude', 'tlc_yellow_trips_2016.dropoff_longitude', 'tlc_yellow_trips_2016.dropoff_datetime', 'tlc_yellow_trips_2016.pickup_datetime']`
296
+
297
+ **Substitutions:**
298
+ - `tlc_yellow_trips_*` β†’ `tlc_yellow_trips_2016` (sql_specific:2016 -> pad4)
299
+
300
+ ---
301
+
302
+ ### sid=134 iid=bq185 db=new_york_plus
303
+
304
+ **Q:** What is the average trip duration in minutes for all valid Yellow taxi trips that took place between February 1, 2016, and February 7, 2016 (inclusive), with a positive trip duration, more than three passengers, and a trip distance of at least ten miles, where both the pickup and dropoff locations a
305
+
306
+ **Old gold (8):** `['tlc_yellow_trips_*.dropoff_datetime', 'tlc_yellow_trips_*.pickup_datetime', 'tlc_yellow_trips_*.passenger_count', 'tlc_yellow_trips_*.trip_distance', 'tlc_yellow_trips_*.pickup_location_id', 'tlc_yellow_trips_*.dropoff_location_id', 'taxi_zone_geom.zone_id', 'taxi_zone_geom.borough']`
307
+
308
+ **New gold (8):** `['tlc_yellow_trips_2016.dropoff_datetime', 'tlc_yellow_trips_2016.pickup_datetime', 'tlc_yellow_trips_2016.passenger_count', 'tlc_yellow_trips_2016.trip_distance', 'tlc_yellow_trips_2016.pickup_location_id', 'tlc_yellow_trips_2016.dropoff_location_id', 'taxi_zone_geom.zone_id', 'taxi_zone_geom.borough']`
309
+
310
+ **Substitutions:**
311
+ - `tlc_yellow_trips_*` β†’ `tlc_yellow_trips_2016` (sql_specific:2016 -> pad4)
312
+
313
+ ---
314
+
315
+ ### sid=136 iid=bq098 db=new_york_plus
316
+
317
+ **Q:** For NYC yellow taxi trips where both the pickup and dropoff occurred between January 1 and 7, 2016, inclusive, calculate the percentage of trips with no tip in each pickup borough, ensuring that only trips where the dropoff occurs after the pickup are included, the passenger count is greater than ze
318
+
319
+ **Old gold (12):** `['tlc_yellow_trips_*.dropoff_datetime', 'tlc_yellow_trips_*.pickup_datetime', 'tlc_yellow_trips_*.total_amount', 'tlc_yellow_trips_*.tip_amount', 'tlc_yellow_trips_*.passenger_count', 'tlc_yellow_trips_*.trip_distance', 'tlc_yellow_trips_*.tolls_amount', 'tlc_yellow_trips_*.mta_tax', 'tlc_yellow_trips_*.fare_amount', 'tlc_yellow_trips_*.pickup_location_id', 'taxi_zone_geom.zone_id', 'taxi_zone_geom.borough']`
320
+
321
+ **New gold (12):** `['tlc_yellow_trips_2016.dropoff_datetime', 'tlc_yellow_trips_2016.pickup_datetime', 'tlc_yellow_trips_2016.total_amount', 'tlc_yellow_trips_2016.tip_amount', 'tlc_yellow_trips_2016.passenger_count', 'tlc_yellow_trips_2016.trip_distance', 'tlc_yellow_trips_2016.tolls_amount', 'tlc_yellow_trips_2016.mta_tax', 'tlc_yellow_trips_2016.fare_amount', 'tlc_yellow_trips_2016.pickup_location_id', 'taxi_zone_geom.zone_id', 'taxi_zone_geom.borough']`
322
+
323
+ **Substitutions:**
324
+ - `tlc_yellow_trips_*` β†’ `tlc_yellow_trips_2016` (sql_specific:2016 -> pad4)
325
+
326
+ ---
327
+
328
+ ### sid=137 iid=bq039 db=new_york_plus
329
+
330
+ **Q:** Find the top 10 taxi trips in New York City between July 1 and July 7, 2016 (ensuring both pickup and dropoff times fall within these dates) where the passenger count is greater than five, the trip distance is at least ten miles, and there are no negative fare-related amounts (including tip, tolls,
331
+
332
+ **Old gold (13):** `['tlc_yellow_trips_*.pickup_datetime', 'tlc_yellow_trips_*.dropoff_datetime', 'tlc_yellow_trips_*.trip_distance', 'tlc_yellow_trips_*.tip_amount', 'tlc_yellow_trips_*.total_amount', 'tlc_yellow_trips_*.pickup_location_id', 'tlc_yellow_trips_*.dropoff_location_id', 'tlc_yellow_trips_*.passenger_count', 'tlc_yellow_trips_*.tolls_amount', 'tlc_yellow_trips_*.mta_tax', 'tlc_yellow_trips_*.fare_amount', 'taxi_zone_geom.zone_name', 'taxi_zone_geom.zone_id']`
333
+
334
+ **New gold (13):** `['tlc_yellow_trips_2016.pickup_datetime', 'tlc_yellow_trips_2016.dropoff_datetime', 'tlc_yellow_trips_2016.trip_distance', 'tlc_yellow_trips_2016.tip_amount', 'tlc_yellow_trips_2016.total_amount', 'tlc_yellow_trips_2016.pickup_location_id', 'tlc_yellow_trips_2016.dropoff_location_id', 'tlc_yellow_trips_2016.passenger_count', 'tlc_yellow_trips_2016.tolls_amount', 'tlc_yellow_trips_2016.mta_tax', 'tlc_yellow_trips_2016.fare_amount', 'taxi_zone_geom.zone_name', 'taxi_zone_geom.zone_id']`
335
+
336
+ **Substitutions:**
337
+ - `tlc_yellow_trips_*` β†’ `tlc_yellow_trips_2016` (sql_specific:2016 -> pad4)
338
+
339
+ ---
340
+
341
+ ### sid=175 iid=bq087 db=covid19_symptom_search
342
+
343
+ **Q:** Please calculate the overall percentage change in the average weekly search frequency for the symptom 'Anosmia' across the five New York City countiesβ€”Bronx County, Queens County, Kings County, New York County, and Richmond Countyβ€”by comparing the combined data from January 1, 2019, through December
344
+
345
+ **Old gold (4):** `['symptom_search_*.symptom_anosmia', 'symptom_search_*.sub_region_1', 'symptom_search_*.sub_region_2', 'symptom_search_*.date']`
346
+
347
+ **New gold (4):** `['symptom_search_2019.symptom_anosmia', 'symptom_search_2019.sub_region_1', 'symptom_search_2019.sub_region_2', 'symptom_search_2019.date']`
348
+
349
+ **Substitutions:**
350
+ - `symptom_search_*` β†’ `symptom_search_2019` (q_year_fallback:2019 -> pad4)
351
+
352
+ ---
353
+
354
+ ### sid=176 iid=bq088 db=covid19_symptom_search
355
+
356
+ **Q:** Please calculate the average levels of anxiety and depression symptoms from the weekly country data for the United States during the periods from January 1, 2019, to January 1, 2020, and from January 1, 2020, to January 1, 2021. Then, compute the percentage increase in these average symptom levels f
357
+
358
+ **Old gold (4):** `['symptom_search_*.symptom_anxiety', 'symptom_search_*.symptom_depression', 'symptom_search_*.country_region_code', 'symptom_search_*.date']`
359
+
360
+ **New gold (4):** `['symptom_search_2019.symptom_anxiety', 'symptom_search_2019.symptom_depression', 'symptom_search_2019.country_region_code', 'symptom_search_2019.date']`
361
+
362
+ **Substitutions:**
363
+ - `symptom_search_*` β†’ `symptom_search_2019` (q_year_fallback:2019 -> pad4)
364
+
365
+ ---
366
+
367
+ ### sid=177 iid=bq089 db=covid19_usa
368
+
369
+ **Q:** Given the latest population estimates from the 2018 five-year American Community Survey, what is the number of vaccine sites per 1000 people for counties in California?
370
+
371
+ **Old gold (12):** `['state_*.geo_id', 'facility_boundary_us_all.facility_place_id', 'place_*.total_pop', 'facility_boundary_us_all.facility_sub_region_2_code', 'place_*.geo_id', 'censustract_*.total_pop', 'facility_boundary_us_all.facility_sub_region_2', 'congressionaldistrict_*.total_pop', 'state_*.total_pop', 'congressionaldistrict_*.geo_id', 'facility_boundary_us_all.facility_sub_region_1', 'censustract_*.geo_id']`
372
+
373
+ **New gold (12):** `['state_2018.geo_id', 'facility_boundary_us_all.facility_place_id', 'place_2018.total_pop', 'facility_boundary_us_all.facility_sub_region_2_code', 'place_2018.geo_id', 'censustract_2018.total_pop', 'facility_boundary_us_all.facility_sub_region_2', 'congressionaldistrict_2018.total_pop', 'state_2018.total_pop', 'congressionaldistrict_2018.geo_id', 'facility_boundary_us_all.facility_sub_region_1', 'censustract_2018.geo_id']`
374
+
375
+ **Substitutions:**
376
+ - `censustract_*` β†’ `censustract_2018` (q_year_fallback:2018 -> pad4)
377
+ - `state_*` β†’ `state_2018` (q_year_fallback:2018 -> pad4)
378
+ - `congressionaldistrict_*` β†’ `congressionaldistrict_2018` (q_year_fallback:2018 -> pad4)
379
+ - `place_*` β†’ `place_2018` (q_year_fallback:2018 -> pad4)
380
+
381
+ ---
382
+
383
+ ### sid=178 iid=bq407 db=covid19_usa
384
+
385
+ **Q:** Find the top three counties with populations over 50,000, using the 2020 5-year census data, that had the highest COVID-19 case fatality rates on August 27, 2020. For these counties, provide the name, state, median age, total population, number of confirmed COVID-19 cases per 100,000 people, number
386
+
387
+ **Old gold (30):** `['state_*.geo_id', 'place_*.geo_id', 'county_*.geo_id', 'schooldistrictelementary_*.geo_id', 'congressionaldistrict_*.median_age', 'schooldistrictsecondary_*.median_age', 'state_*.total_pop', 'summary.state', 'place_*.total_pop', 'summary.county_name', 'summary.date', 'schooldistrictelementary_*.median_age', 'congressionaldistrict_*.total_pop', 'congressionaldistrict_*.geo_id', 'schooldistrictsecondary_*.total_pop', 'summary.deaths', 'puma_*.total_pop', 'puma_*.median_age', 'summary.confirmed_cases', 'state_*.median_age', 'schooldistrictsecondary_*.geo_id', 'summary.county_fips_code', 'county_*.geo_id', 'schooldistrictelementary_*.total_pop', 'county_*.total_pop', 'puma_*.geo_id', 'county_*.median_age', 'county_*.median_age', 'place_*.median_age', 'county_*.total_pop']`
388
+
389
+ **New gold (30):** `['state_2020.geo_id', 'place_2020.geo_id', 'county_2020_5yr.geo_id', 'schooldistrictelementary_2020.geo_id', 'congressionaldistrict_2020.median_age', 'schooldistrictsecondary_2020.median_age', 'state_2020.total_pop', 'summary.state', 'place_2020.total_pop', 'summary.county_name', 'summary.date', 'schooldistrictelementary_2020.median_age', 'congressionaldistrict_2020.total_pop', 'congressionaldistrict_2020.geo_id', 'schooldistrictsecondary_2020.total_pop', 'summary.deaths', 'puma_2020.total_pop', 'puma_2020.median_age', 'summary.confirmed_cases', 'state_2020.median_age', 'schooldistrictsecondary_2020.geo_id', 'summary.county_fips_code', 'county_2020_5yr.geo_id', 'schooldistrictelementary_2020.total_pop', 'county_2020_5yr.total_pop', 'puma_2020.geo_id', 'county_2020_5yr.median_age', 'county_2020_5yr.median_age', 'place_2020.median_age', 'county_2020_5yr.total_pop']`
390
+
391
+ **Substitutions:**
392
+ - `puma_*` β†’ `puma_2020` (q_year_fallback:2020 -> pad4)
393
+ - `state_*` β†’ `state_2020` (q_year_fallback:2020 -> pad4)
394
+ - `congressionaldistrict_*` β†’ `congressionaldistrict_2020` (q_year_fallback:2020 -> pad4)
395
+ - `place_*` β†’ `place_2020` (q_year_fallback:2020 -> pad4)
396
+ - `schooldistrictsecondary_*` β†’ `schooldistrictsecondary_2020` (q_year_fallback:2020 -> pad4)
397
+ - `county_*` β†’ `county_2020_5yr` (sql_specific_year_alpha:2020_5yr)
398
+ - `schooldistrictelementary_*` β†’ `schooldistrictelementary_2020` (q_year_fallback:2020 -> pad4)
399
+
400
+ ---
401
+
402
+ ### sid=182 iid=bq061 db=census_bureau_acs_1
403
+
404
+ **Q:** Which census tract has witnessed the largest increase in median income between 2015 and 2018 in California? Tell me the tract code.
405
+
406
+ **Old gold (6):** `['state_*.geo_id', 'census_tracts_*.geo_id', 'censustract_*.median_income', 'censustract_*.geo_id', 'state_*.median_income', 'census_tracts_*.tract_ce']`
407
+
408
+ **New gold (6):** `['state_2015.geo_id', 'census_tracts_2015.geo_id', 'censustract_2018_5yr.median_income', 'censustract_2018_5yr.geo_id', 'state_2015.median_income', 'census_tracts_2015.tract_ce']`
409
+
410
+ **Substitutions:**
411
+ - `censustract_*` β†’ `censustract_2018_5yr` (sql_specific_year_alpha:2018_5yr)
412
+ - `state_*` β†’ `state_2015` (q_year_fallback:2015 -> pad4)
413
+ - `census_tracts_*` β†’ `census_tracts_2015` (q_year_fallback:2015 -> pad4)
414
+
415
+ ---
416
+
417
+ ### sid=183 iid=bq064 db=census_bureau_acs_1
418
+
419
+ **Q:** Using the 2017 U.S. Census Tract data from the BigQuery public datasets, you need to proportionally allocate each tract's population and income to the zip codes based on the overlapping area between their geographic boundaries. Then, filter the results to include only those zip codes located within
420
+
421
+ **Old gold (18):** `['state_*.geo_id', 'census_tracts_*.geo_id', 'us_census_tracts_national.functional_status', 'us_census_tracts_national.tract_ce', 'state_*.income_per_capita', 'us_census_tracts_national.geo_id', 'census_tracts_*.functional_status', 'censustract_*.total_pop', 'censustract_*.income_per_capita', 'state_*.total_pop', 'zip_codes.zip_code', 'zip_codes.zip_code_geom', 'zip_codes.functional_status', 'census_tracts_*.tract_geom', 'zip_codes.state_code', 'censustract_*.geo_id', 'us_census_tracts_national.tract_geom', 'census_tracts_*.tract_ce']`
422
+
423
+ **New gold (18):** `['state_2017.geo_id', 'census_tracts_2017.geo_id', 'us_census_tracts_national.functional_status', 'us_census_tracts_national.tract_ce', 'state_2017.income_per_capita', 'us_census_tracts_national.geo_id', 'census_tracts_2017.functional_status', 'censustract_2017_5yr.total_pop', 'censustract_2017_5yr.income_per_capita', 'state_2017.total_pop', 'zip_codes.zip_code', 'zip_codes.zip_code_geom', 'zip_codes.functional_status', 'census_tracts_2017.tract_geom', 'zip_codes.state_code', 'censustract_2017_5yr.geo_id', 'us_census_tracts_national.tract_geom', 'census_tracts_2017.tract_ce']`
424
+
425
+ **Substitutions:**
426
+ - `censustract_*` β†’ `censustract_2017_5yr` (sql_specific_year_alpha:2017_5yr)
427
+ - `state_*` β†’ `state_2017` (q_year_fallback:2017 -> pad4)
428
+ - `census_tracts_*` β†’ `census_tracts_2017` (q_year_fallback:2017 -> pad4)
429
+
430
+ ---
431
+
432
+ ### sid=194 iid=bq406 db=google_dei
433
+
434
+ **Q:** Please calculate the growth rates for Asians, Black people, Latinx people, Native Americans, White people, US women, US men, global women, and global men from 2014 to 2024 concerning the overall workforce.
435
+
436
+ **Old gold (11):** `['dar_non_intersectional_*.race_hispanic_latinx', 'dar_intersectional_*.race_white', 'dar_region_non_intersectional_representation.race_asian', 'dar_region_non_intersectional_representation.report_year', 'dar_non_intersectional_*.gender_us_men', 'dar_non_intersectional_*.gender_global_men', 'dar_non_intersectional_*.gender_global_women', 'dar_intersectional_*.race_black', 'dar_non_intersectional_*.gender_us_women', 'dar_intersectional_*.race_native_american', 'dar_region_non_intersectional_representation.workforce']`
437
+
438
+ **New gold (11):** `['dar_non_intersectional_2014.race_hispanic_latinx', 'dar_intersectional_2014.race_white', 'dar_region_non_intersectional_representation.race_asian', 'dar_region_non_intersectional_representation.report_year', 'dar_non_intersectional_2014.gender_us_men', 'dar_non_intersectional_2014.gender_global_men', 'dar_non_intersectional_2014.gender_global_women', 'dar_intersectional_2014.race_black', 'dar_non_intersectional_2014.gender_us_women', 'dar_intersectional_2014.race_native_american', 'dar_region_non_intersectional_representation.workforce']`
439
+
440
+ **Substitutions:**
441
+ - `dar_intersectional_*` β†’ `dar_intersectional_2014` (q_year_fallback:2014 -> pad4)
442
+ - `dar_non_intersectional_*` β†’ `dar_non_intersectional_2014` (q_year_fallback:2014 -> pad4)
443
+
444
+ ---
445
+
446
+ ### sid=228 iid=bq105 db=nhtsa_traffic_fatalities_plus
447
+
448
+ **Q:** According to the 2015 and 2016 accident and driver distraction, and excluding cases where the driver’s distraction status is recorded as 'Not Distracted,' 'Unknown if Distracted,' or 'Not Reported,' how many traffic accidents per 100,000 people were caused by driver distraction in each U.S. state fo
449
+
450
+ **Old gold (12):** `[' accident_2015.consecutive_number', ' accident_2015.state_name', ' distract_2015.consecutive_number', ' distract_2015.driver_distracted_by_name', 'population_by_zip_*.population', 'population_by_zip_*.zipcode', 'zipcode_area.zipcode', 'zipcode_area.state_name', ' accident_2016.consecutive_number', ' accident_2016.state_name', ' distract_2016.consecutive_number', ' distract_2016.driver_distracted_by_name']`
451
+
452
+ **New gold (12):** `[' accident_2015.consecutive_number', ' accident_2015.state_name', ' distract_2015.consecutive_number', ' distract_2015.driver_distracted_by_name', 'population_by_zip_2010.population', 'population_by_zip_2010.zipcode', 'zipcode_area.zipcode', 'zipcode_area.state_name', ' accident_2016.consecutive_number', ' accident_2016.state_name', ' distract_2016.consecutive_number', ' distract_2016.driver_distracted_by_name']`
453
+
454
+ **Substitutions:**
455
+ - `population_by_zip_*` β†’ `population_by_zip_2010` (sql_specific:2010 -> pad4)
456
+
457
+ ---
458
+
459
+ ### sid=237 iid=bq352 db=sdoh
460
+
461
+ **Q:** Please list the average number of prenatal weeks in 2018 for counties in Wisconsin where more than 5% of the employed population had commutes of 45-59 minutes in 2017.
462
+
463
+ **Old gold (25):** `['state_*.geo_id', 'place_*.geo_id', 'cbsa_*.geo_id', 'state_*.employed_pop', 'county_*.geo_id', 'county_*.commute_45_59_mins', 'schooldistrictelementary_*.geo_id', 'place_*.employed_pop', 'place_*.commute_45_59_mins', 'county_*.County_of_Residence', 'cbsa_*.employed_pop', 'county_*.employed_pop', 'censustract_*.commute_45_59_mins', 'state_*.commute_45_59_mins', 'schooldistrictelementary_*.commute_45_59_mins', 'censustract_*.geo_id', 'cbsa_*.commute_45_59_mins', 'schooldistrictelementary_*.employed_pop', 'county_*.County_of_Residence_FIPS', 'censustract_*.employed_pop', 'county_*.Year', 'puma_*.commute_45_59_mins', 'county_*.Ave_Number_of_Prenatal_Wks', 'puma_*.geo_id', 'puma_*.employed_pop']`
464
+
465
+ **New gold (25):** `['state_2017.geo_id', 'place_2017.geo_id', 'cbsa_2017.geo_id', 'state_2017.employed_pop', 'county_2017_5yr.geo_id', 'county_2017_5yr.commute_45_59_mins', 'schooldistrictelementary_2017.geo_id', 'place_2017.employed_pop', 'place_2017.commute_45_59_mins', 'county_2017_5yr.County_of_Residence', 'cbsa_2017.employed_pop', 'county_2017_5yr.employed_pop', 'censustract_2017.commute_45_59_mins', 'state_2017.commute_45_59_mins', 'schooldistrictelementary_2017.commute_45_59_mins', 'censustract_2017.geo_id', 'cbsa_2017.commute_45_59_mins', 'schooldistrictelementary_2017.employed_pop', 'county_2017_5yr.County_of_Residence_FIPS', 'censustract_2017.employed_pop', 'county_2017_5yr.Year', 'puma_2017.commute_45_59_mins', 'county_2017_5yr.Ave_Number_of_Prenatal_Wks', 'puma_2017.geo_id', 'puma_2017.employed_pop']`
466
+
467
+ **Substitutions:**
468
+ - `puma_*` β†’ `puma_2017` (q_year_fallback:2017 -> pad4)
469
+ - `censustract_*` β†’ `censustract_2017` (q_year_fallback:2017 -> pad4)
470
+ - `state_*` β†’ `state_2017` (q_year_fallback:2017 -> pad4)
471
+ - `place_*` β†’ `place_2017` (q_year_fallback:2017 -> pad4)
472
+ - `cbsa_*` β†’ `cbsa_2017` (q_year_fallback:2017 -> pad4)
473
+ - `county_*` β†’ `county_2017_5yr` (sql_specific_year_alpha:2017_5yr)
474
+ - `schooldistrictelementary_*` β†’ `schooldistrictelementary_2017` (q_year_fallback:2017 -> pad4)
475
+
476
+ ---
477
+
478
+ ### sid=238 iid=bq074 db=sdoh
479
+
480
+ **Q:** Count the number of counties that experienced an increase in unemployment from 2015 to 2018, using 5-year ACS data, and a decrease in dual-eligible enrollee counts between December 1, 2015, and December 1, 2018.
481
+
482
+ **Old gold (16):** `['dual_eligible_enrollment_by_county_and_program.County_Name', 'state_*.geo_id', 'county_*.unemployed_pop', 'dual_eligible_enrollment_by_county_and_program.Public_Total', 'place_*.geo_id', 'congressionaldistrict_*.unemployed_pop', 'schooldistrictsecondary_*.unemployed_pop', 'puma_*.geo_id', 'dual_eligible_enrollment_by_county_and_program.FIPS', 'county_*.geo_id', 'state_*.unemployed_pop', 'place_*.unemployed_pop', 'dual_eligible_enrollment_by_county_and_program.Date', 'congressionaldistrict_*.geo_id', 'puma_*.unemployed_pop', 'schooldistrictsecondary_*.geo_id']`
483
+
484
+ **New gold (16):** `['dual_eligible_enrollment_by_county_and_program.County_Name', 'state_2015.geo_id', 'county_2018_5yr.unemployed_pop', 'dual_eligible_enrollment_by_county_and_program.Public_Total', 'place_2015.geo_id', 'congressionaldistrict_2015.unemployed_pop', 'schooldistrictsecondary_2015.unemployed_pop', 'puma_2015.geo_id', 'dual_eligible_enrollment_by_county_and_program.FIPS', 'county_2018_5yr.geo_id', 'state_2015.unemployed_pop', 'place_2015.unemployed_pop', 'dual_eligible_enrollment_by_county_and_program.Date', 'congressionaldistrict_2015.geo_id', 'puma_2015.unemployed_pop', 'schooldistrictsecondary_2015.geo_id']`
485
+
486
+ **Substitutions:**
487
+ - `puma_*` β†’ `puma_2015` (q_year_fallback:2015 -> pad4)
488
+ - `state_*` β†’ `state_2015` (q_year_fallback:2015 -> pad4)
489
+ - `congressionaldistrict_*` β†’ `congressionaldistrict_2015` (q_year_fallback:2015 -> pad4)
490
+ - `place_*` β†’ `place_2015` (q_year_fallback:2015 -> pad4)
491
+ - `schooldistrictsecondary_*` β†’ `schooldistrictsecondary_2015` (q_year_fallback:2015 -> pad4)
492
+ - `county_*` β†’ `county_2018_5yr` (sql_specific_year_alpha:2018_5yr)
493
+
494
+ ---
495
+
496
+ ### sid=239 iid=bq066 db=sdoh
497
+
498
+ **Q:** Could you assess the relationship between the poverty rates from the previous year's census data and the percentage of births without maternal morbidity for the years 2016 to 2018? Use only data for births where no maternal morbidity was reported and for each year, use the 5-year census data from th
499
+
500
+ **Old gold (28):** `['state_*.geo_id', 'county_*.geo_id', 'county_*.Maternal_Morbidity_YN', 'cbsa_*.geo_id', 'place_*.geo_id', 'county_*.geo_id', 'place_*.pop_determined_poverty_status', 'cbsa_*.poverty', 'county_*.pop_determined_poverty_status', 'cbsa_*.pop_determined_poverty_status', 'county_*.geo_id', 'county_*.poverty', 'county_*.pop_determined_poverty_status', 'county_*.geo_id', 'puma_*.poverty', 'county_*.pop_determined_poverty_status', 'state_*.poverty', 'puma_*.pop_determined_poverty_status', 'county_*.Births', 'county_*.County_of_Residence_FIPS', 'county_*.poverty', 'county_*.Year', 'county_*.poverty', 'place_*.poverty', 'puma_*.geo_id', 'state_*.pop_determined_poverty_status', 'county_*.pop_determined_poverty_status', 'county_*.poverty']`
501
+
502
+ **New gold (28):** `['state_2016.geo_id', 'county_2015_5yr.geo_id', 'county_2015_5yr.Maternal_Morbidity_YN', 'cbsa_2016.geo_id', 'place_2016.geo_id', 'county_2015_5yr.geo_id', 'place_2016.pop_determined_poverty_status', 'cbsa_2016.poverty', 'county_2015_5yr.pop_determined_poverty_status', 'cbsa_2016.pop_determined_poverty_status', 'county_2015_5yr.geo_id', 'county_2015_5yr.poverty', 'county_2015_5yr.pop_determined_poverty_status', 'county_2015_5yr.geo_id', 'puma_2016.poverty', 'county_2015_5yr.pop_determined_poverty_status', 'state_2016.poverty', 'puma_2016.pop_determined_poverty_status', 'county_2015_5yr.Births', 'county_2015_5yr.County_of_Residence_FIPS', 'county_2015_5yr.poverty', 'county_2015_5yr.Year', 'county_2015_5yr.poverty', 'place_2016.poverty', 'puma_2016.geo_id', 'state_2016.pop_determined_poverty_status', 'county_2015_5yr.pop_determined_poverty_status', 'county_2015_5yr.poverty']`
503
+
504
+ **Substitutions:**
505
+ - `puma_*` β†’ `puma_2016` (q_year_fallback:2016 -> pad4)
506
+ - `state_*` β†’ `state_2016` (q_year_fallback:2016 -> pad4)
507
+ - `place_*` β†’ `place_2016` (q_year_fallback:2016 -> pad4)
508
+ - `cbsa_*` β†’ `cbsa_2016` (q_year_fallback:2016 -> pad4)
509
+ - `county_*` β†’ `county_2015_5yr` (sql_specific_year_alpha:2015_5yr)
510
+
511
+ ---
512
+
513
+ ### sid=267 iid=bq204 db=eclipse_megamovie
514
+
515
+ **Q:** Find the user with the highest total clicks across all records from all available photo collections.
516
+
517
+ **Old gold (1):** `['photos_v_0_*.user']`
518
+
519
+ **New gold (1):** `['photos_v_0_1.user']`
520
+
521
+ **Substitutions:**
522
+ - `photos_v_0_*` β†’ `photos_v_0_1` (sql_specific:1 -> pad4)
523
+
524
+ ---
525
+
526
+ ### sid=279 iid=bq049 db=iowa_liquor_sales_plus
527
+
528
+ **Q:** Please show the monthly per capita Bourbon Whiskey sales during 2022 in Dubuque County for the zip code that ranks third in total Bourbon Whiskey sales, using only the population aged 21 and older.
529
+
530
+ **Old gold (8):** `['sales.category_name', 'sales.date', 'sales.zip_code', 'sales.sale_dollars', 'sales.county', 'population_by_zip_*.zipcode', 'population_by_zip_*.population', 'population_by_zip_*.minimum_age']`
531
+
532
+ **New gold (8):** `['sales.category_name', 'sales.date', 'sales.zip_code', 'sales.sale_dollars', 'sales.county', 'population_by_zip_2022.zipcode', 'population_by_zip_2022.population', 'population_by_zip_2022.minimum_age']`
533
+
534
+ **Substitutions:**
535
+ - `population_by_zip_*` β†’ `population_by_zip_2022` (q_year_fallback:2022 -> pad4)
536
+
537
+ ---
538
+
539
+ ### sid=281 iid=bq286 db=usa_names
540
+
541
+ **Q:** Can you tell me the name of the most popular female baby in Wyoming for the year 2021, based on the proportion of female babies given that name compared to the total number of female babies given the same name across all states?
542
+
543
+ **Old gold (5):** `['usa_1910_*.name', 'usa_1910_*.gender', 'usa_1910_*.year', 'usa_1910_*.number', 'usa_1910_*.state']`
544
+
545
+ **New gold (5):** `['usa_1910_current.name', 'usa_1910_current.gender', 'usa_1910_current.year', 'usa_1910_current.number', 'usa_1910_current.state']`
546
+
547
+ **Substitutions:**
548
+ - `usa_1910_*` β†’ `usa_1910_current` (usa_1910_current_singleton)
549
+
550
+ ---
551
+
552
+ ### sid=284 iid=bq143 db=CPTAC_PDC
553
+
554
+ **Q:** Use CPTAC proteomics and RNAseq data for Clear Cell Renal Cell Carcinoma to select 'Primary Tumor' and 'Solid Tissue Normal' samples. Join the datasets on sample submitter IDs and gene symbols. Calculate the correlation between protein abundance (log2 ratio) and gene expression levels (log-transform
555
+
556
+ **Old gold (11):** `['quant_proteome_*.case_id', 'quant_proteome_*.aliquot_id', 'quant_proteome_*.gene_symbol', 'quant_proteome_*.protein_abundance_log2ratio', 'aliquot_to_case_mapping_current.sample_submitter_id', 'aliquot_to_case_mapping_current.sample_type', 'aliquot_to_case_mapping_current.case_id', 'aliquot_to_case_mapping_current.aliquot_id', 'RNAseq_hg38_gdc_current.gene_name', 'RNAseq_hg38_gdc_current.fpkm_unstranded', 'RNAseq_hg38_gdc_current.sample_barcode']`
557
+
558
+ **New gold (11):** `['quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current.case_id', 'quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current.aliquot_id', 'quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current.gene_symbol', 'quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current.protein_abundance_log2ratio', 'aliquot_to_case_mapping_current.sample_submitter_id', 'aliquot_to_case_mapping_current.sample_type', 'aliquot_to_case_mapping_current.case_id', 'aliquot_to_case_mapping_current.aliquot_id', 'RNAseq_hg38_gdc_current.gene_name', 'RNAseq_hg38_gdc_current.fpkm_unstranded', 'RNAseq_hg38_gdc_current.sample_barcode']`
559
+
560
+ **Substitutions:**
561
+ - `quant_proteome_*` β†’ `quant_proteome_CPTAC_CCRCC_discovery_study_pdc_current` (sql_specific_alphanumeric:CPTAC_CCRCC_discovery_study_pdc_current)
562
+
563
+ ---
564
+
565
+ ### sid=318 iid=bq006 db=austin
566
+
567
+ **Q:** What is the date with the second highest Z-score for daily counts of 'PUBLIC INTOXICATION' incidents in Austin for the year 2016? List the date in the format of '2016-xx-xx'.
568
+
569
+ **Old gold (2):** `['incidents_*.date', 'incidents_*.descript']`
570
+
571
+ **New gold (2):** `['incidents_2016.date', 'incidents_2016.descript']`
572
+
573
+ **Substitutions:**
574
+ - `incidents_*` β†’ `incidents_2016` (sql_specific:2016 -> pad4)
575
+
576
+ ---
577
+
578
+ ### sid=323 iid=bq430 db=ebi_chembl
579
+
580
+ **Q:** Find pairs of different molecules tested in the same assay and standard type, where both have 10–15 heavy atoms, fewer than 5 activities in that assay, fewer than 2 duplicate activities, non-null standard values, and pChEMBL values over 10. For each pair, report the maximum heavy atom count, the lat
581
+
582
+ **Old gold (13):** `['activities_*.assay_id', 'activities_*.standard_type', 'activities_*.activity_id', 'activities_*.standard_value', 'activities_*.standard_relation', 'activities_*.pchembl_value', 'activities_*.molregno', 'compound_structures_25.canonical_smiles', 'compound_properties_26.heavy_atoms', 'docs_*.doc_id', 'docs_*.year', 'docs_*.journal', 'docs_*.first_page']`
583
+
584
+ **New gold (13):** `['activities_29.assay_id', 'activities_29.standard_type', 'activities_29.activity_id', 'activities_29.standard_value', 'activities_29.standard_relation', 'activities_29.pchembl_value', 'activities_29.molregno', 'compound_structures_25.canonical_smiles', 'compound_properties_26.heavy_atoms', 'docs_29.doc_id', 'docs_29.year', 'docs_29.journal', 'docs_29.first_page']`
585
+
586
+ **Substitutions:**
587
+ - `activities_*` β†’ `activities_29` (sql_specific:29 -> pad4)
588
+ - `docs_*` β†’ `docs_29` (sql_specific:29 -> pad4)
589
+
590
+ ---
591
+
592
+ ### sid=369 iid=ga001 db=ga4
593
+
594
+ **Q:** I want to know the preferences of customers who purchased the Google Navy Speckled Tee in December 2020. What other product was purchased with the highest total quantity alongside this item?
595
+
596
+ **Old gold (5):** `['events_*.user_pseudo_id', 'events_*.event_name', 'events_*.items', 'events_*.items', 'events_*.items']`
597
+
598
+ **New gold (5):** `['events_20201201.user_pseudo_id', 'events_20201201.event_name', 'events_20201201.items', 'events_20201201.items', 'events_20201201.items']`
599
+
600
+ **Substitutions:**
601
+ - `events_*` β†’ `events_20201201` (table_suffix_range:20201201 -> pad8)
602
+
603
+ ---
604
+
605
+ ### sid=370 iid=ga002 db=ga4
606
+
607
+ **Q:** Tell me the most purchased other products and their quantities by customers who bought the Google Red Speckled Tee each month for the three months starting from November 2020.
608
+
609
+ **Old gold (3):** `['events_*.user_pseudo_id', 'events_*.items', 'events_*.event_name']`
610
+
611
+ **New gold (3):** `['events_20201101.user_pseudo_id', 'events_20201101.items', 'events_20201101.event_name']`
612
+
613
+ **Substitutions:**
614
+ - `events_*` β†’ `events_20201101` (sql_8digit_literal:20201101)
615
+
616
+ ---
617
+
618
+ ### sid=372 iid=ga004 db=ga4
619
+
620
+ **Q:** Can you figure out the average difference in pageviews between users who bought something and those who didn’t in December 2020? Just label anyone who was involved in purchase events as a purchaser.
621
+
622
+ **Old gold (2):** `['events_*.user_pseudo_id', 'events_*.event_name']`
623
+
624
+ **New gold (2):** `['events_20201201.user_pseudo_id', 'events_20201201.event_name']`
625
+
626
+ **Substitutions:**
627
+ - `events_*` β†’ `events_20201201` (table_suffix_range:20201201 -> pad8)
628
+
629
+ ---
630
+
631
+ ### sid=373 iid=ga008 db=ga4
632
+
633
+ **Q:** Could you provide the total number of page views for each day in November 2020 as well as the average number of page views per user on those days, restricted to users who made at least one purchase in November 2020?
634
+
635
+ **Old gold (3):** `['events_*.user_pseudo_id', 'events_*.event_date', 'events_*.event_name']`
636
+
637
+ **New gold (3):** `['events_20201101.user_pseudo_id', 'events_20201101.event_date', 'events_20201101.event_name']`
638
+
639
+ **Substitutions:**
640
+ - `events_*` β†’ `events_20201101` (table_suffix_range:20201101 -> pad8)
641
+
642
+ ---
643
+
644
+ ### sid=374 iid=ga017 db=ga4
645
+
646
+ **Q:** How many distinct users viewed the most frequently visited page during January 2021?
647
+
648
+ **Old gold (4):** `['events_*.event_params', 'events_*.event_name', 'events_*.event_timestamp', 'events_*.user_pseudo_id']`
649
+
650
+ **New gold (4):** `['events_20210101.event_params', 'events_20210101.event_name', 'events_20210101.event_timestamp', 'events_20210101.user_pseudo_id']`
651
+
652
+ **Substitutions:**
653
+ - `events_*` β†’ `events_20210101` (table_suffix_range:20210101 -> pad8)
654
+
655
+ ---
656
+
657
+ ### sid=377 iid=ga018 db=ga4
658
+
659
+ **Q:** On January 2nd, 2021, I want to determine the percentage of times users transition from a product list page (PLP) view to a product detail page (PDP) view within the same session, using only page_view events. Could you calculate how many PLP views eventually led to a PDP view in the same session on
660
+
661
+ **Old gold (10):** `['events_*.event_name', 'events_*.event_date', 'events_*.event_timestamp', 'events_*.user_pseudo_id', 'events_*.user_id', 'events_*.device', 'events_*.geo', 'events_*.traffic_source', 'events_*.event_params', 'events_*.user_properties']`
662
+
663
+ **New gold (10):** `['events_20210102.event_name', 'events_20210102.event_date', 'events_20210102.event_timestamp', 'events_20210102.user_pseudo_id', 'events_20210102.user_id', 'events_20210102.device', 'events_20210102.geo', 'events_20210102.traffic_source', 'events_20210102.event_params', 'events_20210102.user_properties']`
664
+
665
+ **Substitutions:**
666
+ - `events_*` β†’ `events_20210102` (sql_8digit_literal:20210102)
667
+
668
+ ---
669
+
670
+ ### sid=382 iid=ga010 db=ga4
671
+
672
+ **Q:** Can you give me an overview of our website traffic for December 2020? I'm particularly interested in the channel with the fourth highest number of sessions.
673
+
674
+ **Old gold (3):** `['events_*.user_pseudo_id', 'events_*.event_params', 'events_*.event_timestamp']`
675
+
676
+ **New gold (3):** `['events_20201201.user_pseudo_id', 'events_20201201.event_params', 'events_20201201.event_timestamp']`
677
+
678
+ **Substitutions:**
679
+ - `events_*` β†’ `events_20201201` (table_suffix_range:20201201 -> pad8)
680
+
681
+ ---
682
+
683
+ ### sid=385 iid=ga012 db=ga4
684
+
685
+ **Q:** On November 30, 2020, identify the item category with the highest tax rate by dividing tax value in usd by purchase revenue in usd for purchase events, and then retrieve the transaction IDs, total item quantities, and both purchase revenue in usd and purchase revenue for those purchase events in tha
686
+
687
+ **Old gold (7):** `['events_*.items', 'events_*.ecommerce', 'events_*.ecommerce', 'events_*.event_name', 'events_*.ecommerce', 'events_*.ecommerce', 'events_*.ecommerce']`
688
+
689
+ **New gold (7):** `['events_20200101.items', 'events_20200101.ecommerce', 'events_20200101.ecommerce', 'events_20200101.event_name', 'events_20200101.ecommerce', 'events_20200101.ecommerce', 'events_20200101.ecommerce']`
690
+
691
+ **Substitutions:**
692
+ - `events_*` β†’ `events_20200101` (q_year_fallback:2020 -> pad8)
693
+
694
+ ---
695
+
data/spider2_original/review.xlsx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ea5c49b0a59d75cf15fef63d3060338fefce5caf1de19b4ff0a3c1f60fe85be2
3
+ size 118096
data/spider2_original/spider2_original_lite_samples.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:56cd6497cf0e479d8f3d880a7c57395c670d80bedce97325054ffdc1dde95dc1
3
+ size 198491180
data/spider2_original/substitution_log.json ADDED
The diff for this file is too large to render. See raw diff