evalstate/birch-html / analysis /deep-dives /opus-performance-deep-dive.html
evalstate's picture
download
raw
10.3 kB
<!doctype html><meta charset="utf-8"><title>Opus performance deep dive</title><style>body{font:16px/1.45 system-ui,-apple-system,Segoe UI,sans-serif;margin:2rem;color:#17202a} table{border-collapse:collapse;width:100%;margin:1rem 0 2rem;font-size:14px} th,td{border:1px solid #d6dde5;padding:.45rem .55rem;text-align:right} th:first-child,td:first-child,td:nth-child(2){text-align:left} th{background:#f3f6fa;position:sticky;top:0} code{background:#f3f6fa;padding:.1rem .25rem;border-radius:.25rem}.note{color:#526070}.bad{color:#9b1c1c}.good{color:#176b35}</style>
<h1>Opus token and wall-time performance deep dive</h1>
<p class="note">Date: 2026-05-30. Standalone companion to the Birch HTML benchmark report.</p>
<h2>Headline</h2><ul>
<li>Fastest wall time: `opus?task_budget=50000` at 325.2s; note it only generated 4/5 successfully.</li>
<li>Lowest total tokens: `opus?task_budget=50000` at 516,007 tokens.</li>
<li>Best 100-quality Opus row by quality-efficiency: `opus47` (872.9s; 2,041,367 tokens).</li>
<li>`task_budget=50000` cut wall time by 66.8% and total tokens by 78.1% vs `opus48`, but quality fell from 100.0 to 87.0 and one artifact failed generation.</li>
<li>`task_budget=200000` was slower (+21.5%) and used more tokens (+56.4%) than plain `opus48`, with lower quality (97.4 vs 100.0).</li>
</ul>
<h2>Overall Opus rows</h2>
<table><thead><tr><th>model</th><th>suite</th><th>quality</th><th>gen</th><th>wall time</th><th>total tokens</th><th>input</th><th>output</th><th>effective input</th><th>cache %</th><th>tok/s</th><th>out tok/s</th><th>turns</th><th>tools</th><th>det</th><th>VLM</th><th>QE rank</th></tr></thead><tbody>
<tr><td>opus47</td><td>publish</td><td>100.0</td><td>5/5</td><td>872.9s</td><td>2,041,367</td><td>1,980,822</td><td>60,545</td><td>228,388</td><td>88.5%</td><td>2338.7</td><td>69.4</td><td>67</td><td>83</td><td>0</td><td>0F/0W</td><td>2</td></tr>
<tr><td>opus46</td><td>new-model-day</td><td>98.8</td><td>5/5</td><td>997.5s</td><td>1,851,922</td><td>1,793,023</td><td>58,899</td><td>191,646</td><td>89.3%</td><td>1856.5</td><td>59.0</td><td>73</td><td>98</td><td>2</td><td>0F/0W</td><td>7</td></tr>
<tr><td>opus48</td><td>new-model-day</td><td>100.0</td><td>5/5</td><td>979.4s</td><td>2,354,120</td><td>2,286,911</td><td>67,209</td><td>267,975</td><td>88.3%</td><td>2403.7</td><td>68.6</td><td>71</td><td>91</td><td>0</td><td>0F/0W</td><td>6</td></tr>
<tr><td>opus?task_budget=50000</td><td>new-model-day</td><td>87.0</td><td>4/5</td><td>325.2s</td><td>516,007</td><td>488,908</td><td>27,099</td><td>184,283</td><td>62.3%</td><td>1586.7</td><td>83.3</td><td>28</td><td>29</td><td>4</td><td>1F/1W</td><td>8</td></tr>
<tr><td>opus?task_budget=200000</td><td>new-model-day</td><td>97.4</td><td>5/5</td><td>1,189.6s</td><td>3,680,965</td><td>3,584,777</td><td>96,188</td><td>311,493</td><td>91.3%</td><td>3094.2</td><td>80.9</td><td>88</td><td>105</td><td>4</td><td>0F/2W</td><td>10</td></tr>
</tbody></table>
<h2>Relative to plain opus48</h2>
<table><thead><tr><th>model</th><th>wall-time ratio</th><th>token ratio</th><th>output-token ratio</th><th>quality delta</th><th>notes</th></tr></thead><tbody>
<tr><td>opus47</td><td>0.89×</td><td>0.87×</td><td>0.90×</td><td>+0.0</td><td>clean</td></tr>
<tr><td>opus46</td><td>1.02×</td><td>0.79×</td><td>0.88×</td><td>-1.2</td><td>2 det failures</td></tr>
<tr><td>opus48</td><td>1.00×</td><td>1.00×</td><td>1.00×</td><td>+0.0</td><td>clean</td></tr>
<tr><td>opus?task_budget=50000</td><td>0.33×</td><td>0.22×</td><td>0.40×</td><td>-13.0</td><td>4/5 gen, 4 det failures, 1F/1W VLM</td></tr>
<tr><td>opus?task_budget=200000</td><td>1.21×</td><td>1.56×</td><td>1.43×</td><td>-2.6</td><td>4 det failures, 0F/2W VLM</td></tr>
</tbody></table>
<h2>Per-artifact wall time and tokens</h2>
<h3>benchmark-comparison</h3>
<table><thead><tr><th>model</th><th>ok</th><th>quality</th><th>wall time</th><th>total tokens</th><th>input</th><th>output</th><th>effective input</th><th>tok/s</th><th>turns</th><th>tools</th><th>det</th><th>VLM</th></tr></thead><tbody>
<tr><td>opus47</td><td>True</td><td>100.0</td><td>150.0s</td><td>397,948</td><td>388,331</td><td>9,617</td><td>26,486</td><td>2652.2</td><td>19</td><td>22</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus46</td><td>True</td><td>100.0</td><td>272.0s</td><td>371,021</td><td>351,900</td><td>19,121</td><td>56,694</td><td>1364.3</td><td>14</td><td>18</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus48</td><td>True</td><td>100.0</td><td>258.3s</td><td>704,433</td><td>685,790</td><td>18,643</td><td>55,911</td><td>2727.1</td><td>21</td><td>26</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus?task_budget=50000</td><td>True</td><td>100.0</td><td>76.8s</td><td>111,775</td><td>105,163</td><td>6,612</td><td>20,498</td><td>1454.5</td><td>7</td><td>7</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus?task_budget=200000</td><td>True</td><td>100.0</td><td>281.1s</td><td>1,036,764</td><td>1,012,407</td><td>24,357</td><td>100,128</td><td>3688.1</td><td>22</td><td>28</td><td>0</td><td>0F/0W</td></tr>
</tbody></table>
<h3>code-review</h3>
<table><thead><tr><th>model</th><th>ok</th><th>quality</th><th>wall time</th><th>total tokens</th><th>input</th><th>output</th><th>effective input</th><th>tok/s</th><th>turns</th><th>tools</th><th>det</th><th>VLM</th></tr></thead><tbody>
<tr><td>opus47</td><td>True</td><td>100.0</td><td>268.4s</td><td>588,373</td><td>571,314</td><td>17,059</td><td>73,388</td><td>2192.5</td><td>14</td><td>18</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus46</td><td>True</td><td>100.0</td><td>237.0s</td><td>540,085</td><td>528,342</td><td>11,743</td><td>40,896</td><td>2278.4</td><td>17</td><td>29</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus48</td><td>True</td><td>100.0</td><td>197.0s</td><td>474,233</td><td>459,662</td><td>14,571</td><td>72,302</td><td>2406.7</td><td>12</td><td>15</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus?task_budget=50000</td><td>True</td><td>100.0</td><td>63.3s</td><td>109,587</td><td>104,544</td><td>5,043</td><td>56,128</td><td>1730.6</td><td>4</td><td>5</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus?task_budget=200000</td><td>True</td><td>87.0</td><td>176.7s</td><td>425,417</td><td>411,266</td><td>14,151</td><td>58,001</td><td>2407.0</td><td>11</td><td>13</td><td>4</td><td>0F/2W</td></tr>
</tbody></table>
<h3>implementation-plan</h3>
<table><thead><tr><th>model</th><th>ok</th><th>quality</th><th>wall time</th><th>total tokens</th><th>input</th><th>output</th><th>effective input</th><th>tok/s</th><th>turns</th><th>tools</th><th>det</th><th>VLM</th></tr></thead><tbody>
<tr><td>opus47</td><td>True</td><td>100.0</td><td>141.6s</td><td>215,600</td><td>206,186</td><td>9,414</td><td>22,107</td><td>1522.3</td><td>11</td><td>12</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus46</td><td>True</td><td>94.0</td><td>130.3s</td><td>167,161</td><td>159,833</td><td>7,328</td><td>22,835</td><td>1283.2</td><td>11</td><td>12</td><td>2</td><td>0F/0W</td></tr>
<tr><td>opus48</td><td>True</td><td>100.0</td><td>196.4s</td><td>264,333</td><td>252,260</td><td>12,073</td><td>39,929</td><td>1345.9</td><td>12</td><td>13</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus?task_budget=50000</td><td>True</td><td>100.0</td><td>62.2s</td><td>111,821</td><td>106,572</td><td>5,249</td><td>22,221</td><td>1797.7</td><td>7</td><td>7</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus?task_budget=200000</td><td>True</td><td>100.0</td><td>132.8s</td><td>343,763</td><td>332,156</td><td>11,607</td><td>42,016</td><td>2589.2</td><td>16</td><td>17</td><td>0</td><td>0F/0W</td></tr>
</tbody></table>
<h3>module-explainer</h3>
<table><thead><tr><th>model</th><th>ok</th><th>quality</th><th>wall time</th><th>total tokens</th><th>input</th><th>output</th><th>effective input</th><th>tok/s</th><th>turns</th><th>tools</th><th>det</th><th>VLM</th></tr></thead><tbody>
<tr><td>opus47</td><td>True</td><td>100.0</td><td>206.7s</td><td>669,243</td><td>653,611</td><td>15,632</td><td>85,438</td><td>3237.0</td><td>13</td><td>19</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus46</td><td>True</td><td>100.0</td><td>192.8s</td><td>417,791</td><td>406,724</td><td>11,067</td><td>44,687</td><td>2167.1</td><td>11</td><td>18</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus48</td><td>True</td><td>100.0</td><td>218.6s</td><td>633,137</td><td>618,129</td><td>15,008</td><td>72,109</td><td>2896.4</td><td>12</td><td>21</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus?task_budget=50000</td><td>False</td><td>35.0</td><td>56.1s</td><td>87,378</td><td>82,544</td><td>4,834</td><td>68,845</td><td>1558.1</td><td>3</td><td>3</td><td>4</td><td>1F/1W</td></tr>
<tr><td>opus?task_budget=200000</td><td>True</td><td>100.0</td><td>460.5s</td><td>1,534,617</td><td>1,500,017</td><td>34,600</td><td>84,706</td><td>3332.5</td><td>23</td><td>30</td><td>0</td><td>0F/0W</td></tr>
</tbody></table>
<h3>numeric-data</h3>
<table><thead><tr><th>model</th><th>ok</th><th>quality</th><th>wall time</th><th>total tokens</th><th>input</th><th>output</th><th>effective input</th><th>tok/s</th><th>turns</th><th>tools</th><th>det</th><th>VLM</th></tr></thead><tbody>
<tr><td>opus47</td><td>True</td><td>100.0</td><td>106.1s</td><td>170,203</td><td>161,380</td><td>8,823</td><td>20,969</td><td>1604.4</td><td>10</td><td>12</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus46</td><td>True</td><td>100.0</td><td>165.4s</td><td>355,864</td><td>346,224</td><td>9,640</td><td>26,534</td><td>2150.9</td><td>20</td><td>21</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus48</td><td>True</td><td>100.0</td><td>109.0s</td><td>277,984</td><td>271,070</td><td>6,914</td><td>27,724</td><td>2549.2</td><td>14</td><td>16</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus?task_budget=50000</td><td>True</td><td>100.0</td><td>66.8s</td><td>95,446</td><td>90,085</td><td>5,361</td><td>16,591</td><td>1429.6</td><td>7</td><td>7</td><td>0</td><td>0F/0W</td></tr>
<tr><td>opus?task_budget=200000</td><td>True</td><td>100.0</td><td>138.5s</td><td>340,404</td><td>328,931</td><td>11,473</td><td>26,642</td><td>2457.6</td><td>16</td><td>17</td><td>0</td><td>0F/0W</td></tr>
</tbody></table>
<p>Downloads: <a href="opus-performance-summary.csv">summary CSV</a>, <a href="opus-performance-by-artifact.csv">by-artifact CSV</a>, <a href="opus-performance-deep-dive.md">markdown</a>.</p>

Xet Storage Details

Size:
10.3 kB
·
Xet hash:
0f722615cac3d73a420e3dee8277aa8b024d9d12ef9ac9ce7362ac052da7f3c7

Xet efficiently stores files, intelligently splitting them into unique chunks and accelerating uploads and downloads. More info.