Evaluating phi-1.5 on GSM8K

#36

by hacky - opened Sep 28, 2023

Sep 28, 2023

As is reported in the technical report, phi-1.5 can obtain 40.2% pass rate for GSM8K via coding. Did anyone manage to replicate the result? It would be great if you could share your evaluation script somewhere.

wanglamao

Oct 16, 2023

I tried to run gsm8k_yaml task of EleutherAI/lm-evaluation-harness and got acc=0.3055.

The evaluation didn't work out of the box and my modifications involve:

enable the get_answer filter in yaml file
use gold_alias instead of doc_to_target

gugarosa

Microsoft org Oct 30, 2023

Hello @hacky !

I am not able to share the full GSM8k evaluation due to some internal imports, but this snippet might help you in using code for the evaluation:

def _timeout_handler(signum: int, frame: Any) -> None:
    raise Exception()

def _validate_completion(completion: str, label: str) -> bool:
    completion_lines = completion.split("TA:")[1].strip().split("\n")
    completion_code = "\n".join(completion_lines[1:] if ":" in completion_lines[0] else completion_lines)

    try:
        signal.signal(signal.SIGALRM, _timeout_handler)
        signal.alarm(2)

        try:
            stdout = io.StringIO()
            with contextlib.redirect_stdout(stdout):
                exec(
                    "import math\nfrom math import *\nimport numpy as np\nimport hashlib\n"
                    + completion_code
                    + "\n\n"
                    + "if type(result) == str:\n\tresult = result.replace(',', '')\n"
                    + f"assert(int(result) == {label})",
                    {},
                )
            signal.alarm(0)
            prediction = 1
        except Exception:
            prediction = 0
        finally:
            signal.alarm(0)

    except Exception:
        prediction = 0

    return prediction

The overall idea is to execute the code that was generated by the model and assert whether its outputs are equal to the ground-truth label. We also added some public imports to prevent many answers from failing.

gugarosa changed discussion status to closed Nov 13, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment