Best way to finetune for structured final output instead of plain text final output ?

#20

by vncntrsnlt - opened 2 days ago

2 days ago

What would be the best way to fine-tune functiongemma into generating more tool calling in the final step instead of generating plain text, given that it already has a way to provide tool definition ? In the sense that I would like to give it a second set of tools to use in the last generation step instead of having it generate just text and I don't want to "reuse" its tool definition parsing tokens and ability instead of providing a generic JSON schema... Or does this even make sense? Thanks!

srikanta-221

Google org about 15 hours ago

Hey @vncntrsnlt ,

FunctionGemma's tool calling isn't implemented as a modular parser, it's learned auto-aggressive behaviour conditioned on the exact tool definition format used during instruction tuning. There's no separate 'tool parsing ability' you can reuse independently of that format. If you introduce a generic JSON schema or change the structure, you are moving out of distribution, which is especially brittle at this scale.

If you want the model to emit a tool call in the final step instead of plain text, you can try supervised fine-tuning with examples where the final assistant turn is a tool invocation. In practice, you are shifting the learned prior so that : P(tool_call | final_turn) > P(text | final_turn).

If you introduce a second set of tools for that last step, provide them in the same canonical format and include training examples where the model selects from that set and terminates with a tool call. Without fine-tuning, you are relying purely on prompt conditioning, which is unlikely to be stable for a smaller models.

Also, only provide the tool definitions relevant to the current step. Don't expose a 270M model to future tools it shouldn't use yet, keep the prompt sparse and stage specific. Smaller models are very sensitive to context clutter and this helps improve tool selection reliability.

If retraining isn't feasible for you, 'constrained decoding' or 'logit biasing' toward tool call tokens can help, but that enforces structure rather than actually teaching the correct behaviour.

Thank you!

vncntrsnlt

about 7 hours ago

Hi @srikanta-221 !

Thanks for your answer. I understand how FunctionGemma tool calling works and that's what I meant with tool definition parsing. And I meant I wan't to reuse and not I don't want to reuse in the sense that I want to have the fine-tuning for structured final output to "reuse" the tool calling from the first round as opposed to messing it up with catastrophic forgetting.

And of course, I'm expecting this "shifting of the learned prior" towards tool calling instead of plain text final turn output to require further training from the base FunctionGemma model.

I will try to make some synthetic dataset out of existing function calling datasets so the model can learn and generalize this last turn tool calling paradigm shift.

I might get back to you here for some more help or comments.

Thanks.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment