Fine-tuning the model in case of discontinous tokens

#7
by pararthdave - opened

I have been searching for ways to fine-tune the model for discontinuous tokens, the existing transformer-based pipeline only supports a single start and single end that is suitable for continuous sequences of tokens. But in the case of entities like addresses (as shown in the model card) this logic does not suffice. Any lead regarding how to train and get inference in such scenario will be much appreciated.
Also, I've been trying this model with the same example as shown in the model card but still, it extracts only one line of address. Is it related to the prompt I am supplying or something else?
Attaching screenshot of the inference for reference

image.png

Are you able to fine tune it?

Are you able to fine tune it?

For continuous tokens with a single start and end. Yes.

Can you send me any script or resources for fine tuning

Can you send me any script or resources for fine tuning

You can have a look at the example given with the layoutlmv1 question-answering model on huggingface to understand the input it's supposed to accept. Let me know if there's any issue in particular that you're facing.

I am just stuck how to encode fo training, like start and end positions for answers

hey GUYZ,
i am looking to fine tune this model, please do share any resources on that.

I am just stuck how to encode fo training, like start and end positions for answers

@TusharGoel you can use the word_ids from tokenizer output and map them with your existing start and end positions. Once that flow is streamlined you can supply it for training directly

hey @TusharGoel @pararthdave , Can you suggest me what annoation tool you have used, i have tried butlerUI but i think thats not scalable.

@HIMANSHUSHAKYAWAR We have a dedicated tagging team in the organization with our in-house tagging platform. I've used label studio previously to tag datasets in coco format. Maybe check that out...

Sign up or log in to comment