Virziens:
Saraksts pētījumiem kur izmanto prompt based evaluation answer (Chain of thought, self-reflection) VS Sampling + Reward model
TODO:
Apkopot prompt based refinement of responses, lai varam konkrēti salīdzināt ar mūsu metodi
Apkopot datu kopas un rezultātus
Mēģināt atrast vēl kādu modeli no datiem:
https://huggingface.co/datasets?other=human-feedback
https://huggingface.co/datasets/Anthropic/hh-rlhf
https://huggingface.co/datasets/Dahoas/full-hh-rlhf
https://huggingface.co/datasets/nvidia/HelpSteer2
TODO:
Atrast konkrētus pētījumus un datasets kur salīdzina performance šādiem uzdevumiem, lai varam notestēt savus modeļus (vajag arī datu kopas)
LLM to Prolog
https://arxiv.org/pdf/2405.17893
https://huggingface.co/datasets/Thomas-X-Yang/gsm8k-prolog?row=6
LLM to SQL https://github.com/defog-ai/sqlcoder/
Ko varētu, lai uzlabotu rezultātu? named-entity-recognition
https://huggingface.co/dslim/bert-base-NER
Llama Guard is an LLM-based safeguard model that can classify the safety risks in LLM prompts and responses. It demonstrates strong performance on existing benchmarks like the OpenAI Moderation Evaluation dataset and ToxicChat, matching or exceeding current content moderation tools
https://dl.acm.org/doi/abs/10.1145/3640544.3645216
https://github.com/gabrielmittag/NISQA
https://www.youtube.com/@YannicKilcher
Anthropic's HH-RLHF
https://github.com/RLHFlow/RLHF-Reward-Modeling
RRHF (Rank Responses to Align Language Models with Human Feedback)
https://arxiv.org/html/2312.07592v1
RAIN (Rewindable Auto-regressive INference) is another inference method that allows pre-trained LLMs to self-evaluate their own generation and use the evaluation to guide generation rewinding for improved AI safety
ToRA (Gou et al., 2023) or RFT (Yuan et al., 2023)
LLM to Prolog
https://arxiv.org/pdf/2405.17893
https://huggingface.co/datasets/Thomas-X-Yang/gsm8k-prolog?row=6
LLM to SQL https://github.com/defog-ai/sqlcoder/
Ko varētu, lai uzlabotu rezultātu? named-entity-recognition
https://huggingface.co/dslim/bert-base-NER