SLR https://www.notion.so/evalds/SLR-55bb08969c3844bfa783269a0aaae301
Jautājumi:
Kā novērtē CommonSense atbilžu kvalitāti?
Apraksti vai kods kā izveidotas reward / fitness functions? Meklēt rakstā vai kodā?
Kuri datastets visbiežāk atkārtojas?
Kuri tasks visbiežāk atkārtojas CoonSense un Math?
Visiem rezultātiem pierakstīt cik N-Shot rezultāti tie ir? Vissvarīgākie ir 0-shot
Kā var pārbaudīt accuracy common sense uzdevumiem?
Citāti total, lai novērtētu darba nozīmību, pievienot?
Kāpēc neizmanto score text embedding, word embeddings, ranking models?
Izpētīt BLEU kriticismu, kāpēc neiesaka lietot https://www.perplexity.ai/search/bleu-score-criticism-6KgU040bQPeSQADLHzod2w
https://www.perplexity.ai/search/explain-promptbreeder-to-deter-Ro00Oki3SHm77oRPMDkGwg
However, some common metrics that could potentially be used to compare the LLM answers to the ground truth include:
Exact match accuracy: The percentage of LLM answers that exactly match the ground truth answers word-for-word.
BLEU score:
A metric that measures the similarity between the LLM answer and ground truth based on n-gram overlap.
3
Rouge score: Another set of metrics that measure the quality of the LLM answer by comparing it to the ground truth based on n-gram recall and precision.