Description of task
Implement python script that tests current configuration of ControllerChatbot.py
using ground truth Question-Answer dataset. Gives Scalar value 0-100 (floating) how good is the configuration. How precisely match ground truth answers
Aproximate desired output: http://share.yellowrobot.xyz/quick/2024-1-29-BB88CA4B-0176-4A21-9BFB-1E3226260B41.zip
Input
Path of excel document.
Excel Table sheet name goes into python script and then is appended to Excel sheets.
Table/Sheet should must have columns:
test_id
input_is_followup_question (True or False, if True, then followup to previous question)
input_question
input_answer (Ground truth)
input_reference_doc_filename (Ground truth filename)
input_reference_doc_page (Ground truth page)
input_language (input language “lv”)
input_answer_length (int)
Output
Each run of script must add new Excel Sheet using the same template.
Add content to table columns:
output_answer (Generated answer for output)
output_reference_doc_filename (Facts reference)
output_reference_doc_page (acts reference)
metric_is_answered (False if respond “nevaru atbildēt” - need to modify code)
metric_is_match_reference_doc_filename
metric_is_match_reference_doc_filename_page
metric_answer_length
metric_answer_length_diff
metric_lavenshtein_distance (small caps)
metric_semantic_distance (using same model combination as in base code)
metric_is_answers_match (gpt4 prompt engineering to zero-shot classification)
result_is_correct (could assume metric_is_answers_match and some other metrics)
result_score (calcuted using formula you should invent that takes all mettrics and gives output as 1-100 float)
Need to also add Metrics for Term Frequency (TF) between input and output answers. Welcome to add more metrics and encorporate into final formula. Need also add Text-2-Text ranking models for comparing answers.
TODOS
Implement in new GIT branch “validation_script”
Manually Change Elektrum Question-Answers into Excel sheet
Raw Question-Answers http://share.yellowrobot.xyz/quick/2024-1-29-DF723B02-7A32-4011-AC8C-10704A92ED47.zip
Prepare Excel template
Implement validation script that automates tesing using requirements above
Conduct experiments to fine tune final formula to capture all aspects of answer matching ground truth. (to test this you even do not need to interact with chatbot, just manually change values in excel)
Finally we organize chatbot parameters to be easily changed using RayTune etc. But this is last task.
Add more features, like adversial, out-of-domain testing