2024-01-29 Eldigen Validation

 

Description of task

Implement python script that tests current configuration of ControllerChatbot.py using ground truth Question-Answer dataset. Gives Scalar value 0-100 (floating) how good is the configuration. How precisely match ground truth answers

Aproximate desired output: http://share.yellowrobot.xyz/quick/2024-1-29-BB88CA4B-0176-4A21-9BFB-1E3226260B41.zip

 

Input

Path of excel document.

Excel Table sheet name goes into python script and then is appended to Excel sheets.

Table/Sheet should must have columns:

  1. test_id

  2. input_is_followup_question (True or False, if True, then followup to previous question)

  3. input_question

  4. input_answer (Ground truth)

  5. input_reference_doc_filename (Ground truth filename)

  6. input_reference_doc_page (Ground truth page)

  7. input_language (input language “lv”)

  8. input_answer_length (int)

Output

Each run of script must add new Excel Sheet using the same template.

Add content to table columns:

  1. output_answer (Generated answer for output)

  1. output_reference_doc_filename (Facts reference)

  1. output_reference_doc_page (acts reference)

  2. metric_is_answered (False if respond “nevaru atbildēt” - need to modify code)

  3. metric_is_match_reference_doc_filename

  4. metric_is_match_reference_doc_filename_page

  5. metric_answer_length

  6. metric_answer_length_diff

  7. metric_lavenshtein_distance (small caps)

  8. metric_semantic_distance (using same model combination as in base code)

  9. metric_is_answers_match (gpt4 prompt engineering to zero-shot classification)

  10. result_is_correct (could assume metric_is_answers_match and some other metrics)

  11. result_score (calcuted using formula you should invent that takes all mettrics and gives output as 1-100 float)

Need to also add Metrics for Term Frequency (TF) between input and output answers. Welcome to add more metrics and encorporate into final formula. Need also add Text-2-Text ranking models for comparing answers.

 

TODOS

  1. Implement in new GIT branch “validation_script”

  2. Manually Change Elektrum Question-Answers into Excel sheet

    Raw Question-Answers http://share.yellowrobot.xyz/quick/2024-1-29-DF723B02-7A32-4011-AC8C-10704A92ED47.zip

  3. Prepare Excel template

  4. Implement validation script that automates tesing using requirements above

  5. Conduct experiments to fine tune final formula to capture all aspects of answer matching ground truth. (to test this you even do not need to interact with chatbot, just manually change values in excel)

  6. Finally we organize chatbot parameters to be easily changed using RayTune etc. But this is last task.

  7. Add more features, like adversial, out-of-domain testing