2023-10-16 PDF scraper Task

Lai Ričards izstāsta kā sistēma strādā līdz šim.

Use case:

Jāpievieno ControllerPDF funkcija parse_deal_file(deal_file_id)

Atsevišķā workerī worker_deals_files.py nevis main.py visus deals_files parsēt un iegūt rezultāus

Priekš PDF extraxct izmantot šo vai kādu citu lib https://pypi.org/project/PyPDF2/

Jābūt max strukturizētam output, pārbaudi pats ar pēc iespējas vairāk dažādiem PDFs - jābūt loģiskā secībā tekstiem

Jāziveido jauna DB tabula un Model classes

Nepieciešams no PDF iegūt šādus segments, piemērs:

text:


xxxxxxxxxx
4
1
PITCH DECK
2
We are changing the way people buy cars around the world.
3
Never buy a banger again.
4
CarExamer.com

text:


xxxxxxxxxx
6
1
Go-to Market Plan
2
Sponsorships
3
Something
4
Affilate 
5
Something
6
...

OCR no bildēm, kur ir teksts atpazīt segments
Segement Anytghing, ImageToText models pielietot, lai no attēliem iegūtu tekstu anotations, aprakstus
Charts and graphs