PDF / Document files as a Data Source in Modeller
Hi Avi / Pyramid Dev team,
Often, information useful for a Pyramid client is stored in PDF or other document formats - for example where the Pyramid client itself has many suppliers who do not interact via any type of direct data interchange.
I deal with many customers who would be good examples of this use case - smaller retailers who purchase goods from numerous wholesale suppliers and receive invoices in a similar format to those attached here - PDFs, Word docs etc.
As shown in the image, perhaps a way of tagging certain label names and extracting the both the label name and the associated value - either by name, relative position or similar - would allow Pyramid to extract the relevant fields and associated values from word or PDF documents to be used alongside additional customer data for greater overall insight.
I understand this functionality is being included in some other BI toolsets as new functionality.
I look forward to your thoughts on this.
Thanks and regards,
You can do this already through the Python Source Block in Model.
Generally, we would be looking for tables of data in the PDF for loading into a DB for analysis. This can be done with a single line of Python code. This article give a good explanation and guide on how to do this, as well as more complex examples where one needs to deal with multiple tables.
There may also be a need to retrieve individual elements of the document, like invoice number or Name and Address that comprise the "header" for the table(s) and / or elements from specific PDF 'fields' used in online PDF forms for example.
You can also do this via Python, with a good explanation available on the same site as the previous one.
In the meantime, we'll monitor the votes for this idea.
Hope that helps.