PDF / Document files as a Data Source in Modeller
Hi Avi / Pyramid Dev team,
Often, information useful for a Pyramid client is stored in PDF or other document formats - for example where the Pyramid client itself has many suppliers who do not interact via any type of direct data interchange.
I deal with many customers who would be good examples of this use case - smaller retailers who purchase goods from numerous wholesale suppliers and receive invoices in a similar format to those attached here - PDFs, Word docs etc.
As shown in the image, perhaps a way of tagging certain label names and extracting the both the label name and the associated value - either by name, relative position or similar - would allow Pyramid to extract the relevant fields and associated values from word or PDF documents to be used alongside additional customer data for greater overall insight.
I understand this functionality is being included in some other BI toolsets as new functionality.
I look forward to your thoughts on this.
Thanks and regards,
Richard.
2 replies
-
Hi Richard,
You can do this already through the Python Source Block in Model.
Generally, we would be looking for tables of data in the PDF for loading into a DB for analysis. This can be done with a single line of Python code. This article give a good explanation and guide on how to do this, as well as more complex examples where one needs to deal with multiple tables.
There may also be a need to retrieve individual elements of the document, like invoice number or Name and Address that comprise the "header" for the table(s) and / or elements from specific PDF 'fields' used in online PDF forms for example.
You can also do this via Python, with a good explanation available on the same site as the previous one.
In the meantime, we'll monitor the votes for this idea.
Hope that helps.
Ian