1

Batch Processing// Multi Threading// Parallel Processing

Hi Team,

 

Its quite evident that we can have the ETL created and scheduled in the tool Itself. And not only one but multiple ETL's i one place rather than using SSIS packages.

Let say, 1000 customers request a reports, what my concern here is whether, does this tool give the feasibility of scheduling processes in batch wise (10 requests at a time) or Multi threading Process, which would execute 10 requests at a time(Parallel Processing) and once these reports are delivered then next 10 requests and so on.

 

Assume we have 10K reports going per day, i would care about the frequency and efficiency of the overall process of Pyramid Tool and the SLA of the report delivery.

 

Best,

Burhan

1 reply

null
    • "making the sophisticated simple"
    • AviPerez
    • 4 yrs ago
    • Reported - view

    Peerzada

    A few thoughts and comments on your question:

    • The "ETL" or yellow tool builds packages for importing/cleaning data. These packages are executed on the "Task" server tier. Each task server can handle multiple jobs at the same time and the capacity of that parallel processing is set in the admin tools (see servers > task engine). The ETL engine is designed to parallel process multiple rows of incoming data at the same time, and generally memory and CPU scales particularly well - even if the data footprint is large (exceptions here can include times when you perform 'pooled' steps where we need the full data set to resolve the activity or if you choose complex processing through external engine like Python and R).
    • The process of generating static "reports" (which is what I assume you are referring to) is handled by the Publish or blue app. It is also designed to run off the Task server because it is a batch job. Like ETL packages, it is governed by the settings of the Task server and the number of threads that you have elected to enable - so you can run multiple concurrent publications if you choose. But unlike ETL packages, the weight of the publication template (number of queries) and the performance of your underlying data sources are the biggest factors impacting performance. (On super fast in-memory databases a typical query may take 50-500 milliseconds even with concurrent requests; while the same load on a relational database may be 5-10 times slower). For example, if you have a report template that is 20 pages long, with 3 different queries per page, each generated report will roughly need to execute 60 unique queries to render the report. If you ran 24 of these in parallel, you data source should be able to comfortably handle 24-72 concurrent queries at any point in time. If you sliced this by 10K items (say 'customers'), to produce unique reports for each customer, you would be pushing 12m queries a day through the engine (20x60x10,000)! And that does not include any queries your end users might be generating through the interactive reporting tools like Discover (green) and Present (red).

    To scale the task server out beyond the thread settings, you can install multiple task servers into your Pyramid cluster (if you have the Enterprise license) - increasing your Pyramid processing real estate. Further, you can designate certain Task machines as being dedicated to yellow or blue processing only - to give you more control over resource allocation. The scaling guide below can explain how this is done:  https://help.pyramidanalytics.com/Content/Root/Guides/Scaling/Scaling%20Pyramid%202018.htm

    So, to your central question: can you generate 10K reports a day?

    YES, there is no theoretical reason why you cannot. And using the Publication scheduler with all its capabilities like report triggers, logical slicers and and automated distribution - the overall process could not be simpler. 

    However, you need to resource the system well. That means providing ample resources to the task server(s) and more importantly to your data source which needs to be able to crunch all that data. 

    There is no SLA offered by Pyramid, because every customer's data, queries, report design, volumes, hardware, network etc is different - and they can all impact results.

    My advice is to prototype the overall exercise to get a sense of the performance and scalability.

Content aside

  • Status Answered
  • 1 Likes
  • 4 yrs agoLast active
  • 1Replies
  • 42Views
  • 2 Following