Applying R to Text Analytics
by Michael Raam , Principal Data Analytics - Pyramid Analytics and Bar Amit Data Scientist - Pyramid Analytics
As machine learning becomes more popular we are asked how Pyramid Analytics can answer these new analytical needs.
With the introduction of the R engine in BI Office, we offer out-of-the box machine learning functionality (forecasting, clustering, and prediction). However, there are cases when we suggest that customers customize the R script in order to achieve their specific needs.
Keep in mind that the techniques we use are statistical and do not claim to be correct for every individual tweet or text analyzed.
Text analysis with Pyramid Analytics using Twitter data is one of many possible examples. This demo could be used to analyze any text source.
This demo includes the following:
- Retrieve the Twitter data feed (create view on TwitterData)
- Sentiment Analysis – assign each tweet a sentiment of: Positive, None, or Negative based on the tweet text.
- Location Analysis – assign each tweet a location based on the tweet text.
- BI Office PAXL file and data model attached.
There are 2 steps for the demo. A prep step where the necessary set up is done and the run step where the actual analysis is done.
Overview of Process
Step 1 – Prep
a. Create a connection to Twitter in the Admin and import data from Twitter via the Data Modeler (in absence of a connection, a slice of Twitter data can be found with Twitter data-09-Aug-16).
Figure 1. Adding a Twitter data source.
Figure 2. Connecting to Twitter.
b. Due to technical issues with the data provided by Twitter we need to clean the Twitter data by creating a view on the database the Data Modeler has created during the data import from Twitter (create view on TwitterData file attached).
Figure 3. Creating a view of the data.
c. After the view is available we will create another model via the Data Modeler using the SQL server database as a source. If needed, create a data source in the admin.
Figure 4. Access the Twitter data (via a view).
Figure 5. Creating a data model from Twitter data.
d. After the additional, SQL Server based model is created, we will need to upload an additional R package, via the admin, “stringr”.
Figure 6. Adding an R script.
e. Unzip and place the attached R package in an accessible folder on the Pyramid Server (TwitterPackage.rar).
Figure 7. Accessing the R-script text.
Step 2: Run
a. Open the Prediction wizard and run the wizard once for every classification (sentiment and location). There will be a total of two prediction templates.
Figure 8. Sample names of two models you will create.
b. Follow these steps in the Prediction wizard:
1. Algorithm – Output set to “Built in: Categorical Items”
Figure 9. First step of the wizard for Algorithm.
2. What to Predict – Choose “statusSource”
Figure 10. Second step of the wizard - What to Predict.
3. Items to Predict for – This is an important setting, make sure it is set to the column “id”. This column must be a column with unique values (PK).
Figure 11.Third step of the wizard - Items to Predict for.
4. Categories to Predict by – Here we will provide the column with the text to be analyzed. Choose “text”.
Figure 12. Fourth step of the wizard - Categories to Predict by.
5. Numbers to Predict by – Keep this empty.
6. R script – Here you will customize the R code with the right code. Examples are attached for each classification of sentiment and location (Custom R for Pyramid Twitter.txt).
Figure 13. Fifth step of the wizard - Enter R-script.
7. Final – Please give an easy caption (such as Location or Twitt Sentiment) as this will be the name of the attribute once the process runs and is saved for reuse.
Figure 14. Save model with Name (Location).
c. After the predictions run, once for sentiment and once for location, there will two new attributes that can be used. Please keep in mind that R is slow. The location text analysis is faster. Note: You can import Sentiment Location.paxl and change the connection information for some prebuilt views.
Figure 15. Open up Data Discovery or imported briefing book.
Figure 16. Completed view using Twitter data.
And there you have it. An analyst-driven way to support deep text analysis.