0

How to use ML to categorize or classify comments?

I have a comments field that people want to classify into around 10 different values.  Is there a built in algorithm that would help with this type of text classification?

1 reply

null
    • Jason_Picker
    • 4 yrs ago
    • Reported - view

    The quick answer is yes.  Although more accurately, it is kind of.  In the ML marketplace, there is a sentiment analysis algorithm that will scan comment fields and classify the comments as either "positive", "negative", or "none".  You can take the script and modify it to provide more classifications as necessary.  The way it works is it uses 2 lists containing keywords and then applies an algorithm to score the comments based on those words.  The lists are fully customizable so you can do what you want with them.

     

    Here is what the code looks like:

    # Classify each sentence by either positive or negative attribute.
    # Input
    # -----
    # textColumn: Expects textColumn to be a string column with multiple words
    #
    # Tips
    # ----
    # Classify sentences such as reviews as positive negative
    
    AllText=as.data.frame(textColumn)
    AllText$classification=rep(5,nrow(AllText))
    AllText$NegativeCount=rep(0,nrow(AllText))
    AllText$PostiveCount=rep(0,nrow(AllText))
    negativeWords = read.csv("https://marketplace.pyramidanalytics.com/r/negativewords.txt",header = F)
    positivewords = read.csv("https://marketplace.pyramidanalytics.com/r/positivewords.txt",header = F)
    ######################## Clean ###########################################
    AllText[,1]=as.data.frame(str_to_lower(AllText[,1]))
    AllText[,1]=as.data.frame(str_replace_all(as.character(AllText[,1]), "[^[:alnum:]]", " "))
    positives = as.character(tolower(positivewords$V1))
    negatives = as.character(tolower(negativeWords$V1))
    negatives=as.data.frame(str_replace_all(as.character(negatives), "[^[:alnum:]]", " "))
    positives=as.data.frame(str_replace_all(as.character(positives), "[^[:alnum:]]", " "))
    positives=str_c(" ",positives[,1])
    positives=str_c(positives," ")
    negatives=str_c(" ",negatives[,1])
    negatives=str_c(negatives," ")
    AllText[,1]=str_c(AllText[,1]," ")
    AllText[,1]=str_c(" ",AllText[,1])
    ######################## Run ##############################################
    AllText$PostiveCount=lapply(AllText[,1],function(x) {length(which(str_detect(as.character(x),as.character(positives))))})
    AllText$NegativeCount=lapply(AllText[,1],function(x) {length(which(str_detect(as.character(x),as.character(negatives))))})
    AllText[as.numeric(AllText[,3])>as.numeric(AllText[,4]),2]="Negative"
    AllText[as.numeric(AllText[,4])>as.numeric(AllText[,3]),2]="Positive"
    AllText[as.numeric(AllText[,3])==as.numeric(AllText[,4]),2]="None"
    classifiedColumn=as.character(AllText[,2])
    outputDF=data.frame(classifiedColumn)

    You will notice the URLs for the 2 lists.  Drop the URL in your browser and you can get a copy that you can download to modify as you wish.

    Once you add the script to your data flow, you can pick the column you want to analyze and then the script will return the classification of that column (positive, negative, none).

     

    There are other R and Python scripts on the internet that have additional classifications that you can use as the basis of your custom script.  If I find a good one or two, I will add links to them in another reply.

Content aside

  • Status Answered
  • 4 yrs agoLast active
  • 1Replies
  • 29Views
  • 1 Following