How to count the number of sentences in a text in R
A fundamental task in R that is frequently used in text analysis and natural language processing is counting the number of sentences in a text. Sentence counting is necessary for many applications, including language modelling, sentiment analysis, and text summarization. In this article, we’ll look at various techniques and R packages for quickly and correctly counting the amount of phrases in a given text using R.
Related Concepts :
- Regular Expressions : Regular expression specifies pattern that is used to identify sentences .
- Functions in R : Various string related functions will be used for counting sentences
Steps Required For Counting Sentences in R :
- First we need to write R script in R Studio that will perform counting of sentences .
- We will store our text in a variable as string .
- Then we will use regular expression to match it with text to count sentences .
- Now we will use below examples to get count of sentences .
- Finally we will display the count of sentences on console .
Code for Counting Sentences in Text using stringr Package
R
text <- "This is R program for counting number of sentences in text. This program is for GFG article . And it is using stringr package for counting." sentences <- unlist ( strsplit (text, "[.!?]" )) num_sentences <- length (sentences) cat ( "Number of sentences using unlist and strsplit :" , num_sentences) |
Output:
Number of sentences using unlist and strsplit : 3
- First we store text in text variable .
- Then we use strsplit to split text using regular expression .
- unlist() – on above split output to convert it to list and store it in sentences variable.
- length() is used to find number of sentences in sentences variable.
Finally we use cat to display the sentence count as below. As there are 3 sentences in the text ending with full stop(.) the output will be 3 .
Counting Sentences in Text using R and strcount()
R
if (! require (stringr)) { install.packages ( "stringr" ) library (stringr) } text <- "This is R program for counting number of sentences in text. This program is for GFG article . And it is using stringr package for counting. And is it working ?" sentence_pattern <- "[.!?]" num_sentences <- str_count (text, sentence_pattern) cat ( "Number of sentences using stringr :" , num_sentences, "\n" ) |
Output:
Number of sentences using stringr : 4
- First we install the stringr package if it is not installed and store text similarly as above in text variable.
- Then we store our regular expression in sentence_pattern variable .
- str_count() to count sentences by matching text on regular expression .
Finally we will display the sentence count using cat. Here in text there are four sentences in total 3 ending with full stop(.) and one ending with question mark(?) .Hence the output is 4
Code for Counting Sentences in Text using openNLP Package
R
if (! require (openNLP)) { install.packages ( "openNLP" ) #this will install the package if not present library (openNLP) } text <- "This is gfg sentence. Another sentence from gfg ! And a third one?" sent_token_annotator <- Maxent_Sent_Token_Annotator () sentences <- sent_token_annotator (text) num_sentences <- length (sentences) cat ( "Number of sentences using openNLP:" , num_sentences, "\n" ) |
Output:
Number of sentences using openNLP: 3
- we store text in text variable .
- Then we set data as “sent_token_english” which will load the model .
- maxent sentence tokenizer to count number of sentences .
- Finally we use length() to count length of sentences and we will display it using cat .
- Make Sure you have JAVA installed and path is set to make this code work.
Here there are 3 sentences seperated by full stop(.) , exclamation mark(!) and question mark(?) respectively . Hence the output is 3.
Code for Counting Sentences in Text using tokenizers Package
R
if (! require (tokenizers)) { install.packages ( "tokenizers" ) library (tokenizers) } text <- "This is an example gfg sentence. Another gfg sentence! this is last example." sentences <- unlist ( tokenize_sentences (text)) num_sentences <- length (sentences) cat ( "Number of sentences using tokenizers:" , num_sentences, "\n" ) |
Output:
Number of sentences using tokenizers: 3
- we store text data in text variable.
- use tokenize_sentences() to tokenize text into sentences.
- unlist() to list the sentences and store it in sentences .
- length() to count sentences and display it using cat .
As there are three sentences in text variable . Two of them separated by full stop(.) and one of them separated by exclamation mark(!). The count is 3.