Web Scraping using XPath

Web Scraping Data from JSON

To scrape data directly from an HTML element, we can use something called XPath. The XPath of the element can be found using the inspect element. In Chrome browser,

right-click> inspect > right click on the element > click Copy full XPath.

First, we need to install the rvest package, a library to scrape web pages.

install.packages('rvest')

So, suppose we are interested in scraping the timetable for train no. 14553 on trainman.in that is on this URL:

https://www.trainman.in/train/14553

Then select the first-row element in the timetable in inspect.

Exemplary website to be scraped for demonstration purpose

Go ahead and copy the XPath as mentioned above. It will be something like this or might change:

/html/body/app-root/app-wrapper/div/main/train-schedule
/div[2]/div[1]/div/div[3]/table/tbody/tr[1]

Select the Copy to fill XPath option to get the menu

The XPath that we got is for one row only. What about the rest of the rows? For that remove the subscript part from tr[2]:

/html/body/app-root/app-wrapper/div/main/train-schedule
/div[2]/div[1]/div/div[3]/table/tbody/tr

So now it gives not only one row but all the rows in the table. To scrape this in R, call the URL, and store it. Now get the HTML data by calling read_html(URL). Now to filter out the specific element use html_nodes() passing the page and XPath. And use %>% html_text() to only get the text part that is excluding the tags and details.

R

# include the installed library rvest 
library(rvest) 
  
# call the url 
url <- "https://www.trainman.in/train/14553"
  
# get the data 
page <- read_html(url) 
  
# filter the required data using xpath 
rows <- html_nodes(page, xpath = "/html/body/app-root/app-wrapper/div/main/train-schedule/div[2]/div[1]/div/div[3]/table/tbody/tr") %>% html_text() 
  
# print 
rows

Output:

Data Scraped from the website

If we have simply copied the XPath of the <table> tag then we would have got only one entry containing all the stations as opposed to 25 entries.

Raw data from the website

This method is not only for table tags but it works for any HTML element and there can be minor differences according to the structure of the webpage.

Web Scraping R Data From JSON

Many websites provide their data in JSON format to be used. This data can be used by us for analysis in R. JSON (JavaScript Object Notation) is a text-based format for representing structured data based on JavaScript object syntax. In this article, we will see how to scrap data from a JSON web source and use it in R Programming Language.

Tags:

#Technical Scripter 2022 #R Language #Technical Scripter

Web Scraping Data from JSON

Web Scraping using XPath

R

Web Scraping R Data From JSON

Similar Reads