Web Scraping using XPath
To scrape data directly from an HTML element, we can use something called XPath. The XPath of the element can be found using the inspect element. In Chrome browser,
right-click> inspect > right click on the element > click Copy full XPath.
First, we need to install the rvest package, a library to scrape web pages.
install.packages('rvest')
So, suppose we are interested in scraping the timetable for train no. 14553 on trainman.in that is on this URL:
https://www.trainman.in/train/14553
Then select the first-row element in the timetable in inspect.
Go ahead and copy the XPath as mentioned above. It will be something like this or might change:
/html/body/app-root/app-wrapper/div/main/train-schedule /div[2]/div[1]/div/div[3]/table/tbody/tr[1]
The XPath that we got is for one row only. What about the rest of the rows? For that remove the subscript part from tr[2]:
/html/body/app-root/app-wrapper/div/main/train-schedule /div[2]/div[1]/div/div[3]/table/tbody/tr
So now it gives not only one row but all the rows in the table. To scrape this in R, call the URL, and store it. Now get the HTML data by calling read_html(URL). Now to filter out the specific element use html_nodes() passing the page and XPath. And use %>% html_text() to only get the text part that is excluding the tags and details.
R
# include the installed library rvest library (rvest) # call the url url <- "https://www.trainman.in/train/14553" # get the data page <- read_html (url) # filter the required data using xpath rows <- html_nodes (page, xpath = "/html/body/app-root/app-wrapper/div/main/train-schedule/div[2]/div[1]/div/div[3]/table/tbody/tr" ) %>% html_text () # print rows |
Output:
If we have simply copied the XPath of the <table> tag then we would have got only one entry containing all the stations as opposed to 25 entries.
This method is not only for table tags but it works for any HTML element and there can be minor differences according to the structure of the webpage.
Web Scraping R Data From JSON
Many websites provide their data in JSON format to be used. This data can be used by us for analysis in R. JSON (JavaScript Object Notation) is a text-based format for representing structured data based on JavaScript object syntax. In this article, we will see how to scrap data from a JSON web source and use it in R Programming Language.