Scraping a Table on https site using R
In this article, we have discussed the basics of web scraping using the R language and the rvest and tidyverse libraries. We have shown how to extract tables from a website and how to manipulate the resulting data. The examples in this article should provide a good starting point for anyone looking to scrape tables from websites. The following are the key concepts related to scraping tables in R:
- Web scraping with R: R provides various libraries such as rvest and XML that can be used to extract data from websites.
- Reading HTML: R can read HTML pages, and these pages can be parsed to extract the data we are interested in.
- Selectors: To extract data from a website, we need to know the HTML structure of the page. Selectors in R allow us to select elements from the HTML page using CSS selectors or XPath.
- Parsing HTML: After selecting the elements of interest, the next step is to parse the HTML content and extract the data.
Before we start scraping tables, the following prerequisites must be met, R should be installed on the system. The rvest library must be installed in R. If it’s not installed, it can be installed by running the following command in the R console:
install.packages("rvest")
Scraping a Table from a Static Website
In this example, we use the read_html function to read the HTML content of the website. Then we use the html_nodes function to select the table using a CSS selector. Finally, we extract the table content using the html_table function and print the first six rows of the table.
R
library (rvest) # Read the HTML content of the website webpage <- read_html ("https://en.wikipedia.org/wiki/\ List_of_countries_by_GDP_ (PPP)_per_capita") # Select the table using CSS selector table_node <- html_nodes (webpage, "table" ) # Extract the table content table_content <- html_table (table_node)[[2]] # Print the table head (table_content) |
Output:
Scraping a Table from a Dynamic Website
Scraping a table from a dynamic website, which is generated using JavaScript. In this example, the rvest library is used to read the HTML code of the webpage and extract the table. The html_nodes function is used to select the first table on the page, and the html_table function is used to convert the HTML code into a DataFrame. Finally, the first few rows of the data frame are displayed using the head function.
R
library (rvest) library (tidyverse) # URL of the website url <- "https://www.worldometers.info/world-population/\ population-by-country/" # Read the HTML code of the page html_code <- read_html (url) # Use the html_nodes function to extract the table table_html <- html_code %>% html_nodes ( "table" ) %>% .[[1]] # Use the html_table function to convert the table # HTML code into a data frame table_df <- table_html %>% html_table () # Inspect the first few rows of the data frame head (table_df) |
Output: