What’s New ?

The Top 10 favtutor Features You Might Have Overlooked

Read More

Web Scraping in R: Beginner's Guide (with Examples)

  • Jan 11, 2024
  • 7 Minutes Read
Web Scraping in R: Beginner's Guide (with Examples)

The ability to extract useful information from websites is a skill that can improve your data analysis game in the ever-expanding world of data-driven decision-making. Data enthusiasts, researchers, and analysts now rely heavily on web scraping, the process of extracting data from websites. In this beginner's guide for Web Scraping in R, how to do it using RStudio, with some examples.

What is Web Scraping?

Through the process of web scraping, data is taken from websites and transformed from unstructured to structured content that can be utilized for analysis and other purposes. R has several packages that make web scraping easier; the most widely used ones are rvest and httr.

Using R for web scraping produces plenty of opportunities for collecting insightful information from the vast internet. R offers data enthusiasts a robust toolkit for navigating and extracting data from the web

Let's start with a basic example of using the rvest package to scrape information from a website. In this case, we'll extract the titles of articles from our website blog (favtutor.com/blogs).

But first, let us install and load the package in our R library.

install.packages(c("rvest", "xml2"))
library(rvest)

 

Now let's code the script to scrape the website and print out the article titles.

url <- "https://favtutor.com/blogs"

# Read the HTML content of the webpage
webpage <- read_html(url)

# Extract the titles of articles
titles <- webpage %>%
  html_nodes(".blog-title") %>%
  html_text()

# Print the extracted titles
print(titles)

 

In this example, the webpage's HTML content is loaded using the read_html function. Next, the CSS selector of the elements we wish to extract—in this case, the class "blog-title"—is specified via html_nodes. The text content of these elements is then extracted using the html_text function. Then we print out all the titles of the articles on the website.

Advanced Web Scraping in R

Now let us look at the advanced functionalities and usage in web scraping in R programming.

Handling Dynamic Content

Since JavaScript is frequently used on modern web pages to load dynamic content, scraping data with traditional methods is more difficult. To navigate and extract information from dynamic web pages, use the RSelenium package.

First, let us install and load the necessary package into our R environment:

install.packages("RSelenium")
library(RSelenium)

 

Now you need to start a selenium server. For that the prerequisite is Java, so make sure you have Java installed (they come by default in Mac systems). You can download Selenium Server from here.

# Start a remote driver
driver <- rsDriver(browser = "chrome")
remote_driver <- driver[["client"]]

# Enter the URL of the website
url <- "https://www.example-dynamic-website.com"

# Navigate to the webpage
remote_driver$navigate(url)

# Extract data from dynamic content (modify as needed)
dynamic_data <- remote_driver$findElement("your-identified-selector")$getElementText()

# Print the extracted dynamic data
print(dynamic_data)

# Close the remote driver
remote_driver$close()

 

Here, a web browser is opened, a dynamic website is navigated to, and data is extracted using RSelenium. When working with websites that use JavaScript to load content dynamically, this sophisticated technique works well.

Dealing with Authentication

A user authentication may be needed to access specific pages on some websites. When web scraping, the httr package can be used to deal with authentication.

Installing and loading the httr package into our R environment:

install.packages("httr")
library(httr)

 

RScript to handle the authentication websites and extract information via web scrapping.

username <- "your_username"
password <- "your_password"

session <- html_session("https://www.example-authenticated-website.com")

form <- html_form(session)[[1]]
filled_form <- set_values(form, "username" = username, "password" = password)
session <- submit_form(session, filled_form)

webpage <- jump_to(session, "https://www.example-authenticated-website.com/target-page")

data <- webpage %>%
  html_nodes("your-identified-selector") %>%
  html_text()

print(data)

 

To access content that is restricted on websites during web scraping, authentication handling is very important. You can easily log in and move around the authenticated pages with the help of the httr package.

Respecting Website Policies

It's important to abide by a website's policies and terms of service when web scraping. IP blocking or legal consequences may result from excessive scraping or from breaking the terms of a website. To prevent overburdening a server, a delay between requests can be implemented with the aid of R's polite package.

We need to start by installing the polite package and loading it into our R environment.

install.packages("polite")
library(polite)

 

Now let us set up a Polite request with a delay of 2 seconds between requests. We can edit the delay as per our preference and the website's policy.

scrape_delay(2)

# Performing the web scrapping operations
#   ....

reset_scrape_delay()

 

You can show your appreciation for the website's resources and reduce the chances that your IP address will be reported for suspicious activity by implementing a scraping delay.

Conclusion

Using the right tools and an organized approach, web scraping in R can help you make use of the vast amount of data available on the internet. But don't forget to follow moral guidelines, honor website policies, and be aware of any applicable laws. 

FavTutor - 24x7 Live Coding Help from Expert Tutors!

About The Author
Abhisek Ganguly
Passionate machine learning enthusiast with a deep love for computer science, dedicated to pushing the boundaries of AI through academic research and sharing knowledge through teaching.