Ads

C# Programming 16 - What is Web scraping and Why we use this?

Hello All,

In this article we will learn about Web Scrapping and how can we implement that in C# Programming.


What is Web Scrapping?

Web scraping is the process of extracting data from any websites. 

Lets create a simple console application to retrieve the list of article title from the given web site.

Step 1 : Create a New Project: Open up Visual Studio and create a new C# console application project.



Step 2: Install Libraries: The first step is to install the required libraries. In C# most commonly used libraries are HtmlAgilityPack and HttpClient.



Step 3 : Next Step is? send a request to the website that you want to scrape. This can be done using the HttpClient library.

  
   string url = "https://lastbenchcoder.blogspot.com/";
   HttpClient httpClient = new HttpClient();
   HttpResponseMessage response = httpClient.GetAsync(url).Result;
   response.EnsureSuccessStatusCode();
   string html = response.Content.ReadAsStringAsync().Result;
  
  

Step 4:
 Next Step is to Parse the HTML, Once you have sent the request, the website's HTML code will be returned. The next step is to parse the HTML code using the HtmlAgilityPack library.

    
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);
    
    
Step 5: Final step is to find the data and Extract that to desired file(csv, excel, pdf etc..). In this example we will display the data in console.

      
   var titleNodes = doc.DocumentNode.SelectNodes("//h2[@class='entry-title']");

   foreach (var node in titleNodes)
   {
       Console.WriteLine(node.InnerText);
   }
      
      

Here is the complete source code

                
using System;
using System.Net.Http;
using HtmlAgilityPack;

class Program
{
    static void Main(string[] args)
    {
        string url = "https://lastbenchcoder.blogspot.com/";
        HttpClient httpClient = new HttpClient();
        HttpResponseMessage response = httpClient.GetAsync(url).Result;
        response.EnsureSuccessStatusCode();
        string html = response.Content.ReadAsStringAsync().Result;

        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        var titleNodes = doc.DocumentNode.SelectNodes("//h2[@class='entry-title']");

        foreach (var node in titleNodes)
        {
            Console.WriteLine(node.InnerText);
        }
    }
}        
        
      
Finally Output will be


That's it for this article, See you in new article.

Take care Bye.

Post a Comment

0 Comments
* Please Don't Spam Here. All the Comments are Reviewed by Admin.

#buttons=(Accept !) #days=(20)

Our website uses cookies to enhance your experience. Learn More
Accept !