Screen Scrapping

September 21, 2017 9:56 am
by Aruthra

Screen scrapping is a technique to access the external web page values and processing the functionalities on the page using programming. Or simply, it is used to parse the HTML values and perform the web page control functionalities like button click , link click etc using programming code.

Screen scrapping is used in many scenarios. For eg: most of the railway mobiles applications are running by scrapping the details from the railway websites. There are a lot of websites that are showing the data from other websites using this technique. Some users are misusing it like a hack. So websites are protected by using captcha tool in order to prevent scrapping of the website. Using scrapping, we can set the user name and password and other values into the external websites and trigger the login button or other functionalities.

We have several methods for screen scrapping . Here I am going to explain the simple scrapping through WebBrowser Control and HTMLAgilityPack.

Scrap using WebBrowser Control.

Let us start by using WebBrowser control.

Create windows or WPF application and include the WebBrowser control in it.

On the page load event, set the URL of the external website to be scrapped.

I have named my WebBrowser control as “webBrowserMain”.

We can set the URL using the below code.

webBrowserMain.Url = new Uri("https://www.yourwebpage.com");

This will load the external website to the webbrowser control of our application.

We need to register the webBrowser completed event to check if the web pages are loaded completely. The events are like as follows.

private void webBrowserMain_DocumentCompleted(object sender, WebBrowserDocumentCompletedEventArgs e)

{

}

I have created a sample website which contains username and Password text box and a login button to login into the website.

Let us start to scrap this website and log into the website using our WebBrowser control scrap method.

My login screen looks like this:

First, we can check the HTML of this webpage using developer tool(f12 key) of the browser.

We can see the Username and Password text boxes and login button.

I am going to set my website URL on to the WebBrowserMain control.

webBrowserMain.Url = new Uri("http://localhost:51440/Login.aspx");

For Login in to the application we need to first access the username and password text boxes and set the values into it. Then we need to trigger the submit button functionality.

First let us check how can we access and set the username and password values .

HtmlElement username = webBrowserMain.Document.GetElementById("txtUserName");

HtmlElement password = webBrowserMain.Document.GetElementById("txtPassword");

This will give the username and password control of that website into our application.

To set the values into this username and password box we can use the below code.

username.SetAttribute("value", "myusername");

password.SetAttribute("value", "mypassword");

We can write all these codes into the webBrowser completed event.

Now let us run the application.

We could see the assigned password and usernames set into the webpage.

Now we need to access the login button The below Code snippet can be used for accessing the HTML button from my web page.

HtmlElement loginButton = webBrowserMain.Document.GetElementById("btnLogin");

This will get the login button control and using this HTLMLElement we can set the click event using the below code.


loginButton.InvokeMember("click");

Yes!! Application scrapped the HTML and invoked the login button functionality. We could see the application getting redirected to Home page using the above code. From the Home Page, we can further access the buttons, textbox, label and other values using this method.

Some websites are designed in such a way that the controls does not have any specific name or Ids. In those cases we can use different methods for accessing those HTML elements like

HtmlElement anchr = webBrowserMain.Document.Window.Frames[1].Document.GetElementsByTagName("a")[1];

Or some control may place inside “Frames”. In those cases we can try the below code to access the control.

tmlElement userName = webBrowserMain.Document.Window.Frames[1].Document.GetElementById("useridField");

Scrapping Using HTMLAgilityPack.

Now we are on the Home page and I am going to scrap the data in the webpage using HTMLAgilityPack.

HTMLAgilityPack is a tool to pull out HTML from a website. Now let us see how can use HTML agility pack for scrapping.

It allows convenient parsing of HTML pages, even those with malformed code (i.e., lacking proper closing tags). HAP goes through page content and builds document object model that can be later processed with LINQ to Objects or XPath.

Let us try HTMLAgilityPack .

First Include the DLL using nuget package or directly add HTMLAgilityPack DLL to your project.

Using the developer tool, we can again get the HTML code of the Home page from the browser and it looks like below.

We can see that the details are showing in the “content-Wrapper’ div. We need to scrap this div to get the details of the user.

For that we first need to initialize the HTML agility pack and need to load the website page on to it using the below code.

HtmlAgilityPack.HtmlDocument document2 = new HtmlAgilityPack.HtmlDocument();

document2.Load(webBrowserMain.DocumentStream);

Now we can access the child elements or nodes from this. The details are showing in the content-Wrapper div, so we need to access this div using the following code.


HtmlAgilityPack.HtmlNode[] nodes = document2.DocumentNode.SelectNodes("//div[@class='content-wrapper']").ToArray();

We got all nodes and we can add it to a list collection for further process.

List<string> scrappedContents = new List<string>();

foreach (HtmlAgilityPack.HtmlNode item in nodes)

{

scrappedContents.Add(item.InnerHtml);

}

The complete code for scrapping the contents using HTMLAgilityPack is as follows.

HtmlAgilityPack.HtmlDocument document2 = new HtmlAgilityPack.HtmlDocument();

document2.Load(webBrowserMain.DocumentStream);

HtmlAgilityPack.HtmlNode[] nodes = document2.DocumentNode.SelectNodes("//div[@class='content-wrapper']").ToArray();

List<string> scrappedContents = new List<string>();

foreach (HtmlAgilityPack.HtmlNode item in nodes)

{

scrappedContents.Add(item.InnerHtml);

}

Run the project and check the output.