Three Secrets to a Loving Relationship with Your Partner

If you really want to be in love, to be loved, to feel a bond that lasts beyond the early days of curiosity and discovery, you need to be a good partner. You see if your words echo or match your…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Web Scraping a Wikipedia Table into a Dataframe

How do you convert a Wikipedia table into a Python Dataframe ?

We have the data which we need to work with. Lets say I need the names of the Indian cities, their states and their population.Now there are many ways you can extract this data like copy and pasting the content on a new excel sheet or using the Wikipedia API. But what if i tell you that this table can directly be converted to a Python Dataframe so it becomes easier for further analysis and processing. Interesting, isn’t it?

The task of extracting data from websites is called Web Scraping.It is one of the most popular methods of collecting data from the internet along with APIs. Some websites do not provide APIs to collect their data so we use data scraping technique. Some of the best programming languages for scraping purpose are Node.js, C , C++, PHP and Python.

We use Python for this particular task. But why Python?

Following are the steps to scrape a Wikipedia table and convert it into a Python Dataframe.

3. Request for the HTML response using the URL : We send a GET request to the Wikipedia URL whose table needs to be scraped and store the HTML response in a variable. It is not legal to scrape any website, so we check the status code. 200 shows that you can go ahead and download it.

4. Inspect page : In order to scrape the data from the website, we place our cursor on the data ,right click and Inspect. This gives us the HTML content through which we can find the tags inside which our data is stored. It is obvious that a table is stored inside the <table> tag in HTML.

Using Inspect in Chrome

5. Parse data from the HTML : Next we create a BeautifulSoup object and using the find() method extract the relevant information,which in our case is the <table> tag. There can be many tables in a single Wikipedia page, so to specify the table we also pass the “class” or the “id” attribute of the <table> tag.

Output :

Output:

Wikipedia Table to Python DataFrame

7. Clean the Data : We only need the city name,state and population(2011) from this dataframe. So we drop the other columns from the dataframe and rename the columns for a better understanding.

Output :

Clean Data

And that’s it!!

You have your Wikipedia table converted into a dataframe which can now be used for further data analysis and machine learning tasks.That’s the beauty of using Python for web scraping. You can have your data in no time using just a few lines of code.

Note : All the resources that you will require to get started have been mentioned and their links provided in this article as well. I hope you make good use of it :)

I hope this article will get you interested in trying out new things like web scraping and help you add to your knowledge. Don’t forget to click on the “clap” icon below if you have enjoyed reading this article. Thank you for your time.

Add a comment

Related posts:

Dice Game Optimal Stopping Strategy

A player is playing a dice game where he/she can roll a 6-sided fair die up to three times (1, 2, or 3 times). The payoff of the game is the value of the last roll. Assuming the player is rational…

3 key things that helped me grow In life

When I looked back at my journey both in my social life and my work life a.k.a cooperate journey. Which started roughly nine years ago. Allow me some bragging rights, I’m have done amazingly well in…

Importance of Data Lineage in Data warehouse.

In the previous article we have discussed on Data Catalog, in this article we will discuss about Data Lineage. What is it Data Lineage, what is its importance, what are the benefits, how to create…