How to do web scraping using python?

In this tutorial, I am going to do web scraping using python.Before jumping to web scraping let know what is the use of it.Web scraping is extracted the data from the website using script.Why should we do this? because consider a situation If you want to get Flipkart or Amazon all product list then manually typing all product name is excel sheet is impossible to do this automatically we write a script to download all product name store in csv, excel or JSON format.


Now let's see the necessary package to write this script.In Python, we use BeautifulSoup,urllib2, pandas.
To do this example I take moneycontrol website (www.moneycontrol.com).This is a famous website for all funds, the stock market and shares etc. In this website, I going to scrap RELIANCE TOP 200 FUND - RETAIL PLAN (G) data.Let's jump into the code!!
I have to give website Link below to scarp the data
Link:http://www.moneycontrol.com/mutual-funds/reliance-top-200-fund-retail-plan/portfolio-holdings/MRC155

Python Code

# import the basic package such as pandas,urlib and beautifulsoup

import pandas as pd
import urllib.request as urllib2
from bs4 import BeautifulSoup



# Analyze the website, we going to scrap the table called Portfolio Holding.It consists Equity,sector,qty,value.So make list where we append all row value to it.
Equity=[]

Sector=[]

Qty=[]

Value=[]

Percentage=[]



website="http://www.moneycontrol.com/mutual-funds/reliance-top-200-fund-retail-plan/portfolio-holdings/MRC155"


#urlib is use fetech the whole data from the website
page = urllib2.urlopen(website)

soup = BeautifulSoup(page)

all_tables=soup.find_all('table')


# table has the class name called tblporhd.You can find this by inspect element the website and fetching all the table datas and appending to corresponding list to it
right_table=soup.find('table', class_='tblporhd')

for row in right_table.findAll("tr"):

     cells = row.findAll('td')

     states=row.findAll('th') #To store second column data

     if len(cells)==5: #Only extract table body not heading

             Equity.append(cells[0].find(text=True))

             Sector.append(cells[1].find(text=True))

             Qty.append(cells[2].find(text=True))

             Value.append(cells[3].find(text=True))

             Percentage.append(cells[4].find(text=True))


#Now the pandas part take over we merge all the list to make table frame using pandas so it will used to make all the analysis
df=pd.DataFrame(Equity,columns=['Equity'])

df['Sector']=Sector

df['Qty']=Qty

df['Value']=Value

df['Percentage']=Percentage

print(df)


#Now converting Dataframe to CSV
df.to_csv('funds.csv') 



I hope you understand the simple web scraping da ta.I upload the original and updated source code in Github for further reference.
The link is given below:https://github.com/12345k/Web-scraping-Financial-Data-Set 
I will constantly upload more web scraping in this account.Kindly have a touch in it.

Happy coding!!!!

Comments

Popular posts from this blog

Artificial Intelligence

The taxonomy of CASE Tools

Zoho Second round - adding a digit to all the digits of a number