Title = post.find(class_='blog-entry-title entry-title').get_text().replace('\n', '')ĭate = lect('.blog-entry-date.clr').get_text()ģ – we import requests module which will enable us to do a http request to ourĤ – we import BeautifulSoup module which will enable us to pull data out of ourĥ – we import csv module which will enable us to create a comma separate valuesįile for spreadsheet and databases import/export.ġ0 – we are creating a variable called response where are making a http getġ4 – we initialize BeautifulSoup by creating a variable called soup. With open('articles.csv', 'w') as csv_file: Posts = soup.findAll(class_='blog-entry-content') # we create a variable called posts and we know that our all our blog posts are in a div called blog-entry-content Soup = BeautifulSoup(response.text, 'html.parser') # we initialize beautiful soup and pass in our response Make sure that blog_scraping.py has the following code: # we import the class that we need scraping our blog NB: You need knowledge of html and css of a web page you are scraping. Now that we haveĪn understand of our html structure let’s scrape the site. We can tell that each post is wrapped in aĮach title is in class called blog-entry-title entry-title and date are in class called blog-entry-date clr. In order for us to inspect the site, I will right click on a post and click on inspect.
#Useful commands for python webscraper code#
Understanding the Html Structure of Our BlogĬan write any scraping code we need to understand the html structure of theīlog, the quickest way to do that, is to inspect elements on the blog.
#Useful commands for python webscraper install#
To install python requests run the following command: pip install requestsĪm going to open blog_scraping.py file on my IDE and here is the output: We are going to use python requests which is an elegant and simple HTTP library. We will be using Pycharm as our IDE in this tutorial since we are going to make an HTTP request to our blog, we need to use a library. In my case, Beautiful Soup is already installed. To install Beautiful soup, use the following command: pip install beautifulsoup4 We will also be using Beautiful Soup which is a python library for pulling data. I will be using python 3.7 on my terminal when I type: python We are going to create a file named blog_scraping.py file. We will get the title, link to the post, as well as the date and put them in a CSV file. This blog has a couple of post on it, we will scrape this blog and get the post from this website. NB: Just Make Sure when you do scraping it’s not illegal. Scraping you need to have an understanding of web data structure, how thingsĪre laid out, because it’s more of html and css. Web Scraping with Python and Beautiful soup Have gone through the terms of services and privacy policy of given website. Most website prohibit any form of data mining and since scraping is a form ofĭata mining, then this becomes illegal. Many fields, a good example would be the following list: We would describe web scraping as a technique of data mining, collecting data from web pages and storing that data in database or spreadsheet for analysis. There are more than 150 million active websites today and finding relevant, comprehensive and apt information has a pivotal role to play in business, as well as search engine optimization.Įvery website uses HTML format to display information to visitors, sometimes we find ourselves in a situation where we need to consume data from a given website but the website does not offer an API, and this is where web scrapping can help. Hello and welcome to this tutorial, my name is Henry and I will be talking you through the various aspect of web scraping using Python 3.7 and Beautiful Soup 4.