Learn Web Scraping in 30-Minutes

[av_section min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' scroll_down='' id='' color='main_color' custom_bg='' src='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' video_mobile_disabled='' overlay_enable='' overlay_opacity='0.5' overlay_color='' overlay_pattern='' overlay_custom_pattern='']
[av_heading heading='Learn Web Scraping in 30-minutes' tag='h1' style='blockquote modern-quote modern-centered' size='' subheading_active='subheading_below' subheading_size='15' padding='10' color='' custom_font='']
with Python and Beautiful Soup
[/av_heading]
[/av_section]

[av_one_half first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_heading tag='h2' padding='10' heading='Overview' color='' style='blockquote modern-quote modern-centered' custom_font='' size='' subheading_active='' subheading_size='15' custom_class='' admin_preview_bg=''][/av_heading]

[av_textblock size='' font_color='' color='']
Web scraping is a very powerful tool to learn for any data professional. With web scraping, the entire internet becomes your database. In this tutorial, we show you how to parse a web page into a data file (csv) using a Python package called BeautifulSoup.

There are many services out there that augment their business data or even build out their entire business by using web scraping. For example there is a steam sales website that tracks and ranks steam sales, updated hourly. Companies can also scrape product reviews from places like Amazon to stay up-to-date with what customers are saying about their products.
[/av_textblock]

[/av_one_half][av_one_half min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_heading tag='h2' padding='10' heading='30 Minute Video Tutorial' color='' style='blockquote modern-quote modern-centered' custom_font='' size='' subheading_active='' subheading_size='15' custom_class='' admin_preview_bg='']
with Python and Beautiful Soup
[/av_heading]

[av_codeblock wrapper_element='' wrapper_element_attributes='']

[/av_codeblock]

[av_social_share title='Share this entry' style='' buttons='' share_facebook='' share_twitter='' share_pinterest='' share_gplus='' share_reddit='' share_linkedin='' share_tumblr='' share_vk='' share_mail=''][/av_social_share]

[/av_one_half][/av_section][av_section min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' scroll_down='' id='' color='main_color' custom_bg='' src='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' video_mobile_disabled='' overlay_enable='' overlay_opacity='0.5' overlay_color='' overlay_pattern='' overlay_custom_pattern='']
[av_one_full first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_heading tag='h2' padding='10' heading='The Code' color='' style='blockquote modern-quote modern-centered' custom_font='' size='' subheading_active='' subheading_size='15' custom_class='']
with Python and Beautiful Soup
[/av_heading]

[av_codeblock wrapper_element='' wrapper_element_attributes='']

from bs4 import BeautifulSoup as soup  # HTML data structure

from urllib.request import urlopen as uReq  # Web client
# URl to web scrap from.

# in this example we web scrap graphics cards from Newegg.com

page_url = "http://www.newegg.com/Product/ProductList.aspx?Submit=ENE&N=-1&IsNodeId=1&Description=GTX&bop=And&Page=1&PageSize=36&order=BESTMATCH"
# opens the connection and downloads html page from url

uClient = uReq(page_url)
# parses html into a soup data structure to traverse html

# as if it were a json data type.

page_soup = soup(uClient.read(), "html.parser")

uClient.close()
# finds each product from the store page

containers = page_soup.findAll("div", {"class": "item-container"})
# name the output file to write to local disk

out_filename = "graphics_cards.csv"

# header of csv file to be written

headers = "brand,product_name,shipping\n"
# opens file, and writes headers

f = open(out_filename, "w")

f.write(headers)
# loops over each product and grabs attributes about

# each product

for container in containers:

    # Finds all link tags "a" from within the first div.

    make_rating_sp = container.div.select("a")
    # Grabs the title from the image title attribute

    # Then does proper casing using .title()

    brand = make_rating_sp[0].img["title"].title()
    # Grabs the text within the second "(a)" tag from within

    # the list of queries.

    product_name = container.div.select("a")[2].text
    # Grabs the product shipping information by searching

    # all lists with the class "price-ship".

    # Then cleans the text of white space with strip()

    # Cleans the strip of "Shipping $" if it exists to just get number

    shipping = container.findAll("li", {"class": "price-ship"})[0].text.strip().replace("$", "").replace(" Shipping", "")
    # prints the dataset to console

    print("brand: " + brand + "\n")

    print("product_name: " + product_name + "\n")

    print("shipping: " + shipping + "\n")
    # writes the dataset to file

    f.write(brand + ", " + product_name.replace(",", "|") + ", " + shipping + "\n")
f.close()  # Close the file

[/av_codeblock]

[/av_one_full]
[/av_section]

[av_section min_height='' min_height_px='500px' padding='default' shadow='no-shadow' bottom_border='no-border-styling' scroll_down='' id='' color='main_color' custom_bg='' src='' attach='scroll' position='top left' repeat='no-repeat' video='' video_ratio='16:9' video_mobile_disabled='' overlay_enable='' overlay_opacity='0.5' overlay_color='' overlay_pattern='' overlay_custom_pattern='']
[av_one_full first min_height='' vertical_alignment='' space='' custom_margin='' margin='0px' padding='0px' border='' border_color='' radius='0px' background_color='' src='' background_position='top left' background_repeat='no-repeat' animation='']

[av_comments_list]

[/av_one_full]
[/av_section]

[av_codeblock wrapper_element='' wrapper_element_attributes='' escape_html='' deactivate_shortcode='' deactivate_wrapper=''][/av_codeblock]

test blog

Learn Web Scraping in 30-Minutes

Something Powerful

Tell The Reader More

Subscribe to Email Updates

Recent Posts

Posts by Topic