The Product Manager’s Guide to Web Scraping
Automate the mundane — Enjoy the PM life
The Snippet is a Weekly Newsletter on Product Management for aspiring product leaders.
Let’s admit it. A typical Product Manager day can be super busy — customer meetings, competitor analysis, backlog grooming, management reporting, and much more. There is always lots to do.
Some of these things of course need a PM’s undivided personal attention, — time spent with customers, feature planning, presentations, etc. But you’ll see that there are also a lot of repetitive tasks that a Product Manager has to deal with. Why not automate the boring, repetitive & mundane (but important) stuff?
One of the things that can really help Product Managers automate data collection is to learn how to scrape the web to collect data. A personal favorite use case is to automatically ‘track’ information as it's updated on a competitor’s website. (This is how I stay up to date on the new product launches from competitors)
At this point, some of you reading this might be thinking — do I need to learn how to code before I can scrape the web?
The answer is “Yes- But you don't need to be an expert”. In fact, even if you have never coded before — simply follow along with this post and make this you're first coding project!
“Learn to Code” — by Naval
In fact, learning to scrape the web is perhaps the easiest way to get started with writing a little bit of code — and it delivers immediate value.
So, let's jump right into it and build a tracker to keep a tab on a competitor’s website.
Your Perfume business
Say you want to launch a global fragrance business, and so you want to know the range of products and brands that your competitor offers and document your findings.
Let's take a look at one of your top competitors — Christian Dior.
Here are all the Men’s Fragrances
…And Women’s Fragrances
Without the web scraping technique — you would have to copy-paste all the information that you need one by one. Too much time and energy wasted on a repetitive task.
And by the way that's just Dior. There are several other competitors in this space that you’d want to learn about, wouldn’t you? Life’s too short to spend time on that stuff. So, let us automate the process instead.
Building the Website Scraper
We will use python to build a webpage scraper that will :
“Open” the 2 Dior URLs — for Men’s & Women’s Fragrances
“Extract” the Brands, Perfume Names, & their Intensities
Log this data into a CSV file
Setting up your environment
Step #1 is to make sure your computer has python — the easiest way you can check the version number is by typing “python” in your command prompt or terminal. It will return the version number and if it is running on 32 bit or 64 bit and some other information.
In the unlikely event that you don’t have python installed — here are the steps
Go to the Python downloads page: Python downloads.
Click on the link/button to download Python 2.7.x.
Follow the install instructions (leave defaults as-is).
Open your terminal and type the command
cd. Next, type the command
python. The Python interpreter should respond with the version number. If you’re on a Windows machine, you will likely have to navigate to the folder where Python is installed (for example,
Python27, which is the default) for the
pythoncommand to function.
Step # 2 is to open your favorite code editor and start writing code! I use sublime text.
Before you start coding…
We are going to write code that will look into the HTML Code for Dior’s website, and extract the information that we need. But before we write the code itself lets take a look at the HTML. This will help us get a better sense of what we need to do.
If you inspect the HTML code on the browser you will see
Expand one of the <li> tags and you will see HTML div tags that contain the product information. If you expand the other <li> tags — you will see these tags repeat for each product listed on the website.
So, all we need to do is to write code that will extract information from these HTML tags.
Because our code needs to extract information located inside HTML tags — we will use a very cool Python library called Beautiful Soup to help us. Beautiful Soup provides a few simple methods for navigating, searching, and modifying HTML tags and extract information inside them.
But first, install Beautiful Soup by typing the following command to your terminal
pip install beautifulsoup4
Next, create a new folder to store your code —let's call this folder “web-scraper-dior”. Inside this folder create a new python file “scraper.py”.
At this point, if you want to simply run the code and see the result — here’s the link to the scraper.py on Github. Simply copy the code from this file and paste it into the new python file you created.
Once you’ve copied the code over — Go to your command window or terminal, and navigate to the “web-scraper-dior” folder that you created (using the command ‘cd’). Once you are in that folder — run the following command on your terminal.
After a few seconds — you should see “CSV file with Dior Product Info Generated” printed on the terminal. Once this happens, go to your folder and the CSV file should be sitting there with all the information that you need. You can modify the code a little bit to extract info from other perfume companies that you care about within minutes!
If you care about what the code is actually doing — read on 😊😊
Understanding the code
If you look at your scraper.py file — it should look something like this
The first thing we do is import the following packages — we will need them.
import urllib2from bs4 import BeautifulSoupimport csv
urllib2 will help us open the Dior URLs we are interested in.
BeautifulSoup will help us navigate and extract data inside the page HTML tags
and CSV will help us write this data into a CSV file.
Next, we define the Dior URLs we mentioned above for Men’s and Women’s fragrances.
dior_urls = [“https://www.dior.com/en_int/fragrance/mens_fragrance/all-products", “https://www.dior.com/en_int/fragrance/womens-fragrance/all-products" ]
In the code snippet below,
for pg in dior_urls: perfumes_array =  page = urllib2.urlopen(pg) soup = BeautifulSoup(page, “html.parser”) products = soup.find_all(“div”, class_=”product-legend”)
we open each URL and do the following
Create an array called “perfumes_array” to store our product attributes from that url.
We use urllib2 to get the HTML code of each web page — we store that into a variable called page
We ask Beautiful Soup to use the variable page to create an object soup that can then help us find any HTML tag that we may be interested in.
Finally, we use this soup object to find all the HTML div tags with class “product-legend” — why? because if you remember from the HTML above, this div tag contains all the product information we care about. Also, each such div tag corresponds to one product on the page.
soup finds all these tags and we store the result into another variable called “products”.
OK, so far so good. Now we simply need to iterate into this products object and get the information we need. Here’s how.
for tag in products: perfumeBrand=tag.find(“span”,class_=”title-with-level product-title font-century-std size-s bold”).text perfumeName=tag.find(“p”,class_=”multiline-text product-subtitle”).text if tag.find(“span”,class_=”sr-only”): perfumeIntensity=tag.find(“span”,class_=”sr-only”).text
In the code snippet above — we loop through the products object and extract the perfumeBrand, perfumeName& prefumeIntensityby finding the appropriate tags.
We then store all these product attributes into an array named “perfumes_array”.
perfumes_array.append((perfumeBrand.strip(), perfumeName.strip(), perfumeIntensity.strip()))
Now we have all the info from the HTML page inside this perfumes_array.
All we need to do is to use this perfumes_array to write the contents into a CSV file and save it in our folder.
That's what the following code does.
with open(“index.csv”, mode=’a’) as csv_file: writer = csv.writer(csv_file) writer.writerow([“*********”.encode(‘utf-8’).strip(),”*********”.encode(‘utf-8’).strip(),”*********”.encode(‘utf-8’).strip() ]) writer.writerow([“BRAND”.encode(‘utf-8’).strip(),”NAME”.encode(‘utf-8’).strip(),”INTENSITY”.encode(‘utf-8’).strip() ])
for perfumeBrand, perfumeName, perfumeIntensity in perfumes_array: writer.writerow([perfumeBrand.encode(‘utf-8’).strip(), perfumeName.encode(‘utf-8’).strip(), perfumeIntensity.encode(‘utf-8’).strip()])
print(“CSV with Product Info Generated”)
Here’s all your competitor product info — programmatically collected and neatly organized into a CSV file — saving you tons of time!🤖🤖🤖🤖🤖🤖
If you have questions or comments on this post, just leave a comment or reach out to me on Twitter!
In case you missed previous posts, you can find them here.
This post has been published on www.productschool.com communities.