For my PFCH project, I started out working with API to extract data. When I saw instructions from API provider, I realized there are a lot of limitations to what I could do with APIs. I learnt that many websites have APIs to help developers with data extraction. While APIs are created to be structured approach to data extraction, they are subject to limitations imposed by companies who created them and understandably so, because data is the core of most web services. For instance, with Zillow’s API, I was only able to request specific information for specific property. There is also a limit to number of API calls non-paying users can make in a day. As a result, I was unable use API to extract data I needed for manipulation and meaningful analysis. Because of these API limitations, I decided to try a different approach: web scraping, a less structured approach to getting data off of the Internet. When I learned about web scraping with BeautifulSoup in this class. I was excited at the prospect of being able to extract large amount of data from the internet without going through their providers. For my PFCH project, I decided to use web scraping to extract all song lyrics from Azlyrics.com for analysis.
The reason for choosing Azlyrics.com as my source of data for this project is because the website has a clean interface that can be easily inspected for needed data. The artist names on Azlyrics.com are sorted alphabetically and the names of artists that begin with each alphabet is given their own URLs. Furthermore, each song lyric has its own webpage. The simplistic interface made Azlyrics.com an inviting place for me to test my web scraping skill, but I soon found out things aren’t always easy as they look.
Azlyrics.com has its data structured as follows: the names of artists are listed alphabetically first, after clicking on any specific artist, the link would take you a page that has all the songs by that specific artist. When the song name is clicked, the link would take you to the webpage that is entirely dedicated to the lyric of the song. I decided to tackle this project by dividing it into three parts. First I would scrape the URL to each artist’s song page for one alphabet at a time. I wrote a testing script that scrapes the URLs to all artist’s song page whose names began with the letter A. The script took about 15 mins to scrape the URLs and print them to the screen. The test script worked for the letter A so I created a list of alphabets from A through Z for my script to loop through in the hope of scraping all artist names from A through Z, but here comes the first stumbling block. The minute I ran the loop, I got an error message saying “connection aborted” Not only was I not able to extract the data, I was blocked entirely from Azlyrics.com for about 24 hours.
After a bit of research, I learnt that many websites have built-in anti-scraping measures to prevent excess amount of data inquiries. As a result, I was no longer able to extract data from Azlyrics. Fortunately, our instructor pointed me to website that archives web pages regularly. With https://archive.org, I was able to access the older archived version of Azlyrics.com. Because the archived version is just a copy of the real website with less up-to-date data. I was free to extract data from it.
I decided to reduce the scope of my extraction to artists whose names begann with the letter A, and then work my way up. I was able to scrape the URLs to the artists’ song pages. For the second part, I wrote another script that loops through every song in the song page to extract the URLs to each lyric page. The script worked well and led me to the lyric page, but here comes to second stumbling block. Despite the simplistic and clean webpage for each song lyric, the lyric is deeply embedded without any unique HTML tags to differentiate them from other text in the webpage. As a result, I decided to just scrape all the texts from every lyric page for cleaning later.
The first and second part of my script both worked. I was successful in obtaining the URLs to each artist song page and URLs to each lyric page. However, when I started passing the lyric page URLs in attempt to scrape all texts from every lyric page. The script was stuck running for two hours without a response and it was just scraping the letter A. At that rate, it would probably take days to scrape letter A through Z. As a result, I decide to further reduce my scope to just 5 artists. With that in mind, I determined the goal of my project is to see what are some of the most frequently used words found in songs written by the top 5 Billboard musicians (The Weeknd, Pentatonix, Bruno Mars, Drake, and The Rolling Stones)
After making sure the first and the second part of my script work. I stitched them together into one single script to loop through each artist and their song list individually. As a result of the scrape, I extracted 743 songs from Azlyrics.com and wrote them out to a JSON file. I was able to convert the JSON file to CSV using http://www.convertcsv.com/json-to-csv.htm. By opening the CSV file with Excel, I was able to clean up and split the lyrics for word count and analysis. In my findings, each musician has a total of approximately 6200 words. After removing some of the commonly used pronouns and prepositions, as I expected, swearing and drug reference can be frequently found in hip-hop lyrics. Surprisingly, of the approximately 6200 words extracted from Bruno Mar’s lyric, he only uses one “f word”, the word “witness” occurred 34 times in The Rolling Stones’ songs, and the word “untouchable” occurred 18 times in Pentatonix songs.