My first web scraping assignment
I’ve always wanted to work in tech when I switched careers three years ago. I must say [arguably] that only a few firms in Ghana have slots or appreciation for data scientists, worst of all, some do not know that such a field exists. I noted something when I started working and this has to do with the fact that you might end having to bridge the skills gap and you never stop learning.
I want to share with you some lessons I learned in my first task working with the data engineering team in my first ever data science role. My team lead gave us a week to scrape data from Youtube and another week to scrape data from Twitter. We worked in silos but collaboration was allowed. We had to learn a lot on our own and all our team lead did was giving guidance but much of the task was up to us.
Scraping data from YouTube
Scrapping data from YouTube has the hard part and the not so hard part: the not so hard part has to do with applying for the credentials, for YouTube, that is pretty straight forward. First of all, you’ll need a Google [Gmail] account to apply for the credentials. Use this link to get started.
Remember to copy the API keys and keep them safe since it has to be kept secret. If you choose not to copy to your local computer, it means you’ll need the internet to assess them anytime. I used python to perform the web scraping. There are a number of materials online to aid in scraping data from YouTube but these are the caveats I want you to pay attention to. One has to do with the number of searches per request. Google allows for 25,000 rows of data per request but that doesn’t come for free. Custom Search JSON API provides 100 search queries per request for free. If you need more, you may sign up for billing in the API Console. Additional requests cost $5 per 1000 queries, up to 10k queries per day.
If you need more than 10k queries per day and your Programmable Search Engine searches 10 sites or fewer, you may be interested in the Custom Search Site Restricted JSON API, which does not have a daily query limit.
One other caveat … you have to be familiar with working with json files. It was my first time working with json files but there’re a number of materials online you can practice with. If you’re familiar with python dictionary, then understanding json will be much simple. This article helped me get around working with json in python. If you’re familiar with Pandas, you can save your results as a csv file and use Pandas to read it.
You can check out my Github repo on scraping data from YouTube here if you get stuck.
Scraping data from Twitter and working with MongoDB
I enjoyed scraping data from Twitter because the codes were pretty straight-forward as compared to scraping data from YouTube. The challenge with working with Twitter API has to do with applying for the credentials. I do not for what reason Twitter management decided to make it somewhat difficult in getting the API keys but this could have something to do with misuse of the credentials and Twitter has come under some scrutiny and backlash in recent times so this action is not surreal.
I read a number of articles before applying for the developer credentials and I fell victim to having to provide further details as to why I should be granted access. On the brighter side, it is possible to get the API key and token instantly if your reason is compelling. Your best chance of getting faster approval is to apply as a hobbyist and be as honest as possible. This avoids you having to provide further information when your reasons are not convincing enough. This process can take days if you’re not approved instantly and this can be frustrating especially if you’re working on a client project and you need those credentials and it’s taking days, sometimes, it might not even get approved after all.. yikes. Keep things simple when responding to questions in the process of the application, this can improve your chances of getting approved. I didn’t get lucky, I wasn’t approved instantly compared to my colleagues and one person I guided but I got an email to provide further reasons for applying for the credentials which I did and got approval shortly afterward… I think in less than 24 hours. Remember.. keep the answers simple except in situations you have to provide an explanation and even evidence of a project you’re working on that will require you to working with Twitter data.
As with working with YouTube API, Twitter endpoints have rate limits depending on the API type which you can read about here. The procedure to scrapping Twitter data is much similar to that of you YouTube, the only difference which is the not so difficult aspect of working with Twitter data is the arguments for the search query function which is very straightforward compared to YouTube.
The second aspect of this section is working with MongoDB. My repository in Github has steps you can use to create, and manipulate data in mongo on your local computer, and viewing data with MongoDB compass.
You can check out my Github repo on scraping data from Twitter and ready to work with virtual environments here if you get stuck.
In my next articles, I will share extensively how to work with MongoDB and connect data in your local computer to MongoDB compass, a MongoDB client, and then how to work with virtual environments.
I wish you the very best as you race to perform your first web scraping.