Skip to content

Lance Lu (ML Challenge)

Source: Notion | Last edited: 2022-09-06 | ID: 761f0d70-135...


Imagine that YOU are aMachine Learning Research Scientist (PAUSED) in our data science team who is collaborating with our data engineering team.

The data engineering team has done some research and found that the Google Trends data is potentially beneficial to the data science team. YOU, as a data scientist, want a time series of consistent Google Trends data from 2017 till the present with hourly interval. YOU informed the engineering team of this requirement, but they said they could not fetch the hourly data directly. The reason why they are unable to fetch the hourly data directly is explained in the Deep Dive section. They may, however, fetch the following raw data from Google Trends:

  • hourly_data.csv: a time series of weekly-consistent Google Trends data starting in 2017 and continuing up to the present, with hourly intervals
  • weekly_data.csv: a time series of monthly-consistent Google Trends data starting in 2017 and continuing up to the present, with weekly intervals
  • monthly_data.csv: a time series of consistent Google Trends data starting in 2017 and continuing up to the present, with monthly intervals
  • Carefully read the Deep Dive section.
  • Write a Python script to solve the **Problem **using the time series files downloadable from the Raw Data section.

The data engineering team fetched the raw data from Google Trends by way of web scraping from its website (as linked here).

The engineering team found that by choosing a time range of 2017-present, they could only provide time series of consistent Google Trends data with time interval of months (downloadable as monthly_data.csv in the **Raw Data **section):

In order to get the time series of hourly interval, they were forced to work within a more constrained time range (i.e. a week). They are able to get a time series of hourly data from 2017 up to the present (downloadable as hourly_data.csv in the **Raw Data **section) by retrieving and concatenating week-range-data on a week-by-week basis.

However, this hourly data are not what YOU want, since the data are not consistent!

Google scales the trends data within the window range you choose. In other words, say for example, a value_hour that equals ‘50’ during the week from 2022-07-03 to 2022-07-09 are not the same as a value_hour that also equals ‘50’ during the week from 2022-07-17 to 2022-07-23.

Only the value_hour numbers that sit within the same week are consistent.

Similarly, to get the time series of weekly interval (downloadable as weekly_data.csv in the Raw Data section), the engineering team used a time range of a month. They fetched month-range-data and concatenated them month by month from 2017 till the present.

By the same token, only the value_week in the same month are consistent.

With monthly_data.csv, weekly_data.csvand hourly_data.csvdata files given to you by the engineering team, how do you use them to output time series of consistent Google Trends data from 2017 till the present with time interval of hours?

Write a Python script to solve this problem using the time series files downloadable from the Raw Data section below.

Upload program code (or pseudo code) file(s) along with the README file for this TTA to GitHub, and send the repository link to careers+data_representation_tta@eonlabs.com

You are always welcomed to ask questions that you may have about this TTA by sending email to to careers+data_representation_tta@eonlabs.com so that our engineering team may answer your questions.