IMDb (Internet Movie Database) is one of the most recognized names for its comprehensive online database collection of movies, films, TV series and so on. The exercise is going to use subset of IMDb dataset (available in TSV format). This workshop will utilize 6 IMDb dataset that are related to US based movies since year 2000. The dataset has around 106K+ movies, ratings, votes and cast/crew information.
The CloudFormation template has launched EC2 Amazon Linux 2 instance with MySQL installed and running. It has created imdb database, 6 new tables (one for each IMDb dataset), downloaded IMDb TSV files to MySQL server local directory and uploaded the files to 6 new tables. To explore dataset, follow below instructions to login EC2 server. It has also configured a remote MySQL user based on the CloudFormation input parameter.
sudo su
cd /var/lib/mysql-files/
ls -lrt
mysql -u DbMasterUsername -pDbMasterPassword
10. Congratulations! You are now connected to a self-managed MySQL source database on EC2. In next steps, we will explore database and tables hosting IMDb datasets
use imdb;
11. Show all the tables created;
show tables;
For illustration purpose, below is a logical diagram represents relationship between various source tables hosting IMDb dataset.
CREATE VIEW imdb.movies AS\
SELECT tp.tconst,\
tp.ordering,\
tp.nconst,\
tp.category,\
tp.job,\
tp.characters,\
tb.titleType,\
tb.primaryTitle,\
tb.originalTitle,\
tb.isAdult,\
tb.startYear,\
tb.endYear,\
tb.runtimeMinutes,\
tb.genres,\
nm.primaryName,\
nm.birthYear,\
nm.deathYear,\
nm.primaryProfession,\
tc.directors,\
tc.writers\
FROM imdb.title_principals tp\
LEFT JOIN imdb.title_basics tb ON tp.tconst = tb.tconst\
LEFT JOIN imdb.name_basics nm ON tp.nconst = nm.nconst\
LEFT JOIN imdb.title_crew tc ON tc.tconst = tp.tconst;
Use below command to review count of records from the denormalized view. At this point, your source database is ready to migrate to Amazon DynamoDB.
select count(*) from imdb.movies;