Explore Source Model

IMDb (Internet Movie Database) is one of the most recognized names for its comprehensive online database collection of movies, films, TV series and so on. The exercise is going to use subset of IMDb dataset (available in TSV format). This workshop will utilize 6 IMDb dataset that are related to US based movies since year 2000. The dataset has around 106K+ movies, ratings, votes and cast/crew information.

The CloudFormation template has launched EC2 Amazon Linux 2 instance with MySQL installed and running. It has created imdb database, 6 new tables (one for each IMDb dataset), downloaded IMDb TSV files to MySQL server local directory and uploaded the files to 6 new tables. To explore dataset, follow below instructions to login EC2 server. It has also configured a remote MySQL user based on the CloudFormation input parameter.

  1. Go to EC2 console
  2. Select the MySQL-Instance and click Connect Final Deployment Architecture
  3. Make sure ec2-user is filled under the User name field. Click Connect Final Deployment Architecture
  4. Elevate your privilege using sudo command
      sudo su
    
    Final Deployment Architecture
  5. Go to the file directory
      cd /var/lib/mysql-files/
      ls -lrt
    
  6. You can see all the 6 files copied from the IMDB dataset to the local EC2 directory Final Deployment Architecture
  7. Feel free to explore the files.
  8. Go to AWS CloudFormation Stacks and click on the stack you created earlier. Go to the Parameters tab and copy the user name and password mentioned next to DbMasterUsername & DbMasterPassword Final Deployment Architecture
  9. Go back to EC2 Instance console and login to mysql
mysql -u DbMasterUsername -pDbMasterPassword

Final Deployment Architecture 10. Congratulations! you are now connected to a self-managed MySQL source database on EC2. In next steps we will explore database and tables hosting IMDb datasets

use imdb;

Final Deployment Architecture 11. Show all the tables created;

show tables;

Final Deployment Architecture

For illustration purpose, below is a logical diagram represents relationship between various source tables hosting IMDb dataset.

  • title_basics table has movies published in US after year 2000. tconst is an alphanumeric key uniquely assigned to each movie.
  • title_akas has published regions, languages and respective movie titles. It’s 1:many relationship with title_basics table.
  • title_ratings has movies rating and vote count. For this exercise, we can assume the information has high frequency update post movie release. It’s 1:1 related with title_basics table
  • title_principals has cast and crew information. It’s 1:many relationship with title_basics table.
  • title_crew has writer and director information. The table is 1:1 related with title_basics table.
  • name_basics has cast and crew details. Every member has unique nconst value assigned. Final Deployment Architecture
  1. We will create denormalized view with 1:1 static information and get it ready for migration to Amazon DynamoDB table. For now, go ahead and copy below code and paste into mysql command line. The details around target data model will be discussed in the next chapter.
CREATE VIEW imdb.movies AS\
	SELECT tp.tconst,\
		   tp.ordering,\
		   tp.nconst,\
		   tp.category,\
		   tp.job,\
		   tp.characters,\
		   tb.titleType,\
		   tb.primaryTitle,\
		   tb.originalTitle,\
		   tb.isAdult,\
		   tb.startYear,\
		   tb.endYear,\
		   tb.runtimeMinutes,\
		   tb.genres,\
		   nm.primaryName,\
		   nm.birthYear,\
		   nm.deathYear,\
		   nm.primaryProfession,\
		   tc.directors,\
		   tc.writers\
	FROM imdb.title_principals tp\
	LEFT JOIN imdb.title_basics tb ON tp.tconst = tb.tconst\
	LEFT JOIN imdb.name_basics nm ON tp.nconst = nm.nconst\
	LEFT JOIN imdb.title_crew tc ON tc.tconst = tp.tconst;

Use below command to review count of records from the denormalized view. At this point you source database is ready for migration to Amazon DynamoDB.

select count(*) from imdb.movies;