DATA BASE AND BIG DATA ANALYTICS

Academic Year 2019/2020 - 1° Year
Teaching Staff Credit Value: 12
Scientific field: ING-INF/05 - Sistemi di elaborazione delle informazioni
Taught classes: 80 hours
Term / Semester: 1° and 2°

Learning Objectives

  • DATA BASE

    This module covers the fundamental concepts of management database systems at scale as well as the analysis of existing benchmarks in different application scenarios. Topics include data models (relational); query languages (SQL); implementation techniques of database management systems even at large scale; noSQL databases, temporal, patial, Multimedia, and Deductive Databases. The module will also discuss available large scale multimedia datasets and how to query them as well as the state of the art techniques on how to create benchmarks for testing data analytics techniques. Principles on how to detect mistakes, biases, systematic errors, and other unexpected problems will be analyzed.

    The learning objectives are:

    • To understand and use the main technologies for database management;
    • To use SQL language for performing efficient queries in cases of large datasets;
    • To understand how to index and query multimedia datasets
    • To become aware of the existing benchmarks and their liminations for training and comparing data analysis techniques.

    Knowledge and understanding

    • To understand the main concepts of management database systems
    • To understand concepts and tools for generating and querying datasets at different scales
    • To understand techniques for indexing and searching multimedia datasets
    • To understand how potential biases in data collection may affect analytics methods

    Applying knowledge and understanding

    • To be able to effectively understand and use the main tools for creating and querying SQL and NoSQL datasets.
    • To query and analysis multimedia at large scale
    • To understand proper benchmarks and analysing achieved results also in terms of potential biases
  • Big Data Analytics

    This module covers the fundamental concepts of management and design of a business intelligence system. Topics include data models for building a data warehouse; ETL (extract, transform and load) functionalities; OLAP analysis; basic data mining; reporting and interactive dashboards, evolution of BI architectures on large datasets. The module covers techniques and algorithms for data visualization and exploratory analysis based on principles and techniques from graphic design, perceptual psychology and cognitive science. It is targeted to using visualization in their data analytics work.

    The learning objectives are:

    1. to understand and use the main methodologies and techniques for data analysis
    2. to understand the main methodologies to design a data warehouse
    3. to understand the main methodologies to transform data into sources of knowledge through visual representation

    Knowledge and understanding

    • To understand the most important methodologies and techniques used by industries to analyse data in order to support the decision process
    • To understand the main methodologies to design a data warehouse
    • To understand the main methodologies to transform data into sources of knowledge through visual representation

    Applying knowledge and understanding

    • To be able to apply methodologies and techniques to analyse data.
    • To be able to design a data warehouse.
    • To be able to build report and data analysis and organize them into interactive dashboards

Course Structure

  • DATA BASE

    The main teaching methods are as follows:

    • Lectures to provide the basic theoretical and methodological knowledge for understanding how to manage data at scale;
    • Hands-on exercises to make students apply the learned methods, thus to improve their solving problem skills
    • Paper reading and student presentations in order to provide critical thinking skills
    • Seminars by renowned reaserach and industrial experts in the field.
  • Big Data Analytics

    The main teaching methods are as follows:

    • Lectures, to provide theoretical and methodological knowledge of the subject;
    • Hands-on exercises, to provide “problem solving” skills and to apply design methodology;
    • Laboratories, to learn and test the usage of related tools

Required Prerequisites

  • DATA BASE

    Basic programming skills.

  • Big Data Analytics
    • Basic knowledge of database systems
    • Basic knowledge of SQL

Attendance of Lessons

  • DATA BASE

    Strongly recommended. Attending and actively participating in the classroom activities will contribute to the overall assessment of the final exam (see evaluation procedure section) .

  • Big Data Analytics

    Strongly recommended. Attending and actively participating in the classroom activities will contribute positively towards the overall assessment of the oral exam.


Detailed Course Content

  • DATA BASE

    1) Models and Languages for Database Management (15 hours)

    • Fundamentals of Database Management Systems (DBMS)
    • Relational Model: basic concepts, integrity constraints and keys.
    • SQL language: data definition, data modification, queries, views, transactions.
    • NO-SQL database: MongoDB

    2) Querying and processing big data (10 hours)

    • Apache Spark SQL with Python
    • Dataset and Dataframes
    • Window functions
    • Caching and logging functions
    • The Spark UI
    • Examples of data analysis with Spark SQL

    3) Analyzing existing benchmarks (15 hours)

    • Comparative analysis of benchmarks for testing out data analysis and machine learning methods for several tasks from classification to regression to generation.
    • Categorization of common biases in benchmarks: selection bias, negative bias, cross-generalization bias
    • Identifying and correcting dataset-related biased results
  • Big Data Analytics

    1. Introduction to Business Intelligence and Big Data Analytics (6 hours)

    • Goal and rationale of BI systems
    • The value of knowledge - data driven decision making
    • The structure and evolution of BI and Big Data analytics systems
    • OLAP vs OLTP
    • Data warehouse and Business intelligence
    • Advanced tools and platforms for BI and analytics

    2. Data models for data warehouse (12 hours)

    • Conceptual modeling
    • Dimensions and facts
    • Multi-dimensional data model
    • Conceptual, logical and physical design

    3. BI Architecture (12 hours)

    • ETL (extract, transform and load) functionalities
    • OLAP analysis
    • OLAP query
    • Reporting
    • Interactive Dashboard

    4. Data Visualization (10 hours)

    • Introduction to Visualization
    • Data transformation into sources of knowledge through visual representation
    • Charts and standard views: relevance, appropriateness and best practices
    • Advanced and innovative tools for data visualization
    • The evaluation of the quality of visualizations

Textbook Information

  • DATA BASE

    1. R. Elmasri and S. Navathe, Fundamentals of Database Systems, 7th Edition, Pearson, 2016.
    2. Denny Lee, Tomasz Drabas, Learning Spark SQL, Packt Publishing, 2017
    3. Instructor’s notes
    4. Research papers (a list will be published on the page course)

  • Big Data Analytics
    1. Ralph Kimball, Margy Ross. The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd Edition, Wiley, 2013
    2. Instructor’s notes (published on Studium)

Course Planning

DATA BASE
 SubjectsText References
1Introduction to databases: Concepts and ArchitectureBook 1 - Chapter 1 and 2 
2Relational Data Model Book 1 - Chapter 5 
3Basic SQL: data definition, SQL query, update instruction set. Book 1 - Chapter 6 + Notes 
4Advanced SQL: Complex Queries, Triggers, ViewsBook 1 - Chapter 7 + Notes 
5Query processing and optimizationBook 1 - Chapter 18 and 19 
6NOSQL Databases and Big Data Storage SystemsBook 1 - Chapter 24 + Notes 
7Active, Temporal, Spatial, Multimedia, and Deductive DatabasesBook 1 - Chapter 26 
8Getting started with Spark SQL for Data ProcessingBook 2 - Chapter 1 and 2 + Notes 
9Spark SQL for Data ExplorationBook 2 - Chapter 3 + Notes 
10Spark SQL for Learning ApplicationsBook 2 - Chapter 6 and 10 + Notes 
11Multimedia benchmarks for bias identification and analysisResearch paper list on course course 

Learning Assessment

Learning Assessment Procedures

  • DATA BASE

    The final exam consists of a) a lab test aiming at assessing the capabilities in writing SQL and NoSQL queries using also SPARK SQL, b) a final report critically analyzing, in terms of possible biases, an existing dataset. The exam is evaluated according to the ability to write SQL queries, derive aggregated information from data and to discover correctly biases in data and motivate solutions for solving such biases.

    The vote on the database module will account for 50% of the total grade for the entire course.

    The module also foresees intermediate tests for students attending the course. These tests include: a lab test on SQL writing, a presentation, two written reports analyzing existing benchmarks (whose list will be given at the beginning of the course). The choice of the datasets to analyze will be done during classes in order to avoid overlap.

    The grading policy for intermediate tests is:

    • SQL test: 30%

    • Paper presentation: 30%

    • Reports: 30%

    • Attendance and Discussion during classes: 10%

  • Big Data Analytics

    The final exam consists of a) a project work aiming at assessing the capabilities in developing a BI system including the analysis and the visualization of relevant information, b) an oral exam that will consist of the discussion of the project work.

    Assessment criteria include: depth of analysis, adequacy, quality and correctness of the proposed solutions to the project work, ability to justify and critically evaluate the adopted solutions, clarity.

    The vote on the Big Data Analytics module will account for 50% of the total grade for the entire course.


Examples of frequently asked questions and / or exercises

  • DATA BASE

    Examples of questions and exercises will be available on the webpage course and on the Studium platform.

  • Big Data Analytics

    Examples of questions and exercises are available on the Studium platform