- Kenneth Benoit, Department of Methodology.
*Office hours*: Mondays 16-17:00, Thursdays 11-12:00, COL.8.11 - Milan Vojnovic, Department of Statistics.
*Office hours*: By appointment, COL, 2.05A

- Christine Yuen, Department of Statistics.
*Office hours*: Monday 15:00 - 16:00, COL 5.03 (from week 2, computer workshop related questions only)

- Lectures on Tuesdays 10:00–12:00 in CLM.2.02
- Classes on Thursdays 13:00–14:30 in TW2.4.02

No lectures or classes will take place during School Reading Week 6.

Week |
Topic |
Week |
Topic |
---|---|---|---|

1 | Introduction to Data | 7 | Exploratory data analysis |

2 | The shape of data | 8 | Exploratory data analysis (cont’d) |

3 | Creating and managing databases | 9 | Model evaluation |

4 | Using data from the Internet | 10 | Dimensionality reduction |

5 | Working with APIs | 11 | Graph data visualization |

6 | Reading Week |

This course will cover the principles of digital methods for storing and structuring data, including data types, relational and non-relational database design, and query languages. Students will learn to build, populate, manipulate and query databases based on datasets relevant to their fields of interest. The course will also cover workflow management for typical data transformation and cleaning projects, frequently the starting point and most time-consuming part of any data science project. This course uses a project-based learning approach towards the study of online publishing and group -based collaboration, essential ingredients of modern data science projects. The coverage of data sharing will include key skills in on-line publishing, including the elements of web design, the technical elements of web technologies and web programming, as well as the use of revision- control and group collaboration tools such as GitHub. Each student will build one or more interactive website based on content relevant to his/her domain -related interests, and will use GitHub for accessing and submitting course materials and assignments.

A core objective of this course is to provide students with a well-rounded sense of “data science literacy”, meaning you will become familiar with the core structures, terms, protocols, and software that forms the core material of data science and applied computing. This is a broad category, covering abstract concepts such as database normal forms and complex data structures, but also covers a range of simple tools and formats such as markup languages, web publishing, and working with APIs (application programming interfaces). In the second half of the course, we will focus on communicating results visually through turning data into plots and other visualizations.

On the theory side, introduce principles and applications of the electronic storage, structuring, manipulation, transformation, extraction, and dissemination of data. This includes data types, database design, data base implementation, and data analysis through structured queries. Through joining operations, we will also cover the challenges of data linkage and how to combine datasets from different sources. We begin by discussing concepts in fundamental data types, and how data is stored and recorded electronically. We will cover database design, especially relational databases, using substantive examples across a variety of fields. Students are introduced to SQL through MySQL, and programming assignments in this unit of the course will be designed to insure that students learn to create, populate and query an SQL database. We will briefly compare relational databases to other formats of database manager, the “NoSQL” types such as MongoDB, including the JSON data format. Students will be encouraged to work with data relevant to their own interests as they learn to create, populate and query data.

On the practical side, we will cover a variety of tools with which every data scientist should be familiar, including revision control tools, web publishing formats, tools and commands for reshaping and recasting data, how to work with different data formats, how to merge and link data, and how to publish a website.

In the data visualisation part of the course, we will cover a variety of principles, tools, and methods for visualizing data.

For the final project, we will provide you with a dataset, which you will be expected to transform in order to produce visualizations.

This course is an introduction to the fundamental concepts of data and data visualization for students and assumes no prior knowledge of these concepts.

The course will involve 20 hours of lectures and 15 hours of computer workshops in the MT.

No prior experience with programming is required.

We will use some tools, notably SQLite, R, and Python, but these will be used in coordination with MY470 (Computer Programming) where their use will be covered more formally. Lectures and assignments will be posted on Github, Students are expected to use Github also to submit problem sets and final exam.

Where appropriate, we will use Jupyter notebooks for lab assignments, demonstrations, and the course notes themselves.

Project assignment (60%) and continuous assessment in weeks 3, 6, 8, 10 (10% each). Students will be expected to produce 10 problem sets in the MT.

In the first week, we will introduce the basic concepts of the course, including how data is recorded, stored, and shared. Because the course relies fundamentally on GitHub, a collaborative code and data sharing platform, we will introduce the use of git and GitHub, using the lab session to guide students through in setting up an account and subscribing to the course organisation and assignments.

This week will also introduce basic data types, in a language-agnostic manner, from the perspective of machine implementations through to high-level programming languages. We will introduce the notion of databases and database managers, and the client-server model.

*Lecture Notes*:

- Administrative overview of the course (see also pdf version)
- Lecture, Week 1 (see also pdf version)
- R example to fix

*Readings*:

- Lake, P. and Crowther, P. 2013.
*Concise guide to databases: A Practical Introduction*. London: Springer-Verlag. Chapter 1, Data, an Organizational Asset - Goodrich, M.T., Tamassia, R. and Goldwasser, M.H. 2013.
*Data structures and algorithms in Python*. John Wiley & Sons Ltd. Ch. 1, through section 1.3. - Wickham, Hadley. Nd.
*Advanced R*, 2nd ed. Ch 1, Introduction, and Chapter 2, Data Structures. - GitHub Guides, especially: “Understanding the GitHub Flow”, “Hello World”, and “Getting Started with GitHub Pages”.

*Further Readings*:

- “Understanding Big and Little Endian Byte Order”.
*Better Explained*website. - Nelson, Meghan. 2015. “An Intro to Git and GitHub for Beginners (Tutorial).”
- GitHub. “Markdown Syntax” (a cheatsheet).
- Chacon, Scott and Ben Straub.
*Pro Git*. 2nd ed. Apress. Chapters 1-2. - Jim McGlone, “Creating and Hosting a Personal Site on GitHub A step-by-step beginner’s guide to creating a personal website and blog using Jekyll and hosting it for free using GitHub Pages.”.

*Lab*: **Working with git and GitHub**.

- Installing git and setting up an account on GitHub
- How to complete and submit assignments using GitHub Classroom
- Forking and correcting a broken Jupyter notebook
- Cloning a website repository, modifying it, and publishing a personal webpage

This week moves beyond the rectangular format common in statistical datasets, modeled on a spreadsheet, to cover relational structures and the concept of database normalization. We will also cover ways to restructure data from “wide” to “long” format, within strictly rectangular data structures. Additional topics concerning text encoding, date formats, and sparse matrix formats are also covered.

*Readings*:

- Wickham, Hadley and Garett Grolemund. 2017.
*R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*. Sebastopol, CA: O’Reilly. Part II Wrangle, Tibbles, Data Import, Tidy Data (Ch. 7-9 of the print edition). - The
**reshape2**package for R.

*Further Resources*:

- Reshaping data in Python: “Reshaping and Pivot Tables”.
- Robin Linderborg, “Reshaping Data in Python”, 20 Jan 2017.

*Lecture Notes*:

- Lecture, Week 2 (see also pdf version)

*Lab*: **Reshaping and data in R**
See also:

We will return to database normalization, and how to implement this using good practice in a relational database manager, SQLite. We will cover how to structure data, verify data types, set conditions for data integrity, and perform complex queries to extract data from the database. We will also cover authentication and how to connect to local and remote databases. Finally, for a comparison, we will show a different (non-relational) database model through MongoDB, contrasting this to the relational paradigm.

*Readings*:

- Lake, Peter.
*Concise Guide to Databases: A Practical Introduction*. Springer, 2013. Chapters 4-5, Relational Databases and NoSQL databases. - Nield, Thomas.
*Getting Started with SQL: A hands-on approach for beginners*. O’Reilly, 2016. Entire text.

*Further Resources*:

- SQLite documentation.
- Bassett, L. 2015.
*Introduction to JavaScript Object Notation: A to-the-point Guide to JSON*. O’Reilly Media, Inc.

*Lecture Notes*:

- Lecture, Week 3 (see also pdf version)

*Lab*: **Working with a relational database manager**

- To complete this assignment, you will edit the .ipynb file to add your answers, and submit that.
- Here are some additional notes that build on the lecture
- Here are some suggestions for additional resources that could help in answering Exercise 3.

This week covers markup languages, content style sheets, and web protocols for publishing and transmitting data. Continuing from the material covered in the first week lab session, we will cover markup languages, including HTML, XML, and Markdown, as well as common data formats such as JSON (Javascript Object Notation). We will cover basic web scraping, to turn web data into text or numbers. We will also cover the client-server model, and how machines and humans transmit data over networks and to and from databases.

*Readings*:

- Shay Howe. 2015.
*Learn to Code HTML and CSS: Develop and Style Websites*. New Riders. Chs 1-8. - Beautiful Soup Documentation

*Further Resources*:

- Duckett, Jon.
*HTML and CSS: Design and Build Websites*. New York: Wiley, 2011. - Severance, Charles Russell.
*Introduction to Networking: How the Internet Works*. Charles Severance, 2015. - Vik Paruchuri, “Python Web Scraping Tutorial using BeautifulSoup”, 17 November 2016.
- Justin Yek, “How to scrape websites with Python and BeautifulSoup”, 10 June 2017.

*Lecture Notes*:

- Lecture, Week 4 (see also pdf version)

*Lab*: **Scraping data from the web**

Publicly accessible *application programming interfaces* (APIs) provide a common source of “big” data available from a variety of sources, such as social media data. This data consists of a variety of data types, but is usually transmitted in JSON format. In this session, we will cover the basics of APIs, including authentication and the use of protocols for interacting with APIs, and in processing the data that is obtained using these methods. We will also discuss common problems in using text, including character encodings, working with Unicode, transforming text into numeric data, and cleaning textual data for analysis.

*Readings*:

- Cooksey, Brian.
*An Introduction to APIs*. Zapier, 2014. **python-twitter**documentation

*Further Resources*:

- Documentation on the Twitter REST API
- the
**twitteR**package for R - Richard Ishida. 2015. “Character encodings for beginners”. W3C.

*Lecture Notes*:

- Lecture, Week 5 (see also pdf version)

*Lab*: **Working with social media data: Twitter**

- Download Twitter data using Twitter’s REST APIs
- Clean and process the data
- Normalize the data and store it
- Perform basic analysis of the text and non-textual data.

We will introduce the basic statistical plots that are commonly used in exploratory data analysis. We will first consider standard plots for univariate data analysis, including histograms, empirical distribution functions, as well as plots of summary statistics such as boxplots and violinplots. We will then consider different variants of bar plots, which are commonly used for comparison of parallel batches of data, as well as scatter plots for exploration of correlation patterns in data.

*Readings*:

- M. Friendly, A Brief History of Data Visualization, Handbook of Computational Statistics: Data Visualization (Editors C. Chen, W. Hardle and A. Unwin), Vol III, Springer-Verlag, 2006
- E. R. Tufte, The Visual Display of Quantitative Information, Second Edition, Graphics Press, 2001
- J. W. Tukey, Exploratory Data Analysis, Pearson, 1977
- Matplotlib
- Seaborn: statistical data visualization

*Lab*: **Matplotlib primer and basic statistical plots**

- Basic plotting using Matplotlib and Seaborn libraries
- GitHub archive dataset exploratory data analysis
- Class 7 solution
- Notebook cleaning the github json data

We will consider how to visualize matrix data such as covariance and other similarity matrices and adjacency matrices of graphs such as those representing social networks. The key here is to use a suitable ordering of matrix rows and columns to visualize any possibly existing clustering structure. We will explain the underlying methods based on spectral theory of matrices, using the concepts of matrix eigenvectors and clustering based on matrix eigenvectors. In particular, we will explain the method based on *seriation* using the so-called Fiedler eigenvector and *spectral co-clustering* based on using eigenvectors in combination with k-means clustering method.

*Readings*:

- L. Wilkinson and M. Friendly, History Corner: The History of the Cluster Heat Map, The American Statistician, Vol 63, No 2, May 2009
- I. S. Dhilon, Co-clustering documents and words using bipartite spectral graph partitioning, Proc. of ACM KDD, 2001
- Scikit-learn documentation, Section 2.4: Biclustering

*Lab*: **Statistical plots using Matplotlib and Seaborn**

- Synthetic matrix data visualization using seriation method
- Visualization of adjacency matrices derived from GitHub archive dataset
- Using sklearn.cluster.bicluster

In this week, we will introduce standard statistical plots for the performance evaluation of statistical models and machine learning algorithms for classification. We will introduce standard statistical plots for assessing the performance of binary classifiers, such as *receiver operating characteristic* (ROC) and *precision-recall* (PR) curves. We will learn how to interpret these plots and discuss their advantages and limitations.

We will also discuss various standard metrics used for assessing the performance of binary classifiers, such as *accuracy*, *area under the curve* (AUC) and *Gini coefficient*, discuss their relation to the ROC curve, as well as their advantages and limitations.

*Readings*:

- J. A. Sweets, R. M. Dawes and J. Monahan, Better Decisions through Science, Scientific American, October 2000, pp 82-87
- T. Fawcet, An Introduction to ROC Analysis, Pattern recognition letters, Vol 27, pp 861-874, 2006
- N. Japkowicz and M. Shah, Evaluating Learning Algorithms: A Classification Perspective, Cambridge University Press, 2011
- API reference: sklearn.metrics

*Lab*: **Evaluating classifiers using sklearn.metrics**

- Comparing binary classifiers in ROC and PR space
- Comparison of ROC and PR curves
- Accuracy, AUC and other metrics

We will consider how to visualize hidden structures in high-dimensional data, such as hidden clusters or embedded low-dimensional manifolds, by using dimensionality reduction methods. We will explain the underlying principles of dimensionality reduction methods such as multidimensional scaling, locally linear embedding, isomap, spectral embedding, and stochastic neighbor embedding. We will see how the geometry, linear algebra and optimisation methods give raise to different dimensionality reduction methods.

Our focus will be on the dimensionality methods that are commonly used in practice and widely available through software libraries such as sklearn.manifold. We will also consider modern tools for visualizing different dimensionality reductions such as Google embedding projector.

*Readings*:

- T. F. Cox and M. A. A. Cox, Multidimensional Scaling, Second Edition, Chapman & Hall / CRC, 2000
- I. Borg and P. J. F. Groenen, Modern Multidimensional Scaling: Theory and Applications, Second Edition, Springer, 2005
- A. Geron, Hands-on Machine Learning with Scikit-Learn & TensorFlow, O’Reilly, 2017, Chapter 8, Dimensionality Reduction
- Google’s embedding projector
- API reference, scikit-learn, Section 2.2: manifold learning

*Lab*: **Dimensionality reduction using sklearn.manifold**

- Dimensionality reduction plots using different methods
- Understanding the meaning of various input parameters
- Understanding the sensitivity to the input parameter values

In the last week, we will consider basic methods for visualization of graph data such as visualizing social network relationships. We will consider different graph layouts and the principles of how they are computed. This will involve methods based on simple principles for drawing graphs that have a tree structure as well as more sophisticated methods based on spectral theory of linear algebra and dynamical systems for general graphs.

*Readings*:

- A. Hagberg, D. Schult and P. Swart, NetworkX Reference
- NetworkX: Software for complex networks, https://networkx.github.io/
- Graphviz – Graph Visualisation Software, especially manual pages, layout commands

*Lab*: **Graph drawing using NetworkX**

- Loading and manipulating graphs using NetworkX
- Changing basic properties of graph visualization such as node or edge colors
- Drawing graphs using different layouts
- Using graphviz graph layouts