Overview

Massive amounts of data are collected by many companies and organizations and the task of a data scientist is to extract actionable knowledge from the data – for scientific needs, to improve public health, to promote businesses, for social studies and for various other purposes. This course will focus on the practical aspects of the field and will attempt to provide a comprehensive set of tools for extracting knowledge from data.

Location and Time

Bloomberg Center Room 131
Meets: MW 4:45pm - 6:00pm

Contact

Instructors
  • Prof. Giri Iyengar
  • gi43@cornell.edu
  • office hours: tba
Teaching Assistants
  • Andrew Drozdov
  • apd64@cornell.edu
  • office hours: tba
Discussion

Syllabus

# Date Topic Assignment
1 Wed Jan 24, 2018 Introduction to Data Science -
2 Mon Jan 29, 2018 Data Modeling: Supervised Learning Methods Assign 0 Due
3 Wed Jan 31, 2018 Data Modeling: Unsupervised Learning Methods -
4 Mon Feb 5, 2018 ETL, Feature Engineering, Bootstrapping, Sampling Assign 1 Due
5 Wed Feb 7, 2018 Case Study: Scaling Machine Learning in Ad Tech -
6 Mon Feb 12, 2018 Deep Learning -
7 Wed Feb 14, 2018 Deep Learning Assign 2 Due
8 Mon Feb 19, 2018 NO CLASS -
9 Wed Feb 21, 2018 NLP and Knowledge Bases -
10 Mon Feb 26, 2018 NLP and Knowledge Bases Assign 3 Due
11 Wed Feb 28, 2018 NLP and Knowledge Bases -
12 Mon Mar 5, 2018 Recommendation Systems pt 1 -
13 Mon Mar 5, 2018 Recommendation Systems pt 2 -
14 Wed Mar 7, 2018 Class cancelled due to weather Assign 4 Due
15 Mon Mar 12, 2018 Recommendation Systems pt 3 -
16 Wed Mar 14, 2018 Social Network Analysis -
17 Mon Mar 19, 2018 Social Network Analysis Assign 5 Due
18 Wed Mar 21, 2018 CLASS CANCELLED -
19 Mon Mar 26, 2018 Data Visualization Project Part 0 Due
20 Wed Mar 28, 2018 Computer Vision and Fashion Mining -
21 Mon Apr 2, 2018 NO CLASS -
22 Wed Apr 4, 2018 NO CLASS -
23 Mon Apr 9, 2018 Datalogue - Company Presentation Assign 6 Due
24 Wed Apr 11, 2018 Deeper Look at Bootstrap -
25 Mon Apr 16, 2018 Map Reduce and Streaming Calculations Project Part 1 Due
26 Wed Apr 18, 2018 Map Reduce and Streaming Calculations -
27 Mon Apr 23, 2018 Big Data Tools -
28 Wed Apr 25, 2018 Big Data Tools Project Part 2 Due
29 Mon Apr 30, 2018 Time Series and Practical Considerations -
30 Wed May 2, 2018 Privacy, Ethics of Data Science, Course Summary -
31 Mon May 7, 2018 Final Projects in Class
32 Wed May 9, 2018 Final Projects in Class Final Project Due

Summary of Topics

These descriptions only cover a brief non-exhaustive list of topics in the course.

Machine Learning in Health Care.
We will cover various topics in the health care space including ML for wellness (such as detecting depression through social media), analyzing doctorial notes, image classification in the context of disease, and general collection and use of medical data.
Related links: https://mlhc17mit.github.io/ https://arxiv.org/abs/1705.09585 http://techtalks.tv/talks/how-can-nlp-help-cure-cancer/62223/
Machine Learning and Security.
Machine learning models are ubiquitously used in systems we interact with every day. We'll discuss techniques to update parameters and perform inference in a secure way. In addition, we'll talk about adversarial attack that fool models.
Related links: https://arxiv.org/pdf/1709.02753.pdf https://www.kaggle.com/c/nips-2017-defense-against-adversarial-attack
Topics in NLP and Information Retrieval.
There are many interesting tasks at the intersection of text processing and machine learning. For instance, knowledge base construction, learning to query, natural language inference, question answering, and neural machine translation are a handful of tasks that have seen increased attention in the research community in recent years.
Related links: http://www.cs.cornell.edu/courses/cs6741/2017fa/
Topics in Computer Vision and Fashion Mining.
Fashion is fashionable in the ML community. MNIST has recently been refreshed with Fashion MNIST. The Deep Fashion dataset contains 800k annotated images with a handful of associated tasks. Not to mention it has benefited from a wealth of existing research in computer vision.
Related links: http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html https://canvas.instructure.com/courses/904706 https://www.kaggle.com/c/imagenet-object-detection-challenge

Summary of Assignments

Assign 0
Hello world. Setup a programming environment and simple programming exercises.
Assign 1
Linear models, nearest neighbors, clustering, dimensionality reduction.
Assign 2
Feature Engineering and ETL.
Assign 3
Fine-tuning Convolutional Neural Networks.
Assign 4
Sentiment Analysis using Deep Learning.
Assign 5
Collaborative filtering with unstructured data.
Assign 6
Data Visualization using tSNE.
Project
For the final project, students will work in teams to implement a novel system or analyze an existing model using a dataset they deem appropriate.
Part 0. Decide on team / topic / advisor.
Part 1. Implement a baseline (optionally collect a dataset if one does not exist). Put together a writeup with background on the topic of interest.
Part 2. Implement a more complex system (either to beat your baseline or for separate but related task).
Part 3. Final writeup. make any code / writing revisions that were previously recommended.
By the end of the project the students should have a github repo with their code, a clear and presentable writeup, and strong background knowledge in their chosen topic.

Grading Scheme

Top Level Category Grade Percentage
Assignments 60%
Project 40%
Assignments Category Percentage
Assign 0 10%
Assign 1 15%
Assign 2 15%
Assign 3 15%
Assign 4 15%
Assign 5 15%
Assign 6 15%
Project Category Percentage
Part 0 10%
Part 1 30%
Part 2 30%
Final Part 30%

Course Requirements and Additional Grading Policy

Late homework. Each student will have 3 "slip" days per assignment with no penalty, then 20% will be deducted per day.

Dropped homework. There are 6 assignments and an assignment 0 to setup your programming environment. We will drop the lowest score among assignments 1-6.

Homework collaboration. You are encouraged (but not required) to work in groups of no more than 2 students on each assignment. Please indicate the name of your collaborator at the top of each assignment and cite any references you used (including articles, books, code, websites, and personal communications). If you’re not sure whether to cite a source, err on the side of caution and cite it. Each student should submit their own writeup. Remember not to plagiarize: you must write the solutions yourself.

Project collaboration. tba

Attendance. Some homework assignments may require information that was only shared in class (not in the online slides). In addition, there may be surprise quizzes to further encourage attendance.

Statement about students with disabilities Your access in this course is important. Please give me (Giri Iyengar) or one of the TAs your Student Disability Services (SDS) accommodation letter early in the semester so that we have adequate time to arrange your approved academic accommodations. If you need an immediate accommodation for equal access, please speak with me after class or send an email message to me and/or SDS at sds_cu@cornell.edu. If the need arises for additional accommodations during the semester, please contact SDS. You may also feel free to speak with Student Services at Cornell Tech who will connect you with the university SDS office.

Academic integrity. Each student in this course is expected to abide by the Cornell University Code of Academic Integrity. Any work submitted by a student in this course for academic credit will be the student's own work. You are encouraged to study together and to discuss information and concepts covered in lecture and the sections with other students. You can give "consulting" help to or receive "consulting" help from such students. However, this permissible cooperation should never involve one student having possession of a copy of all or part of work done by someone else, in the form of an e-mail, an e-mail attachment file, a diskette, or a hard copy. Should copying occur, both the student who copied work from another student and the student who gave material to be copied will both automatically receive a zero for the assignment. Penalty for violation of this Code can also be extended to include failure of the course and University disciplinary action.