Course readings

Textbook for general introduction to social data science:

Textbooks for data science in Python

Session 1a & 1b: Introduction to SDS and Python

We introduce the field and provide an overview of course logistics. We follow up with a review of basic Python.

Required readings

Introduction to Social Data Science

Python programming

Inspirational reading

If you’re interested, and want to delve deeper into coding and programming (you certainly don’t have to, they are not required for this course), we highly recommend the following posts:

A broad, early, and easy-to-read idea of data driven (social) science:

Session 2: Reproducible research

We introduce Git for handling and sharing your code as well as Markdown for writing. These two tools have the potential to greatly enhance your productivity.

Required reading

Session 3: Strings, queries and APIs

We start to leverage our python knowledge to make queries on the web. This allows us to pull data directly from Statistics Denmark's API.

Required reading

Session 4: Data structuring 1

We learn basic processing with the Python modules, pandas and numpy. This includes file input/output, arithmetics, slicing data etc.

Required reading

  • PDA: chapter 5, sections 4.1, 6.1, 6.2.

Inspirational reading and additional material

There are many good resources for learning how to master data structuring. See below for two ways of self-learning:

Session 5: Intro to visualization

We introduce visualizations in Python. We use pandas and seaborn. Both these modules are built on the fundamental and flexible plotting module matplotlib.

Required reading

Inspirational reading

Sessions 6: Data structuring 2

We learn about missing data, data transformation, categorical data and temporal data.

Required reading

  • PDA: chapter 7 and sections 11.1, 11.2, 12.1.
  • PML: chapter 4, section 'Handling categorical data'.

Session 7: Data structuring 3

We learn two powerful tools in data structuring: combining different data sets and the-split-apply-combine framework which is called groupby in pandas.

Required reading

Session 8: Scraping 1 - Data Collection

We learn to create and collect datasets from the web. This means interacting with apis and webpages and extracting information from unstructured webpages.

Required readings

Inspirational reading

Below are some interesting academic papers using data scraped from online sources that might provide inspiration for your exam project.

Session 9: Scraping 2 - Parsing

Here we develop our skills in parsing and pattern extraction using regular expressions. This is a fundamental data science skill that goes beyond web scraping alone.

Required readings

Session 10: Ethics and Big Data Intro

Required readings

Ethics

Big data

Inspirational readings

Session 11: Machine learning intro

We introduce basic machine learning concepts. We start with the simple machine learning models for classification problems.

Required readings

  • PML: chapters 1,2 and the following section from chapter 3:
    • Modeling class probabilities via logistic regression
    • Maximum margin classification with support vector machines

Session 12: Supervised learning 1

We explain the overfitting problem of modelling. We show one possible solution is regularization of standard linear models.

Required readings

  • PML: chapter 3, the following sections:
    • Tackling overfitting via regularization
    • Partitioning a dataset into separate training and test sets
  • PML: chapter 4, the following sections:
    • Bringing features onto the same scale
    • Selecting meaningful features
  • PML: chapter 10, the following sections:
    • Introducing linear regression
    • Implementing an ordinary least squares linear regression model
    • Evaluating the performance of linear regression models
    • Using regularized methods for regression
    • Turning a linear regression model into a curve – polynomial regression

Inspirational readings

  • Kleinberg, J., Ludwig, J., Mullainathan, S. and Obermeyer, Z., 2015. "Prediction policy problems." American Economic Review, 105(5), pp.491-95.

Session 13: Supervised learning 2

We introduce cross validation to gauge overfitting. We introduce principal component analysis for dimensionality reduction.

Required readings

  • PML: chapter 6 and the following section from chapter 5:
    • Unsupervised dimensionality reduction via principal component analysis

Session 14: Supervised learning 3

We learn about two new classes of machine learning models: neighbor based models and tree based methods. These two non-parametric classes of models are easy to apply and extremely useful.

Required readings

Session 15: Text data 1

We introduce the concept of Text as Data, and apply our newly acquired knowledge of supervised learning to a text classification problem.

Required readings

  • PML: following sections from chapter 8:

    • Preparing the IMDb movie review data for text processing
    • Introducing the bag-of-words model
    • Training a logistic regression model for document classification
    • Working with bigger data – online algorithms and out-of-core learning
  • Gentzkow, M., Kelly, B.T. and Taddy, M., 2017. "Text as data" (No. w23276). National Bureau of Economic Research.

  • Grimmer, Justin, and Brandon M. Stewart. 2013. "Text as data: The promise and pitfalls of automatic content analysis methods for political texts." Political Analysis, 21.3: 267-297.

Inspirational readings

Session 16: Cluster analysis

We introduce basics concepts in unsupervised learning and cluster analysis. Cluster analysis and unsupervised methods in general are useful both for exploration of your data, feature engineering and as a heuristic meausurement tool when labeled data is expensive.

Required readings

  • PML: chapter 11

Session 17: Text data 2

We introduce simple techniques for sentiment analysis of text that can be used off-the-shelf, and apply our knowledge of cluster analysis on a text categorization problem.

Required readings

Session 18: Tools for Big Data

We run through some examples of how to do parallel processing in Python and get a short introducion to MapReduce.

Required readings

Miscellaneous

Other data types

In our follow up course, Topics in Social Data Science, we teach advanced tools for text data and new data types. The new data types include spatial data and network data. If you are interesting in working with

Python vs R

Some students may have noticed that the course in 2015 and 2016 used R for data science, however, we have now opted for Python. Two of the main reasons are that Python has more simple syntax with more flexibility applications, making it easier to learn and better to structure data. In addition Python has a more extensive support for machine learning models as well as Big Data applications. See a thorough discussion of advantages between the two programs here:

If you are already familiar with R and prefer to work with R in the exercises you may look Kosuke Imai's book on Quantitative Social Science and looking up earlier year's references here. Note that all assignments and the exam is required to be in Python.