Social Data Science 2018

A summer school at University of Copenhagen. The course is offered by the Center for Social Data Science.

Course information

NB! Students are required to have completed two tutorials at DataCamp before the first day of class - more information below under “Preparation”.

The objective of this course is to learn how to analyze, gather and work with modern quantitative social science data. Increasingly, social data - data that capture how people behave and interact with each other - is available online in new, challenging forms and formats. This opens up the possibility of gathering large amounts of interesting data, to investigate existing theories and new phenomena, provided that the analyst has sufficient computer literacy, while at the same time being aware of the promises and pitfalls of working with various types of data.

This aim of this course is fourfold:

  1. We will introduce students to the state of the art social science literature using computational methods and social data.
  2. We will present students with an overview of key benefits and challenges of working with different kinds of social data. We will show how various kinds of data (survey, web-based, experimental, administrative, etc.) can be used to answer different questions within the social sciences. Furthermore, we will discuss ethical challenges related to the use of different types of data.
  3. We will introduce students to statistical techniques for predicting and classification, known as machine learning, and we will discuss how these methods relate to existing empirical tools within economics such as causal inference and regression.
  4. We will present modern data science methods needed for working with computational social science and social data in practice. Being an effective economist and data scientist means spending large fractions of our time writing and debugging code. In this section you will learn how to write code that will clean, transform, scrape, merge, visualize and analyze social data.

The course will consist of two weeks of teaching and one week of making the final exam project. Each day is divided into two teaching sessions. A morning session 9-12 and an afternoon session 13-16. Most teaching sessions contain an equal mix of lectures and exercises.

The lectures will focus on the broad topics covered in the course (part 1-3 listed above). In the exercise classes we will get our hands dirty and present data science methods needed for collecting and analyzing real-world data. In addition to core computational concepts, these classes will focus on the following topics

  1. Generating data: We will teach how to “scrape”, i.e. find and collect data, from websites as well as working with APIs.
  2. Data manipulation tools: Participants will learn how to import, transform, munge and merge data from various sources.
  3. Visualization tools: We will learn best practices for visualizing data in different steps of a data analysis. Participants will learn how to visualize raw data as well as effective tools for communicating results from statistical models for broader audiences.
  4. Reproducability tools: Participants will learn how to use version control and social coding using Github and how to effectively communicate the insights of an analysis using markdown.
  5. Prediction tools: We will cover key implementations of machine learning algorithms and participants will learn how to apply and interpret these models in practice.

Note that an average of three hours of exercises per day is not a large amount of time for learning how to code. We will use some of this time like development meetings: going over assignments, having detailed code reviews of various forms, and discussing blocking issues and potential solutions.

Academic interest in data handling skills is growing. This implies increased demand for skills needed to effectively gather, handle, and analyze data as well as present results to a range of audiences. Therefore, this course will provide you with important tools for future academic study. Furthermore, the skills taught in this course are also widely used in business. Python programming skills in particular are highly valued in fields such as data science, finance and information technology. As this course is focused on general skills for working with social science data such as gathering and visualization, it is equally relevant for students seeking careers outside academia where skills such as the ability to effectively communicate the results of an analysis are in high demand.

This course assumes no knowledge of any particular software or computer program, but while we will try to demystify the technological side of things so students feel comfortable getting started and thinking like a data scientist, this will be a technical course, and students should expect to spend a significant amount of time learning these tools. Because the course builds on a wide range of techniques, we do not have any hard requirements to sign up for the course, but students are expected to have an interest in some subset of: statistics, econometrics, linear algebra, and a scripting language (we will use Python in this course).

Teaching material and code will be distributed and collected via Git, hosted on our course Github repo.

Preparation

Before class begins on August 13 all students are required to prepare. We expect everyone of you to have completed the following tutorials before the first day of class:

You should use the invitation that will be sent out through the course page or write Andreas an email to sign up for the class site on DataCamp. This will make it easier for us to spot where you have difficulties. Note that each tutorial is expected to take four hours.

We also expect you to bring a laptop with software installed, see this post.

The path forward

We are wrapping up this foundational course in social data science. We hope that you learned a lot. We want to present some paths to pursue going forward. The following could be of interest: Topics in Social Data Science, the latest version here is an obvious choice where we go more into depth with text data and machine learning. We also learn about spatial and networks data. We expect to offer a seminar in the spring about using machine learning in econometrics. [Read More]

Peergrade and A2

Some of you have had problems submitting to peergrade. Hopefully these problems are fixed now - if not, a workaround has been proposes here:

https://github.com/abjer/sds/issues/19

The assignment is still due at 23:59 today.

If you still encounter problems handing in, please submit a github issue on the repository.

Assignments (Updated!)

Assignment 2 is now available on github here. It is due 23:59, Friday, August 24, 2018. Assignment 1 is now available on the GitHub repo here. The hand-in will be on peergrade.io and deadline for handing on is Sunday before midnight Danish time. More info on accessing peergrade will follow on the intranet by Saturday noon. The assignment must be handed in as a group. The format of the assignment must be Jupyter Notebook file (i. [Read More]

Exam project

When the teaching in our course finishes you are to hand in a project. The formal requirements are found here. In this blog post we want to expand a little bit on how a good project will look. The focus of the project is to pose and interesting question to data that can be collected publicly and attempt to answer it. Some good projects use machine learning for modelling a variable of interest, while some good projects use machine learning to parse text data. [Read More]

Useful Jupyter tips

This post provides a short intro on picking up useful Jupyter hacks. We begin with a short overview of the most essential Jupyter shortcuts and skills. If you have not read the introduction to Jupyter you should do that immediately, see here. Keyboard short cuts Editing and executing cells - enter edit mode: click inside the cell or press ENTR - exit edit mode: click outside cell or press ESC. [Read More]

Installation

This class will involve a lot of coding, for which you will need some basic tools. Please make sure to set up the following tools before the first day of class. Python and Jupyter notebook A git client A github account A text editor (optional) We will discuss these tools in much more detail in class, so don’t worry if this is all new and perhaps a bit frightening right now. [Read More]