Textbook for general introduction to social data science:
- Big by Bit - Social research in the digital age by Matthew J. Salganik [BBB], free online version here.
Textbooks for data science in Python
- Python for Data Analysis, 2nd ed. (2017) by Wes McKinney [PDA]
- Python Machine Learning, 2nd ed. (2017) by Sebastian Raschka & Vahid Mirjalili [PML]
Session 1a & 1b: Introduction to SDS and Python
We introduce the field and provide an overview of course logistics. We follow up with a review of basic Python.
Introduction to Social Data Science
BBB: chapter 1
Grimmer, Justin. "We are all social scientists now: how big data, machine learning, and causal inference work together." PS: Political Science & Politics 48.1 (2015): 80-83.
PDA: chapters 2, 3
DataCamp 2016, ‘Jupyter tutorial’, available here
If you’re interested, and want to delve deeper into coding and programming (you certainly don’t have to, they are not required for this course), we highly recommend the following posts:
A broad, early, and easy-to-read idea of data driven (social) science:
- Anderson, Chris. 2008. "The end of theory: The data deluge makes the scientific method obsolete." Wired, 16-07.
Session 2: Reproducible research
We introduce Git for handling and sharing your code as well as Markdown for writing. These two tools have the potential to greatly enhance your productivity.
- Nolan, John. 2015. "How to Write Faster, Better & Longer: The Ultimate Guide to Markdown."
- Jones, Zachery. 2015. "Git & Github tutorial".
- Note that an optional followup can be found in Rainey, Carlisle. 2015. "Git for Political Science".
Session 3: Strings, queries and APIs
We start to leverage our python knowledge to make queries on the web. This allows us to pull data directly from Statistics Denmark's API.
- PDA: sections 2.3 pp. 39-43, 3.3, 6.1 pp. 178-180, 6.3 and 7.3 pp. 211-213
- Gazarov, Petr. 2016. "What is an API? In English, please."
Session 4: Data structuring 1
- PDA: chapter 5, sections 4.1, 6.1, 6.2.
Inspirational reading and additional material
- Lohr, Steve. 2014. "For Big-Data Scientists, ‘Janitor Work’ Is Key Hurdle to Insights"
There are many good resources for learning how to master data structuring. See below for two ways of self-learning:
- Rada, Greg. 2013. "Intro to pandas data structures"- read all three sections.
- DataCamp offers further smaller courses on Pandas and data structuring
Session 5: Intro to visualization
- PDA: chapter 9
- Christ Moffitt, 2017. "Effectively Using Matplotlib"
- Read sections 1-3 in: Wickham, Hadley. 2010. "A Layered Grammar of Graphics". Journal of Computational and Graphical Statistics, Volume 19, Number 1, Pages 3–28.
- Schwabish, Jonathan A. 2014. "An Economist's Guide to Visualizing Data". Journal of Economic Perspectives, 28(1): 209-34.
- Healy, Kieran and James Moody. 2014. "Data Visualization in Sociology". Annual Review of Sociology, 40:105–128.
- Cherdarchuk, Joey. 2013. "Data Looks Better Naked", blog post.
Sessions 6: Data structuring 2
We learn about missing data, data transformation, categorical data and temporal data.
- PDA: chapter 7 and sections 11.1, 11.2, 12.1.
- PML: chapter 4, section 'Handling categorical data'.
Session 7: Data structuring 3
We learn two powerful tools in data structuring: combining different data sets and the-split-apply-combine framework which is called
groupby in pandas.
- PDA: chapters 8 and 10.
- Wickham, Hadley. 2011. “The Split-Apply-Combine Strategy for Data Analysis”. Journal of Statistical Software 40(1).
Session 8: Scraping 1 - Data Collection
We learn to create and collect datasets from the web. This means interacting with apis and webpages and extracting information from unstructured webpages.
- Chapter 2: "Working with Web Data and APIs." in Big Data and Social Science: A Practical Guide to Methods and Tools edited by Ian Foster, Rayid Ghani, Ron S. Jarmin, Frauke Kreuter, and Julia Lane. (copies will be provided).
- Shiab, Nael. 2015. "On the Ethics of Web Scraping and Data Journalism". Global Investigative Journalism Network.
Below are some interesting academic papers using data scraped from online sources that might provide inspiration for your exam project.
Stephens-Davidowitz, Seth. 2014. "The cost of racial animus on a black candidate: Evidence using Google search data." Journal of Public Economics, 118: 26-40.
Stephens-Davidowitz, Seth, Hal Varian, and Michael D. Smith. 2016. "Super Returns to Super Bowl Ads?". R & R, Journal of Political Economy.
Stephens-Davidowitz, Seth, and Hal Varian. 2015 "A Hands-on Guide to Google Data." Google working paper.
Barberá, Pablo. 2015. "Birds of the same feather tweet together: Bayesian ideal point estimation using Twitter data." Political Analysis, 23.1: 76-91.
Cavallo, A. (2018). "Scraped data and sticky prices". Review of Economics and Statistics, 100(1).
Bond, Robert M., et al. 2012. "A 61-million-person experiment in social influence and political mobilization." Nature, 489.7415: 295-298.
Session 9: Scraping 2 - Parsing
Here we develop our skills in parsing and pattern extraction using regular expressions. This is a fundamental data science skill that goes beyond web scraping alone.
- Chapter 2. Dan Jurafsky and James H. Martin: Speech and Language Processing (3rd ed. draft)
- Introduction to pattern matching using regex: "An introduction to regex in python. Blog.
Session 10: Ethics and Big Data Intro
- BBB: chapter 6
- Vox.com "The Cambridge Analytica Facebook scandal."
- material on GDPR, TBA
- BBB: chapter 2 + sections 3.1, 3.2
- Lazer, David and Jason Radford. 2017. "Data ex Machina. Introduction to Big Data." Annual Review of Sociology vol 43, August.
- Einav and Levin: Economics in the Age of Big Data. Science. 2013. Link.
- Edelman, Benjamin. 2012. "Using internet data for economic research." The Journal of Economic Perspectives, 26.2: 189-206.
- Jesse Singal. 2015. "The Case of the Amazing Gay-Marriage Data: How a Graduate Student Reluctantly Uncovered a Huge Scientific Fraud." New York Magazine.
- Athey, Susan. 2017. "Beyond prediction: Using big data for policy problems". Science
- Christine L. Borgman. Provocations, What Are Data and Data Scholarship in the Social Science. Chapters 1,2 and 6 in Big Data, Little Data, No Data. MIT Press 2015. (copies will be provided).
Session 11: Machine learning intro
We introduce basic machine learning concepts. We start with the simple machine learning models for classification problems.
- PML: chapters 1,2 and the following section from chapter 3:
- Modeling class probabilities via logistic regression
- Maximum margin classification with support vector machines
Session 12: Supervised learning 1
We explain the overfitting problem of modelling. We show one possible solution is regularization of standard linear models.
- PML: chapter 3, the following sections:
- Tackling overfitting via regularization
- Partitioning a dataset into separate training and test sets
- PML: chapter 4, the following sections:
- Bringing features onto the same scale
- Selecting meaningful features
- PML: chapter 10, the following sections:
- Introducing linear regression
- Implementing an ordinary least squares linear regression model
- Evaluating the performance of linear regression models
- Using regularized methods for regression
- Turning a linear regression model into a curve – polynomial regression
- Kleinberg, J., Ludwig, J., Mullainathan, S. and Obermeyer, Z., 2015. "Prediction policy problems." American Economic Review, 105(5), pp.491-95.
Session 13: Supervised learning 2
We introduce cross validation to gauge overfitting. We introduce principal component analysis for dimensionality reduction.
- PML: chapter 6 and the following section from chapter 5:
- Unsupervised dimensionality reduction via principal component analysis
Session 14: Supervised learning 3
We learn about two new classes of machine learning models: neighbor based models and tree based methods. These two non-parametric classes of models are easy to apply and extremely useful.
PML: following sections in chapters 3 and 4:
- K-nearest neighbors – a lazy learning algorithm
- Decision tree learning
- Assessing feature importance with random forests
Mullainathan, Sendhil, and Jann Spiess. 2017. "Machine Learning: An Applied Econometric Approach." Journal of Economic Perspectives, 31 (2): 87-106.
Varian, Hal. 2012 Big Data: New Tricks for Econometrics
Athey, Susan. 2018. The Impact of Machine Learning on Economics NBER
BBB: sections 4.1, 4.2
Session 15: Text data 1
We introduce the concept of Text as Data, and apply our newly acquired knowledge of supervised learning to a text classification problem.
PML: following sections from chapter 8:
- Preparing the IMDb movie review data for text processing
- Introducing the bag-of-words model
- Training a logistic regression model for document classification
- Working with bigger data – online algorithms and out-of-core learning
Gentzkow, M., Kelly, B.T. and Taddy, M., 2017. "Text as data" (No. w23276). National Bureau of Economic Research.
Grimmer, Justin, and Brandon M. Stewart. 2013. "Text as data: The promise and pitfalls of automatic content analysis methods for political texts." Political Analysis, 21.3: 267-297.
King, G., Pan, J., & Roberts, M. E. 2013. How censorship in China allows government criticism but silences collective expression. American Political Science Review, 107(02), 326-343.
Andrea Ceron, Luigi Curini, Stefano M. Iacus. "Using Sentiment Analysis to Monitor Electoral Campaigns: Method Matters—Evidence From the United States and Italy
Session 16: Cluster analysis
We introduce basics concepts in unsupervised learning and cluster analysis. Cluster analysis and unsupervised methods in general are useful both for exploration of your data, feature engineering and as a heuristic meausurement tool when labeled data is expensive.
- PML: chapter 11
Session 17: Text data 2
We introduce simple techniques for sentiment analysis of text that can be used off-the-shelf, and apply our knowledge of cluster analysis on a text categorization problem.
- PML: following part in chapter 8:
- Topic modeling with Latent Dirichlet Allocation
- Chapter 18. Dan Jurafsky and James H. Martin: Speech and Language Processing (3rd ed. draft)
Session 18: Tools for Big Data
We run through some examples of how to do parallel processing in Python and get a short introducion to MapReduce.
- Sebastian Raschka. 2014. "An introduction to parallel programming using Python's multiprocessing module"
- Grus, Joel. Data science from scratch: first principles with python. Chapter 24. "O'Reilly Media, Inc.", 2015.
Other data types
In our follow up course, Topics in Social Data Science, we teach advanced tools for text data and new data types. The new data types include spatial data and network data. If you are interesting in working with
Python vs R
Some students may have noticed that the course in 2015 and 2016 used R for data science, however, we have now opted for Python. Two of the main reasons are that Python has more simple syntax with more flexibility applications, making it easier to learn and better to structure data. In addition Python has a more extensive support for machine learning models as well as Big Data applications. See a thorough discussion of advantages between the two programs here:
If you are already familiar with R and prefer to work with R in the exercises you may look Kosuke Imai's book on Quantitative Social Science and looking up earlier year's references here. Note that all assignments and the exam is required to be in Python.