Module 2 What’s data science?
2.1 Before class #1
Required external reading for this module: What’s data science? (4,660 words, approx. 20 minutes of reading time)
Watch the YouTube video Angry Hiring Manager Panel from 10:18 to 16:48 (6.5 minutes) and list the skills they mention as important to have in a data science position.
Fill out Survey 1 (10 min)
2.2 What’s data science?
Data science is one of the fields with the highest demand, with prospects of increased demand for the next decade (Kross et al. 2020; Hadavand, Gooding, and Leek 2018). Interestingly, the data scientist title was invented in 2008, and the median base salary for a data scientist surpassed $100,000 in the United States in 2019 (Robinson and Nolis 2020).
CHALLENGE
Based on your own experience and on your reading for this module, in your groups discuss the following question:
- What is data science?
2.3 What does a data scientist do?
Data science is an interdisciplinary field, and as such data scientists hold jobs with a broad range of skills, from statistics to communication. A quick search for data science jobs reveals this long list of skills. However, no single data scientist has all skills listed for different data science jobs. Instead, each data scientist specializes in different skills (Robinson and Nolis 2020).
CHALLENGE
Make a list of skills listed on data science job announcements and in the video you just watched. Based on these, discuss the following questions in your group:
Which skills do you already have? At what level of proficiency?
Which skills are you interested in developing further?
Based on the skills you already have, and the skills you want to acquire, what type of job in data science would you be interested in?
2.4 Data Science Workflow
The basic data science workflow involve three main parts:
The Question: Form the question you want to answer. Many times you will be given a question, and you have to “translate” it so you can answer it with your data analysis.
Data Acquisition: data file, database, or web API
Data Wrangling: import + tidy data + transform (Grolemund and Wickham 2018)
Data Exploration: transform + visualize + model + repeat (Grolemund and Wickham 2018)
Results Communication: visualize + write + knit (Grolemund and Wickham 2018)
CHALLENGE
In your groups, based on your own intuition and experience, and based on the Introduction to R for Data Science book (Grolemund and Wickham 2018), summarize what each of the following steps means:
Tidy
Transform
Visualize
Model
Communicate
We will approach the entire data science workflow in this course (but not necessarily every step listed), not in this order. We start with step 3 (Data Wrangling) and 4 (Data Exploration), before we address step 2 (Data Acquisition) and step 5 (Communication)
CHALLENGE
Go back to the list of skills and job positions we discussed (based on the reading and the video):
- Which steps in the data science workflow correspond to the job skills we talked about?
2.5 Before class #2
Please fill out Survey 1 (10 min).
Reading: Data Science examples (1,0333 words, 8 min)
Reading: Data Intake (1,686 words, 12 min)
2.6 What’s data?
CHALLENGE
In your small group, discuss the examples provided in the excerpt from “Executive Data Science” (Caffo, Peng, and Leek 2016).
Is data science about “data”? Why or why not?
Why did Netflix end up not implementing the best solution from the Netflix prize challenge?
What data was used in each of the examples provided in the reading?
What is data? (come up with a definition).
Examples of what data might look like.
- Structured data (rare):
State | School Year | Average Tuition |
---|---|---|
Nevada | 2004-05 | 3621.392 |
Nevada | 2005-06 | 3687.290 |
Florida | 2004-05 | 3848.201 |
Florida | 2007-08 | 3879.416 |
Florida | 2006-07 | 3887.656 |
Florida | 2005-06 | 3924.234 |
Wyoming | 2008-09 | 3928.671 |
Wyoming | 2007-08 | 4071.898 |
Wyoming | 2004-05 | 4086.351 |
Wyoming | 2006-07 | 4122.205 |
CHALLENGE
Which of the columns (or variables) in the data frame above are categorical, which are quantitative?
- Structured, but messy data (more common):
State | 2004-05 | 2005-06 | 2006-07 | 2007-08 | 2008-09 | 2009-10 | 2010-11 | 2011-12 | 2012-13 | 2013-14 | 2014-15 | 2015-16 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Alabama | 5682.838 | 5840.550 | 5753.496 | 6008.169 | 6475.092 | 7188.954 | 8071.134 | 8451.902 | 9098.069 | 9358.929 | 9496.084 | 9751.101 |
Alaska | 4328.281 | 4632.623 | 4918.501 | 5069.822 | 5075.482 | 5454.607 | 5759.153 | 5762.421 | 6026.143 | 6012.445 | 6148.808 | 6571.340 |
Arizona | 5138.495 | 5415.516 | 5481.419 | 5681.638 | 6058.464 | 7263.204 | 8839.605 | 9966.716 | 10133.503 | 10296.200 | 10413.844 | 10646.278 |
Arkansas | 5772.302 | 6082.379 | 6231.977 | 6414.900 | 6416.503 | 6627.092 | 6900.912 | 7028.991 | 7286.580 | 7408.495 | 7606.410 | 7867.297 |
California | 5285.921 | 5527.881 | 5334.826 | 5672.472 | 5897.888 | 7258.771 | 8193.739 | 9436.426 | 9360.574 | 9274.193 | 9186.824 | 9269.844 |
Colorado | 4703.777 | 5406.967 | 5596.348 | 6227.002 | 6284.137 | 6948.473 | 7748.201 | 8315.632 | 8792.856 | 9292.954 | 9298.599 | 9748.188 |
Connecticut | 7983.695 | 8249.074 | 8367.549 | 8677.702 | 8720.976 | 9371.019 | 9827.013 | 9736.431 | 10036.627 | 10453.110 | 10663.995 | 11397.337 |
Delaware | 8352.890 | 8610.597 | 8681.846 | 8945.801 | 8995.473 | 9987.183 | 10534.181 | 11026.241 | 11362.690 | 11502.524 | 11514.660 | 11676.216 |
Florida | 3848.201 | 3924.234 | 3887.656 | 3879.416 | 4150.004 | 4783.032 | 5510.659 | 5940.945 | 6494.901 | 6451.664 | 6345.000 | 6360.159 |
Georgia | 4298.040 | 4492.167 | 4584.268 | 4790.266 | 4831.365 | 5549.913 | 6428.007 | 7709.284 | 7853.257 | 7992.390 | 8063.014 | 8446.961 |
user_id | screen_name | text | reply_to_screen_name |
---|---|---|---|
6.331283e+07 | blagogirl | (???) The illiterate calling Iran out? 80 million bounty on Trumps head? | realTuckFrumper |
6.331283e+07 | blagogirl | (???) Iran does NOT fear Trump. They realize what OUR country is dealing with. “The White House is inflicted with mental retardation” | JonHutson |
1.125104e+18 | dl_kirkwood | I’m afraid 11 soldiers had to be shipped out from the Iran hit after all with traumatic brain injuries. Seems the Military does not notify homeland unless a soldiers is shipped out for the injury. So, Trump did not know for a week. https://t.co/HdBNbKClBl | NA |
2.820552e+07 | djbarro | (???) Are you going to carry a sign supporting the women in Iran brave enough to remove their hijabs and go to prison? | GloriaAllred |
1.506314e+08 | kizu91 | US…Special…Representative…Hold…Press…Briefing…Situation in…Iran…Video…first…week…January…saw…drastic…spike…tensions…Washington…Tehran…President…Donald Trump…order…assassination…elite…Quds…Force…commander…Qasem…Soleimani…Iraq | NA |
1.506314e+08 | kizu91 | crash…land…collide…plane…aircraft…all…176…people…on board…Iran…missile…attack…US…base…Iraq…rocket…Western…Sahara…Suriname…Colombia…Dominica…Australia…Anguilla…Guadeloupe…Uruguay…Cyprus…Namibia…Brazil…Paraguay…Denmark…55 | NA |
1.506314e+08 | kizu91 | Iran…MP…Urge…Gov’t…Expel…UK…Envoy…Consider…Downgrading…Diplomatic…Ties…Alleged…Meddling…envoy…Robert Macaire…detained…days…ago…alleged…participation…unsanctioned…protest…Tehran…down…Ukraine…Boeing…737…release…15…minutes | NA |
1.506314e+08 | kizu91 | Government…Supporter…Gather…Tehran…13….Friday…Prayer…Video…Iran…gather…rally…commemorate…kill…fatal…crash…land…collide…Ukraine…Boeing…plane…aircraft…shot…down…missile…rocket…January…Imam…Khomeini…International…Airport…16 | NA |
1.506314e+08 | kizu91 | British…Treasury…Expand…Hezbollah…Asset…Freeze…UK…government…approved…measure…follow…heat…conflict…United States…Islamic…Republic…Iran…Trump…Administration…target…assassination…high-profile…military…general…early…January…film | NA |
7.297365e+17 | SwmpladySH | Hackers Are Coming for the 2020 Election — And We’re Not Ready https://t.co/q82kNu9gMd via (???) | NA |
- Textual Data (always messy):
## [1] "CHAPTER I"
## [2] ""
## [3] ""
## [4] "Emma Woodhouse, handsome, clever, and rich, with a comfortable home"
## [5] "and happy disposition, seemed to unite some of the best blessings of"
## [6] "existence; and had lived nearly twenty-one years in the world with very"
## [7] "little to distress or vex her."
## [8] ""
## [9] "She was the youngest of the two daughters of a most affectionate,"
## [10] "indulgent father; and had, in consequence of her sister's marriage, been"
** CHALLENGE **
What data formats are out there in the world. Create a list based on your experience and the excerpt from “Modern Data Science with R” (Baumer, Kaplan, and Horton 2017).
2.7 What does data analysis look like?
The way you communicate your data analysis will depend on what question you’re trying to answer and who your audience is. Here are some of my favorite data analysis reports:
Whose (coffee) beans reign supreme? A #tidytuesday static image
Women in Space A #tidytuesday static image
Which city is faster? A City Cycle Race Shinny app
The Physical Traits that Define Men and Women in Literature An interactice website
References
Baumer, Benjamin S, Daniel T Kaplan, and Nicholas J Horton. 2017. Modern Data Science with R. CRC Press.
Caffo, Brian, Roger D Peng, and Robert H Leek. 2016. Executive Data Science: A Guide to Training and Managing the Best Data Scientists. Leanpub.
Grolemund, Garrett, and Hadley Wickham. 2018. R for Data Science. O’Reilly.
Hadavand, Aboozar, Ira Gooding, and Jeffrey T Leek. 2018. “Can Mooc Programs Improve Student Employment Prospects?” Available at SSRN 3260695.
Kross, Sean, Roger D Peng, Brian S Caffo, Ira Gooding, and Jeffrey T Leek. 2020. “The Democratization of Data Science Education.” The American Statistician 74 (1): 1–7.
Robinson, Emily, and Jacqueline Nolis. 2020. Build a Career in Data Science. Manning.