Avoiding Pitfalls in Regression Analysis

(Updated with links and more Dec 1, 2020. Updated with SAS Global Forum announcement on Jan. 22, 2021.)

Professors reluctant to venture into these areas do no service to their students for preparation to enter the real world of work.

Today (November 30, 2020)  I presented: “Avoiding Pitfalls in Regression Analysis” during the Causal Inference Webinar at the Urban Analytics Institute in the Ted Rogers School of Management, Ryerson University. I was honored to do this at the kind invitation of Murtaza Haider, author of Getting Started with Data Science.  Primary participants are his students in Advanced Business Data Analytics in Business. This is an impressive well-crafted course (taught in R) and at the syllabus-level covers many of the topics in this presentation. I met Murtaza some time ago online and have come to regard him as a first-rate Applied Econometrician.

Ethics and moral obligation to our students

Just as Peter Kennedy developed rules for the ethical use of Applied Econometrics, this presentation is the first step to developing a set of rules for avoiding pain in one’s analysis. A warning against Hasty Regression (as defined) is prominent.

(Update 1/22/2021: My paper, “Haste Makes Waste: Don’t Ruin Your Reputation with Hasty Regression,” has been accepted for a prerecorded 20 minute breakout session at SAS Global Forum 2021, May 18-20, 2021. More on this in a separate post later.)

Kennedy said in the original 2002 paper, Sinning in the Basement, “… my opinion is that regardless of teachability, we have a moral obligation to inform students of these rules, and, through suitable assignments, socialize them to incorporate them into the standard operating procedures they follow when doing empirical work.… (I) believe that these rules are far more important than instructors believe and that students at all levels do not accord them the respect they deserve.”– Kennedy, 2002, pp. 571-2”  See my contribution to the cause, an essay on Peter Kennedy’s vision in Bill Frank’s book cited below.

While the key phrase in Peter’s quote seems to be the “moral obligation,” the stronger phrase is “regardless of teachability.” Professors reluctant to venture into these areas do no service to their students when they enter the real world of work. As with Kennedy, some of the avoidance of pitfall rules are equally difficult to teach leading faculty away from in-depth coverage.

The Presentation

A previous presentation has the subtitle, “Don’t let common mistakes ruin your regression and your career.” I only dropped that subtitle here for space-saving and not to disavow the importance of these rules in a good career trajectory.

cover slide

This presentation highlights seven of ten pitfalls that can befall even the technically competent and fully experienced. Many regression users will have learned about regression in courses dedicating a couple of weeks to much of a semester, and could be self-taught or have learned on the job. The focus of many curricula is to perfect estimation techniques and studiously learn about violations of the classical assumptions.  Applied work is so much more and one size does not always fit. The pitfalls remind all users to think fully through their data and their analysis. Used properly, regression is one of the most powerful tools in the analyst’s arsenal. Avoiding pitfalls will help the analyst avoid fatal results.

The Pitfalls in Regression Practice?

  1. Failure to understand why you are running the regression.
  2. Failure to be a data skeptic and ignoring the data generating process.
  3. Failure to examine your data before you regress.
  4. Failure to examine your data after you regress.
  5. Failure to understand how to interpret regression results.
  6. Failure to model both theory and data anomalies, and to know the difference.
  7. Failure to be ethical.
  8. Failure to provide proper statistical testing
  9. Failure to properly consider causal calculus
  10. Failure to meet the assumptions of the classical linear model.

How to get this presentation

Faculty, if you would like this presentation delivered to your students or faculty via webinar, please contact me.  Participants of the webinar can request a copy of the presentation by emailing me at myers@uakron.edu. Specify the title of the presentation and please give your name and contact information. Let me know what you thought of the presentation as well.

You can join me on LinkedIn at https://www.linkedin.com/in/stevencmyers/. Be sure to tell me why you are contacting me so I will be sure to add you.

I extend this to those who have heard the presentation before when first presented to the Ohio SAS Users Group 2020 webinar series on August 26, 2020.

Readings, my papers:

Recommended Books:

Other Readings and references:

COVID-19 in the State of Ohio, updated daily

Updated 4/11/2020:  Everyone is interested in how we are doing in Ohio during the COVID19 pandemic. Accordingly, I look at the data from the Ohio Department of Health and assemble it into a report for you. You can read my full report below which includes multiple graphs and tables and can download the pdf. I intend to update the pdf report each day as new data becomes available.  Also, you should check back often as the information displayed will change with new data. I will also offer new items as I think of them. 

Full disclaimer, I am not an expert in epidemiology nor have I attempted to model the behavior and predict the future. On LinkedIn,  I have written about the importance of having a qualified subject matter expert paired with each data modeler. I am nonetheless interested in any suggestions you have. I have added a footnote to each table explaining that the definition of a case changed on April 10 from “confirmed (by a test) cases” to the “confirmed cases plus probable cases” which inflates the data by 47 cases on April 10. This to match definitions by the CDC, but worries me as to the lack of consistency before and after the change date.

First up is Weekly changes in the number of cases, hospitalizations, and deaths. A look at the number of cases shows a considerable decline in the cases. Every data point is an average of the last week of cases. When changes are on way down it suggests that the curve of the total caseload is indeed being bent.

weekly changes in cases of covid-19

Rates of hospitalizations and deaths are shown in the next graph. This past week Amy Acton said Ohio has tested 50,000 people and our cases are just under 6000, so that means in rough measure that of everyone tested, the large majority of are showing symptoms or clearly in harm’s way, that the positive results are that about 12 percent. That suggests the actual death rate which is 3.9% or all positive cases, maybe as low as (12%) of 3.9% or about 0.4% of all those tested and much less than the death rate out of the population of 11 million. Of course, I do not have individual testing data and this is a bit of hopeful speculation.

rates of cases

 

I also did a visualization of the hospitalization and death rates by age and sex and posted that to LinkedIn. You can access that here. Similar numbers and heatmaps are in the full report below.

I used SAS® to organize and analyze the data.

Because people are interested in how we are doing in Ohio during the COVID19 pandemic I hope this is of interest to you.

OH_report_COVID19

Download the report here.

Proper citation requested. Steven C. Myers. 2020. Ohio Covid19 report. accessed at https:econdatascience.com/COVID19 on (your access date).

Request for Comments to myers@uakron.edu

Economic Freedom: Solve Problems, Tell Stories

Time and time again we hear employers wanting two qualities out of their data scientists, be able to solve problems and tell stories. How important is economic freedom? Does it lead to greater standards of living? The answer can be shown in tables of results well laid out, but visualizing those results has an even greater impact and better tells the story.

If a “picture is worth a thousand words” then a SAS SGPLOT is worth many pages of tables or results. Can you see the story here?

Economic Freedom is shown to be associated with ever higher standards of living across countries.

The problem is whether countries with higher levels of economic freedom also have higher standards of living. It appears that is true. The association seems undeniable. Is it causal? That is another question that the visual begs. Chicken and Egg reasoning doesn’t seem likely here. It does appeal that the association is one way. For that to be established, we have to answer is economic freedom necessary for higher standards of living. And we have to determine that if the economic freedom had not been accomplished would the standard of living not been as high.

More on that in a future post on the importance of “why.” For now, enjoy the fact that their seems to be a key to make the world better off. Oh, not just from this graph, but from countless successes in countries in the past. My undergraduate analytic students are expanding on this finding to see if their choices from the 1600 World Development Indicators of the World Bank hold up in the same way as GDP per-capita does here in this graph. We/they modify the question to “Do countries that have higher economic freedom also have greater human progress?” I am anxious to see what they find.

The Economic Freedom data comes to us from The Heritage Foundation. Let me know what you think about the visual.

This is a followup to my post on my blog at econdatascience.com “Bubble Chart in SAS SGPLOT like Hans Rosing.”

The SAS PROC SGPLOT code to create the graph is on my GITHUB repository. It makes use of Block command for the banding and selective labeling based on large residuals from a quadratic regression. The quadratic parametric regression and the loess non-parametric regression are to suggest the trend relationship.

Sorry Data not included.

Bubble Chart in SAS SGPLOT like Hans Rosing

Robert Allison blogs as the SAS Graph Guy. He recreates using SAS PROC SGPLOT the famous bubble chart from Hans Rosing of Gapminder Institute. Hans shows that life expectancy and income per person have dramatically changed over the years. Because Hans Rosing is a ot the father of visualizations, Robert produces this graph (shown here) and this very cool animation.

I can’t wait to see  Economic Freedom and income per person soon in one of these graphs. My students are trying to do this right now.  At this point in the term they are acquiring two datasets from Heritage on 168 countries, which contain the index of economic freedom for 2013 and 2018. Then they are cleaning and joining them so they can reproduce the following figure and table in SAS PROC SGPLOT for each year.

 

 

 

 

 

 

 

 

 

 

 

 

I have written about this project in prior terms here. Once they have this data joined and the above figures reproduced then they will move on to the final project for this semester. They will be looking through the 1600 World Development Indicators of the World Bank.  Each team of students will choose 5 and will join that to their data to answer the question:

Does Economic Freedom lead to greater Human Progress?

I may share their results, for now this is some pretty cool graphics from the SAS Graph Guy. 

 

 

 

Data Scientist Jobs Are Increasing For Economists: Evidence from the AEA

Economists, especially Econometricians, are in hot demand in the field of Data Science. Last March I posted Amazon’s Secret Weapon:  Economic Data Sciences which was one of many similar articles on the high demand. It is the entire premise of this blog and my work at university is to highlight this and point economists and our business data analytic students in that direction. Our curriculum is centered on SAS because having the students learning to program at a base level and to learn the power of SAS is a good basis for future job employment (see Data Analytic Jobs in Ohio – May/June 2019).

Because we are looking for a couple of PhD economists for tenure track positions, I thought to wander around in JOE (Job Openings for Economists) and eventually wandered into wondering how many Data Science jobs were directly advertising in the JOE competing with academic positions (including ours). 

So to sharpen my SAS SGPLOT skills i collected some data and found that indeed Data Scientists are in increasing demand over time in JOE , bur not as much as exists in the general market of Indeed.com.  Clearly in JOE job listings in the August to December timeline are the best time to find a data science job, and August 2019 should grow as more jobs are added leading up to the ASSA meetings in San Diego in January. If you’re there look me up, but I suspect I will be in an interviewing room from dawn to dusk. 

Enjoy! Comments welcomed. 

 

Updated to final 2019-2020 numbers
Preliminary 2019-2020 numbers
What do you think about the SGPLOT?
5/5

For those wanting to see the SAS code

My apologies, Elementor does not handle txt code so well, or I have not yet figured that out. (Small amount of research shows the lack of a code widgit  is a problem with Elementor.)

Code with data and image are available at https://github.com/campnmug/SGPLOT_Jobs

data ds;
input date MMDDYY10. total DStitle NotDStitle;
t=_n_;
Datalines;
2/1/2014 0 0 0
8/1/2014 2 2 0
2/1/2015 0 0 0
8/1/2015 5 2 3
2/1/2016 1 1 0
8/1/2016 11 5 6
2/1/2017 1 1 0
8/1/2017 12 6 6
2/1/2018 2 1 1
8/1/2018 14 11 3
2/1/2019 7 4 3
8/1/2019 12 6 6
;
run;
Title1 bold 'Data Scientist Jobs Are Increasing For Economists: Evidence from the AEA';
Title2 color=CX666666 'Advertisement for Data Scientists in Job Openings for Economists (JOE)';
title3 color=CX666666 "Counts shown are the result of a search of all listings for 'Data Scientist'";
proc sgplot;
vbar date / response = total discreteoffset=-.0 datalabel DATALABELATTRS=(Family=Arial Size=10 Weight=Bold)
legendlabel="Total Data Scientist Jobs" dataskin=gloss;
vbar date / response = DStitle transparency=.25 discreteoffset=+.0 datalabel DATALABELATTRS=(Family=Arial Size=10 Weight=Bold)
legendlabel="Job title is 'Data Scientist' " dataskin=gloss;
yaxis display = none ;
xaxis display = ( nolabel);
inset "To put this in perspective:" " "
"Most 'Data Scientist' and 'Economist' jobs"
"are not advertised in JOE"
"A search for 'Economist' and 'Data Scientist'"
"on Indeed.com yields 514 jobs on Oct 14, 2019"
/ position=topleft border
TEXTATTRS=(Color=maroon Family=Arial Size=8
Style=Italic Weight=Bold);
inset "Aug 2019" "preliminary"
/ position=topright noborder
TEXTATTRS=(Color=black Family=Arial Size=8
Style=Italic );

format date worddate12.;
footnote1 Justify=left 'JOE listings are at https://www.aeaweb.org/joe/listings';
footnote2 Justify=left 'Only active listings in either the Aug-Jan or Feb-Jul timeline were searched.';
footnote3 Justify=left 'Search conducted on Oct 14, 2019, so the last count will grow as new jobs are entered into the system.';
footnote4 ' ';
footnote5 Justify=left bold Italic color = CX666666 'Created by Steven C. Myers at EconDataScience.com' ;
run;
run cancel