2021 TAMIDS Data Science Competition

Award

Team

Award Webpage

Objective

Data

The two datasets provided by the Texas A&M Institute of Data Science included detailed county presidential election vote information. 

Visualization Dashboards

The first interactive dashboard was created by Jupyter Notebook. This interactive United States map allowed users to hover around a State (as in Figure 2) and observe the count of Democrat votes, Republican votes, and their relative percentages. 

The second interactive dashboard was built with R Shiny, which presented Senate elections. 

Model Selection

Two models were built - 2020 Republican voting results and 2020 Democratic voting results. The predicted values were the percentage of voting by each state; the predictors were previous years’ voting results and different categories’ spending amounts. After cleaning the data, for 2020 Republican voting results, the data frame has 51 observations and 50 variables; for 2020 Democratic voting results, the data frame has 51 observations and 131 variables. 

The main issue of directly using regression was that the number of the variables was much larger than the number of observations, which creates singularity issues. Therefore, to prevent singularity and to minimize the number of selected variables, forward stepwise regression was used for building the models. In addition, to validate the model, the cross-validation method was used to discuss the effectiveness of the models, and the number of the folds for cross validation was set to be 10.  

Because the goal of this model was to look for effective predictors, 11 predictors were used to fit the reduced polynomial models even though the forward stepwise method suggested fewer predictors. 

Democrat Forward Select Model and Republican Forward Select Model, respectively:

Reduced polynomial models for Democrat and Republican, respectively:

Diagnostic plots for Democrat reduced polynomial model and for Republican reduced polynomial model, respectively:

Coefficient plots for Democrat model and Republican model, respectively:

Other Factors

The voter turnout rate and the voting eligible population

The VEP Turnout Rate versus the number of new Covid cases in each State in October, 2020

polynomial regression to predict the number of votes with predictors of “Covid cases”, “Men”, “Women”, “Black”, “White”.

Conclusion