Research Question 3
What impacts the number of tests and procedures a patient gets done in the ER?
Objectives
This research question looks to see if there are any disparities in the amount of testing and procedures a patient gets done in the ER. This can help health care professionals become aware of any blocks or social determinants that may impact patients from getting wholistic care in the ER.
- Dependent variable:
- Total Count of Diagnostic tests and procedures (10 variables into 1)
- Independent variables:
- Insurance Status
- Poverty Category: Poor/Negative, Near Poor, Low Income, Middle Income, High Income
- Sex
*Notes Data Source & Preprocessing
After reducing the 15 original variables into just two—insurance status and total tests—the dataset offered limited predictive value. To enhance the analysis, I merged it with another MEPS file containing additional demographic information, allowing for a more comprehensive and nuanced examination. I merged the MEPS 2023 Full Year Consolidated Data File to include additional variables. Using the patient ID, I linked patient data and added sex and poverty information to the dataset.
Data Visualization
The data has an imbalanced data of 644 patients with no insurance v 3597 with insurance. The side-by-side bar chart in Figure 7 shows a left skewed normal distribution, with most patients in both insured and not insured having around 2 tests. The proportion of the number of tests in the population is similar.
Figure 7: Bar Chart of Number of ER Tests by Insurance Status
Table 1 shows poverty levels and insurance status. For those without insurance, the largest shares fall into the Middle Income (32.76%) and High Income (27.64%) groups. Interestingly, Near Poor and Low-Income patients are more often insured, while a notable portion of uninsured patients come from middle- and high-income groups.
Table 1: Proportion of Patients by Poverty Category and Insurance Status
| Insurance | Poor/Negative | Near Poor | Low Income | Middle Income | High Income |
|---|---|---|---|---|---|
| No | 22.36% | 5.28% | 11.96% | 32.76% | 27.64% |
| Yes | 23.16% | 7.67% | 16.15% | 25.38% | 27.63% |
Multiple Linear Regression
The multiple linear regression results (Table 2) showed that the strongest predictor was high income, which was associated with an average increase of 0.24 tests. The next significant predictor was insurance status, with insured patients surprisingly receiving 0.17 fewer tests on average compared to uninsured patients. Other factors, including near-poor status, low income, middle income, and sex, were not significant.
Key Insight: - High income associated with an average increase of 0.24 tests per patient.
Key Insight: - Insured patients receiving 0.17 fewer tests on average
Table 2. Logistic Regression Ranked Coefficients from Strongest to Weakest
| Predictor | Estimate | Std. Error | t value | Pr(> |
|---|---|---|---|---|
| (Intercept) | 2.73161 | 0.08486 | 32.188 | <0.001 |
| High Income*** | 0.23923 | 0.07195 | 3.325 | 0.0009 |
| Insurance (Yes vs No)* | -0.17027 | 0.07134 | -2.387 | 0.0171 |
| Near Poor | 0.11644 | 0.11075 | 1.051 | 0.2932 |
| Low Income | 0.02835 | 0.08245 | 0.344 | 0.731 |
| Middle Income | 0.00577 | 0.07258 | 0.079 | 0.9366 |
| Sex (Female vs Male) | -0.02602 | 0.05236 | -0.497 | 0.6193 |
Poisson Regression
The Poisson regression is designed to count model data, which specifically addresses counting the number of total tests. Poisson Regression produced results like the linear regression, though the p-values for high income and insurance status were slightly weaker. The performance metrics of the two models were also nearly the same. The results showed that the strongest predictor was high income, which was associated with an average increase of 0.09 tests. The next significant predictor was insurance status, with insured patients surprisingly receiving 0.07 fewer tests on average compared to uninsured patients. The remaining factors were not significant.
Key Insight: - High income associated with an average increase of 0.09 tests per patient.
Key Insight: - Insured patients receiving 0.07 fewer tests on average
Table 3. Poisson Regression Ranked Coefficients from Strongest to Weakest
| Predictor | Estimate | Std. Error | t value | Pr(> |
|---|---|---|---|---|
| (Intercept) | 1.00333 | 0.03723 | 26.953 | <0.001 |
| High Income** | 0.08890 | 0.03174 | 2.801 | 0.0051 |
| Insurance (Yes vs No)* | -0.06277 | 0.03093 | -2.029 | 0.0425 |
| Near Poor | 0.04439 | 0.04910 | 0.904 | 0.3659 |
| Low Income | 0.01094 | 0.03702 | 0.295 | 0.7677 |
| Middle Income | 0.00236 | 0.03262 | 0.072 | 0.9424 |
| Sex (Female vs Male) | -0.00980 | 0.02317 | -0.423 | 0.6723 |
Random Forest Regression
The Random Forest regression results indicate that poverty category is the strongest predictor of total tests. Insurance status is the next most important predictor, followed by sex, which has the smallest impact on the model’s predictions. The percent increase in MSE “measures how much the model’s prediction error increases, if the variable is randomly permuted”. The increase in node purity “measures the total improvement in the model’s fit contributed by each variable”. In both these metrics, the poverty category was the strongest predictor, followed by insurance then sex.
Table 4: Random Forest Regression Results
| Predictor | % Increase in MSE | Increase in Node Purity |
|---|---|---|
| Poverty Category | 18.84 | 31.90 |
| Insurance | 8.26 | 12.07 |
| Sex | 4.31 | 7.16 |
RQ3: Best Model - Poisson Regression
The performance of all three models was similar. The RMSE was 1.37, meaning the predicted number of tests was off by about 1.37 on average, and the MAE was 1.12, showing an average absolute error of 1.12 tests.
The Random Forest model had a slightly higher R² of 0.011 compared to 0.008 for both the Poisson and Multiple Linear Regression models, but overall the low R² indicates the models explain very little of the variation in total tests.
One limitation is the low variance in the total tests (around 1.9), which means most observations are close to the mean. Low variance can increase bias and lower R². Variable selection could not be performed because I did not have many variables to choose from after reducing the predictors from 17 to 4 key variables, additional variables were later merged from another dataset. Poisson regression can handle low variance and skewed data very well, so Poisson regression is the most appropriate model for this analysis.
Table 5: Performance Metrics Comparisons: Random Forest, Poisson Regression, & Multiple Linear Model
| Model | RMSE | MAE | R² |
|---|---|---|---|
| Random Forest | 1.37 | 1.12 | 0.011 |
| Poisson Regression | 1.37 | 1.12 | 0.008 |
| Multiple Linear Regression | 1.37 | 1.12 | 0.008 |