To perform case-control analysis in Python, we typically calculate the odds ratio, confidence intervals, and perform hypothesis testing, such as a chi-square test or logistic regression for adjustment. Here’s a step-by-step guide to conducting case-control analysis in Python.
We assume two groups, "cases" (with the condition) and "controls" (without the condition), and a binary exposure variable (e.g., "exposed" vs. "not exposed").
For example:
exposed_cases
: Number of cases who were exposed.not_exposed_cases
: Number of cases who were not exposed.exposed_controls
: Number of controls who were exposed.not_exposed_controls
: Number of controls who were not exposed. # Example data
exposed_cases = 50
not_exposed_cases = 30
exposed_controls = 20
not_exposed_controls = 100
A contingency table organizes the data for further analysis.
import pandas as pd
# Creating a 2x2 contingency table
data = pd.DataFrame(
{
"Cases": [exposed_cases, not_exposed_cases],
"Controls": [exposed_controls, not_exposed_controls]
},
index=["Exposed", "Not Exposed"]
)
print(data)
The odds ratio (OR) measures the association between exposure and outcome. Here’s how to calculate it manually.
$$ \text{OR} = \frac{(\text{exposed\_cases} \times \text{not\_exposed\_controls})}{(\text{not\_exposed\_cases} \times \text{exposed\_controls})} $$
# Calculating the odds ratio
odds_ratio = (exposed_cases * not_exposed_controls) / (not_exposed_cases * exposed_controls)
print(f"Odds Ratio: {odds_ratio}")
Alternatively, you can use the statsmodels
library to calculate the odds ratio and confidence intervals:
import statsmodels.api as sm
import numpy as np
# Using statsmodels to calculate odds ratio and confidence interval
table = np.array([[exposed_cases, not_exposed_cases], [exposed_controls, not_exposed_controls]])
oddsratio, p_value = sm.stats.table2x2(table).oddsratio, sm.stats.table2x2(table).oddsratio_confint()
print(f"Odds Ratio: {oddsratio}")
print(f"95% Confidence Interval: {p_value}")
For a 95% confidence interval, use the formula: