Research Guide
Date Icon
Jul 23, 2025

Public Health Datasets for Medical Trainees

Working with Publicly Available Datasets

This is a topic I wish I knew about when I was training. I didn't realize how many papers could be published using these datasets and how quickly I could get them done because I didn't have to do any data collection. I had to learn through trial and error. I spent months doing a chart review project yielding zero publications. Later in my career, I published over five first-author papers in that same span using public databases. My hope is to distill my years of experience, coursework, and mentorship to accelerate your research journey.

This article will serve to provide an introduction to publicly available databases, their benefits, where to find them, and how to analyze them.

What are Publicly Available Datasets?

As the name suggests, these datasets are available to the public and can be downloaded for free and without any paperwork or approvals. There are dozens of nationally representative datasets that can be easily found, downloaded, and analyzed.

You may be thinking, "if these datasets are publicly available, would all the research questions be answered already?" Not at all! These datasets have thousands of variables and are updated every couple of years. That means there are literally tens of thousands of research questions that can be explored. There aren't enough researchers to ask or answer all the important questions before the next cycle comes out.

Benefits of Public Databases

Public Health Advantages

From a public health perspective, these datasets are extremely powerful for a number of reasons:

  • Large sample sizes - They collect data from thousands of US residents each cycle. This means your sample size is enormous and can be used to detect associations that wouldn't be found in smaller studies.

  • Representative sampling - Because these surveys use representative sampling methods, the results can be extrapolated to the entire US population. This is one of the unique advantages of these databases. Almost every other data source will be less generalizable.

  • Current data - The data is updated every 1-2 years. New data is always better than old data because they're able to study changes related to external factors. For example, if someone wanted to study telehealth utilization, the timeframe should occur after the COVID-19 pandemic because telehealth use grew dramatically as a result of that event.

Academic Publishing Advantages

Beyond their overall public health value, these datasets offer practical advantages for academic publishing:

  • Well-validated methodology - These data are well-validated, and their methodology well-described. Thousands of papers have been published from these datasets, which means that you won't need to defend the data collection methodology. Because reviewers are familiar with these datasets, they will focus their attention on the rest of the paper rather than how the data was generated.

  • High-impact publication potential - Papers using these data have the potential to be published in the best journals in the world. Look up the New England Journal of Medicine or the Journal of the American Medical Association and search how many papers have been published using these public datasets. Dozens of papers have been published in the top journals using these data. Compare that with other types of studies - case reports and chart review projects are never seen in these journals.

Where to Find Public Databases

Here is a list of some selected links to the datasets and a brief description. There are dozens more that can easily be found.

National Health Examination and Nutrition Survey (NHANES)

A survey combining interviews, physical examinations, and laboratory tests on a nationally representative sample of ~5,000 people annually. Unique for combining self-reported data with objective health measurements like blood work, body measurements, and clinical assessments. Excellent for studying disease prevalence and nutritional status.

National Health Interview Survey (NHIS)

The longest-running health survey in the U.S., conducting annual in-person interviews with ~35,000 households. Focuses on health status, healthcare access, and health behaviors through self-report. Strong for tracking health trends over time and studying healthcare utilization patterns.

Medical Expenditure Panel Survey (MEPS)

Tracks healthcare costs, utilization, and insurance coverage for the same families over 2+ years. Provides detailed information on what Americans pay for healthcare, what services they use, and how they're insured. Essential for health economics and policy research.

Behavioral Risk Factor Surveillance System (BRFSS)

The largest telephone-based health survey system, collecting data from all 50 states on health behaviors, chronic conditions, and preventive services. Provides state-level data that's crucial for public health planning and tracking Healthy People objectives.

Youth Risk Behavior Surveillance System (YRBSS)

Monitors health behaviors among high school students that contribute to leading causes of death and disability. Covers topics like substance use, sexual behaviors, violence, and mental health. Critical for understanding adolescent health trends and informing school-based interventions.

How to Analyze these Data

Accessing these data is easy. Analyzing these data requires a little more nuance. Most of the time, this requires a statistical background. If you don't have much statistical experience, this next section might be hard to contextualize. There are a few things that need to be true for the analysis:

Statistical Requirements

  • Appropriate analyses to account for the complex survey - Specifically, this means accurate utilization of survey weights, strata, and sampling units.

  • Appropriate cohort extract - Because of the survey weights, no one from the dataset can be excluded and must be analyzed with everyone still in the cohort.

Next Steps

While handling these data and managing these statistical requirements may seem daunting, Lumono guides users of every level through the entire research process using these datasets. We organize the data, help you ask a relevant research question, run the appropriate statistical analyses, and help you interpret the results. We accelerate your research journey from months to weeks. Sign up for our newsletter to receive exclusive research guides and product updates.

Sign up for research tips.
Be the first to know when we launch.

Join the waitlist and get our free guide:

"Finding the Right Research Mentor: A Complete Guide"

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Get started with Crom today & unlock the full potential of your business. Innovative solutions & dedicated support team are here to help you succeed.

Cta Graphics 01
Cta Graphics 02
Cta Graphics 03

RELATED

Similar Articles