📊 Panel Data in Stata: An Intro (Part 1)

In the world of applied statistics and econometrics, panel data (also known as longitudinal data) plays a central role — especially when we’re interested in analyzing changes over time across multiple entities (like individuals, countries, or firms). If you’re new to panel data and want to get started in Stata, you’re in the right place.

This first post in the “Panel Data in Stata” series walks you through the fundamentals, explains the structure of panel datasets, and shows how to set up and explore your data correctly in Stata.

🤔 What Is Panel Data?

Panel data combines features of both cross-sectional and time-series data. It involves repeated observations over time for the same units. For example:

Person_ID	Year	Income
1	2020	50000
1	2021	53000
2	2020	47000
2	2021	49500

This structure allows us to track changes within individuals (over time) and between individuals (across the dataset).

🛠 Setting Up Panel Data in Stata

Before you run any panel-specific models, you must declare your dataset as panel data using the xtset command. Here’s how it works:

Step 1: Load your dataset

use mypaneldata.dta, clear

Step 2: Check your identifier variables

You need two things:

An ID variable (e.g., personid, country, firm_id)
A time variable (e.g., year)

You can inspect them like this:

list personid year income if _n <= 10

Step 3: Declare the data as panel

xtset personid year

If time is irregular (i.e., not equally spaced), Stata will warn you. For instance:

xtset personid year
note: time variable with gaps

This is OK — just be aware it might affect models with time-based lags or trends.

🧾 Checking Panel Structure

Use the following command to get a summary of your panel setup:

xtdescribe

This gives you:

Number of panels (i.e., cross-sectional units)
Number of time periods per panel
Overall time range

Also helpful:

xtsum income

This breaks down the summary statistics into:

Overall mean and standard deviation
Between (variation across units)
Within (variation over time within units)

🛑 Common Mistakes to Avoid

Forgetting to xtset: Many panel-specific commands won’t work until you do this.
Mismatched identifiers: Ensure your ID and time variables are unique per observation.
Using date strings: Convert string dates into numeric formats (e.g., using gen year = year(dailydate) or tsset with td date format).

✅ What’s Next?

Now that you’ve set up your data, you’re ready to start exploring it using panel tools — like fixed effects, random effects, lag variables, and Granger causality tests. In Part 2, we’ll cover: