Working with Dates and Times in Stata

This article will introduce you to working with dates and times in Stata. We won’t cover all the details–this is a topic with a lot of details–but what we will cover has a good chance of being all you’ll need.

You’ll retain more of what you learn from this article if you run the example code yourself and try the exercises. To set up, open Stata, clear out anything you were doing before, and load the first example data set from the SSCC’s web server:

clear all
use https://sscc.wisc.edu/sscc/pubs/stata_dates/dates.dta

Note: this article was written as a Jupyter Notebook using the nbstat kernel and then converted to HTML by Quarto. You’ll notice a few bits of Stata code that are only needed to make that process work well, such as using the ab(30) option with list to prevent variable names from being abbreviated. You don’t need to copy them.

Converting Dates and Times

Often your first task is to convert the dates and times in your data set to Stata format, and most of the time everything after that is easy.

How Stata Stores Dates and Times

Stata stores a date as a number: most commonly the number of days that have passed since January 1, 1960. Thus 23,056 means February 15th, 2023. But the time unit could also be weeks, months, quarters, etc., depending on what’s convenient for your project. When you need to store a date and time (a datetime), the time unit becomes milliseconds. Thus 1,992,047,400,000 means 2:30PM on February 15th, 2023. Note that this number is too big to store accurately using the default float variable type. Always use double precision numbers for datetime variables.

Storing dates as numbers of time units since January 1, 1960 means you can do math with them. For example, if you know a subject’s birthday and the date of the beginning of the study, you can calculate how old they were when the study began by just subtracting. But dates as numbers of time units don’t mean much to most humans. Fortunately, there are special formats for dates that tell Stata to display those numbers as the corresponding human-readable date. These formats all start with %t followed by a letter corresponding to the time unit: %td for days, %tw for weeks, %tm for months, %tq for quarters, etc. The format for datetime variables is called %tc (think ‘clock’).

If a variable contains 23,056 and has been formatted with %td, it will be displayed as 15feb2023. But if it’s formatted with %tm it will be displayed as 3881m5, meaning month 5 (May) of the year 3881. It’s really the format that determines what the time unit is.

To create a Stata date variable, you’ll choose a time unit, convert the date to “number of that time unit that have passed since January 1, 1960”, and then apply the appropriate format. The second step in that process is the tricky one but Stata has a variety of functions that will do it for you, depending on the time unit you want and how the date is currently stored.

Note that Excel stores dates as the number of days since January 1, 1900 and some other systems use January 1, 1970. Be careful about converting dates from other programs. If you are planning to make data widely available or to store it for a long time, consider storing dates as separate numbers for year, month, day, etc. as that form is easy to convert and unambiguous.

Converting Numbers to Dates

Because date components stored as separate numbers are easy to use, they’re very common. The nd (numeric date) variables in the example date set are in this form:

list nd*, ab(30)

     +------------------------------------------------------------------+
     | nd_day   nd_month   nd_year   nd_hours   nd_minutes   nd_seconds |
     |------------------------------------------------------------------|
  1. |     15          2      2023          2           30            0 |
     +------------------------------------------------------------------+

The mdy() function converts numbers for month, day, and year to dates with day as the time unit. The arguments can be either numbers or variables containing numbers, and the name of the function tells you the order they go in: month, day, and then year. Thus:

gen nd = mdy(nd_month, nd_day, nd_year)
list nd

     +-------+
     |    nd |
     |-------|
  1. | 23056 |
     +-------+

To make this readable, apply the %td format to the nd variable:

format %td nd
list nd

     +-----------+
     |        nd |
     |-----------|
  1. | 15feb2023 |
     +-----------+
Exercise 1

The variables year_of_birth, month_of_birth, and day_of_birth contain the birthdate of a (made up) subject. Convert it to Stata form with a human-readable format.

Solution

To create a date where month is the time unit, use the ym() function. Again, the function name tells you the order of the arguments: year and then month. The other conversion functions follow the same pattern: yq() takes year and then quarter and converts them to a date with quarters as the time unit, etc.

To make a date variable where the time unit is month readable, apply the %tm format.

Thus to convert nd to a monthly date, run:

gen nd_monthly = ym(nd_year, nd_month)
format nd_monthly %tm
list nd_monthly

     +----------+
     | nd_mon~y |
     |----------|
  1. |   2023m2 |
     +----------+
Exercise 2

Often you’ll find data sets only contain birth month and birth year, to preserve anonymity. Convert year_of_birth and month_of_birth to a monthly variable.

Solution

To create datetime variables, you can use the mdyhms() function, where hms means hours, minutes, and seconds. Use gen double to create the new variable as a double precision variable so it can store time accurately:

gen double nd_time = ///
    mdyhms(nd_month, nd_day, nd_year, nd_hours, nd_minutes, nd_seconds)
format nd_time %tc
list nd_time

     +--------------------+
     |            nd_time |
     |--------------------|
  1. | 15feb2023 02:30:00 |
     +--------------------+
Exercise 3

Repeat the steps above, but don’t specify that the new variable be of type double (i.e. run gen nd_time2 = ... rather than gen double nd_time2 = ...). What goes wrong and why?

Solution

Sometimes all you care about is the time of day and not the date. The hms() function only takes hours, minutes and seconds:

gen double nd_time_of_day = hms(nd_hours, nd_minutes, nd_seconds)
format nd_time_of_day %tc
list nd_time_of_day

     +--------------------+
     |     nd_time_of_day |
     |--------------------|
  1. | 01jan1960 02:30:00 |
     +--------------------+

The result says it’s on January 1, 1960, but you can just ignore that.

Converting Strings to Dates

It’s also common to have dates stored as strings (text). Again, there are a variety of functions available for converting such strings to dates, depending on the time unit desired. The date() function converts strings to dates with days as the time unit; weekly(), monthly(), etc. convert to longer time units, and clock() converts to datetime.

For all of these functions the first argument is a a string or variable containing strings with the date to be converted. The second argument is a string called a mask that tells the function what date and time components the string contains and in what order. For example, the mask "MDY" means “month, day, year.” These functions are impressively intelligent about identifying components and their meaning. For example, the mask "MDY" will work for both February 15, 2023 and 2/15/2023. It will work on a variety of abbreviations for the month name as well.

The following components can be used in a mask (from the Stata help file for dates):

Component Meaning
D day
W week
M month
Q quarter
H half-year
Y year
19Y two-digit year in the 1900s
20Y two-digit year in the 2000s
h hour
m minute
s second
# placeholder for something to be ignored

Consider the dates sd1, sd2, and sd3:

list sd1 sd2 sd3

     +---------------------------------------+
     |          sd1         sd2          sd3 |
     |---------------------------------------|
  1. | Feb. 15 2023   2/15/2023   2023-02-15 |
     +---------------------------------------+

sd1 and sd2 are both in the form month, day, year, so they can both be converted using the mask "MDY":

gen sd1_date = date(sd1, "MDY")
gen sd2_date = date(sd2, "MDY")
format sd1_date sd2_date %td
list sd1_date sd2_date

     +-----------------------+
     |  sd1_date    sd2_date |
     |-----------------------|
  1. | 15feb2023   15feb2023 |
     +-----------------------+

sd3 is in ISO standard format: year, month, day. Thus it needs the mask "YMD":

gen sd3_date = date(sd3, "YMD")
format sd3_date %td
list sd3_date

     +-----------+
     |  sd3_date |
     |-----------|
  1. | 15feb2023 |
     +-----------+
Exercise 4

interview contains the date on which the fictional subject was interviewed. Convert it to a Stata date called interview_date. Subtract the birthdate variable you created in Exercise 1 from interview_date to see how old the subject was at the time of the interview. What unit is that result in?

Solution

The string sd4 contains just a month and year:

list sd4

     +----------+
     |      sd4 |
     |----------|
  1. | Feb 2023 |
     +----------+

Converting it calls for the monthly() function with the mask "MY" and the %tm format:

gen sd4_date = monthly(sd4, "MY")
format sd4_date %tm
list sd4_date

     +----------+
     | sd4_date |
     |----------|
  1. |   2023m2 |
     +----------+

Masks for the the conversion functions of larger time units (weekly(), monthly(), etc.) can only contain the time unit and a year. For them, all the mask really does is specify the order they appear in.

To convert datetime variables use the clock() function–and usually a longer mask. Consider sd5:

list sd5

     +---------------------+
     |                 sd5 |
     |---------------------|
  1. | 2023-02-15 14:30:00 |
     +---------------------+

The date components are year, month, day, hours, minutes, and seconds, so the mask is "YMDhms". Don’t forget to use a double precision variable!

gen double sd5_datetime = clock(sd5, "YMDhms")
format sd5_datetime %tc
list sd5_datetime

     +--------------------+
     |       sd5_datetime |
     |--------------------|
  1. | 15feb2023 14:30:00 |
     +--------------------+

Now consider sd6:

list sd6

     +----------------------------+
     |                        sd6 |
     |----------------------------|
  1. | 2:30PM on February 15, '23 |
     +----------------------------+

The date components here are hours, minutes, month, day, and year. But there are some complications:

You might be worried about the hour, but Stata will automatically detect the “PM” and take it into account.

The word “on” is not a date component. You need to tell Stata to ignore it by putting # in the mask between minute and month, meaning the element there should be ignored.

The year is only two digits. Use 20Y to tell Stata to assume it’s 2023 rather than 1923.

There’s no information about seconds. That’s okay: Stata will assume the seconds should be zero. If day or month is missing, Stata will assume it should be one.

gen double sd6_datetime = clock(sd6, "hm#MD20Y")
format sd6_datetime %tc
list sd6_datetime

     +--------------------+
     |       sd6_datetime |
     |--------------------|
  1. | 15feb2023 14:30:00 |
     +--------------------+

What if you only cared about the date of sd6 and not the time? Just use the date() function and tell it to ignore all the time elements:

gen sd6_date = date(sd6, "###MD20Y")
format sd6_date %td
list sd6_date

     +-----------+
     |  sd6_date |
     |-----------|
  1. | 15feb2023 |
     +-----------+

However, you cannot use the same approach to convert sd6 to a monthly date. Remember the mask for functions like monthly() is much simpler and does not include # to ignore elements. You’ll learn an alternative method shortly.

Exercise 5

Stata always makes the current date available as c(current_date) and the current time available as c(current_time). They are strings, and you can see their content with display c(current_date) and display c(current_time). Combine them into a single string and the convert them to a datetime variable called now.

Hint: you can combine strings by adding them.

Solution

Converting Between Date Formats

You can convert dates from one format to another using one of a set of conversion functions. For example, to convert sd1_date from a regular date with days as the time unit to a monthly date, use the mofd() function.

gen sd1_monthly = mofd(sd1_date)
format sd1_monthly %tm
list sd1_date sd1_monthly, ab(30)

     +-------------------------+
     |  sd1_date   sd1_monthly |
     |-------------------------|
  1. | 15feb2023        2023m2 |
     +-------------------------+

For mofd() think “months of days” as in “make months out of days.” Similar functions include dofm() (days of months), wofd() (weeks of days), dofc() (days of clock, as in datetime), etc.

Note how in converting sd1_date to sd1_monthly the day of the month was lost. If you convert back, Stata will fill in one for the day:

gen sd1_date2 = dofm(sd1_monthly)
format sd1_date2 %td
list sd1_date sd1_monthly sd1_date2, ab(30)

     +-------------------------------------+
     |  sd1_date   sd1_monthly   sd1_date2 |
     |-------------------------------------|
  1. | 15feb2023        2023m2   01feb2023 |
     +-------------------------------------+

This is just like filling in missing components when reading in a date: Stata will fill in a one or zero as appropriate.

We previously converted sd6 to a date even though it contained time components by telling the date() function to ignore them, but noted that we could not convert it to monthly the same way. One solution is to first create sd6_date as a daily date and then convert it to monthly. But note that you can combine those steps:

gen sd6_monthly = mofd(date(sd6, "###MD20Y"))
format sd6_monthly %tm
list sd6 sd6_monthly, ab(30)

     +------------------------------------------+
     |                        sd6   sd6_monthly |
     |------------------------------------------|
  1. | 2:30PM on February 15, '23        2023m2 |
     +------------------------------------------+
Exercise 6

Convert sd1_date to a quarterly date.

Solution

Using Dates

Once you have your dates in Stata format, they’re easy to use–but they do take a bit of getting used to. Keep in mind that they are really numbers, and you can do things with them that make sense for numbers.

The data set claims.dta contains initial claims (ICSA) for unemployment insurance on a weekly basis from January 2015 to February 2023. This is a key measure of the state of job market: high numbers of claims means many people have been laid off from their jobs. The data set was obtained from the Federal Reserve Economic Database (FRED) using the import fred command. You need to make an account with FRED and obtain a key to use that command, and if you’re interested in the US economy you probably want to do that.

The variable daten is the date in Stata format. Note that it’s a daily date (the format is %td) even though there’s one observation per week. The variable dates is the date in string format. Since daten is just a number, you can do things like use it in graphs:

clear
use https://sscc.wisc.edu/sscc/pubs/stata_dates/claims
line ICSA daten

Yes, that massive spike is the effect of the COVID-19 pandemic.

Date Conditions and Constants

Suppose we wanted to divide this dataset into the pre-pandemic period and the post-pandemic period, creating an indicator for “post-pandemic.” We’ll call March 15th, 2020, the start of the pandemic for our purposes, since that’s when it began to clearly impact the job market.

Conceptually, we want to set post to the condition “daten is after March 15th, 2020” (i.e. it will get a 1 if that condition is true and a 0 if that condition is false). But how to we code that condition in Stata?

Since dates are stored as number of days since January 1, 1960, “before” and “after” can be done with less than and greater than, respectively. But that means we need “March 15th, 2020” as a number of days since January 1 1960–just like daten. You can do that using the same conversion functions used to convert variables into dates, just passing in numbers instead of variables. I would suggest that mdy() is the easiest to use:

gen post = (daten > mdy(3, 15, 2020))
Exercise 7

Suppose we declare (somewhat arbitrarily) that the pandemic’s intensive period ended on June 1st, 2021. Create an indicator variable called pandemic that is 1 for weeks during the intensive period and 0 outside it. Find the mean value of ICSA for both periods.

Solution

Date Components

The year() function will extract the year from a date as a simple number:

gen year = year(daten)
list daten year in 1/5

     +------------------+
     |     daten   year |
     |------------------|
  1. | 03jan2015   2015 |
  2. | 10jan2015   2015 |
  3. | 17jan2015   2015 |
  4. | 24jan2015   2015 |
  5. | 31jan2015   2015 |
     +------------------+

Similar functions exist for other time components, such as month() and quarter(). However, these functions only work with daily dates–they assume that the date they’re given is number of days since January 1, 1960. That’s not a problem though: just convert whatever you have to a daily date with the appropriate conversion function, like dofm() or dofc().

Exercise 8

Create a variable containing the quarter of each observation. Which quarter has the highest average number of claims? How does that change if you exclude the pandemic period?

Solution

Date Arithmetic and Durations

Now load a small selection from the American Time Use Survey (ATUS):

clear
use https://sscc.wisc.edu/sscc/pubs/stata_dates/atus
set linesize 85
list in 1/5

     +------------------------------------------------------------------------------+
     |         caseid   actline                                 activity   duration |
     |------------------------------------------------------------------------------|
  1. | 20170101170002         1                                 Sleeping         90 |
  2. | 20170101170002         2   Washing, dressing and grooming oneself         30 |
  3. | 20170101170002         3                Travel related to working         15 |
  4. | 20170101170002         4          Waiting associated with working         15 |
  5. | 20170101170002         5                           Work, main job        150 |
     +------------------------------------------------------------------------------+

The ATUS asks subjects to list all their activities for a 24 hour period starting at 4:00AM the previous day. The ATUS gives the duration of each activity, but let’s use that to calculate the start and end time of each activity. While we’ll be working with times in this example, if we had durations measured in days and wanted to talk about start and end dates the process would be essentially identical.

The start and end times will need to be stored as datetime variables, but we don’t know the date the subjects were interviewed. That’s okay: we’ll just ignore the date part of the datetimes. They’ll all end up being times on January 1, 1960 or January 2nd, 1960 (since the activity list continues until 4:00AM the next day) which you can just think of as “Day One” and “Day Two.”

Since datetimes are just numbers, if you have a starting time and add the duration to it, the result will be the end time. But they do all need to be in the same units. Datetimes are in milliseconds, so the first thing to do is calculate the duration of each activity in milliseconds. To do so multiply by 60 (60 seconds per minute) and then 1000 (1000 milliseconds per second):

gen double dur_ms = duration * 60 * 1000

The start time of the first activity for each person is 4:00AM. Recall that _n is the observation number of the current observation, so the condition if _n==1 identifies the first observation. If the command starts with by caseid: that will be the first observation for each person, but since the data are not sorted by casedid (yet) make that bysort caseid:. Then use hms() to specify a time in terms of hours, minutes and seconds:

bysort caseid: gen double start = hms(4, 0, 0) if _n==1
(3,977 missing values generated)

For all the other activities, their start time is the start time of the previous activity plus the duration of the previous activity in milliseconds:

by caseid: replace start = start[_n-1] + dur_ms[_n-1] if _n>1
(3977 real changes made)

Recall that putting square brackets after a variable allows us to access the value of the variable for a given observation number. Since _n is the current observation, start[_n-1] means “the value of start for the observation before the current observation.”

The end time for every activity is its start time plus its duration:

gen double end = start + dur_ms

Now all you need to do is format start and end appropriately to see the results:

format start end %tc
list caseid actline activity duration start end in 1/5

     +------------------------------------------------------------------------------+
  1. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       1 |                               Sleeping |       90 |
     |------------------------------------------------------------------------------|
     |                       start          |                         end           |
     |          01jan1960 04:00:00          |          01jan1960 05:30:00           |
     +------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------+
  2. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       2 | Washing, dressing and grooming oneself |       30 |
     |------------------------------------------------------------------------------|
     |                       start          |                         end           |
     |          01jan1960 05:30:00          |          01jan1960 06:00:00           |
     +------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------+
  3. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       3 |              Travel related to working |       15 |
     |------------------------------------------------------------------------------|
     |                       start          |                         end           |
     |          01jan1960 06:00:00          |          01jan1960 06:15:00           |
     +------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------+
  4. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       4 |        Waiting associated with working |       15 |
     |------------------------------------------------------------------------------|
     |                       start          |                         end           |
     |          01jan1960 06:15:00          |          01jan1960 06:30:00           |
     +------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------+
  5. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       5 |                         Work, main job |      150 |
     |------------------------------------------------------------------------------|
     |                       start          |                         end           |
     |          01jan1960 06:30:00          |          01jan1960 09:00:00           |
     +------------------------------------------------------------------------------+

Note that if you were given start and end times you could easily calculate duration. Just remember it will start out as duration in milliseconds so if you want minutes you need to convert it:

gen duration2 = (end - start) / 60 / 1000
list caseid actline activity duration start end duration2 in 1/5, ab(30)

     +------------------------------------------------------------------------------+
  1. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       1 |                               Sleeping |       90 |
     |------------------------------------------------------------------------------|
     |                  start     |                    end     |     duration2      |
     |     01jan1960 04:00:00     |     01jan1960 05:30:00     |            90      |
     +------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------+
  2. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       2 | Washing, dressing and grooming oneself |       30 |
     |------------------------------------------------------------------------------|
     |                  start     |                    end     |     duration2      |
     |     01jan1960 05:30:00     |     01jan1960 06:00:00     |            30      |
     +------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------+
  3. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       3 |              Travel related to working |       15 |
     |------------------------------------------------------------------------------|
     |                  start     |                    end     |     duration2      |
     |     01jan1960 06:00:00     |     01jan1960 06:15:00     |            15      |
     +------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------+
  4. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       4 |        Waiting associated with working |       15 |
     |------------------------------------------------------------------------------|
     |                  start     |                    end     |     duration2      |
     |     01jan1960 06:15:00     |     01jan1960 06:30:00     |            15      |
     +------------------------------------------------------------------------------+

     +------------------------------------------------------------------------------+
  5. |         caseid | actline |                               activity | duration |
     | 20170101170002 |       5 |                         Work, main job |      150 |
     |------------------------------------------------------------------------------|
     |                  start     |                    end     |     duration2      |
     |     01jan1960 06:30:00     |     01jan1960 09:00:00     |           150      |
     +------------------------------------------------------------------------------+

Our final example won’t actually introduce anything new about dates, so feel free to stop here. Continue if you want more practice with dates, more practice restructuring data, or need to do something similar to the final example.

Restructuring Date-Based Data

The ATUS has one observation per subject per activity, but for some purposes it’s better to break the day up into blocks of time and have one observation per subject per time block. That makes it much easier to say “What was the subject doing at noon?” It could also allow you to create a model for how the person chooses to spend each time block.

We’ll convert this data set to one where each subject has one row for each 15 minute time block. If a subject does two or more activities in a time block, we’ll assign the block to the activity they spent the most time on, or if that’s a tie to the one they did first. The overall plan is:

  1. Expand the data set so there’s one observation per minute (since the durations are in minutes)
  2. Identify which time block each minute belongs to
  3. Identify the activity to keep
  4. Drop all but one observation per time block

First, use the expand command to make a copy of each activity for each minute that it lasted:

expand duration
(283,823 observations created)

We use the duration variable to tell expand how many copies to make. The first observation has a duration of 90, so there are now 90 copies of that observation. (You can also use just a number with expand, for example expand 10 to make 10 copies of each observation.)

Now we need to associate each observation with a different minute. Since the first activity starts at 4:00AM, the first observation for that activity should be for 4:00AM, the second should be 4:01AM, the third 4:02AM, etc. This time is the start time plus one minute per observation number minus one (since the first observation is 4:00AM and not 4:01AM). Don’t forget to convert minutes to milliseconds. Also, this needs to be done separately for each activity (i.e. observation numbers need to start over from 1 for every activity):

bysort caseid actline: gen double time = start + (_n-1)*60*1000
format time %tc

list caseid actline activity start time in 1/5, ab(30)

     +----------------------------------------------------------+
  1. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                                  time                    |
     |                    01jan1960 04:00:00                    |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  2. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                                  time                    |
     |                    01jan1960 04:01:00                    |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  3. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                                  time                    |
     |                    01jan1960 04:02:00                    |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  4. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                                  time                    |
     |                    01jan1960 04:03:00                    |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  5. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                                  time                    |
     |                    01jan1960 04:04:00                    |
     +----------------------------------------------------------+

The next step is to identify the 15-minute block each observation falls into. Dividing time by 15 minutes and throwing away the remainder would give you blocks identified by number (i.e. block 0, block 1, etc.) For some purposes that might be what you want. But for this example we’ll multiply by 15 minutes again so the identifier is the time the block starts:

gen double time_block = int(time/(15*60*1000)) * 15*60*1000
format time_block %tc
list caseid actline activity start time time_block in 1/5

     +----------------------------------------------------------+
  1. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                   time     |             time_block      |
     |     01jan1960 04:00:00     |     01jan1960 04:00:00      |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  2. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                   time     |             time_block      |
     |     01jan1960 04:01:00     |     01jan1960 04:00:00      |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  3. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                   time     |             time_block      |
     |     01jan1960 04:02:00     |     01jan1960 04:00:00      |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  4. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                   time     |             time_block      |
     |     01jan1960 04:03:00     |     01jan1960 04:00:00      |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  5. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                   time     |             time_block      |
     |     01jan1960 04:04:00     |     01jan1960 04:00:00      |
     +----------------------------------------------------------+

Our chosen data structure can only have one activity per 15 minute block, and we’ve chosen to keep the one that they spent the most time on, or the activity they did first in case of a tie. So we need to identify how much time the subject spent on each activity during each block. (We don’t care that they spent two hours sleeping if only one minute of it falls in the block we’re looking at.)

Since each observation represents one minute, _N, the number of observations, will tell us how many minutes they spent on an activity if we look at each activity within a block separately:

bysort caseid time_block actline: gen minutes = _N
list caseid actline activity start time time_block minutes in 1/5

     +----------------------------------------------------------+
  1. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                time  |          time_block  |  minutes   |
     |  01jan1960 04:00:00  |  01jan1960 04:00:00  |       15   |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  2. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                time  |          time_block  |  minutes   |
     |  01jan1960 04:01:00  |  01jan1960 04:00:00  |       15   |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  3. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                time  |          time_block  |  minutes   |
     |  01jan1960 04:02:00  |  01jan1960 04:00:00  |       15   |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  4. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                time  |          time_block  |  minutes   |
     |  01jan1960 04:03:00  |  01jan1960 04:00:00  |       15   |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  5. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                time  |          time_block  |  minutes   |
     |  01jan1960 04:04:00  |  01jan1960 04:00:00  |       15   |
     +----------------------------------------------------------+

Now we want to keep the observation in each time block with the highest value of minutes and the lowest value of actline if there’s a tie. The easy way to do that is to sort the data such that that observation comes first and then keep the first observation:

gsort caseid time_block -minutes actline
by caseid time_block: keep if _n==1
list caseid actline activity start time_block in 1/5
(268,800 observations deleted)

     +----------------------------------------------------------+
  1. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                            time_block                    |
     |                    01jan1960 04:00:00                    |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  2. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                            time_block                    |
     |                    01jan1960 04:15:00                    |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  3. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                            time_block                    |
     |                    01jan1960 04:30:00                    |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  4. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                            time_block                    |
     |                    01jan1960 04:45:00                    |
     +----------------------------------------------------------+

     +----------------------------------------------------------+
  5. |         caseid | actline | activity |              start |
     | 20170101170002 |       1 | Sleeping | 01jan1960 04:00:00 |
     |----------------------------------------------------------|
     |                            time_block                    |
     |                    01jan1960 05:00:00                    |
     +----------------------------------------------------------+

Recall that with gsort (generalized sort) sorting by -minutes means sort in descending order rather than ascending.

Let’s clean up the data set a bit before proceeding. Drop unneeded variables. Rename time_block to just time, and it is now the row identifier rather than actline.

drop dur_ms duration2 time minutes actline 
rename time_block time
order caseid time
list in 1/5

     +-----------------------------------------------------------+
  1. |         caseid |               time | activity | duration |
     | 20170101170002 | 01jan1960 04:00:00 | Sleeping |       90 |
     |-----------------------------------------------------------|
     |                  start      |                    end      |
     |     01jan1960 04:00:00      |     01jan1960 05:30:00      |
     +-----------------------------------------------------------+

     +-----------------------------------------------------------+
  2. |         caseid |               time | activity | duration |
     | 20170101170002 | 01jan1960 04:15:00 | Sleeping |       90 |
     |-----------------------------------------------------------|
     |                  start      |                    end      |
     |     01jan1960 04:00:00      |     01jan1960 05:30:00      |
     +-----------------------------------------------------------+

     +-----------------------------------------------------------+
  3. |         caseid |               time | activity | duration |
     | 20170101170002 | 01jan1960 04:30:00 | Sleeping |       90 |
     |-----------------------------------------------------------|
     |                  start      |                    end      |
     |     01jan1960 04:00:00      |     01jan1960 05:30:00      |
     +-----------------------------------------------------------+

     +-----------------------------------------------------------+
  4. |         caseid |               time | activity | duration |
     | 20170101170002 | 01jan1960 04:45:00 | Sleeping |       90 |
     |-----------------------------------------------------------|
     |                  start      |                    end      |
     |     01jan1960 04:00:00      |     01jan1960 05:30:00      |
     +-----------------------------------------------------------+

     +-----------------------------------------------------------+
  5. |         caseid |               time | activity | duration |
     | 20170101170002 | 01jan1960 05:00:00 | Sleeping |       90 |
     |-----------------------------------------------------------|
     |                  start      |                    end      |
     |     01jan1960 04:00:00      |     01jan1960 05:30:00      |
     +-----------------------------------------------------------+
Exercise 9

What is the most common activity in the time block that starts at noon? At 10:00PM? At 2:00AM?

Solution