Lookahead and lookbehind

Consider the dataset containing a list of addresses,

input str200 address 
"1601 E NASA Pkwy, Houston, TX 77058"
"1000 George Bush Drive West College Station, TX 77845"
"150 West 65th Street, New York, NY 10023"
"1600 Pennsylvania Avenue NW, Washington, DC 20500"
"One Shields Avenue, Davis, CA 95616"
end

We may use the following regular expression to retrieve the ZIP codes from the addresses.

gen zip = ustrregexs(0) if ustrregexm(address, "\d{5}$")

If we only want the ZIP codes from addresses in Texas, a positive lookbehind (?<=TX\s) could be used.

gen zip_tx = ustrregexs(0) if ustrregexm(address, "(?<=TX\s)\d{5}$")

A negative lookbehind (?<!TX\s) could be used to retrieve the ZIP codes from addresses NOT in Texas.

gen zip_not_tx = ustrregexs(0) if ustrregexm(address, "(?<!TX\s)\d{5}$")

The dataset now looks like:

. list

     +-----------------------------------------------------------------------------------+
     |                                               address     zip   zip_tx   zip_no~x |
     |-----------------------------------------------------------------------------------|
  1. |                   1601 E NASA Pkwy, Houston, TX 77058   77058    77058            |
  2. | 1000 George Bush Drive West College Station, TX 77845   77845    77845            |
  3. |              150 West 65th Street, New York, NY 10023   10023               10023 |
  4. |     1600 Pennsylvania Avenue NW, Washington, DC 20500   20500               20500 |
  5. |                   One Shields Avenue, Davis, CA 95616   95616               95616 |
     +-----------------------------------------------------------------------------------+

Stata has four regular expression functions based on the ICU. Asjad Naqvi has a wonderful tutorial about their usage.