Protected Health Information De-Identification Standards

On November 26, 2012, the OCR released specific guidance regarding the de-identification of Protected Health Information (PHI). This guidance is the result of input from experts in various fields, workshops and in-depth research regarding various de-identification approaches. The intent of this guidance was to assist covered entities to understand:

  • What de-identification is
  • The general process by which de-identified information is created
  • The approved de-identification options available to them

The HIPAA Privacy Rule defines de-identification as: “Health information that does not identify an individual and with respect to which there is no reasonable basis to believe that the information can be used to identify an individual is not individually identifiable health information.”

Once PHI has gone through correct de-identification, then it is no longer considered PHI and thus free from HIPAA regulations. The underlying question was not that PHI must be protected, but the issue of how to de-identify was the subject of interpretation.

The Privacy Rule states that there are only two methods which can be used to achieve de-identification:  1. Expert Determination, or 2. Safe Harbor. These two methods are explained below.

Expert Determination

The Privacy Rule §164.514(b) states the following about expert determination:

  1. A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable:

    (i) Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and

    (ii) Documents the methods and results of the analysis that justify such determination”

Although there is no specific degree or certification required to be an “expert,” generally speaking, experts may be found in the statistical, mathematical, or other scientific domains. There is no need to have specific healthcare expertise, but it is helpful if they do.

The key is that specific principles and/or methodologies utilized are properly documented in which the expert signifies that the methodology meets the “minimal risk” requirement. The OCR suggested general workflow is as follows:

  1. “The expert will evaluate the extent to which the health information can (or cannot) be identified by the anticipated recipients.
  2. The expert often will provide guidance to the covered entity or business associate on which statistical or scientific methods can be applied to the health information to mitigate the anticipated risk. The expert will then execute such methods as deemed acceptable by the covered entity or business associate data managers, i.e., the officials responsible for the design and operations of the covered entity’s information systems.
  3. The expert will evaluate the identifiability of the resulting health information to confirm that the risk is no more than very small when disclosed to the anticipated recipients. Stakeholder input suggests that a process may require several iterations until the expert and data managers agree upon an acceptable solution.”

Table 1 describes some principles used by experts to determine the identifiability of health information.

Table 1





Prioritize health information features into levels of risk according to the chance it will consistently occur in relation to the individual.

Low: Results of a patient’s blood glucose level test will vary

High: Demographics of a patient (e.g., birth date) are relatively stable

Data source Availability

Determine which external data sources contain the patients’ identifiers and the replicable features in the health information, as well as who is permitted access to the data source.

Low: The results of laboratory reports are not often disclosed with identity beyond healthcare environments.

High: Patient name and demographics are often in public data sources, such as vital records -- birth, death, and marriage registries.


Determine the extent to which the subject’s data can be distinguished in the health information.

Low: It has been estimated that the combination of Year of Birth, Gender, and 3-Digit ZIP Code is unique for approximately 0.04% of residents in the United States9. This means that very few residents could be identified through this combination of data alone.

High: It has been estimated that the combination of a patient’s Date of Birth, Gender, and 5-Digit ZIP Code is unique for over 50% of residents in the United States 10,11. This means that over half of U.S. residents could be uniquely described just with these three data elements.

Assess Risk

The greater the replicability, availability, and distinguishability of the health information, the greater the risk for identification.

Low: Laboratory values may be very distinguishing, but they are rarely independently replicable and are rarely disclosed in multiple data sources to which many people have access.

High: Demographics are highly distinguishing, highly replicable, and are available in public data sources.


Safe Harbor

The Privacy Rule §164.514(b) defines the Safe Harbor method for de-identification as follows:

    “(2)(i) The following identifiers of the individual or of relatives, employers, or household members of the individual, are removed:

      (A) Names

      (B) All geographic subdivisions smaller than a state, including street address, city, county, precinct, ZIP code, and their      equivalent geocodes, except for the initial three digits of the ZIP code if, according to the current publicly available data      from the Bureau of the Census:

        (1) The geographic unit formed by combining all ZIP codes with the same three initial digits contains more than 20,000 people; and

        (2) The initial three digits of a ZIP code for all such geographic units containing 20,000 or fewer people is changed to 000

      (C) All elements of dates (except year) for dates that are directly related to an individual, including birth date, admission date, discharge date, death date, and all ages over 89 and all elements of dates (including year) indicative of such age, except that such ages and elements may be aggregated into a single category of age 90 or older

      (D) Telephone numbers

      (E) Fax numbers

      (F) Email addresses

      (G) Social security numbers

      (H) Medical record numbers

      (I) Health plan beneficiary numbers

      (J) Account numbers

      (K) Certificate/license numbers

      (L) Vehicle identifiers and serial numbers, including license plate numbers

      (M) Device identifiers and serial numbers

      (N) Web Universal Resource Locators (URLs)

      (O) Internet Protocol (IP) addresses

      (P) Biometric identifiers, including finger and voice prints

      (Q) Full-face photographs and any comparable images

      (R) Any other unique identifying number, characteristic, or code, except as permitted by paragraph (c) of this section; and

        (ii) The covered entity does not have actual knowledge that the information could be used alone or in combination with other information to identify an individual who is a subject of the information.”

The guidance clearly states that only utilizing part of the criteria is a violation of the rule. For example, partial social security numbers or patient initials are unacceptable. In today’s environment of easy access to information, it is easy to see how most of these 17 types of identifiers could be researched and identify individuals. Some do require additional clarification, such as item 2(i)(B).

When it comes to zip codes, the first three digits must be examined. There are 17 restricted zip codes because the populations in these areas are less than 20,000. If your data contains any zip codes that begin with any of the following digits, they must be changed to 000: 036, 059, 063, 102, 203, 556, 692, 790, 821, 823, 830, 831, 878, 879, 884, 890, 893

“Actual knowledge” as described by item 2(ii), means clear and direct knowledge that the remaining information could be used, either alone or in combination with other information, to identify an individual who is a subject of the information. The OCR guidance gives several examples of this situation:

  1. Occupation: High profile occupations such as a university president, or local judge.
  2. Clear Familial Relation: For example, a researcher for a covered entity had a family member in the data pool and the data provided sufficient information (ie. Complicated procedures or unusual diagnosis in a specific gender or age group) in which the researcher could recognize that the data pertains to that relative.
  3. Publicized Clinical Event: For example, large number of multiple births that was publicized in the local media.
  4. Knowledge of a Recipient’s Ability: The anticipated recipient of the data had data or an algorithm which could reconstruct the original data, and the covered entity was aware that the recipient had that specific capability. For more information, see the: Privacy Rule De-Identifiers article