top of page

Auto Insurance Data

Business Case




The data asset is relational. There are four different data files. One represents customer information. A second contains address information. A third contains demographic data, and a fourth includes customer cancellation information. All of the data sets have linking ids, either ADDRESS_ID or CUSTOMER_ID. The ADDRESS_ID is specific to a postal service address. The CUSTOMER_ID is unique to a particular individual. Note that there can be multiple customers assigned to the same address. Also, note that not all customers have a match in the demographic table.

The latitude-longitude information generally refers to the Dallas-Fort Worth Metroplex in North Texas and is mappable at a high level. Just be aware that if you drill down too far, some people may live in the middle of Jerry World, DFW Airport, or Lake Grapevine. Any lat/long pointing to a specific residence, business, or other physical site is purely coincidental. The physical addresses are fake and are unrelated to the lat/long.


In the termination table, you can derive a binary (churn/did not churn) from the ACCT_SUSPD_DATE field. The data set is modelable. That is, you can use the other data in the data to predict who did and did not churn. The underlying logic behind the prediction should be consistent with predicting auto insurance churn in the real world.

Rows and Records

There are 1,536,673 unique addresses.

There are 2,280,321 unique customers. Of these, 2,112,579 have demographic information, and 269,259 canceled during the previous year.

Real or Fake: 

This data is 100% fake.  Any relationship to the real world is coincidental.

Fields and descriptions:

Available for download here.

bottom of page