Skip to content

EstimationData

Class

gme.EstimationData( data_frame=None, name:str='unnamed', imp_var_name:str='importer', exp_var_name:str='exporter', year_var_name:str='year', trade_var_name:str=None, sector_var_name:str=None, notes:List[str]=[])

Description

An object used for storing data for gravity modeling and producing some summary statistics.

Arguments

data_frame: Pandas.DataFrame
  A DataFrame containing trade, gravity, etc. data.

name: (optional) str
  A name for the dataset.

imp_var_name: str
  The name of the column containing importer IDs.

exp_var_name: str
  The name of the column containing exporter IDs.

year_var_name: str
  The name of the column containing year data.

trade_var_name: (optional) str
  The name of the column containing trade data.

sector_var_name: (optional) str
  The name of the column containing sector/industry/product IDs.

notes: (optional) str
  A string to be included as a note n the object.

Attributes

data_frame: Pandas.DataFrame
  The supplied DataFrame.

name: str
  The supplied data name.

imp_var_name: str
  The name of the column containing importer IDs.

exp_var_name: str
  The name of the column containing exporter IDs.

year_var_name: str
  The name of the column containing year data.

trade_var_name: str
  The name of the column containing trade data.

sector_var_name: str
  The name of the column containing sector/industry/product IDs.

notes: List[str]
  A list of notes.

number_of_exporters: int
  The number of unique exporter IDs in the dataset.

number_of_importers: int
  The number of unique importer IDs in the dataset.

shape: List[int]
  The dimensions of the dataset.

countries: List[str]
  A list of the unique country IDs in the dataset.

number_of_countries: int
  The number of unique country IDs in the dataset.

number_of_years: int
  The number of years in the dataset

columns: List[str]
  A list of column names in the dataset.

number_of_sectors: int
  If a sector is specified, the number of unique sector IDs in the dataset.

Methods

tablulate_by_group:
  Summarize columns by a user-specified grouping. Can be used to tabulate, aggregate,
  summarize,etc. data.

Arguments:
   tab_variables: List[str]

Column names of variables to be tabulated

   by_group: List[str]

Column names of variables by which to group observations for tabulation.

   how: List[str]

The method by which to combine observations within a group. Can accept 'count', 'mean', 'median', 'min', 'max', 'sum', 'prod', 'std', and 'var'. It may work with other numpy or pandas functions.

Returns: Pandas.DataFrame
   A DataFrame of tabulated values for each group.

year_list:
  Returns a list of years present in the data.

countries_each_year:
  Returns a dictionary keyed by year ID containing a list of country IDs present in each
  corresponding year.

sector_list:
  Returns a list of unique sector IDs

dtypes:
  Returns the data types of the columns in the EstimationData.data_frame using
  Pandas.DataFrame.dtypes(). See Pandas documentation for more information.

info:
  Print summary information about EstimationData.data_frame using
  Pandas.DataFrame.dtypes(). See Pandas documentation for more information.

describe:
  Generates some descriptive statistics for EstimationData.data_frame using
  Pandas.DataFrame.describe(). See Pandas documentation for more information.

add_note:
  Add a note to the list of notes in 'notes' attribute.

Arguments:
   note: str

A note to add to EstimationData.

Returns: None

correlation:
  Return and plot the correlation matrix of the data.

Arguments:
   columns: List[str] (optional)

A list of column names to include in the matrix. Default is to include all columns.

   plot: bool

If True, it plots a heatmap of the correlation matrix.

Returns: Pandas.DataFrame
   A correlation matrix

add_pair_var:
  Create a new variable that is a concatenation of categorical variables for use as a fixed effect or
  cluster category, for example. Original values are separated by '_' in the concatenated value.

Arguments:
   var_list: List[str]

List of names of columns to concatinate.

   var_name: (Optional) str

Column name to use for the new variable. By default, it uses the names in var_list separated by '_'.

   symmetric: bool

If False, all variables are concatenated in the order of the var_list (e.g. ARG_USA and USA_ARG would both would be created). If True, ignores ordering of values across columns and concatenates alphabetically (e.g. all ARG and USA pairings would concatenate as ARG_USA)

Returns: None, adds new column to EstimationData.dataframe.

Examples

# Load a DataFrame
>>> import pandas as pd
>>> gravity_data = pd.read_csv('https://www.usitc.gov/data/gravity/example_trade_and_grav_data_small.csv')
>>> gravity_data.head(5)
  importer exporter  year   trade_value  agree_pta  common_language  \
0      AUS      BRA  1989  3.035469e+08        0.0              1.0
1      AUS      CAN  1989  8.769946e+08        0.0              1.0
2      AUS      CHE  1989  4.005245e+08        0.0              1.0
3      AUS      DEU  1989  2.468977e+09        0.0              0.0
4      AUS      DNK  1989  1.763072e+08        0.0              1.0
   contiguity  log_distance
0         0.0      9.553332
1         0.0      9.637676
2         0.0      9.687557
3         0.0      9.675007
4         0.0      9.657311

example_estimation_data = EstimationData(gravity_data,
                                 imp_var_name='importer',
                                 exp_var_name='exporter',
                                 trade_var_name='trade_value',
                                 year_var_name='year',
                                 notes='Downloaded from https://www.usitc.gov/data/gravity/example_trade_and_grav_data_small.csv')

# tabulate_by_group
# Sum trade value by importer and year
>>> aggregated_data = example_estimation_data.tabulate_by_group(tab_variables = ['trade_value'],
...                                                         by_group = ['importer', 'year'],
...                                                         how = ['sum'])
>>> aggregated_data.head(5)
  importer_  year_  trade_value_sum
0       ARG   1989     0.000000e+00
1       ARG   1990     0.000000e+00
2       ARG   1991     0.000000e+00
3       ARG   1992     0.000000e+00
4       ARG   1993     1.593530e+10

# Summarize minimum and maximum trade flows between each trading pair
>>> summarized_data = example_estimation_data.tabulate_by_group(tab_variables = ['trade_value'],
...                                                         by_group = ['importer', 'exporter'],
...                                                         how = ['min','max'])
>>> summarized_data.head(5)
  importer_ exporter_  trade_value_min  trade_value_max
0       ARG       AUS              0.0     4.095529e+08
1       ARG       AUT              0.0     2.986187e+08
2       ARG       BEL              0.0     7.669537e+08
3       ARG       BOL              0.0     2.743706e+09
4       ARG       BRA              0.0     2.218091e+10

# year_list
>>> example_estimation_data.year_list()
[1989,
 1990,
 1991,
 1992,
 ...

# countries_each_year
>>> countries = example_estimation_data.countries_each_year()
>>> countries.keys()
dict_keys([1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015])
>>> countries[1989]
['ESP',
 'SGP',
 'PHL',
 'NGA',
 'VEN',
 ...

# dtypes
>>> example_estimation_data.dtypes()
importer            object
exporter            object
year                 int64
trade_value        float64
agree_pta          float64
common_language    float64
contiguity         float64
log_distance       float64
dtype: object

>>> example_estimation_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 98612 entries, 0 to 98611
Data columns (total 8 columns):
importer           98612 non-null object
exporter           98612 non-null object
year               98612 non-null int64
trade_value        98612 non-null float64
agree_pta          97676 non-null float64
common_language    97676 non-null float64
contiguity         97676 non-null float64
log_distance       97676 non-null float64
dtypes: float64(5), int64(1), object(2)
memory usage: 6.0+ MB

# describe
>>> example_estimation_data.describe()
               year   trade_value     agree_pta  common_language  \
count  98612.000000  9.861200e+04  97676.000000     97676.000000
mean    2002.210441  1.856316e+09      0.381547         0.380646
std        7.713050  1.004735e+10      0.485769         0.485548
min     1989.000000  0.000000e+00      0.000000         0.000000
25%     1996.000000  1.084703e+06      0.000000         0.000000
50%     2002.000000  6.597395e+07      0.000000         0.000000
75%     2009.000000  6.125036e+08      1.000000         1.000000
max     2015.000000  4.977686e+11      1.000000         1.000000

         contiguity  log_distance
count  97676.000000  97676.000000
mean       0.034051      8.722631
std        0.181362      0.818818
min        0.000000      5.061335
25%        0.000000      8.222970
50%        0.000000      9.012502
75%        0.000000      9.303026
max        1.000000      9.890765

# notes
>>> example_estimation_data.add_note('year IDs are integers')
>>> example_estimation_data.notes
['Downloaded from https://www.usitc.gov/data/gravity/example_trade_and_grav_data_small.csv',
'year IDs are integers']