In a twitter thread, I referred to the analysis in Axtell (2001) as an example of research where data binning might not be warranted and could have influenced the analysis. Axtell (2001) condenses 5,541,918 observations from U.S. firms into 13 data points to provide evidence in favor of Zipf’s law. His results are summarized in Figure 1. The left panel of Figure 1 displays the frequency distribution of U.S. firm employment on a log-log scale. In the right panel we observe the tail CCDF (Complementary Cumulative Distribution Function) of U.S. firm sales on a log-log scale. Both closely follow a straight line with a coefficient indicative of Zipf’s law.
Figure1: Frequency distribution of U.S. firm employment on a log-log scale in 1997 (left panel) and tail CCDF of U.S. firm sales on a log-log scale (right panel).
The concept of binning data to fit a distribution is rather counterintuitive. It corresponds to making a rough nonparametric estimate (a histogram) of the data before fitting a distribution to this estimate. As such, binning data is not warranted if the complete dataset is available. It results in a loss of information, reweighs the observations and the arbitrary choice of the number of bins will influence the distribution fits. A lower number of bins increases the statistical error and bias from finite size effects on the shape parameter (Virkar and Clauset 2014). Moreover, binned data could favor the Pareto distribution as it “tends to smooth out some of the sampling fluctuations” (Virkar and Clauset 2014, 93)
The unbinned CCDFs of Portuguese domestic firm-level sales displayed in Figure 2, which originates from current research, are suggestive of this. Both the empirical sales and fitted Pareto distribution look quite different compared to the binned data displayed above. Earlier research displaying an unbinned firm-level sales distribution (see for instance Nigai (2017)) are equally suggestive.