1 Introduction

In a twitter thread, I referred to the analysis in Axtell (2001) as an example of research where data binning might not be warranted and could have influenced the analysis. Axtell (2001) condenses 5,541,918 observations from U.S. firms into 13 data points to provide evidence in favor of Zipf’s law. His results are summarized in Figure 1. The left panel of Figure 1 displays the frequency distribution of U.S. firm employment on a log-log scale. In the right panel we observe the tail CCDF (Complementary Cumulative Distribution Function) of U.S. firm sales on a log-log scale. Both closely follow a straight line with a coefficient indicative of Zipf’s law.

Figure1: Frequency distribution of U.S. firm employment on a log-log scale in 1997 (left panel) and tail CCDF of U.S. firm sales on a log-log scale (right panel).

The concept of binning data to fit a distribution is rather counterintuitive. It corresponds to making a rough nonparametric estimate (a histogram) of the data before fitting a distribution to this estimate. As such, binning data is not warranted if the complete dataset is available. It results in a loss of information, reweighs the observations and the arbitrary choice of the number of bins will influence the distribution fits. A lower number of bins increases the statistical error and bias from finite size effects on the shape parameter (Virkar and Clauset 2014). Moreover, binned data could favor the Pareto distribution as it “tends to smooth out some of the sampling fluctuations” (Virkar and Clauset 2014, 93)

The unbinned CCDFs of Portuguese domestic firm-level sales displayed in Figure 2, which originates from current research, are suggestive of this. Both the empirical sales and fitted Pareto distribution look quite different compared to the binned data displayed above. Earlier research displaying an unbinned firm-level sales distribution (see for instance Nigai (2017)) are equally suggestive.

Figure2: CCDF of Portuguese domestic firm-level sales in 2006.

Drawing conclusions from this comparison between Figure 1 and 2 is hard, however, as we are comparing different countries and different time periods. Ideally, we could try to apply the binning methodology of Axtell (2001) on a micro-level dataset and investigate the deficiencies related to data binning. A direct replication as such is difficult, however. Axtell (2001) works with different binning schemes of which I do not have the data at hand, relies on two parameter estimation techniques for the Pareto distribution and seems to left-truncate the firm-level sales distribution at an undisclosed truncation point.

The point of this blog post is therefore to summarily evaluate the methodological choices made by Axtell (2001) regarding data binning. This will not only allow us to assess the presented evidence in favor of Zipf’s law by Axtell (2001), but can also be helpful when faced with future distributional analyses based on binned data. Note that all results in this paper can be replicated using the code provided on my github account.

2 Preparation: binning schemes

As stated before, data binning merely corresponds to summarizing the data by means of a histogram. The bin widths of a histogram are of critical importance for this summary to be representative of reality. Three binning schemes are often used:

Linear binning: sales are discretized using equidistant binpoints on a linear scale.
Logarithmic binning: sales are discretized using binpoints that are equidistant on a logarithmic scale.
Variable binning: sales are discretized using binpoints between which the distance can vary depending on the (frequency of the) underlying variable. This is often the case in binned public access data where bin widths are small and equidistant on a (log)linear scale for highly frequent small values of the underlying variable, but progressively increase for larger values with low frequency of the underlying variable.

We demonstrate the influence of the three binning schemes on the representation of distributions. In Figure 3, we plot both the unbinned (line) and binned (points) CCDFs on a log-log scale for three simulated distributions with 100,000 observations: a Lognormal with standard deviation of 1.74, a Lognormal with standard deviation of 1.74 truncated at the second quintile (leaving 60% of the data available), and a Pareto distribution with a shape parameter of 1.01. All distributions are binned with 15 binpoints. The variable binning scheme consists of logarithmic binning up till the fourth quintile, after which we progressively increase the bin widths. This imitates a sort of binning scheme that could be encountered in binned public access data.

Figure 3: Complete (line) and binned (points) CCDF on a log-log scale for three simulated distributions and three binning schemes. Cutoff for the variable binning scheme indicated by dashed vertical line.

We can observe from Figure 3 that only logarithmic binning allows us to correctly capture the distributional shape. The performance of linear binning is problematic, as well the performance of variable binning. In the case of variable binning, it appears the CCDF demonstrates concavity as soon as the bin widths progressively increase. This is a purely mechanical problem that can be overcome by correcting the frequencies observed in each bin for their respective bin widths. The resulting corrected frequency plots are displayed in Figure 4 on a log-log scale. Along with the logarithmic binning scheme, the variable binning scheme can be deemed appropriate when correcting for bin width.

Figure 4: Frequency plot on a log-log scale for three simulated distributions and three binning schemes. Cutoff for the variable binning scheme indicated by dashed vertical line.

3 Estimation

The previous section introduced two methods to visualize the distributions. Along with these two visualization methods come two parameter estimators of the Pareto distribution. First, the CDF of a Pareto distribution is given by

\[\begin{equation*} F(x) = 1 - \left(\frac{x_{min}}{x}\right)^k, \end{equation*}\]

where \(x_{min}, k\) represent the scale and shape of the Pareto distribution. As could already be deduced from Figure 3, the CDF of a Pareto distribution is a straight line on a log-log plot. A simple OLS-estimation suffices, then, to recover the Pareto shape parameter.

At this point, however, another disadvantage of data binning becomes apparent (see Clauset, Shalizi, and Newman (2009) for additional disadvantages of the OLS approach). Data binning condenses observations into data points. As the number of condensed observations differ per bin, this entails a reweighing of the observations. Whereas a bin in the left tail of the distribution can summarize up to a million observations, a bin in the far right tail can consist of just one observation. In a simple OLS, both bins are assigned an equal weight.

The influence of observation weights on parameter estimates can easily be observed from the distribution fits displayed in Figure 5. When the estimation methodology is applied to unbinned Pareto distributed data (lower panel), one can estimate the Pareto shape parameter of 1.01 without bias. When the estimator is applied to unweighed binned data, a bias of 0.03 occurs for the logarithmic binning scheme (lower middle panel) and 0.21 for the variable binning scheme (lower right panel). Applying frequency weights to the OLS estimator, this bias is reduced to zero for the logarithmic binning scheme and 0.02 for the variable binning scheme.

Figure 5: Fitted Pareto distribution to unbinned, weighed and unweighed CCDF on a log-log scale for three simulated distributions and three binning schemes.

A second estimation methodology relies on the PDF of the Pareto distribution. From the definition of the CDF, it follows that the PDF of the Pareto distribution is given by

\[\begin{equation*} f(x) = \frac{k x_{min}^k}{x^{k+1}}. \end{equation*}\]

As with the parameter estimation based on the CDF, the PDF of a Pareto distribution is represented by a straight line on a log-log plot. A simple OLS-estimation therefore suffices to recover the Pareto shape parameter. Similar to the previous estimator, data binning results in a reweighing of the observations and biased estimates. This can be observed in Figure 6. When the estimator is applied to unweighed binned data, a bias of -0.09 is observed for the logarithmic binning scheme (lower middle panel) while it accounts to 0.49 for the variable binning scheme (lower right panel). Using frequency weights to correct the OLS estimator, this bias is reduced to zero for the logarithmic binning scheme and 0.02 for the variable binning scheme.

Figure 6: Fitted Pareto distribution to weighed and unweighed frequency plot on a log-log scale for three simulated distributions and three binning schemes.

4 Truncation and evaluation

In some cases, it is argued a dataset is Pareto distributed in the right tail only. The data therefore has to be left-truncated before fitting the Pareto distribution. In such cases, data binning has the disadvantage of resulting in information loss: it obscures important variation in the tail and can conceal the exact truncation point. First, obscuring variation in the tail of the distribution renders differentiating between real Pareto tails and other heavy-tailed distributions even more difficult. In our simulated data example, the \(R^2\)-values of the OLS-estimators are very high both for the truncated Lognormal and real Pareto distribution (see Figure 5 and 6). Based on these values (see Clauset, Shalizi, and Newman (2009) and Virkar and Clauset (2014) for more powerful methods to differentiate between heavy-tailed distributions), the truncated Lognormal is almost indistinguishable from the Pareto distribution

Second, data binning can conceal the exact truncation point, which is vital to the accuracy of the estimated shape parameter k. Choosing a minimum too low results in a biased shape parameter, as one will be fitting a power-law to non-power-law data. Choosing a value too high, on the other hand, increases the statistical error and bias from finite size effects on the shape parameter, as one discards legitimate data points. Moreover, it is widely documented that the scale and shape parameter of the Pareto distribution exhibit a positive correlation (see for instance Head, Mayer, and Thoenig (2014)).

5 Conclusion

In conclusion, DO NOT bin your data. If you DO use binned data, be careful. Binning data means losing information, reweighing data and introducing arbitrariness by means of the chosen number of bins. As the analysis presented in this blog demonstrates, binned data introduces pitfalls for accurate data analysis.

With the knowledge obtained on how to circumvent these pitfalls, we can now assess the results presented by Axtell (2001) in Figure 1. In panel one of Figure 1, we observe data that was binned using a variable binning scheme. This variable binning scheme is accounted for by correcting the frequencies and displaying the distribution in a frequency plot. The fitted regression line, however, is obtained from unweighed binned data and likely biased. Provided the high frequency attached to the deviating binpoint on the lower end of the data, it is highly likely this regression line is steeper than what would have been obtained from a weighed OLS-estimator with smaller bias.

The binned US sales distribution displayed in Panel two of Figure 1 is most important to us, as it allows us to interpret its deviation from the complete Portuguese sales distribution displayed in Figure 2. Sadly, there is too little information to correctly assess the evidence in favor of Zipf’s law based on this Panel two. As the CCDF appears to be truncated, it is unclear how the complete distribution would relate to Pareto. Our results indicate that obtaining a high \(R^2\) in such case is not sufficiently strong evidence in favor of the Pareto distribution.

To date, the results presented in Axtell (2001) have a strong influence on the economic literature. I hope this blog post conveys the message that the field would benefit from additional research on the characterization of the firm size distribution, ideally based on unbinned data.

References

Axtell, Robert L. 2001. “Zipf Distribution of Us Firm Sizes.” Science 293 (5536): 1818–20.

Clauset, Aaron, Cosma Rohilla Shalizi, and Mark EJ Newman. 2009. “Power-Law Distributions in Empirical Data.” SIAM Review 51 (4): 661–703.

Head, Keith, Thierry Mayer, and Mathias Thoenig. 2014. “Welfare and Trade Without Pareto.” American Economic Review 104 (5): 310–16.

Nigai, Sergey. 2017. “A Tale of Two Tails: Productivity Distribution and the Gains from Trade.” Journal of International Economics 104: 44–62.

Virkar, Yogesh, and Aaron Clauset. 2014. “Power-Law Distributions in Binned Empirical Data.” The Annals of Applied Statistics, 89–119.

Ghent University, contact: rubenl.dewitte@ugent.be ↩︎

The Do’s and Dont’s of data binning for distributional analysis.

Ruben Dewitte¹

July 6, 2020

1 Introduction

2 Preparation: binning schemes

3 Estimation

4 Truncation and evaluation

5 Conclusion

References

The Do’s and Dont’s of data binning for distributional analysis.

Ruben Dewitte1

July 6, 2020

1 Introduction

2 Preparation: binning schemes

3 Estimation

4 Truncation and evaluation

5 Conclusion

References

Ruben Dewitte¹