On the power laws of language: word frequency distributions
About eight decades ago, Zipf postulated that the word frequency distribution of languages is a power law, i.e., it is a straight line on a log-log plot. Over the years, this phenomenon has been documented and studied extensively. For many corpora, however, the empirical distribution barely resembles a power law: when plotted on a loglog scale, the distribution is concave and appears to be composed of two differently sloped straight lines joined by a smooth curve. A simple generative model is proposed to capture this phenomenon. Theword frequency distributions produced by this model are shown to match the observations both analytically and empirically. © 2017 Copyright held by the owner/author(s).