Scipy教程 - 统计函数库scipy.stats
http://blog.csdn.net/pipisorry/article/details/49515215
统计函数Statistical functions(scipy.stats)
Python有一个很好的统计推断包。那就是scipy里面的stats。
Scipy的stats模块包含了多种概率分布的随机变量,随机变量分为连续的和离散的两种。
所有的连续随机变量都是rv_continuous的派生类的对象,而所有的离散随机变量都是 rv_discrete的派生类的对象。
This module contains a large number of probability distributions as well as a growing library of statistical functions.
Each univariate distribution is an instance of a subclass of rv_continuous(rv_discrete for discrete distributions):
rv_continuous([momtype, a, b, xtol, ...]) | A generic continuous random variable class meant for subclassing. |
rv_discrete([a, b, name, badvalue, ...]) | A generic discrete random variable class meant for subclassing. |
连续分布及其相关的函数
连续分布
alpha | An alpha continuous random variable. |
anglit | An anglit continuous random variable. |
arcsine | An arcsine continuous random variable. |
beta | A beta continuous random variable. |
betaprime | A beta prime continuous random variable. |
bradford | A Bradford continuous random variable. |
burr | A Burr (Type III) continuous random variable. |
burr12 | A Burr (Type XII) continuous random variable. |
cauchy | A Cauchy continuous random variable. |
chi | A chi continuous random variable. |
chi2 | A chi-squared continuous random variable. |
cosine | A cosine continuous random variable. |
dgamma | A double gamma continuous random variable. |
dweibull | A double Weibull continuous random variable. |
erlang | An Erlang continuous random variable. |
expon | An exponential continuous random variable. |
exponnorm | An exponentially modified Normal continuous random variable. |
exponweib | An exponentiated Weibull continuous random variable. |
exponpow | An exponential power continuous random variable. |
f | An F continuous random variable. |
fatiguelife | A fatigue-life (Birnbaum-Saunders) continuous random variable. |
fisk | A Fisk continuous random variable. |
foldcauchy | A folded Cauchy continuous random variable. |
foldnorm | A folded normal continuous random variable. |
frechet_r | A Frechet right (or Weibull minimum) continuous random variable. |
frechet_l | A Frechet left (or Weibull maximum) continuous random variable. |
genlogistic | A generalized logistic continuous random variable. |
gennorm | A generalized normal continuous random variable. |
genpareto | A generalized Pareto continuous random variable. |
genexpon | A generalized exponential continuous random variable. |
genextreme | A generalized extreme value continuous random variable. |
gausshyper | A Gauss hypergeometric continuous random variable. |
gamma | A gamma continuous random variable. |
gengamma | A generalized gamma continuous random variable. |
genhalflogistic | A generalized half-logistic continuous random variable. |
gilbrat | A Gilbrat continuous random variable. |
gompertz | A Gompertz (or truncated Gumbel) continuous random variable. |
gumbel_r | A right-skewed Gumbel continuous random variable. |
gumbel_l | A left-skewed Gumbel continuous random variable. |
halfcauchy | A Half-Cauchy continuous random variable. |
halflogistic | A half-logistic continuous random variable. |
halfnorm | A half-normal continuous random variable. |
halfgennorm | The upper half of a generalized normal continuous random variable. |
hypsecant | A hyperbolic secant continuous random variable. |
invgamma | An inverted gamma continuous random variable. |
invgauss | An inverse Gaussian continuous random variable. |
invweibull | An inverted Weibull continuous random variable. |
johnsonsb | A Johnson SB continuous random variable. |
johnsonsu | A Johnson SU continuous random variable. |
kappa4 | Kappa 4 parameter distribution. |
kappa3 | Kappa 3 parameter distribution. |
ksone | General Kolmogorov-Smirnov one-sided test. |
kstwobign | Kolmogorov-Smirnov two-sided test for large N. |
laplace | A Laplace continuous random variable. |
levy | A Levy continuous random variable. |
levy_l | A left-skewed Levy continuous random variable. |
levy_stable | A Levy-stable continuous random variable. |
logistic | A logistic (or Sech-squared) continuous random variable. |
loggamma | A log gamma continuous random variable. |
loglaplace | A log-Laplace continuous random variable. |
lognorm | A lognormal continuous random variable. |
lomax | A Lomax (Pareto of the second kind) continuous random variable. |
maxwell | A Maxwell continuous random variable. |
mielke | A Mielke’s Beta-Kappa continuous random variable. |
nakagami | A Nakagami continuous random variable. |
ncx2 | A non-central chi-squared continuous random variable. |
ncf | A non-central F distribution continuous random variable. |
nct | A non-central Student’s T continuous random variable. |
norm | A normal continuous random variable. |
pareto | A Pareto continuous random variable. |
pearson3 | A pearson type III continuous random variable. |
powerlaw | A power-function continuous random variable. |
powerlognorm | A power log-normal continuous random variable. |
powernorm | A power normal continuous random variable. |
rdist | An R-distributed continuous random variable. |
reciprocal | A reciprocal continuous random variable. |
rayleigh | A Rayleigh continuous random variable. |
rice | A Rice continuous random variable. |
recipinvgauss | A reciprocal inverse Gaussian continuous random variable. |
semicircular | A semicircular continuous random variable. |
skewnorm | A skew-normal random variable. |
t | A Student’s T continuous random variable. |
trapz | A trapezoidal continuous random variable. |
triang | A triangular continuous random variable. |
truncexpon | A truncated exponential continuous random variable. |
truncnorm | A truncated normal continuous random variable. |
tukeylambda | A Tukey-Lamdba continuous random variable. |
uniform | A uniform continuous random variable. |
vonmises | A Von Mises continuous random variable. |
vonmises_line | A Von Mises continuous random variable. |
wald | A Wald continuous random variable. |
weibull_min | A Frechet right (or Weibull minimum) continuous random variable. |
weibull_max | A Frechet left (or Weibull maximum) continuous random variable. |
wrapcauchy | A wrapped Cauchy continuous random variable. |
连续随机变量对象的方法
rvs(*args, **kwds) | Random variates of given type.产生服从这种分布的一个样本,对随机变量进行随机取值,可以通过size参数指定输出的数组大小。 |
pdf(x, *args, **kwds) | Probability density function at x of the given RV.随机变量的概率密度函数。产生对应x的这种分布的y值。 |
logpdf(x, *args, **kwds) | Log of the probability density function at x of the given RV. |
cdf(x, *args, **kwds) | Cumulative distribution function of the given RV.随机变量的累积分布函数,它是概率密度函数的积分(也就是x时p(X<x)的概率)。产生对应x的这种分布的累积分布函数的值。 |
logcdf(x, *args, **kwds) | Log of the cumulative distribution function at x of the given RV. |
sf(x, *args, **kwds) | Survival function (1 - cdf) at x of the given RV.随机变量的生存函数,它的值是1-cdf(t)。 |
logsf(x, *args, **kwds) | Log of the survival function of the given RV. |
ppf(q, *args, **kwds) | Percent point function (inverse of cdf) at q of the given RV.累积分布函数的反函数。q=0.01时,ppf就是p(X<x)=0.01时的x值。 |
isf(q, *args, **kwds) | Inverse survival function (inverse of sf) at q of the given RV. |
moment(n, *args, **kwds) | n-th order non-central moment of distribution. |
stats(*args, **kwds) | Some statistics of the given RV.计算随机变量的期望值和方差。 |
entropy(*args, **kwds) | Differential entropy of the RV. |
expect([func, args, loc, scale, lb, ub, ...]) | Calculate expected value of a function with respect to the distribution. |
median(*args, **kwds) | Median of the distribution. |
mean(*args, **kwds) | Mean of the distribution. |
std(*args, **kwds) | Standard deviation of the distribution. |
var(*args, **kwds) | Variance of the distribution. |
interval(alpha, *args, **kwds) | Confidence interval with equal areas around the median. |
__call__(*args, **kwds) | Freeze the distribution for the given arguments. |
fit(data, *args, **kwds) | Return MLEs for shape, location, and scale parameters from data.对一组随机取样进行拟合,找出最适合取样数据的概率密度函数的系数。如stats.norm.fit(x)就是将x看成是某个norm分布的抽样,求出其最好的拟合参数(mean, std)。 |
fit_loc_scale(data, *args) | Estimate loc and scale parameters from data using 1st and 2nd moments. |
nnlf(theta, x) | Return negative loglikelihood function. |
多变量分布Multivariate distributions
multivariate_normal | A multivariate normal random variable. |
matrix_normal | A matrix normal random variable. |
dirichlet | A Dirichlet random variable. |
wishart | A Wishart random variable. |
invwishart | An inverse Wishart random variable. |
special_ortho_group | A matrix-valued SO(N) random variable. |
ortho_group | A matrix-valued O(N) random variable. |
random_correlation | A random correlation matrix. |
multivariate_normal
>>> x, y = np.mgrid[-1:1:.01, -1:1:.01]
>>> pos = np.dstack((x, y)) #二维坐标组合成三维坐标点坐标
>>> rv = multivariate_normal([0.5, -0.2], [[2.0, 0.3], [0.3, 0.5]])
>>> rv.pdf(pos) #接受的参数是三维数据,第三维代表一个数据坐标,1、2维代表网格坐标位置。
离散分布及其相关的函数
当分布函数的值域为离散时,称之为离散概率分布。例如投掷有6个面的骰子时,只能获得1到6的整数,因此得到的概率分布为离散的。
对于离散随机分布,通常使用概率质量函数(PMF)描述其分布情况。在stats库中所有描述离散分布的随机变量都从rv_discrete类继承。
直接用rv_discrete 类自定义离散概率分布
stats.rv_discrete(values=(x,p))中的参数表示随机变量x和其对应的概率。
设有一个不均匀的骰子,各点出现的概率不相等。可以用下面的数组x保存骰子的所有可能值,数组p保存每个值出现的概率:
>>> x = range(1,7)
>>> p = (0.4, 0.2, 0.1, 0.1, 0.1, 0.1)
用下面的语句定义表示这个特殊骰子的随机变量,并调用其rvs()方法投掷此骰子20次,获得符合概率p的随机数:
>>> dice = stats.rv_discrete(values=(x,p))
>>> dice.rvs(size=20)
Array([2, 5, 1, 2, 1, 1, 2, 4, 1, 3, 1, 1, 4, 3, 1, 1, 1, 2, 6, 4])
from scipy import stats import numpy as np import matplotlib.pyplot as plt fs_meetsig ) fs_xk = np.sort(fs_meetsig) fs_pk = np.ones_like(fs_xk) / len(fs_xk) fs_rv_dist = stats.rv_discrete(name='fs_rv_dist', values=(fs_xk, fs_pk)) plt.plot(fs_xk, fs_rv_dist.cdf(fs_xk), , mec='r', label='friend') plt.show()
[rv_discrete Examples]
离散分布
bernoulli | A Bernoulli discrete random variable. |
binom | A binomial discrete random variable. |
boltzmann | A Boltzmann (Truncated Discrete Exponential) random variable. |
dlaplace | A Laplacian discrete random variable. |
geom | A geometric discrete random variable. |
hypergeom | A hypergeometric discrete random variable. |
logser | A Logarithmic (Log-Series, Series) discrete random variable. |
nbinom | A negative binomial discrete random variable. |
planck | A Planck discrete exponential random variable. |
poisson | A Poisson discrete random variable. |
randint | A uniform discrete random variable. |
skellam | A Skellam discrete random variable. |
zipf | A Zipf discrete random variable. |
离散分布的函数
rvs(*args, **kwargs) | Random variates of given type. |
pmf(k, *args, **kwds) | Probability mass function at k of the given RV. |
logpmf(k, *args, **kwds) | Log of the probability mass function at k of the given RV. |
cdf(k, *args, **kwds) | Cumulative distribution function of the given RV. |
logcdf(k, *args, **kwds) | Log of the cumulative distribution function at k of the given RV. |
sf(k, *args, **kwds) | Survival function (1 - cdf) at k of the given RV. |
logsf(k, *args, **kwds) | Log of the survival function of the given RV. |
ppf(q, *args, **kwds) | Percent point function (inverse of cdf) at q of the given RV. |
isf(q, *args, **kwds) | Inverse survival function (inverse of sf) at q of the given RV. |
moment(n, *args, **kwds) | n-th order non-central moment of distribution. |
stats(*args, **kwds) | Some statistics of the given RV. |
entropy(*args, **kwds) | Differential entropy of the RV. |
expect([func, args, loc, lb, ub, ...]) | Calculate expected value of a function with respect to the distribution for discrete distribution. |
median(*args, **kwds) | Median of the distribution. |
mean(*args, **kwds) | Mean of the distribution. |
std(*args, **kwds) | Standard deviation of the distribution. |
var(*args, **kwds) | Variance of the distribution. |
interval(alpha, *args, **kwds) | Confidence interval with equal areas around the median. |
__call__(*args, **kwds) | Freeze the distribution for the given arguments. |
, , , , , , , , , , , , , , , , , ]
fat_percent = [9.5, 26.5, 7.8, 17.8, 31.4, 25.9, 27.4, 27.2, 31.2, 34.6, 42.5, 28.8, 33.4, 30.2, 34.1, 32.9, 41.2, 35.7]
age = np.array(age)
fat_percent = np.array(fat_percent)
data , ])
print(stats.describe(data))
DescribeResult(nobs=18, minmax=(array([ 7.8, 17.8]), array([ 60., 61.])), mean=array([ 37.36111111, 37.86666667]), variance=array([ 236.58604575, 188.78588235]), skewness=array([-0.30733374, 0.40999364]), kurtosis=array([-0.65245849, -1.26315357]))
修改了一个输出结果形式
for key, value in stats.describe(data)._asdict().items(): print(key, ':', value)
nobs : 18
minmax : (array([ 7.8, 17.8]), array([ 60., 61.]))
mean : [ 37.36111111 37.86666667]
variance : [ 236.58604575 188.78588235]
skewness : [-0.30733374 0.40999364]
kurtosis : [-0.65245849 -1.26315357]
也可以使用pandas中的函数进行替代,这样输出比较舒服[python数据处理库pandas]
概率分布的熵和kl散度的计算 scipy.stats.entropy
scipy.stats.entropy(pk, qk=None, base=None)[source]
Calculate the entropy of a distribution for given probability values.
If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=0).
If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=0).
This routine will normalize pk and qk if they don’t sum to 1.
香农熵的计算entropy
shannon_entropy = stats.entropy(ij/sum(ij), base=None) print(shannon_entropy)
entropy的python直接实现
shannon_entropy_func = lambda pij: -sum(pij*np.log(pij)) shannon_entropy = shannon_entropy_func(ij[np.nonzero(ij)]) print(shannon_entropy)
def entropy(counts):
'''Compute entropy.'''
ps = counts/float(sum(counts)) # coerce to float and normalize
ps = ps[nonzero(ps)] # toss out zeros
H = -sum(ps * numpy.log2(ps)) # compute entropy
return H
两个分布的kl散度的计算
kl = sp.stats.entropy(fs_rv_dist, nonfs_rv_dist)
kl散度的其它实现[距离和相似度度量方法]
[scipy.stats.entropy¶]
假设检验相关的
ttest_1samp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis, equal_var]) Calculates the T-test for the means of TWO INDEPENDENT samples of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
kstest(rvs, cdf[, args, N, alternative, mode]) Perform the Kolmogorov-Smirnov test for goodness of fit.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
power_divergence(f_obs[, f_exp, ddof, axis, ...]) Cressie-Read power divergence statistic and goodness of fit test.
ks_2samp(data1, data2) Computes the Kolmogorov-Smirnov statistic on 2 samples.
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney rank test on samples x and y.
tiecorrect(rankvals) Tie correction factor for ties in the Mann-Whitney U and Kruskal-Wallis H tests.
rankdata(a[, method]) Assign ranks to data, dealing with ties appropriately.
ranksums(x, y) Compute the Wilcoxon rank-sum statistic for two samples.
wilcoxon(x[, y, zero_method, correction]) Calculate the Wilcoxon signed-rank test.
kruskal(*args) Compute the Kruskal-Wallis H-test for independent samples
friedmanchisquare(*args) Computes the Friedman test for repeated measurements
ttest_1samp实现了单样本t检验。因此,如果我们想检验数据Abra列的稻谷产量均值,通过零假设,这里我们假定总体稻谷产量均值为15000,我们有:
from scipy import stats as ss
# Perform one sample t-test using 1500 as the true mean
print ss.ttest_1samp(a = df.ix[:, 'Abra'], popmean = 15000)
# OUTPUT
(-1.1281738488299586, 0.26270472069109496)
返回下述值组成的元祖:
- t : 浮点或数组类型
t统计量 - prob : 浮点或数组类型
two-tailed p-value 双侧概率值
通过上面的输出,看到p值是0.267远大于α等于0.05,因此没有充分的证据说平均稻谷产量不是150000。将这个检验应用到所有的变量,同样假设均值为15000,我们有:
print ss.ttest_1samp(a = df, popmean = 15000)
# OUTPUT
(array([ -1.12817385, 1.07053437, -65.81425599, -4.564575 , 6.17156198]),
array([ 2.62704721e-01, 2.87680340e-01, 4.15643528e-70,
1.83764399e-05, 2.82461897e-08]))
第一个数组是t统计量,第二个数组则是相应的p值。
列联表函数Contingency table functions
chi2_contingency(observed[, correction, lambda_]) Chi-square test of independence of variables in a contingency table.
contingency.expected_freq(observed) Compute the expected frequencies from a contingency table.
contingency.margins(a) Return a list of the marginal sums of the array a.
fisher_exact(table[, alternative]) Performs a Fisher exact test on a 2x2 contingency table.
绘图测试Plot-tests
ppcc_max(x[, brack, dist]) Returns the shape parameter that maximizes the probability plot correlation coefficient for ppcc_plot(x, a, b[, dist, plot, N]) Returns (shape, ppcc), and optionally plots shape vs.
probplot(x[, sparams, dist, fit, plot]) Calculate quantiles for a probability plot, and optionally show the plot.
boxcox_normplot(x, la, lb[, plot, N]) Compute parameters for a Box-Cox normality plot, optionally show it.
Statistical functions for masked arrays (scipy.stats.mstats)
蒙面统计函数Masked statistics functions
argstoarray(*args) Constructs a 2D array from a group of sequences.
betai(a, b, x) Returns the incomplete beta function.
chisquare(f_obs[, f_exp, ddof, axis]) Calculates a one-way chi square test.
count_tied_groups(x[, use_missing]) Counts the number of tied values.
describe(a[, axis]) Computes several descriptive statistics of the passed array.
f_oneway(*args) Performs a 1-way ANOVA, returning an F-value and probability given any f_value_wilks_lambda(ER, EF, dfnum, dfden, a, b) Calculation of Wilks lambda F-statistic for multivariate data, per Maxwell find_repeats(arr) Find repeats in arr and return a tuple (repeats, repeat_count).
friedmanchisquare(*args) Friedman Chi-Square is a non-parametric, one-way within-subjects ANOVA.
kendalltau(x, y[, use_ties, use_missing]) Computes Kendall’s rank correlation tau on two variables x and y.
kendalltau_seasonal(x) Computes a multivariate Kendall’s rank correlation tau, for seasonal data.
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
kruskalwallis(*args) Compute the Kruskal-Wallis H-test for independent samples
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
ks_twosamp(data1, data2[, alternative]) Computes the Kolmogorov-Smirnov test on two samples.
kurtosis(a[, axis, fisher, bias]) Computes the kurtosis (Fisher or Pearson) of a dataset.
kurtosistest(a[, axis]) Tests whether a dataset has normal kurtosis
linregress(*args) Calculate a regression line
mannwhitneyu(x, y[, use_continuity]) Computes the Mann-Whitney statistic
plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
mode(a[, axis]) Returns an array of the modal (most common) value in the passed array.
moment(a[, moment, axis]) Calculates the nth moment about the mean for a sample.
mquantiles(a[, prob, alphap, betap, axis, limit]) Computes empirical quantiles for a data array.
msign(x) Returns the sign of x, or 0 if x is masked.
normaltest(a[, axis]) Tests whether a sample differs from a normal distribution.
obrientransform(*args) Computes a transform on input data (any number of columns).
pearsonr(x, y) Calculates a Pearson correlation coefficient and the p-value for testing non-plotting_positions(data[, alpha, beta]) Returns plotting positions (or empirical percentile points) for the data.
pointbiserialr(x, y) Calculates a point biserial correlation coefficient and the associated p-value.
rankdata(data[, axis, use_missing]) Returns the rank (also known as order statistics) of each data point along scoreatpercentile(data, per[, limit, ...]) Calculate the score at the given ‘per’ percentile of the sequence a.
sem(a[, axis, ddof]) Calculates the standard error of the mean (or standard error of measurement) signaltonoise(data[, axis]) Calculates the signal-to-noise ratio, as the ratio of the mean over standard skew(a[, axis, bias]) Computes the skewness of a data set.
skewtest(a[, axis]) Tests whether the skew is different from the normal distribution.
spearmanr(x, y[, use_ties]) Calculates a Spearman rank-order correlation coefficient and the p-value theilslopes(y[, x, alpha]) Computes the Theil slope as the median of all slopes between paired values.
threshold(a[, threshmin, threshmax, newval]) Clip array to a given value.
tmax(a, upperlimit[, axis, inclusive]) Compute the trimmed maximum
tmean(a[, limits, inclusive]) Compute the trimmed mean.
tmin(a[, lowerlimit, axis, inclusive]) Compute the trimmed minimum
trim(a[, limits, inclusive, relative, axis]) Trims an array by masking the data outside some given limits.
trima(a[, limits, inclusive]) Trims an array by masking the data outside some given limits.
trimboth(data[, proportiontocut, inclusive, ...]) Trims the smallest and largest data values.
trimmed_stde(a[, limits, inclusive, axis]) Returns the standard error of the trimmed mean along the given axis.
trimr(a[, limits, inclusive, axis]) Trims an array by masking some proportion of the data on each end.
trimtail(data[, proportiontocut, tail, ...]) Trims the data by masking values from one tail.
tsem(a[, limits, inclusive]) Compute the trimmed standard error of the mean.
ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_ind(a, b[, axis]) Calculates the T-test for the means of TWO INDEPENDENT samples of ttest_onesamp(a, popmean[, axis]) Calculates the T-test for the mean of ONE group of scores.
ttest_rel(a, b[, axis]) Calculates the T-test on TWO RELATED samples of scores, a and b.
tvar(a[, limits, inclusive]) Compute the trimmed variance
variation(a[, axis]) Computes the coefficient of variation, the ratio of the biased standard deviation winsorize(a[, limits, inclusive, inplace, axis]) Returns a Winsorized version of the input array.
zmap(scores, compare[, axis, ddof]) Calculates the relative z-scores.
zscore(a[, axis, ddof]) Calculates the z score of each value in the sample, relative to the sample
单变量和多变量核密度估计Univariate and multivariate kernel density estimation (scipy.stats.kde)
gaussian_kde(dataset[, bw_method]) Representation of a kernel-density estimate using Gaussian kernels.
统计函数使用举例
连续分布-Norm高斯分布
{高斯[正态]分布随机变量,A normal continuous random variable.}
生成服从高斯分布的随机向量(从正态分布中采样)stats.norm.rvs(loc, scale, size)
参数:
The location (loc) keyword specifies the mean.
The scale (scale) keyword specifies the standard deviation.
norm通过loc和scale参数可以指定随机变量的偏移和缩放参数。 对于正态分布的随机变量来说,这两个参数相当于指定其期望值和标准差。
高斯分布N(0,0.01)随机偏差 y , )
输出:array([ 0.05419826, 0.04151471, -0.10784729, 0.18283546, 0.02348312, -0.04611974, 0.0069336 , 0.03840133, -0.05015316, 0.23315205])
y.stats()
(array(0.0), array(0.1)
Note: 也可以使用numpy.random.norm函数生成高斯分布随机数[numpy库 - 随机数模块numpy.random]。
求正态分布最佳拟合参数stats.norm.fit(x)
>>> X =stats.norm(loc=1.0,scale=2.0,size = 100)
可以使用fit()方法对随机取样序列x进行拟合,返回的是与随机取样值最吻合的随机变量的参数
>>> stats.norm.fit(x) #得到随机序列的期望值和标准差
array([ 1.01810091, 2.00046946])
求正态分布N(1,1)概率密度函数某个x对应的值
, )
Note: 从正态分布概率密度中看出,这个和norm.pdf(x - 1)是不一样的,只有标准差为1时才相等。
求正态分布N(1,1)累积分布函数某个x对应的值
, )
绘制一维和二维正态分布概率密度图
[概率论:高斯分布]
均匀分布
mu = uniform.rvs(size=N) # 从均匀分布采样
伽玛分布
伽玛分布需要额外的形状参数。伽玛分布可用于描述等待k个独立的随机事件发生所需的时间,k就是伽玛分布的形状参数。
伽玛分布的尺度参数theta和随机事件发生的频率相关,由scale参数指定。
>>> stats.gamma.stats(2.0,scale=2)
(array(4.0), array(8.0))
根据伽玛分布的数学定义可知其期望值为k*theta,方差为k*theta^2 。上面的程序验证了这两个公式。 当随机分布有额外的形状参数时,它所对应的rvs()、pdf()等方法都会增加额外的参数以接收形状参数。
离散分布-二项分布
假设有一种只有两个结果的试验,其成功概率为 P,那么二项分布描述了进行n次这样的独立试验而成功k次的概率。
二项分布的概率质量函数公式如下:
使用二项分布的概率质量函数pmf()可以很容易计算出现k次6点的概率。
pmf()
pmf()的第一个参数为随机变量的取值,后面的参数为描述随机分布所需的参数。对于二项分布来说,参数分别为n和P,而取值范围则为0到n之间的整数。
程序通过二项分布的概率质量公式计算投掷5次骰子出现0到6所对应的概率:
>>> stats.binom.pmf(range(6), 5, 1/6.0)
array([0.401878, 0.401878, 0.166751, 0.032150, 0.003215, 0.000129])
由结果可知:出现0或1次6点的概率为40.2%,而出现3次6点的概率为3.215%
泊松分布
在二项分布中,如果试验次数n很大,而每次试验成功的概率p很小,其乘积np比较适中,那么试验成功次数的概率可以用泊松分布近似描述。
在泊松分布中,使用lambda描述单位时间(或单位面积)内随机事件的平均发生率。如果将二项分布中的试验次数n看作单位时间内所做的试验次数,那么它和事件出现概率P的乘积就是事件的平均发生率,即lambda = np。
泊松分布的概率质量函数公式如下:
二项分布的近似分布
程序分别计算二项分布和泊松分布的概率质量函数,当n足够大时,二者是十分接近的。
程序中事件平均发生率lambda恒等于10。根据二项分布的试验次数计算每次事件出现的概率p=lambda/n。
>>> _lambda = 10.0
>>> k = np.arange(20)
>>> possion = stats .poisson .pmf(k, _lambda) # 泊松分布
>>> binom100 = stats.binom.pmf(k, 100, _lambda/100) #二项式分布 100
>>> binom1000=stats.binom.pmf(k, 1000 , _lambda/1000) #二项式分布 1000
>>> np.max(np.abs(binom100-possion)) # 计算最大误差
0.006755311103353312
>>> np.max(np.abs(binom1000-possion))# n为 1000时,误差较小
0.00063017540509099912
泊松分布的模拟过程
泊松分布适合描述单位时间内随机事件发生次数的分布情况。例如某设施在一定时间内的 使用次数。机器出现故障的次数。自然灾害发生的次数等等。
下面使用随机数模拟泊松分布,并与其概率质量函数进行比较,事件每秒的平均发生次数为lambda=10。其中观察时间分别为1000秒,50000秒。可以看出:观察时间越长,事件每秒发生的次数就越符合泊松分布。
>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time )*time
>>> count, time_edges = np.histogram(t, bins=time, range=(0,time))
>>> count
array([10, 9, 8, …, 11, 10, 18])
>>>x = count_edges[:-1]
>>> dist, count_edges = np. histogram (count, bins=20, range= (0,20), normed=True)
>>> poisson = stats .poisson.pmf(x, _lambda)
>>> np.max(np.abs(dist-poisson)) #最大误差很小,符合泊松分布
0.0088356241037075706
Note: 用rand()产生平均分布于0到time之间的_lambda*time 个事件所发生的时刻。
用histogram()可以统计数组t中每秒之内事件发生的次数count。
根据泊松分布的定义,count数组中数值的分布情况应该符合泊松分布。统计事件次数在0到20区间内的概率分布。当histogram()的normed参数为True并且每个统计区间的长度为1时,其结果和概率质量函数相等。
泊松分布的时间间隔:伽玛分布
还可以换一个角度看随机事件的分布问题。可以观察相邻两个事件之间时间间隔的分布情况,或者隔k个事件的时间间隔的分布情况。根据概率论,事件之间的时间间隔应符合伽玛分布,由于时间间隔可以是任意数值,因此伽玛分布是一种连续概率分布。伽玛分布的概率密度函数公式如下,它描述第k个亊件发生所需的等待时间的概率分布。伽玛函数,当 k为整数时,它的值和k的阶乘k!相等。
程序模拟事件的时间间隔的伽玛分布,观察时间为1 000秒,平均每秒产生10个事件。
图中“k=1”,它表示相邻两个事件之间的时间间 隔的分布,而“k=2”则表示相隔一个事件的两个事件之间的时间间隔的分布,可以看出它们都符合伽玛分布.
>>> _lambda = 10
>>> time = 10000
>>> t = np.random.rand(_lambda*time)*time
>>> t.sort()#计算事性前后的时间间隔,需要先对随机时刻进行排序
>>> s1 = t[1:] - t[:-1] #相邻两个事件之间的时间间隔
>>> s2 = t[2:] - t[:-2] #相隔一个事件的两个亊件之间的时间间隔
>>> dist1, x1= np.histogram(s1, bins=100, normed=True)
>>> dist2, x2 = np.histogram(s2 , bins=100, normed=True)
>>> gamma1 = stats.gamma.pdf((x1[:-1]+x1[1:])/2, 1, scale=1.0/_lambda)
>>> gamma2 = stats.gamma.pdf((x2[:-1]+x2[1:])/2, 2, scale=1.0/_lambda)
>>> np.max(np.abs(gamma1 - dist1))
0.13557317865888141
>>> np.max(np.abs(gamma2 - dist2))
0.087375030861794656
>>> np.max(gamma1), np.max(gamma2)
(9.3483221580498537, 3.6767953241013656) #由于概率密度函数的值本身比较大,因此上面的误差已经很小了:
Note:模拟伽玛分布:
首先在10000秒之内产生100000个随机事件发生的时刻.因此事件的平均发生次数为每秒10次;
为了计算事性前后的时间间隔,需要先对随机时刻进行排序;
histogram()返回的第二个值为统计区间的边界,采用gamma.pdf()计算伽玛分布的概率密度时,使用各个区间的中值进行计算。Pdf()的第二个参数为k值,scale参数为1/λ;
from:http://blog.csdn.net/pipisorry/article/details/49515215
ref:Statistical functions (scipy.stats)
Scipy教程 - 统计函数库scipy.stats的更多相关文章
- Scipy教程 - 优化和拟合库scipy.optimize
http://blog.csdn.net/pipisorry/article/details/51106570 最优化函数库Optimization 优化是找到最小值或等式的数值解的问题.scipy. ...
- SciPy - 科学计算库(上)
SciPy - 科学计算库(上) 一.实验说明 SciPy 库建立在 Numpy 库之上,提供了大量科学算法,主要包括这些主题: 特殊函数 (scipy.special) 积分 (scipy.inte ...
- SciPy 教程
章节 SciPy 介绍 SciPy 安装 SciPy 基础功能 SciPy 特殊函数 SciPy k均值聚类 SciPy 常量 SciPy fftpack(傅里叶变换) SciPy 积分 SciPy ...
- python基础系列教程——Python库的安装与卸载
python基础系列教程——Python库的安装与卸载 2.1 Python库的安装 window下python2.python3安装包的方法 2.1.1在线安装 安装好python.设置好环境变量后 ...
- ProxySQL Disk库和Stats库
转载自:https://www.jianshu.com/p/9ef815162fe9 DISK库 disk schema 用于将配置持久化到磁盘上.配置持久化后,下次重启ProxySQL时就会读取这些 ...
- Scipy教程 - 距离计算库scipy.spatial.distance
http://blog.csdn.net/pipisorry/article/details/48814183 在scipy.spatial中最重要的模块应该就是距离计算模块distance了. fr ...
- python3安装pandas执行pip3 install pandas命令后卡住不动的问题及安装scipy、sklearn库的numpy.distutils.system_info.NotFoundError: no lapack/blas resources found问题
一直尝试在python3中安装pandas等一系列软件,但每次执行pip3 install pandas后就卡住不动了,一直停在那,开始以为是pip命令的版本不对,还执行过 python -m pip ...
- scipy科学计算库
特定函数 例贝塞尔函数: 积分 quad,dblquad,tplquad对应单重积分,双重积分,三重积分 from scipy.integrate import quad,dblquad,tplqua ...
- 4 扩展库Scipy
https://www.scipy.org/ 1. numpy 矩阵 2. matplotlib 绘图库 3. pandas 高效的Series和DataFrame数据结构 4.5 ndarry ...
随机推荐
- Postgresql查询最近12个月、最近30天数据
-- 最近 12 个月 SELECT * FROM 表名 WHERE 日期字段 BETWEEN (now() - INTERVAL '12 months') AND now() -- 最近 30 天 ...
- [Java] 设计模式:代码形状 - lambda表达式的一个应用
[Java] 设计模式:代码形状 - lambda表达式的一个应用 Code Shape 模式 这里介绍一个模式:Code Shape.没听过,不要紧,我刚刚才起的名字. 作用 在应用程序的开发中,我 ...
- display:none
$("#loadimg").css("display",""); <span id="loadimg" clas ...
- Struts2 转换器
转换器 从一个 HTML 表单到一个 Action 对象,类型转换是从字符串到非字符串 Http 没有 "类型" 的概念,每一项表单的输入只可能是一个字符串或一个字符串数组,在服务 ...
- 利用Bioperl的SeqIO模块解析fastq文件
测序数据中经常会接触到fastq格式的文件,比如说拿到fastq格式的原始数据后希望查看测序碱基的质量并去除低质量碱基.一般而言大家都是用现有的工具,比如说fastqc这个Java写的小程序,确实很好 ...
- 码农代理免费代理ip端口字段js加密破解
起因 之前挖过爬取免费代理ip的坑,一个比较帅的同事热心发我有免费代理ip的网站,遂研究了下:https://proxy.coderbusy.com/. 解密 因为之前爬过类似的网站有了些经验,大概知 ...
- Go 语言函数
函数是基本的代码块,用于执行一个任务. Go 语言最少有个 main() 函数. 你可以通过函数来划分不同功能,逻辑上每个函数执行的是指定的任务. 函数声明告诉了编译器函数的名称,返回类型,和参数. ...
- Java常用集合学习总结
一 数组 数组可以存储基本数据类型和对象的一种容器,长度固定,所以不适合在对象数量未知的情况下使用. Arrays : 用于操作数组对象的工具类,里面都是静态方法. Arrays.asList:把A ...
- jQuery中$(function()与(function($)等的区别详细讲解
(function($) {-})(jQuery); 这里实际上是匿名函数,如下: function(arg){-} 这就定义了一个匿名函数,参数为arg 而调用函数时,是在函数后面写上括号和实参的, ...
- 豌豆夹Redis解决方案Codis安装使用
豌豆夹Redis解决方案Codis安装使用 1.安装 1.1 Golang环境 Golang的安装非常简单,因为官网被墙,可以从国内镜像如studygolang.com下载. [root@vm roo ...