Elo Rating System和Probability Theory Basics

2024-11-05 15:43:20 1 小时读完 (大约 7387 个字)

Categories:

technology

Tags: Elo, Matchmaking, Probabiltiy

Elo Rating System

https://en.wikipedia.org/wiki/Elo_rating_system

The Elo rating system is a method for calculating the relative skill levels of players in zero-sum games such as chess or esports. It is named after its creator Arpad Elo, a Hungarian-American physics professor. (Educated in public primary and secondary schools, he earned his BS (22歲) and MS (25歲) degrees, in Physics, from the University of Chicago.)

西洋棋的elo rating在這裡，我的dota和chess在這裡。

Pairwise comparisons form the basis of the Elo rating methodology.

https://github.com/V2beach/books/blob/main/1978-elo-theratingofchessplayerspastandpresent.pdf

Elo’s central assumption was that the chess performance of each player in each game is a normally distributed random variable. Although a player might perform significantly better or worse from one game to the next, Elo assumed that the mean value of the performances of any given player changes only slowly over time. Elo thought of a player’s true skill as the mean of that player’s performance random variable.

A further assumption is necessary because chess performance in the above sense is still not measurable. One cannot look at a sequence of moves and derive a number to represent that player’s skill. Performance can only be inferred from wins, draws, and losses. Therefore, a player who wins a game is assumed to have performed at a higher level than the opponent for that game. Conversely, a losing player is assumed to have performed at a lower level. If the game ends in a draw, the two players are assumed to have performed at nearly the same level.

實踐中存在很多問題。

Mathematical details

如果玩家A的rating是RA，玩家B的當前分數是RB，玩家A預測分數的公式（using the logistic curve with base 10）是
$E_{\mathrm{A}}=\frac{1}{1+10^{\left(R_{\mathrm{B}}-R_{\mathrm{A}}\right) / 400}} .$
相似地，B的預期分數是
$E_{\mathrm{B}}=\frac{1}{1+10^{\left(R_{\mathrm{A}}-R_{\mathrm{B}}\right) / 400}} .$
也可以表示為
$E_A=\frac{Q_A}{Q_A+Q_B}$
和
$E_B=\frac{Q_B}{Q_A+Q_B},$
其中$Q_A=10^{R_A/400}$$Q_B=10^{R_B/400}$。這個logistic函數的曲線如下，

其實就是用分差把預期分數約束到0到1之間，其中勝利是1，失敗是0。

假設玩家A（當前分數$R_A$）的預期分數是$E_A$但是實際得分是$S_A$，玩家更新分數的公式是
$R_{\mathrm{A}}^{\prime}=R_{\mathrm{A}}+K \cdot\left(S_{\mathrm{A}}-E_{\mathrm{A}}\right).$

K-factor set at K=16 for masters and K=32 for weaker players.

Moreover, online judge sites are also using Elo rating system or its derivatives. For example, Topcoder is using a modified version based on normal distribution, while Codeforces is using another version based on logistic distribution.

Formal derivation for win/loss games
The above expressions can be now formally derived by exploiting the link between the Elo rating and the stochastic gradient update in the logistic regression.有的用隨機梯度更新分數

Logistic/Sigmoid

A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with the equation.

$f(x)=\frac{L}{1+e^{-k\left(x-x_0\right)}}$

where

L is the carrying capacity, the supremum of the values of the function;
k is the logistic growth rate, the steepness of the curve; and
x0 is the x value of the function’s midpoint.
The logistic function has domain the real numbers, the limit as x→−∞ is 0, and the limit as x→+∞ is L.

Standard logistic function where L=1,k=1,x0=0.

sup(X)是取上限函数，inf(X) 是取下限函数。 sup是supremum的简写，意思是：上确界，最小上界。 inf是infimum的简写，意思是：下确界，最大下界。

A sigmoid function refers specifically to a function whose graph follows the logistic function. It is defined by the formula:

$\sigma(x)=\frac{1}{1+e^{-x}}=\frac{e^x}{1+e^x}=1-\sigma(-x) .$

Some sigmoid functions compared. In the drawing all functions are normalized in such a way that their slope at the origin is 1.

The logistic function was introduced in a series of three papers by Pierre François Verhulst between 1838 and 1847, who devised it as a model of population growth by adjusting the exponential growth model。

Exponential Growth：

The graph illustrates how exponential growth (green) surpasses both linear (red) and cubic (blue) growth.

Linear growth

Cubic growth

Exponential growth

這logistic函數發明初衷是用於優化人口增長函數，在統計學和機器學習的應用只是它茫茫多應用的其中之一而已。
所有圖像長得像logsitic函數的都是sigmoid函數(S-shaped curve)。

Logistic functions are used in several roles in statistics. For example, they are the cumulative distribution function of the logistic family of distributions, and they are, a bit simplified, used to model the chance a chess player has to beat their opponent in the Elo rating system. More specific examples now follow.

Logistic regression

Logistic functions are used in logistic regression to model how the probability p of an event may be affected by one or more explanatory variables: an example would be to have the model

$p = f(a + bx),$

where x is the explanatory variable, a and b are model parameters to be fitted, and f is the standard logistic function.

Logistic regression and other log-linear models are also commonly used in machine learning. A generalisation of the logistic function to multiple inputs is the softmax activation function, used in multinomial logistic regression.

Log-linear

先複習一下，線性回歸是$y_i(x)=\beta_0+\beta_1x_i$，

把數據$(x_i, y_i)$代入之後，用最小平方法(Least Square)計算loss function目標/損失函數，
${Loss}\left({\beta}_0, {\beta}_1\right)=\sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2=\sum_{i=1}^n\left(y_i-\left({\beta}_0+{\beta}_1 x_i\right)\right)^2$
之後更新beta修正函數擬合數據。

而log-linear就是把a+bx結果代入logistic或者其他sigmoid函數，好處是比如對分類任務來說，这个Sigmoid Function可以将线性的值，映射到[0-1]范围中。如果映射结果小于0.5，则认为是负的样本，如果是大于0.5，则认为是正的样本。

採用logistic分布和gaussian/normal分布的區別？

A workable rating system, fully developed from basic theory, includes certain principal components:

Rating scale
Performance distribution function
Percentage expectancy function
Performance rating formula
Continuous rating formula
Appropriate numerical coefficients and ancillary formulae
—— 8.12 The Rating of Chessplayers, Past & Present, Second Edition, Arpad E. Elo

區別就是預期分數（player’s expected score）的計算公式，上面用的logistic function(base10)其實就是logistic distribution的CDF，可以發現如下圖所示跟normal distribution的CDF很像。

In probability theory and statistics, the logistic distribution is a continuous probability distribution.
Its cumulative distribution function is the logistic function, which appears in logistic regression and feedforward neural networks.
It resembles the normal distribution in shape but has heavier tails (higher kurtosis).

logistic distribution其實就是尾部更長的normal distribution了。

Logistic distribution		Normal distribution
Probability density function		Probability density function The red curve is the standard normal distribution.
Cumulative distribution function		Cumulative distribution function
Notation	LogisticDistribution(μ,s)	$N (μ, σ^{2})$
Parameters	$μ,$ location (real) $s > 0,$ scale (real)	$μ \in R$ = mean (location) $σ^{2} \in R_{> 0}$ = variance (squared scale)
Support	$x \in (- \infty, \infty)$	$x \in R$
PDF	$\frac{e^{- (x - μ) / s}}{s {(1 + e^{- (x - μ) / s})}^{2}}$	$\frac{1}{\sqrt{2 π σ^{2}}} e^{- \frac{(x - μ)^{2}}{2 σ^{2}}}$
CDF	$\frac{1}{1 + e^{- (x - μ) / s}} = \frac{1 + \tanh \frac{x - μ}{2 s}}{2}$	$Φ (\frac{x - μ}{σ}) = \frac{1}{2} [1 + \erf (\frac{x - μ}{σ \sqrt{2}})]$
Quantile	$μ + s \log (\frac{p}{1 - p})$	$μ + σ \sqrt{2} \erf^{- 1} (2 p - 1)$
Mean	$μ$	$μ$
Median	$μ$	$μ$
Mode	$μ$	$μ$
Variance	$\frac{s^{2} π^{2}}{3}$	$σ^{2}$

Probability theory basics機率基礎

https://www.youtube.com/watch?v=GwSEguqJj6U&list=PLtvno3VRDR_jMAJcNY1n4pnP5kXtPOmVk
台大電機 Prof. 葉丙成機率與統計 2013，這個講得超好。

為什麼我們要研究機率？
因為我們對這世界了解太少，這世界的運作有很多仍是未知的。——prof.benson

術語：

Experiment實驗(procedures實驗步驟、model機率模型、observations結果觀察，三個部分任何一個改變，都會變為新的實驗)，
Outcome結果，
Sample Space樣本空間，若是S={o1,…,on}n個outcome的集合
Event事件（實驗結果的敘述，因此是Outcome的集合，或者是Sample Space的子集），
Event Space事件空間，則為\{\{o1},…,{on},{o1,o2},{o1,o3},…,{on-1,on},{o1,o2,o3},…,{o1,o2,…,on},{}}一共2^n個事件元素的集合
Probability機率（是個函數，P(Event)，Probability is a function of Event(Set)，Domain定義域就是Event Space，Range值域是[0,1]）
Set Theory（因為Probability is a function of Set，所以要了解Set）術語：
- Element, Set, Subset, Universal Set全集, Empty Set, Intersection交集, Union并集/聯集, Complement補集, Difference差集(X - Y), Disjoint不相交(X, Y disjoint if X∩Y=∅), Mutually Exclusive互斥（兩個set是disjoint，一堆集合兩兩不相交就是互斥）
- De Morgan’s Law:

為什麼事件空間是2^n？一次實驗裡兩個結果，三個結果怎麼可能同時發生？
剛開始這個疑問就是沒搞清結果和事件的區別，一個事件可能包含多個結果。（37分鐘有例子）

Axioms /ˈaksɪəm/

從這部影片第一次了解一個理論的公理，中國大陸的教學真差。
近代數學常從數條公理推導出整套理論，線性代數有8(9?)條公理(9 axioms of linear algebra/real vector spaces):

Additive axioms. For every x,y,z in X, we have
- x+y = y+x.
- (x+y)+z = x+(y+z).
- 0+x = x+0 = x.
- (-x) + x = x + (-x) = 0.
Multiplicative axioms. For every x in X and real numbers c,d, we have
- 0x = 0
- 1x = x
- (cd)x = c(dx)
Distributive axioms. For every x,y in X and real numbers c,d, we have
- c(x+y) = cx + cy.
- (c+d)x = cx +dx.
  整套理論只基於幾條公理的好處是，對於一個新問題只需要確定滿足這些公理（其實沒有定義“+”要怎麼做，是general的，套別的case其實不一定滿足），就能適用其他所有定理。
  公理不用證明，axioms是理性直覺覺得是對的東西。只從這幾條廢話推導出所有定理確實牛逼。

Axioms of Probability(機率三公理):

Axiom 1: For any event A, $P(A) \geq 0$
Axiom 2: Probability of the sample space S is $P(S)=1$
Axiom 3: If A1,A2,A3,⋯ are disjoint events, then $P(A_1 \cup A_2 \cup A_3 \cdots)=P(A_1)+P(A_2)+P(A_3)+\cdots$

axiom3的disjoint即互斥。

性質(定理)

P(A) = P(A-B) + P(A∩B)
P(A∪B) = P(A) + P(B) - P(A∩B)
切麵包定理：若 $C_1, C_2, \cdots, C_n$互斥，且$C_1 \cup C_2 \cup \cdots \cup C_n = S$，則对任何事件 A：$P(A) = P(A \cap C_1) + P(A \cap C_2) + \cdots + (A \cap C_n)$
若A⊂B則P(A)≤P(B)
Boole’s inequality/Union Bound布爾不等式：對任何事件A1, A2, …, An，$\mathbb{P}\left(\bigcup_i A_i\right)\leq \sum_i\mathbb P(A_i)$
Bonferroni’s inequality

conditional probability條件機率(貝氏定理)

P ( X | Y ), Y是觀察到的已發生的事件，X是關心的未發生的事件。

原本實驗的樣本空間若是S即P(S)=1，新的樣本空間就變成了Y即P(Y|Y)=1，這個豎線｜讀作“given”比如P(X|Y)讀作P of X given Y

tex公式裡∪是\\cup，∩是\\cap，一個杯子一個帽子很形象很有趣。

$P(X|Y) = \frac{P(X\cap Y)}{P(Y)}$

P ( X | Y )可以理解成只是樣本空間S變成了Y而已。

若事件C1, C2 . . . Cn互斥且S = C1 ∪ C2 ∪ . . . . . ∪ Cn，則對任意事件A，

Law of Total Probability(切麵包)，$\begin{array}{l} P(A) =\sum\limits_{i=1}^{n} P(C_i)P(A| C_i)\end{array}$。Ex：阿宅 vs. 可愛店員：店員對阿宅笑否，受店的生意影響很大。已知滿座機率，可愛店員對阿宅笑的機率？
Bayes’ Rule, $ P(C_j|A)=\frac{P(A|C_j)P(C_j)}{\sum\limits_{i=1}^{n} P(A | C_i) P(C_i)}$，之前算的都是P of A given C，貝葉斯是解決計算P of C given A。Ex：一日，老闆見可愛店員笑，請問在此情況下，當日生意滿座之機率為何？

Independence(獨立性)

若滿足P(A∩B)=P(A)P(B)則A、B兩事件為獨立事件。
跟互斥對比：A∩B=∅不可能事件即P(A∩B)=0。

可以理解成兩個實驗的樣本空間沒有交集？

Prof. Benson立刻就講到了更好的定義，若滿足P(A|B) = P(A)則A、B兩事件為獨立事件。

根據上面的理解“P ( X | Y )可以理解成只是樣本空間S變成了Y而已”，即Y的樣本空間跟X完全沒有關係，比如X的樣本空間是{o1, o2, …, on}，Y的樣本空間是{p1, p2, …, pn}，事件Y可能是{p1, p2, p3}，完全不影響這個世界裡事件X發生與否，這裡的Y實際上是{p1, p2, p3, o1, o2, …, on}。

非獨立事件的conditional probability是這樣的：假如實驗是投一個6面骰子一次，樣本空間是{1,2,3,4,5,6}，X事件是扔出{1,5,6}，Y事件是扔出大數{4,5,6}，那X｜Y的樣本空間就變成了{4,5,6}即X只可能是扔出了大數中的某個數，o1∩Y=∅，X就變成{5,6}，P(X｜Y)則=P({5,6})/P({4,5,6})=2/3。

Graph Scheme

實驗分解。

counting method

為什麼機率需要算排列組合，需要counting？
因為古典機率常假設每個outcome發生機率相同，所以計算某個事件的機率，等同於計算這個事件包含多少個outcome，像上面計算事件空間那樣，機率（事件）=outcome數/事件空間元素數。所以機率問題等於counting問題。

counting前的判斷：

distinguishable? 可區分？
w/wo replacement? 有放回？
order matters or not? 關心順序嗎？if order matters then Permutation (排列問題) else Combination (組合問題)

Permutation(order matters)

w/o replacement
w/ replacement

Combination(order doesn’t matter)

除以k!就是因為order doesn’t matter，要把order的排列除掉，也就是k個items的所有排列方式k!種（比如3個items有6種排列方式=3!）。

數學裡叫n choose k，n取k。

$C(n, k) = \binom{n}{k} = C_n^k = \frac{n!}{k!(n-k)!}$

二項式定理裡的係數就是這個所以C(n, k)也叫binomial coefficient

Multinomial

Random Variable隨機變數

就是把outcome數字化的方式，讓推導更數學、更簡明。

Discrete R.V. (離散隨機變數)
Continuous R.V. (連續隨機變數)

隨機變數的函數也是隨機變數，因為outcome的函數的函數也是outcome的函數。

CDF (Cumulative Distribution Function)

累計分佈函數$F_X(x) = P(X \le x)$。

discrete r.v.函數曲線是階梯式的，所以X的區間有沒有等於包不包含邊界區別很大

PMF (Probability Mass Function)

對一個discrete random variable來說PMF和CDF可以互相推得

兩個函數一個是X=x(PMF)，一個是X<=x(CDF)就是這個區別。
分布是把很多相似實驗總結成規律。

Discrete Probability Distributions

Bernoulli Distribution

只care成功失敗，一次實驗，非1即0。

Binomial Distribution

二項分布。

n=1時退化成白努力分布。

Uniform Distribution

均勻分布。

Geometric Distribution

几何分布。

$P_X(x)$等比級數，即幾何級數

$S_n=\frac{a_1\left(1-r^n\right)}{1-r}, r \neq 1$

Pascal Distribution

k=1就退化成Geometric分布。

Poisson Distribution

Probability Density Function

PDF就是CDF的微分！

積分constant不用擔心，可以通過-無窮=0等邊界條件唯一確定。

why mass和density，這張圖一目了然，continuous不能有PMF，就是顧名思義“質量”和“密度”的關係

](https://wiki.v2beach.cn/assets/Screenshot 2024-11-02 at 15.31.32.png) ![

Uniform Distribution

Exponential Distribution

Gamma / Erlang Distribution

Normal Distribution

終於來到elo的重點。

比如gaussian，為什麼單獨對standard normal建表？因為：。。。

https://en.wikipedia.org/wiki/Normal_distribution

PDF/PMF vs. CDF

關於PDF vs. PMF上面（why mass和density，這張圖一目了然，continuous不能有PMF，就是顧名思義“質量”和“密度”的關係）寫過了。

PDF/PMF 就是 P(x = …)；
CDF 就是P(… < x < …)。

也是一目了然。

Central Limit Theorem(中央極限定理)

exponential分布的獨立事件convolution完變成了gaussian

應用

為什麼mean是probability，variance是probability(1- probability)，因為Bernoulli的X非1即0，所以expectation是1*probability + 0*(1- probability)，$Var(X)=E(X^2)-E(X)^2 = probability - probability^2$。

失憶性

似然likelihood

概率（Probability）是指给定某个已知的模型和参数时，观测数据出现的可能性。
- 已知模型和参数，计算数据出现的概率。
似然（Likelihood）则是指给定观测数据的情况下，某个特定模型参数出现的可能性。
- 已知数据，计算特定模型参数的可能性。

似然（Likelihood）通常指的是一个函数，用来衡量某个特定参数值在给定数据集下生成这些数据的可能性。它并不是概率，虽然其形式和概率类似，但有所不同。具体来说，似然是根据已知数据和模型参数来计算某个参数值“合理性”的度量。

在统计学中，我们有一个统计模型，其中包含一些我们想要估计的未知参数。我们通过观测到的数据来推断这些参数的值。
似然函数（Likelihood Function）是指给定某个参数值后，生成观测数据的概率（或可能性）。我们通过似然函数来评估不同参数值的合理性。

先驗？後驗？

貝葉斯定理：
$P(\theta \mid D) = \frac{P(D \mid \theta) P(\theta)}{P(D)}$，
其中：

$P(\theta \mid D)$ 是后验概率(Posterior Probability)，即在观察到数据 $D$ 后，参数 $\theta$ 的概率分布。

在贝叶斯框架下，后验（posterior）是我们通过已有的数据（如症状、检查结果等）对某个假设或未知参数（如是否患病）的认知的更新。

$P(D \mid \theta)$ 是似然函数，即在给定参数 $\theta$ 的条件下，观察到数据 $D$ 的概率。

$P(\theta)$ 是先验概率(Prior Probability)，即在没有观察到数据时，关于参数 $\theta$ 的先验知识。

先验（Prior）是指我们在没有观察到数据之前，对某个参数或假设的信念或假设。也就是说，先验是我们根据以前的知识、经验或理论对某个未知量（如参数、假设等）做出的初步估计。

$P(D)$ 是证据或标准化常数，确保后验概率的总和为1。

举个例子：

假设你在诊断一个病人是否患有某种疾病，$\theta$ 代表病人是否患病（例如 $\theta = 1$ 代表患病，$\theta = 0$ 代表不患病）。你有以下信息：

先验 $P(\theta)$：表示你在没有任何诊断数据时，认为病人患病的概率。

似然 $P(D \mid \theta)$：表示在给定病人确实患病或未患病的情况下，得到某种诊断结果 $D$ 的概率。

后验 $P(\theta \mid D)$：表示在获取到诊断结果 $D$ 后，病人患病的更新概率。

應用場景是我們經常需要已知 $P(\theta \mid D)$ 求 $P(D \mid \theta)$（或反之），那麼假如我們還知道 $P(\theta)$ 就可以用貝葉斯公式求解，其中 $P(D)$ 根據上面條件機率中的Law of Total Probability(切麵包)可以通過 $P(D) = =\sum\limits_{i=1}^{n} P(\theta_i)P(D \mid \theta_i)$ 求解。

同樣用上面的例子，“$P(D)$”是診斷結果的機率，可以通過無數個獨立的病人是否患病 $P(\theta_i)$ 乘基於此的診斷結果 $P(D \mid \theta_i)$ 之和求得。

用似然解釋Neural Nets

監督學習模型和無監督學習模型都可以看作是一個似然函數。

似然函數與預測的關係：
- 对于分类任务，假设我们有 $C$ 个类别，并且通过神经网络得到了每个类别的概率 $p(y_i = c | x_i; \theta)$，其中 $c$ 是类别标签，$y_i$ 是样本 $i$ 的真实标签。
- 这个概率分布表示了模型对每个类别的“信心”。假设真实标签是 $y_i = c$，那么我们关心的是模型预测该标签的概率 $p(y_i = c | x_i; \theta)$。
模型目標轉為最大/極大化似然函數（maximum likelihood estimation，简作MLE）：
- 对于所有训练数据点，似然函数 $L(\theta)$ 是所有预测的概率的乘积： $L(\theta) = \prod_{i=1}^{N} p(y_i | x_i; \theta)$
- 这个似然函数表示了在给定所有训练数据的条件下，模型参数 $\theta$ 下观测到实际标签的“概率”或“可能性”。
- 模型的目標就是找到 $\theta$ 讓模型預測到正確類別 $y_i$ 的概率 $p(y_i)$ 最大化。
原來的乘積轉為對數：
- 为了简化计算和优化，通常使用对数似然（log-likelihood），这将乘积转化为和（因為我們並不關心 $L$ 的絕對值而只希望越大越好）：
  - 对数似然的作用是将概率分布的乘法转化为对数的加法，便于优化（尤其是在梯度下降中）。
负对数似然（NLL）：
- 由于我们通常使用最大化对数似然来训练模型，但在优化过程中更常见的是最小化损失函数。因此，我们使用负对数似然（negative log-likelihood）作为损失函数： $\text{NLL}(\theta) = -\sum_{i=1}^{N} \log p(y_i | x_i; \theta)$
其實交叉熵softmax或者MSE(L2) Loss都可以看作是一種NLL Loss：
- 当我们假设数据点 $y_i$ 是从一个高斯（正态）分布中采样的，且预测值 $\hat{y}_i$ 也是一个高斯分布的均值时，NLL 会转化为 MSE。
  - 假设我们有一组样本 $(x_i, y_i)$，我们的目标是最小化 NLL，使得模型的预测概率最大化。如果我们假设数据点 $y_i$ 服从一个高斯分布，均值为 $\hat{y}_i$，方差为 $\sigma^2$，那么似然函数为： $p(y_i | \hat{y}_i, \sigma^2) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp\left( -\frac{(y_i - \hat{y}_i)^2}{2 \sigma^2} \right)$
  - NLL： $\text{NLL} = - \sum_{i=1}^{N} \log p(y_i | \hat{y}_i, \sigma^2) = \sum_{i=1}^{N} \frac{(y_i - \hat{y}_i)^2}{2 \sigma^2} + \frac{N}{2} \log(2 \pi \sigma^2)$
  - 可以看到，这里有两部分：一部分是 $(y_i - \hat{y}_i)^2$，另一部分是常数项。我们通常关心的是与参数优化相关的部分，因此常数项可以忽略不计，剩下的就是一个类似于均方误差的项： $\text{NLL} \propto \sum_{i=1}^{N} (y_i - \hat{y}_i)^2$
- 在二分类问题中，我们假设每个样本的标签 $y_i$ 是从 Bernoulli 分布中采样的，模型预测的是一个表示为类别 1 的概率 $p(y_i = 1 | x_i; \theta)$。假设输出概率是 $\hat{y}_i$，那么负对数似然（NLL）为：
  - 这实际上就是 $\textbf{二分类交叉熵（Binary Cross-Entropy）}$ 损失函数。因此，二分类交叉熵与 NLL 是等价的，它表示了模型预测概率与真实标签之间的差异。

有标签数据（监督学习）：似然函数通常是

$L(\theta \mid X, Y) = P(Y \mid X, \theta)$

我們希望最大化给定输入 $X$ 时，输出 $Y$ 的条件概率。

无标签数据（无监督学习）：似然函数是

$L(\theta \mid X) = P(X \mid \theta)$

我們希望最大化给定数据 $X$ 时，原分布数据出现的概率。

用貝葉斯解釋Neural Nets

1. 贝叶斯公式

根据贝叶斯定理，后验概率分布可以表示为：

$P(\theta \mid X, Y) = \frac{P(Y \mid X, \theta) P(\theta)}{P(Y \mid X)}$

其中：

$ P(\theta \mid X, Y) $ 是后验分布，表示在观测数据 $ X $ 和标签 $ Y $ 后，参数 $ \theta $ 的分布。
$ P(Y \mid X, \theta) $ 是似然函数，表示在给定参数 $ \theta $ 和输入 $ X $ 时，标签 $ Y $ 出现的概率。
$ P(\theta) $ 是先验分布，表示我们对参数 $ \theta $ 的先验信念。
$ P(Y \mid X) $ 是边际似然或证据：

$P(Y \mid X) = \int P(Y \mid X, \theta) P(\theta) d\theta$

2. 最大化后验概率MAP

即最大後驗概率估計(Maximum A Posteriori estimation, MAP)。A posteriori is from Latin ā posteriōrī, which means literally, “from what is later.”

为了找到最可能的参数值 $ \theta $，我们希望最大化后验分布 $ P(\theta \mid X, Y) $：

$\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} P(\theta \mid X, Y)$

由于分母 $ P(Y \mid X) $ 与参数 $ \theta $ 无关，是一个常数，因此可以忽略。这样，最大化后验概率等价于最大化分子：

$\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} \left[ P(Y \mid X, \theta) P(\theta) \right]$

3. 对数变换

为了简化计算，我们对目标函数取对数。由于对数函数是单调递增的，最大化对数与最大化原函数等价：

$\log P(\theta \mid X, Y) = \log \left[ P(Y \mid X, \theta) P(\theta) \right]$

利用对数的乘法法则，这可以拆分为两部分：

$\log P(\theta \mid X, Y) = \log P(Y \mid X, \theta) + \log P(\theta)$

4. 最小化负对数损失

在优化中，我们通常通过最小化损失函数来实现参数优化。因此，我们可以将上述公式转化为最小化负对数后验概率：

$\hat{\theta}_{\text{MAP}} = \arg\min_{\theta} \left[ -\log P(Y \mid X, \theta) - \log P(\theta) \right]$

这就得到了 MAP 损失函数：

$\text{MAP loss} = -\log P(Y \mid X, \theta) - \log P(\theta)$

第一项 $ -\log P(Y \mid X, \theta) $ 是负对数似然（NLL），衡量模型预测的准确性。
第二项 $ -\log P(\theta) $ 是负对数先验，对参数进行正则化，鼓励模型参数满足先验分布。

5. 特殊情况：最大似然估计（MLE）

当先验 $ P(\theta) $ 是均匀分布时，先验项 $ \log P(\theta) $ 是一个常数。此时，MAP 退化为最大似然估计（MLE）：

$\hat{\theta}_{\text{MLE}} = \arg\min_{\theta} \left[ -\log P(Y \mid X, \theta) \right]$

即，最大似然估计只关注似然，不考虑先验。

总结

MAP 最大化的是后验概率 $ P(\theta \mid X, Y) $，通过结合似然和先验来优化参数。
MLE 是 MAP 的一种特殊情况，当先验均匀时，MAP 退化为 MLE，只最大化似然。

頻率學派vs.貝葉斯

這篇文章講頻率學派和貝葉斯學派的區別講得還不錯，這裡面的共軛先驗跟我在LDA數學八卦裡學的學的Beta-Binomial共軛還不太是一回事。

推导步骤	最大似然估计（MLE）	最大后验估计（MAP）
目标	最大化数据的似然函数，即找到最可能的参数值 $ \theta $。	最大化后验分布 $ P(\theta \mid X, Y) $，结合了似然和先验。
后验分布 / 似然	无先验，仅依赖似然：	依赖似然和先验：
	$ P(\theta \mid X, Y) \propto P(Y \mid X, \theta) $	$ P(\theta \mid X, Y) \propto P(Y \mid X, \theta) P(\theta) $
贝叶斯公式	$ P(\theta \mid X, Y) = \frac{P(Y \mid X, \theta) P(\theta)}{P(Y)} $	$ P(\theta \mid X, Y) = \frac{P(Y \mid X, \theta) P(\theta)}{P(Y \mid X)} $
推导目标	最大化似然函数（$ P(Y \mid X, \theta) $）	最大化后验分布（结合似然和先验）
取对数后	对似然函数取对数得到对数似然：	对后验分布取对数得到对数后验：
	$ \log P(Y \mid X, \theta) $	$ \log P(Y \mid X, \theta) + \log P(\theta) $
损失函数	负对数似然（Negative Log-Likelihood, NLL）损失函数：	MAP损失函数：
	$ \text{NLL}_{\text{MLE}} = -\log P(Y \mid X, \theta) $	$ \text{MAP loss} = -\log P(Y \mid X, \theta) - \log P(\theta) $
优化目标	最小化负对数似然（NLL）：	最小化MAP损失函数：
	$ \hat{\theta}_{\text{MLE}} = \arg\min_{\theta} \left[ -\log P(Y \mid X, \theta) \right] $	$ \hat{\theta}_{\text{MAP}} = \arg\min_{\theta} \left[ -\log P(Y \mid X, \theta) - \log P(\theta) \right] $
是否包含先验	不包含先验。	包含先验 $ P(\theta) $，通过 $ -\log P(\theta) $ 正则化参数。
在数据稀缺时的行为	在数据稀缺时，MLE 可能会过拟合，因为它仅关注最大化似然，没有考虑任何先验。	在数据稀缺时，MAP 通过引入先验能够稳定估计，避免过拟合。
特殊情况	当先验 $ P(\theta) $ 是均匀分布时，MAP 退化为 MLE。	当先验是均匀分布时，MAP 退化为最大似然估计（MLE）。

dota mmr

https://dota2.fandom.com/wiki/Matchmaking
https://dota2.fandom.com/wiki/Matchmaking_Rating

Valve has stated that matchmaking tries to fulfil several criteria:

The teams are balanced. (Each team has a 50% chance to win. It’s based on Elo rating.)
The discrepancy in skill between the most and least skilled player in the match is minimized.
The highest skill Radiant player should be close to the same skill as the highest skill Dire player.
The discrepancy between experience (measured by the number of games played) between the least experienced player and the most experienced player is minimized.
Each team contains about the same number of parties.
Five-player parties can only be matched against other five-player parties.
Players’ language preferences contains a common language.
Wait times shouldn’t be too long.

elo hell

Elo hell (also known as MMR hell) is a video gaming term used in MOBAs and other multiplayer online games with competitive modes. It refers to portions of the matchmaking ranking spectrum where individual matches are of poor quality, and are often determined by factors such as poor team coordination which are perceived to be outside the individual player’s control. This ostensibly makes it difficult for skilled players to “climb” up the matchmaking ranking (and out of Elo hell), due to the difficulty of consistently winning games under these conditions. Its existence in various games has been debated, and some game developers have called it an illusion caused by cognitive bias.

Elo Rating System和Probability Theory Basics

Elo Rating System

Mathematical details

Logistic/Sigmoid

Logistic regression

Log-linear

採用logistic分布和gaussian/normal分布的區別？

Probability theory basics機率基礎

Axioms /ˈaksɪəm/

性質(定理)

conditional probability條件機率(貝氏定理)

Independence(獨立性)

Graph Scheme

counting method

Permutation(order matters)

Combination(order doesn’t matter)

Multinomial

Random Variable隨機變數

CDF (Cumulative Distribution Function)

PMF (Probability Mass Function)

Discrete Probability Distributions

Bernoulli Distribution

Binomial Distribution

Uniform Distribution

Geometric Distribution

Pascal Distribution

Poisson Distribution

Probability Density Function

Uniform Distribution

Exponential Distribution

Gamma / Erlang Distribution

Normal Distribution

PDF/PMF vs. CDF

Central Limit Theorem(中央極限定理)

應用

失憶性

似然likelihood

先驗？後驗？

用似然解釋Neural Nets

用貝葉斯解釋Neural Nets

1. 贝叶斯公式

2. 最大化后验概率MAP

3. 对数变换

4. 最小化负对数损失

5. 特殊情况：最大似然估计（MLE）

总结

頻率學派vs.貝葉斯

dota mmr

elo hell

评论

目录

Your browser is out-of-date!