-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathKacper.tex
2190 lines (1858 loc) · 108 KB
/
Kacper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%
% in CH1 give an real life example at the begining as a BACKGROOUND
% describe bacis concepts
% and go into more details? obvious
%
%
%
% in CH2 active learning or web optimization
%
%
%
\documentclass[12pt, a4paper, pdflatex, leqno, twoside]{report}
% notitlepage - abstract on the same page
\usepackage{indentfirst} % indent frst paragraph of section
% \usepackage{fullpage} % full A4 page
% \usepackage[left=2.5cm, right=2.5cm, bottom=2.5cm, top=2.5cm]{geometry}
\usepackage[a4paper,inner=3cm,outer=2cm,top=2.5cm,bottom=2.5cm,pdftex]{geometry}
\usepackage{amsmath}
\usepackage{amsfonts} % fancy maths font
\usepackage{mathrsfs} % fancy maths font
\usepackage{dsfont} % indocator finction
\usepackage{mathtools}
\usepackage[pdftex]{graphicx}
\usepackage{cite} % BiTeX
\usepackage{lipsum}
\newcommand{\ts}{\textsuperscript}
\usepackage[usenames,dvipsnames]{color}
% \geometry{bindingoffset=2cm}
% \setlength{\oddsidemargin}{5mm}
% \setlength{\evensidemargin}{5mm}
% for multi figures
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage[]{algorithm2e}
% \usepackage{polski}
% \usepackage[polish,english]{babel}
% \usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc} % polsih
\usepackage{hyperref}
% Harvard citation
\usepackage[square]{natbib}
% argmax with commands
\newcommand{\argmax}{\operatornamewithlimits{argmax}}
% equality from definition | =^{\text{def}}
\newcommand{\myeq}{\stackrel{\mathclap{\normalfont\scriptsize\mbox{def}}}{=}}
% Code snippets
\usepackage{listings}
% \usepackage{color}
\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}
\lstset{frame=tb,
language=R,
aboveskip=3mm,
belowskip=3mm,
showstringspaces=false,
columns=flexible,
basicstyle={\small\ttfamily},
numbers=none,
numberstyle=\tiny\color{gray},
keywordstyle=\color{blue},
commentstyle=\color{dkgreen},
stringstyle=\color{mauve},
breaklines=true,
breakatwhitespace=true
tabsize=3
}
% END Code snippets
% $\backsim\ \sim\ \thicksim$
\newcommand{\HRule}{\rule{\linewidth}{0.5mm}}
\newenvironment{dedication}
{\clearpage % we want a new page
\thispagestyle{empty}% no header and footer
\vspace*{\stretch{1}}% some space at the top
\itshape % the text is in italics
% \raggedleft % flush to the right margin
\raggedright % flush to the right margin
\par\setlength{\leftskip}{0.3\textwidth}\noindent\ignorespaces
}
{\par % end the paragraph
\vspace{\stretch{3}} % space at bottom is three times that at the top
\clearpage % finish off the page
}
\begin{document}
\begin{titlepage}
\begin{center}
% Upper part of the page. The '~' is needed because \\
% only works if a paragraph has started.
\includegraphics[width=0.5\textwidth]{graphics/UOB-logo.png}~\\[2.5cm] % was 1cm
% \textsc{\LARGE University of Bristol}\\[1.5cm]
%\textsc{\Large Final year project}\\[0.5cm]
% \colorbox{magenta}{problem}
% Title
\HRule \\[0.4cm]
{ \huge \bfseries %\\[0.5cm]
Comprehensive introduction to \emph{\textbf{Multi-armed bandits}}:\\[.5cm]
\emph{Thompson Sampling}\\
\&\\
Active Learning in the bandits scenario\\[0.4cm] }
\HRule \\[1.5cm]
% Author and supervisor
\begin{minipage}{0.4\textwidth}
\begin{flushleft} \large
\emph{Author:}\\
Kacper B. \textsc{\textbf{Sokol}}
\end{flushleft}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{flushright} \large
\emph{Supervisor:} \\
Dr.~David \textsc{\textbf{Leslie}}
\end{flushright}
\end{minipage}
\let\thefootnote\relax\footnote{Level H/6 $|$ MATH 32200---20cp Project}
\vfill
% Bottom of the page
{\large \today}
\end{center}
\end{titlepage}
\newpage
\thispagestyle{empty}
\mbox{}
\newpage
\thispagestyle{empty}
\mbox{}
%Acknowledgment
\begin{center}Acknowledgement of Sources\\[2cm]\end{center}
For all ideas taken from other sources (books, articles, Internet), the source
of the ideas is mentioned in the main text and fully referenced at the end of
the report.\\[0.5cm]
All material which is quoted essentially word-for-word from other sources is
given in quotation marks and referenced.\\[.5cm]
Pictures and diagrams copied from the Internet or other sources are labelled
with a reference to the web page, book, article etc.\\[2cm]
Signed:\\[1cm]
Dated:~~~~~~~~~~~~\today
% \thispagestyle{empty}% no header and footer
% \thispagestyle{empty}
% \cleardoublepage
% \pagestyle{plain}
% \vfill
\newpage
\thispagestyle{empty}
\mbox{}
% \title{\emph{Multi-armed bandits} problem.\\
% Practical introduction to the problem for everyone.\\
% Real life application.}
% \author{Kacper Sokol\\University of Bristol, UK}
% \date{\today}
% \maketitle
% \begin{flushright}
% Supervised by:\\
% \textbf{David Leslie}
% \end{flushright}
% \begin{center}
% \line(1,0){250}
% \end{center}
\begin{abstract}
\thispagestyle{empty}% no header and footer
This project covers two main topics: the multi-armed bandits theory with extensive
treatment of the Thompson Sampling approach, as well as the application of multi-armed bandits in active
learning. The comprehensive introduction to the theory underlying multi-armed
bandits is gradually developed in the project to cover the concepts necessary for understanding
the basic bandits strategies. The latter part of the paper presents the first of its
kind (to the best of our knowledge) application of the Thompson Sampling-inspired multi-armed bandits algorithm
to solve computer science task of learning in the environment of insufficient
information.\\
Reader does not require any prior knowledge in this field, for only the basics of
statistics and probability theory are necessary to smoothly follow the text.\\
\begin{center}
Keywords: \textbf{multi-armed bandits, active learning, semi-supervised learning,
exploration, exploitation, Thompson Sampling}
\let\thefootnote\relax\footnote{\noindent This paper together with all
figures and experiment source code is available as \texttt{GitHub} repository
at: \url{https://github.com/So-Cool/MAB}.}
\end{center}
\end{abstract}
\newpage
\thispagestyle{empty}
\mbox{}
\begin{dedication}
I would like to thank my parents who provide me with any kind of support. For
their guidance and advice which help me make the right choices throughout
my life and realize my dreams.\newline
It would also be a painful journey without my supervisor Dr.~David~Leslie who
always served me with advice on how to ``read'' and ``write'' all the maths and
avoid the unnecessary and overwhelming clutter in the books.\newline
Finally, big thanks to Iza and Kuba who often take care of my leisure time, even
though it is always lacking.\\[2cm]
% \foreignlanguage{polish}{}
\begin{flushright}
Dzi\k{e}kuj\k{e}, mamo.\\
Dzi\k{e}kuj\k{e}, Tomek.
\end{flushright}
% I was lost now I'm found
\textcolor{white}{found me!}
\end{dedication}
\newpage
\thispagestyle{empty}
\mbox{}
\newpage
{
\thispagestyle{empty}
\cleardoublepage
\pagestyle{plain}
\tableofcontents
\thispagestyle{empty}
% \cleardoublepage
% \pagestyle{plain}
% \newpage
}
\newpage
\thispagestyle{empty}
\mbox{}
\chapter{Introduction\label{chap:intro}}
\setcounter{page}{1}
The \emph{multi-armed bandits}(MAB) theory is a set of problems that has been rapidly developing as a field of
statistics and probability since early 20\ts{th} century. With a vastly
growing number of tasks that could be framed as a bandit scenario, the field has
become a topic of research for many scientists, economists, not to mention
the companies looking for efficiency improvements and savings. All things
considered, these problems can be solved with ease by finding a balance between
\emph{exploration} and \emph{exploitation}.\\
Multi-armed bandits are a class of problems originating from a sequential
allocation dilemma. They were defined during Second World War and quickly
reached the fame of being too difficult to solve and were abandoned for
decades thus. The first general solution was constructed by John Gittins(section~\ref{sec:gitind})
in late
60's, nevertheless, his work was overlooked for almost 20 years until it was revived in the early 80's~\citep{gittins+glazebrook+weber}.\\
\noindent The main reference for this chapter is~\citep{berry+firstedt}.\\
\section{Background}
It is often believed that statistics and probability are static sciences---they define a set of tools to analyse various aspects of a process or data which have already been collected.
Less often we are interested in continuously developing events, that we want to discover or that require interaction. Simple statistics or probability might not be advanced enough to
handle such cases as good as the bandits theory.\\
To begin with, we shall discuss the \emph{fruit machine}, as it is the first thing that
comes to the reader's mind after hearing about multi-armed bandits. Imagine a row of
slot machines in front of you. Pulling an arm of each of these automata will result in a different
outcome determined by some unknown probability distribution. For the simplicity
we will consider the result as various reels' combinations and we will assume that each automaton results in binary
outcome: \emph{win} with probability $p$ and \emph{lose} with probability $p-1$. Without loss of generality,
row of such machines can be transformed into only one automaton but with multiple arms or buttons each corresponding
to a single machine in aforementioned row.\\
A natural example that follows binary bandits is a row of coins, where some
of them may be unfair. In a presented scenario each coin corresponds to an
\emph{Arm} of a bandit and tossing one of them for several times can be
considered the realization of a Bernoulli process with unknown parameters.\\
If a player is rewarded when the outcome of a trial is \textbf{H}ead then the
goal is to find the coin which is biased with maximum probability of \textbf{H}
and play it forever.\\
If gamblers do not want to lose all their money really quickly, it would probably
be a good idea to employ some kind of a strategy that maximises chances of
winning. It is assumed that gamblers are visiting a given casino for the first time,
so any prior information regarding the expected return from each arm is also assumed to
be unknown. Initially, a random arm is chosen for all of them might look alike. On
contrary, during the second turn selecting \emph{optimal} arm to be played
becomes a serious dilemma that one might not have yet realized. The gamblers
face a choice between the arm that has already been pulled with known sample expected return and any other arm which for now on remains a mystery, since there is no
information about a potential reward.\\
If the gamblers decide to take advantage of the already known arm and pull it again, we
call this an \emph{exploitation}---that is taking advantage of already
checked possibilities. On the other hand, taking a risk and choosing one of the
unknown arms will result in gathering more information about the system,
which is usually said to be an \emph{exploration} step.\\
\section{Applications} % emphasized underlying
Multi-armed bandits are not just a theory that one reads from a book and tries to
memorise, for they extend to many real life applications. This section is devoted
to a simple case study in which it seems natural to use the ``bandits approach''.
Applications are versatile ranging from drug testing and maximising income from
web advertisement through semi-supervised machine learning in modern computer
science, to time and budget management of research projects.\\
To begin with, we will imagine a hospital testing two new medicines for a certain
disease. Patients are queuing up to receive a treatment. Assuming that doctor
cannot refuse to treat anyone, each person(each
\emph{play}) suffering from a disease can be given two possible drugs(two \emph{arms}). The key
assumption here is that the effect of a chosen action occurs immediately. In other
words, a treated person either remains ill or is cured(immediate \emph{payoff}).
The goal of a doctor is always to maximise the number of healed people. This model
defines the two-armed bandit.\\
The second mentioned approach is nowadays widely incorporated by companies such
as Google~\citep{AYPSze12, ASMB:ASMB874},
LinkedIn~\citep{Tang:2013:AAF:2505515.2514700},
Microsoft~\citep{graepel2010web}, Yahoo~\citep{Li:2010:CAP:1772690.1772758} to facilitate
their services. Research groups of these companies are using bandits algorithms
to choose the best website layout and advertisements' locations to increase
the click-through rate(CTR)\footnote{Measure of success of an on-line advertising
campaign.}, improve recommendation systems or enhance performance of semi-supervised and active learning algorithms.\\
The bandits theory also plays a key role in experiment allocation with
restricted resources like time and budget~\citep{gittins+glazebrook+weber}.
While considering bandits' \emph{arms} as research projects that are pending to be conducted
with the limited amount of scientists, time, or funding; the goal is to maximise the number of
accomplished tasks(\emph{payoff}) before distributing the money among them.\\
With the scenarios presented above, though they are barely scratching the tip of an iceberg, we
are now focusing on foundations of multi-armed bandits theory, for one needs to
precisely describe the processes happening ``behind the scene'' to completely understand optimal strategies introduced in chapter~\ref{ch:exploration}.\\
We begin with introducing the reader with necessary notation and nomenclature.
Then, we are moving to basic processes description.
\section{Terminology}
Statistical decision theory defines ``simple multi-armed bandit" as a
sequential selection from $N \geq 2$ stochastic processes---generally called
\emph{arms}, where both time and processes may be discrete or continuous. The
goal is typically to recover unknown parameters that characterise stochastic
processes---the \emph{arms}---to maximize expected \emph{payoff}.\\
Knowing the most common terms present in MAB literature will certainly help the reader in understanding
the content of this project therefore, the basic concepts are:
\begin{description}
\item[Multi-armed bandit ($N$)]--- a ``device'' with $N \geq 2$ possible choices
of action(\emph{arms}).
\item[Agent]--- a person who decides which \emph{arm} to pull based on a chosen
\emph{strategy} $\tau$.
\item[Strategy ($\tau$)]--- tells the \emph{agent} which \emph{arm} to pull at a
given stage of the \emph{game}. A strategy is \emph{optimal} if it yields
maximal expected \emph{payoff}.
\item[Arm]--- one of $N$ actions that may be taken by the \emph{agent}. An
\emph{arm} is \emph{optimal} if it is the best selection when following an
\emph{optimal} strategy.
\item[Play]--- an \emph{arm} pulled at a stage $m$ of the \emph{game} (i.e.\ one
turn).
\item[Game]--- a sequence of \emph{arms'} pulling, based on a chosen \emph{strategy}
$\tau$.
\item[Payoff]--- a return of a game such as \emph{win--lose} or the \emph{amount}
of money gained.
\item[Discount series]--- factors that define how valuable is each of the
\emph{payoffs}. For example only the first $m$ outcomes may count and all the rest
is neglected; or: the longer the \emph{game} is played, the less a particular
outcome counts with regard to the overall expected \emph{payoff}.
\end{description}
\section{Player's dilemma}
The goal of an agent is to maximise the overall reward from a game by following an optimal strategy. The player needs to memorise previous outcomes---feedback acquired after pulling an arm---to make the action selection process efficient. Optimally, the agent should choose a strategy that requires the least memory and computations.\\
\section{General assumptions}
With basic terminology in our minds we present limitations of MAB theory and restrictions that need to hold to apply bandits strategies.
Multi-armed bandit setting can be used to solve a variety of problems, therefore,
some non-trivial environments are also present.\\
Two fundamental assumptions with regard to the benefits from selecting an arm in vast majority of bandits problems are:
\begin{itemize}
\item the immediate \emph{payoff} i.e.\ the \emph{agent} knows the result of a taken
action immediately and,
\item the information exploitation i.e.\ the information gathered after a \emph{play} can be used to modify chosen \emph{strategy}.\\
\end{itemize}
We generally consider discrete bandits with processes described by random variables. Nevertheless, real time random bandits are usually present when a decision needs to be made
for an event occurring in non-deterministic intervals of time and information about the event can only be acquired during its occurrence.\\
Moreover, in some cases we may restrict the memory of an \emph{agent} to the last
$s$ outcomes. In this way, a selected \emph{strategy} $\tau$ can rely on up to $s$
previous \emph{plays}; these approaches are called the \emph{finite
memory bandits}.\\
Furthermore, we assume that \emph{arms} are independent. This premise has recently been revised by~\citep{Pandey:2007:MBP:1273496.1273587}.\\
MAB with dependent arms grow in popularity owing to the Internet advertising
applications. In such a scenario, the ads displayed to the user are usually
dependent on each other(one manufacturer---different products; or one class of products made by
different companies) and very often they can be easily grouped. Policies for
such scenarios are made to consider these connections between actions and to
exploit the underlying information.\\
Sometimes, the agent's goal is to gather information about available choices. If such learning needs to be done in an
efficient manner, MAB is often a good choice. An example presented in the \emph{non-monotone} section(\S~\ref{sec:nonmonotone}) also fits here; at the beginning, a number of rounds is used to learn
some information about the environment to make the best possible choice at given time.
\\
\section{Discount Sequence}
To specify the rules governing the ``significance'' of outcome revealed after a single
play at stage $m$, the \emph{discount sequence} is introduced. It is a vector
$\mathbf{A}$ of specified length, which can also be infinite.
$$
\mathbf{A} = \left( \alpha_1, \alpha_2, \alpha_3, ... \right) \text{ .}
$$
When an \emph{arm} is selected the discount sequence is modified by a unit left shift:
$$
\left( \alpha_1, \alpha_2, \alpha_3, ... \right)
\rightarrow
\left( \alpha_2, \alpha_3, \alpha_4, ... \right) \text{ .}
$$
There are many different discount sequences used with multi-armed bandits each
with numerous features and assumptions. In the literature only two of them are described in
great detail and both are presented below.
\subsubsection{Uniform sequence}
This discount sequence is most commonly used when a player wants to maximize
the payoff in the first $h$ rounds.
The $h$-horizon uniform discount sequence is defined as:
$$
\alpha_i =
\begin{cases}
1 & \text{for } i \leq n \text{ ,}\\
0 & \text{for } i > n \text{ ,}
\end{cases}
$$
leading to:
$$
\mathbf{A} = ( \underbrace{ 1, 1, 1, ..., 1}_{h\text{ elements}}, 0, 0, 0,
... ) \text{ .}
$$
After an \emph{arm} selection step, the horizon of discount sequence is decreased by $1$.
\subsubsection{Geometric sequence}
The geometric discount sequence is expressed with components $\alpha_i = a^{i-
1}$ for some $a \in ( 0, 1 )$ resulting in:
$$
\mathbf{A} = \left( a^0, a^1, a^2, ... \right) \text{ ,}
$$
where $\alpha_1 = a^0$ is always equal to $1$. The characteristic feature of
this series is that after an \emph{arm} selection step, the discount sequence is still
proportional to the original sequence. It is often used when agent's interest in outcomes decreases with time.\\
It can also be shown that when the discounting is geometric, a bandit problem
involving $k$ independent arms can be solved by transforming it into $k$ different two-armed
bandits, each involving one known and one unknown arm~\citep{gittins+glazebrook+weber}.\\[1.5cm]
In the case of both: uniform and geometric sequences, the decision problem and the optimal strategy are
unchanged if a discount series is multiplied by some constant. Furthermore, the
geometric sequence remains effectively the same throughout the game.
\subsubsection{Mixture of uniform sequences}
If we consider Bernoulli trials(with success worth $1$) of a clinical test, the
experiment can terminate with positive probability $\nu_m$ at any stage $m$ as
we do not know the exact number of incoming patients. In such cases, we may form
a discount sequence where factor at given stage is the probability that the
experiment has not been terminated so far, $\alpha_m = \sum_{i=m}^\infty
\nu_i$. Clearly, discount series that arises in the presented scenario is a mixture
of uniforms. This case is identical to a deterministic sequence, where the average of
uniforms is weighed by the $\nu_m$'s.\\
To clarify, let's consider $\gamma$ to be a probability of terminating at each stage; therefore,
the factor at each reached stage is: $\nu_m = (1-\gamma)\gamma^{m-1}$,
$m=1,2,3,...$, leading to a geometric discount sequence.
\subsubsection{Random discounting}
The random discounting is one of the most complex.
In the mixture of uniforms we do not obtain any advantage conditioning on the past, in two cases:
\begin{itemize}
\item as long as we get $1$ the process continues and there is no possibility that previous factors were $0$,
\item once we get $0$ the process is no more of interest.
\end{itemize}
Generally speaking, in a random discount sequence we
are only provided with some prior probability distribution over the space of all
possible discount sequences.\\
\subsubsection{Observable and non-observable sequences}
Sometimes the discount sequence cannot be observed, for it is either blended with a
reward as a single value, or it is not given to us together with the reward. In
such cases we need to estimate it. Like in the random scenario, given the
distribution over all possible discount sequences, we can estimate it by:
$$
\hat{\alpha}_m = \mathbb{E}(\alpha_m | \text{``probability distribution over
all discount sequences''}) \text{ .}
$$~\\
On the other hand, we may observe a discount sequence when we obtain both: a reward and a value of the discount at a given stage.
These cases may be harder to solve when the discounting is random as we cannot replace it with a non-random sequence without significantly altering the problem.\\
%
% In the scenarios presented above, the strategies tend to depend on deterministic, observable
% discount factors.\\
\subsubsection{Real-time sequences}
In the real-time sequences the intervals between events are random and therefore
directly influence the decision making process. To visualise this scenario we should
consider a clinical trial with patients arriving at random. In
such cases we usually use discount sequence described by: $\mathbf{A} = ( \exp(-\beta
t_1), \exp(-\beta t_2), \exp(-\beta t_3),... )$, where $\beta$ is a weight
coefficient and $t_i$ are known arrival times.\\
More complicated situation for real-time discounting can be described as:
$$
\alpha_t =
\begin{cases}
1 & \text{for } t \in [0,1) \text{ ,} \\
0 & \text{for } t \in [1,\infty) \text{ .}
\end{cases}
$$
Such sequence can express interest in maximising the response of patients
arriving in first unit interval. It is worth noting that the action choice may
be significantly different if the first event occurs at time $0.01$---then if it
would occur at time $0.99$. In short, a risky arm could be appropriate in the first
case, whereas a mean maximising action can be a good choice in the second case.\\
\subsubsection{Non-monotone sequences\label{sec:nonmonotone}}
The majority of discount sequences considered in literature are monotone increasing
or decreasing. This fact is motivated by real life applications where our
interest in process decreases, i.e.\ the player is interested in a quick reward; or
increases, i.e.\ the player is interested in a high reward in the future by first
learning about the environment.\\
The most popular and one of a few non-monotone sequences is defined by:
$\alpha_n = 1$ and $\alpha_m = 0$ for $m \neq n$. This structure of discount
sequence indicates that the first $n-1$ rounds are played for a sole purpose of
obtaining information to make the best possible choice at stage $n$ and all the
other rounds do not matter thus.\\
\chapter{Optimal Policy \texttt{\textbf{Exploration}}\label{ch:exploration}}
In this chapter we present a number of approaches to find an optimal
solution to MAB problem. First of all, we briefly describe methodologies that have been well-known for several of decades; then we give extensive description of one of the most recent approach called \emph{Thompson Sampling}. Finally, we form conclusions and highlight the ongoing research.\\
Once we can formally define the MAB problem we are in the position to start seeking an optimal strategy. There are numerous different approaches available, all of them with different pros and cons.\\
\section{Seeking an optimal solution}
Strategies presented below are used to balance exploration and exploitation in order to
find an optimal solution, i.e.\ to determine an equilibrium between acquiring information which can benefit in future choices
and the immediate payoff.\\
Let's now recall the hospital example: we have to decide whether it is worth to sacrifice
the wellbeing of the early coming patients to learn more about a particular condition by means of experimenting.
Such a choice would lead to treatment improvement over time and
yielding better results on future patients. This phenomenon could be referred to as sacrificing early payoff
to gain more information about the system and maximise the future return.\\
We could also democratise such a dramatic scenario by using a geometric sequence---in this way, the health of current patients
would carry equal value to the future patients' health.\\
According to MAB literature a strategy is considered as optimal if it chooses an optimal arm infinitely many times, i.e.\ if it converges to optimal selection within
the limit of time. We usually consider \emph{preemptive} case, where arbitrary switching between actions
is allowed and takes negligible time.\\
\subsection{Examples}
\subsubsection{Index approach(Gittins Index)\label{sec:gitind}}
The pioneering index approach introduced by John C.\ Gittins is one the oldest optimal solutions. It did not only move the multi-armed bandits concept significantly
forward but it also accelerated the growth in
the whole wide class of sequential allocation problems.\\
The motivation behind this approach is to assign \emph{priority indices}
to each action, where the index of a particular arm should only depend on the history
and outcomes for this action and no other. The decision process then is dependent on choosing an action with highest current index. \\
The theory of indices allocation is based on calibrating actions at a given
state against some standardised actions with simple properties. The main
advantage of such an approach is restricting \emph{state function} of an action, so it depends only
on its history; and consequently, reducing its complexity~\citep{gittins+glazebrook+weber}.\\
Indices are described as real-valued functions(dynamic allocation index) on the
set of all available alternatives; the selection process is then maximising this
function.\\
With such indeces we can specify optimal policy for particular problem(which
has any set of possible alternatives of a given type) with regard to so called
\emph{standard bandit problem}.\\
The index theorem says that a policy for bandit process is optimal if it is an
index policy with respect to $\nu(B_1, \cdot), \nu(B_2, \cdot),..., \nu(B_n,
\cdot)$ where $\nu$ is index function and $B_i$ is given bandit process.\\
The significant drawback of the index strategy is requirement of a vast amount of
resources such as computational power and memory storage. Moreover, the
following assumptions need to hold:
\begin{itemize}
\item rewards are accumulated up to an infinite time horizon,
\item there is constant and strict exponential discounting,
\item unless an arm is pulled, no rewards are collected and the state of the arm remains unchanged,
\item there is only one processor(server).\\
\end{itemize}
In the simplest case we consider a multi-armed bandit as a number of semi-Markov
decision processes. The \emph{index theorem} advocates the existence of an optimal \emph{index policy},
which on the other hand confirms the existence of a real-valued index, say $\nu ( B_i , \xi_i(t) )$, where $B_i$ is i\ts{th} bandit
process and $\xi_i$ is a state of i\ts{th} process(dependent on history), in another words, each index depends
only on its current state and no other processes' state. \\
\subsubsection{Upper Confidence Bound(UCB) and Lower Confidence Bound(LCB)}
Upper Confidence Bound and Lower Confidence Bound policies are based on UCB and LCB for the mean reward of each arm, where
the selection process is based on correspondingly largest and lowest bound. In the simplest scenario, in order to determine UCB and LCB we first compute the sample mean and the sample variance. Then we choose our confidence interval (e.g.\ 90\%, 95\%) and calculate corresponding $z$ value. We determine our bounds as follows: $UCB = \mu + \sigma z_{(\cdot)} $ and $LCB = \mu - \sigma z_{(\cdot)} $.\\
Sometimes instead of calculating sample values, Bayesian inference is used to get $\mu$ and $\sigma$ estimates.\\
Under certain assumptions the policy is proved to select suboptimal
arms for only a finite number of times, which means that the optimal arm will be played exponentially more
often than any other arm within the time limit. Certain algorithms of this family require
just the knowledge of the sample mean for each arm, hence being simple and
computationally inexpensive~\citep{Scott:2010:MBL:1944422.1944432}.\\
\subsubsection{``Stay on a winner''}
``Stay on a winner''---a well-known strategy very often used by gamblers---is a myopic policy
where arm $a$ is selected in round $t+1$ if it was considered successful at
time $t$. On contrary, if the selected arm failed, we either select another one at random or
according to some deterministic policy.\\
Even though, continuation on a successful arm is a good strategy, the undesirable ``failure switch'' might occur~\citep{berry+firstedt}. The policy performs near optimally when the best arm is characterized by high success rate. The strategy tends toward equal allocation, which leads to over-exploration---success rate of optimal arm tends to $0$~\citep{Scott:2010:MBL:1944422.1944432}.\\
% \subsubsection{Minimax approach}
% ~\citep{Scott:2010:MBL:1944422.1944432}.\\
\subsubsection{Thompson Sampling approach}
Thompson Sampling was first proposed in 1933 by William R.\ Thompson~\citep{thompson:biom33} and mainly concerned Bernoulli processes. Recently, the approach was revised and made its way with simplicity and a low
computational cost.\\
The basic idea of it is to sample from posterior distribution of
each action and select the arm with the highest sample. The more uncertain we are about
particular action, the higher variance estimate we have yielding more sparse
samples. This natural mechanism guarantees infinite exploration of all actions as sampled values may lie away from estimated mean.\\
The method is extensively described in section~\ref{sec:thompsonsampling}; we
motivate our choice by relatively few undergraduate level texts available.
Furthermore, revised version of Thompson Sampling is a cutting edge approach with a lot of ongoing
research.\\
\subsection{Policy types}
\subsubsection{Finite memory strategy}
It is a class of strategies, where the decision process has finite memory thus, the policy can only rely on specified number of recent outcomes. This restriction can be set to $w$ last rounds;
as a consequence some policies can suffer large loss while others will work great.\\
\subsubsection{Myopic strategies}
The myopic strategies are widely considered to be suboptimal for they try to maximise the expected
reward during the next infinitesimal interval of time. They are called myopic
because they neglect eventualities or sacrifice a long-range
vision to maximise the current payoff. Sometimes uncertainty(here understood as the exploratory value) is taken
into account although it displays limited lookahead, as it only focuses on a current choice.\\
For some class of problems a short term optimum can
simultaneously be a long term one, so for these problems myopic solution
seems the best as it does not involve complex computation or expensive
lookahead. This approach can be applied to many strategies but it is very frequently
used with index methods.\\
% is thompson samplkng one of myoptic methods? intead of taking longe ranfe
% view it ties to minimize current vaiance of hypothesis to decide on best one.\\
% local not global solution with respect to time intervals.
\subsubsection{Undirected policy}
In this class of strategies we make our choice by considering only the exploitative
value(we focus on local payoff) e.g.\ $\epsilon$-greedy, $\epsilon$-decreasing, Boltzmann action
selection. We make a choice based on the currently highest reward estimate and
do not include exploratory component.\\
% \subsubsection{Belief-lookahead}
\subsubsection{Bayesian approach}
This set of policies uses Bayesian inference---combines the prior belief with the
likelihood estimate---to produce current approximation of the process.
Bayes' theorem allows a ``straight forward'' application to adaptive learning so it can be a great tool in sequential decision making processes
like MAB.\\
A good example of the Bayesian approach is the Thompson Sampling. \\
% It can be fully Bayesian approach where---here the chosen action is supposed
% to maximise the expected cumulative reward for the rest of the process. Drawback
% of such approach is hard to guarantee infinite exploration.\\
\section{Thompson Sampling\label{sec:thompsonsampling}}
In order to fully understand
\emph{Thompson Sampling} approach to the multi-armed bandits
theory, two concepts need to be introduced first: the Bayesian statistics and
the sampling theory.
It is possible to use any probability distribution with Thompson Sampling but
for the sake of simplicity, we will consider only a the \emph{normal} distribution in this
paper. The aforementioned restriction does not mean that it is impossible to apply this
technique with any other distribution but due to space constrains the theory
will be presented with \emph{normal posterior}. Apart from that, discussing other
scenarios of Thompson Sampling would require reader to be familiar with
analytic approach to Bayesian \emph{posteriors}, \emph{priors} and
\emph{likelihoods}.
\subsection{Bayesian statistics\label{sec:bayesian}}
In this section we briefly introduce normal distribution, for we can
discuss its aspects with regard to Bayesian statistics. Once we present the concept we show its application in a Thompson Sampling.\\
\subsubsection{The normal distribution}
The most common distribution in statistics with the well-known bell-shaped(see
figure~\ref{fig:normaldist}) curve is the normal distribution also called
\emph{Gaussian} distribution. If a random variable $\mathrm{X}$ follows such
distribution parametrized by a \emph{mean} $\mu$ and a \emph{standard deviation}
$\sigma$, we commonly write $\mathrm{X} \sim \mathcal{N}\left( \mu, \sigma^2
\right)$. Probability density function of $\mathrm{X}$ is then:
$$
f \left(x | \mu, \sigma \right) = \frac{1}{\sigma \sqrt{2 \pi }} e^{- \frac{
{\left ( x - \mu \right )}^2 }{2 \sigma^2} } \text{ ,}
$$
where the first part ${\left( \sigma \sqrt{2 \pi } \right)}^{-1}$ is a
normalizing factor and the latter part is a distribution ``kernel''.\\
The area under the curve integrates to $1$ on the $(-\infty, +\infty)$ range~\citep{rice1995mathematical}:
$$
\int_{-\infty}^{+\infty} \! f \left(x | \mu, \sigma \right) \, \mathrm{d}x = 1
\text{ .}
$$
\begin{figure}[htbp]
\centering
\includegraphics[width=0.5\textwidth]{graphics/normalpdf.pdf}
\begin{tiny}
\caption{Probability density function of the normal distribution $\mathcal{N}\left(
0, 5 \right)$ in characteristic bell shape.\label{fig:normaldist}}
% created in \texttt{R} (Snippet in \emph{Appendix~\ref{snip:normaldist}})
\end{tiny}
\vspace{1cm}
\end{figure}
\subsubsection{\emph{prior}, \emph{posterior} and \emph{likelihood}
distributions}
The next concept that we discuss is the dependency between prior, posterior
and likelihood of a particular distribution. The one needs to understand these connections to realise that once the agents have acquired new piece of information, they can use Bayesian statistics to improve their current estimates of mean reward of each action.\\
To illustrate these dependencies the \emph{normal}
distribution is used for the reasons mentioned above. The generalisation
to other distributions is straight forward~\citep{gelman2003bayesian}.\\
The basic theorem underlying the following discussion is called Bayes' theorem for point probabilities and states:
$$
p \left( \mathrm{B} | \mathrm{A} \right) = \frac{ p \left( \mathrm{A} |
\mathrm{B} \right) p \left( \mathrm{B} \right) }{ p \left( \mathrm{A} \right) } \text{ ,}
$$
\begin{center}
or
\end{center}
$$
p \left( \mathrm{B} | \mathrm{A} \right) \propto p \left( \mathrm{A} |
\mathrm{B} \right) p \left( \mathrm{B} \right) \text{ ,}
$$
where:
\begin{description}
\item[$p \left( \mathrm{B} | \mathrm{A} \right)$] is a \textbf{posterior}---being a
conditional probability of event \textrm{B} given event \textrm{A},
\item[$p \left( \mathrm{A} | \mathrm{B} \right)$] is a \textbf{sampling density
(``likelihood'')}---being a conditional probability of event \textrm{A} given
event \textrm{B},
\item[$p \left( \mathrm{B} \right)$] is a \textbf{prior}---being a marginal
probability of event \textrm{B}, and,
\item[$p \left( \mathrm{A} \right)$] is a \textbf{normalising factor}---being a \textbf{marginal} probability of event
\textrm{A}(data).
\end{description}
Now we will focus on general results of Bayesian statistics when our likelihood
function is normally distributed. From this point onwards, we can develop several different scenarios \emph{vide infra}.\\[1.5cm]
\textbf{\textrm{Non-informative prior. }}In the face of lack of information about
prior distribution, the best that can be done is to minimise its influence on
the inference. According to the \emph{principle of insufficient reason} proposed by
Bayes and Laplace, we should assume that a prior is \emph{uniformly} distributed
so all our outcomes are equally likely. We also assume that it is distributed over
the real line for both $\mu$ and $\log \sigma^2$(transformation to $\log$ scale
is performed because $\sigma^2$ is a non-negative quantity and it results in a
stretch along the real line). These operations give the joint probability $p
\left( \mu, \sigma^2 \right) \propto \frac{1}{\sigma^2} $ leading to posterior
distributions given by $p \left( \mu | \mathrm{X}, \sigma^2 \right) \sim
\mathcal{N} \left( \bar{x}, \frac{\sigma^2}{n} \right) $ and $p \left( \sigma^2
| \mathrm{X}, \mu \right) \sim \mathrm{Inv}\text{-}\mathrm{Gamma} \left(
\frac{n}{2} , \sum_{i} \frac{\left( x_i - \mu \right)^2}{2} \right) $(i.e.\ the inverse
gamma distribution). This approach might not be perfect and its criticism is widely
known, nonetheless, it is sufficient for the application of this project ~\citep{Syversveen98noninformativebayesian}.\\
We could use above approach if prior to the game we would not be able to acquire any information regarding available arms.\\
% give equation for normal likelihood --- uniform prior
\textbf{\textrm{Informative prior. }}It is the opposite scenario to the one
described above. Knowing the prior distributions, the application of Bayesian
statistics is as simple as finding corresponding posterior and calculating
parameters.\\
From this point onwards, we will assume that our prior is normally distributed
and informative. \\
In our MAB context this means that all arms are normally distributed.\\[1.5cm]
\textbf{\textrm{Known variance. }}Firstly we consider \emph{normal prior--normal likelihood} with $\sigma^2$ known and $\mu$ unknown (i.e.\ our variable).
$$
f \left( \mu | \mathrm{X} \right) \propto f \left( \mathrm{X} | \mu \right) f
\left( \mu \right) \text{ .}
$$
The $\sigma^2$ in the notation is omitted for the purposes of clarity. In this case our
prior is defined as follows:
$$
f \left( \mu \right) \sim \mathcal{N}\left( \mu, \tau^2 \right) \text{ ,}
% ~
$$
giving:
$$
f \left( \mu \right) = \frac{1}{\sqrt{2\pi} \tau} e^{- \frac{{\left( \mu
- M \right)}^2}{2 \tau^2} } \text{ ,}
$$
where $M$ is prior mean and $\tau^2$ is a variance of $\mu$ round $M$; the likelihood
is given by:
$$
f \left( \mathrm{X} | \mu \right) \sim \mathcal{N}\left( \mu, \sigma^2
\right) \text{ ,}
$$
resulting in:
$$
f \left( \mathrm{X} | \mu \right) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}
\sigma} e^{- \frac{{\left( \mu - x_i \right)}^2}{2 \sigma^2} } \text{ ,}
$$
where, $x_i \in X$ are data points.\\
We use above concepts together with Bayes rule what results in:
$$
f \left( \mu | \mathrm{X} \right) \propto \frac{1}{\sigma \tau} e^{ -
\frac{ {\left( \mu - M \right)}^2 }{2 \tau^2} -\frac{ \sum_{i=1}^{n} {\left(
\mu - x_i \right)}^2 }{2 \sigma^2} } \text{ ,}
$$
which clearly contains a kernel of normal distributions. After an algebraic
transformation, we can conclude
that posterior is also \textbf{normally} distributed
with the mean $\epsilon$ and the variance $\delta^2$ --- $f \left( \mu | \mathrm{X}
\right) \sim \mathcal{N} \left( \epsilon, \delta^2 \right) $:
\begin{eqnarray*}
\epsilon &=& \frac{\sigma^2 M + n \tau^2 \bar{x}}{n \tau^2 + \sigma^2} = \frac{
\frac{1}{\tau^2} }{ \frac{1}{\tau^2} + \frac{n}{\sigma^2} }M + \frac{
\frac{n}{\sigma^2} }{ \frac{1}{\tau^2} + \frac{n}{\sigma^2} } \bar{x} \text{ ,}
\\
\delta^2 &=& \frac{\sigma^2 \tau^2}{n \tau^2 + \sigma^2} = \frac{
\frac{\sigma^2}{n} \tau^2 }{ \tau^2 + \frac{\sigma^2}{n} } \text{ .}
\end{eqnarray*}
\\
This scenario corresponds to MAB with all actions normally distributed with known variance---in our application such scheme is very unlikely.\\[1.5cm]
\textbf{\textrm{Unknown variance. }}This is a presentation of a more realistic case with the
posterior model:
$$
p \left( \mu, \sigma^2 | \mathrm{X} \right) \propto p \left( \mathrm{X} | \mu,
\sigma^2 \right) p \left( \mu, \sigma^2 \right) \text{ .}
$$
We now need to specify the details of prior distribution. One way is to assume
independence of both $\mu$ and $\sigma^2$ and establish separate priors for
each with $p(\mu, \sigma^2) = p(\mu) p(\sigma^2)$ what is documented as a good
technique, nevertheless, it is also common to follow \emph{non-informative} prior
scenario described above.~\citep{gelman2003bayesian}.\\
The later strategy is to assume that $\mu \sim
\mathcal{N} \left( M, \tau^2 \right)$ and choose the parameters resulting in a
flat distribution e.g. $M=0$, $\tau^2 = 10^4$. Furthermore, it is easy to notice
that $\sigma^2$ follows again \textrm{Inverse}-\textrm{Gamma} distribution.\\
\noindent Basic results of Bayesian statistics for normal distribution are
available in wide variety of books. For further reading please refer
to~\citep{lynch2007introduction, gelman2003bayesian}.\\
With the scenario described above our MAB has normally distributed arms with unknown mean reward and its variance. This setting is of our interest and will be developed throughout the following sections.\\
\subsection{Sampling}
In general, sampling is a technique used in statistics to select at random a
subset of individuals or data points from a population of interest. The aim of
such a procedure is to gather a representative group which holds the properties of
the original population. The main advantage of this technique is lowering the
amount of data that needs to be processed.\\
In the approach of this study
we use sampling to draw values from posterior distribution of
each \emph{arm} to make an action choice and reduce the uncertainty of parameters'
estimates.\\
In MAB once we decide which arm to pull we acquire a feedback, which in our case is a sample from a distribution of chosen arm. We use this piece of information to improve our estimate of the parameters characterising arm's distribution before making a decision in the next round.\\
\begin{figure}[htbp]
\centering
\includegraphics[width=0.7\textwidth]{graphics/sampling.png}
\begin{tiny}
\caption{Sampling and updating posterior probability density function of a
normal distribution $\mathcal{N}\left( \mu , \sigma^2 \right)$. Figure taken
from~\citep{Jacobs2008normalnormal}.\label{fig:sampling}}
\end{tiny}
\vspace{1cm}
\end{figure}
\subsection{Introduction to Thompson Sampling\label{sec:thompson}}
In the following section we explain Thompson Sampling approach and discuss related work, extending the
originally proposed methodology and proving that under presented below assumptions the
approach behaves optimally~\citep{May:2012:OBS:2503308.2343711}.\\
The solution described here is widely used in on-line advertising by the IT giants
like Microsoft, Google and Yahoo due to its simplicity, flexibility and
scalability~\citep{graepel2010web}. Moreover, simulations indicate superiority
of Thompson Sampling over all competitors~\citep{May:simulation}.\\
% \subsubsection*{Introduction}
% Firstly the general idea is presented.
\subsubsection{Intuition}
With the aim of better understanding how presented above concepts are used to build complete strategy we present the following case studies. \\
With a view to clarifying this example, we assume that at each stage of MAB process we are able to
obtain a sample from posterior distribution with a mean $\mu_a$ and a variance
$\sigma^2_a$ for each action $a$. If our goal is to maximise the overall reward we simply
select the arm with the highest local reward (i.e.\ highest current sample).\\
If all assumptions
about our process hold by selecting the given action, we acquire a sample from unknown distribution of $a$ and use Bayesian statistics to improve our estimate of its
parameters. It is desirable to incorporate the obtained sample and update both mean and
variance estimates---this action guarantees that our posterior becomes a prior for the next round yielding more accurate approximation.\\
In order to visualise this process we assume that we are given 2 arms; the deterministic reward of the
first one is always less than the value sampled from posterior of the second one and the latter behaves as shown
in figure~\ref{fig:sampling}. We can see that as we sample and update the
posterior, it converges to true mean and simultaneously the variance of our
estimate decreases.\\
It also should be clear that we can only update---improve---the estimate of
selected at given stage action and no other.\\
Usually, a drawn sample will be quite close to mean of a normal distribution(fig.\ \ref{fig:sim1}) yielding choice of an action with the highest true mean. Occasionally, it will
lie in the distribution tail leading to exploration of suboptimal at a given time
action(fig.\ \ref{fig:sim2}). In other words, the higher the variance of action, the
greater the probability of exploring a particular arm. This argument intuitively
guarantees infinite exploration of environment, thus convergence to overall
optimal solution.
\begin{figure}[htbp]
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=0.99\linewidth]{graphics/sim1.png}
\caption{\label{fig:sim1}}
\end{subfigure}
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=0.99\linewidth]{graphics/sim2.png}
\caption{\label{fig:sim2}}
\end{subfigure}
\begin{tiny}
\caption{Sampling from two posteriors of normally distributed
arms.\label{fig:sim}}
\end{tiny}
\vspace{1cm}
\end{figure}