-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdissertation.tex
1311 lines (997 loc) · 100 KB
/
dissertation.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
%
% in CH1 give an real life example at the begining as a BACKGROOUND
% describe bacis concepts
% and go into more details? obvious
%
%
%
% in CH2 active learning or web optimization
%
%
%
\documentclass[12pt, a4paper, pdflatex, leqno]{report}
% notitlepage - abstract on the same page
\usepackage{indentfirst} % indent frst paragraph of section
% \usepackage{fullpage} % full A4 page
\usepackage[left=2.5cm, right=2.5cm, bottom=2.5cm, top=2.5cm]{geometry}
\usepackage{amsmath}
\usepackage{amsfonts} % fancy maths font
\usepackage{mathrsfs} % fancy maths font
\usepackage{dsfont} % indocator finction
\usepackage{mathtools}
\usepackage[pdftex]{graphicx}
\usepackage{cite} % BiTeX
\usepackage{lipsum}
\newcommand{\ts}{\textsuperscript}
\usepackage[usenames,dvipsnames]{color}
% for multi figures
\usepackage{graphicx}
\usepackage{caption}
\usepackage{subcaption}
\usepackage[]{algorithm2e}
% \usepackage{polski}
% \usepackage[polish,english]{babel}
% \usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc} % polsih
\usepackage{hyperref}
% Harvard citation
\usepackage[square]{natbib}
% argmax with commands
\newcommand{\argmax}{\operatornamewithlimits{argmax}}
% equality from definition | =^{\text{def}}
\newcommand{\myeq}{\stackrel{\mathclap{\normalfont\scriptsize\mbox{def}}}{=}}
% Code snippets
\usepackage{listings}
% \usepackage{color}
\definecolor{dkgreen}{rgb}{0,0.6,0}
\definecolor{gray}{rgb}{0.5,0.5,0.5}
\definecolor{mauve}{rgb}{0.58,0,0.82}
\lstset{frame=tb,
language=R,
aboveskip=3mm,
belowskip=3mm,
showstringspaces=false,
columns=flexible,
basicstyle={\small\ttfamily},
numbers=none,
numberstyle=\tiny\color{gray},
keywordstyle=\color{blue},
commentstyle=\color{dkgreen},
stringstyle=\color{mauve},
breaklines=true,
breakatwhitespace=true
tabsize=3
}
% END Code snippets
% $\backsim\ \sim\ \thicksim$
\newcommand{\HRule}{\rule{\linewidth}{0.5mm}}
\newenvironment{dedication}
{\clearpage % we want a new page
\thispagestyle{empty}% no header and footer
\vspace*{\stretch{1}}% some space at the top
\itshape % the text is in italics
% \raggedleft % flush to the right margin
\raggedright % flush to the right margin
\par\setlength{\leftskip}{0.3\textwidth}\noindent\ignorespaces
}
{\par % end the paragraph
\vspace{\stretch{3}} % space at bottom is three times that at the top
\clearpage % finish off the page
}
\begin{document}
\begin{titlepage}
\begin{center}
% Upper part of the page. The '~' is needed because \\
% only works if a paragraph has started.
\includegraphics[width=0.5\textwidth]{graphics/UOB-logo.png}~\\[2.5cm] % was 1cm
% \textsc{\LARGE University of Bristol}\\[1.5cm]
%\textsc{\Large Final year project}\\[0.5cm]
% \colorbox{magenta}{problem}
% Title
\HRule \\[0.4cm]
{ \huge \bfseries %\\[0.5cm]
Comprehensive introduction to \emph{\textbf{Multi-armed bandits}}:\\[.5cm]
\emph{Thompson Sampling}\\
\&\\
Active Learning in the bandits scenario\\[0.4cm] }
\HRule \\[1.5cm]
% Author and supervisor
\begin{minipage}{0.4\textwidth}
\begin{flushleft} \large
\emph{Author:}\\
Kacper B. \textsc{\textbf{Sokol}}
\end{flushleft}
\end{minipage}
\begin{minipage}{0.4\textwidth}
\begin{flushright} \large
\emph{Supervisor:} \\
Dr.~David \textsc{\textbf{Leslie}}
\end{flushright}
\end{minipage}
\let\thefootnote\relax\footnote{Level H/6 $|$ MATH 32200---20cp Project}
\vfill
% Bottom of the page
{\large \today}
\end{center}
\end{titlepage}
%Acknowledgment
\begin{center}Acknowledgement of Sources\\[2cm]\end{center}
For all ideas taken from other sources (books, articles, internet), the source of the ideas is mentioned in the main text and fully referenced at the end of the report.\\[0.5cm]
All material which is quoted essentially word-for-word from other sources is given in quotation marks and referenced.\\[.5cm]
Pictures and diagrams copied from the internet or other sources are labelled with a reference to the web page,book, article etc.\\[2cm]
Signed:\\[1cm]
Dated:~~~~~~~~~~~~\today
% \thispagestyle{empty}% no header and footer
\thispagestyle{empty}
\cleardoublepage
\pagestyle{plain}
\vfill
% \title{\emph{Multi-armed bandits} problem.\\
% Practical introduction to the problem for everyone.\\
% Real life application.}
% \author{Kacper Sokol\\University of Bristol, UK}
% \date{\today}
% \maketitle
% \begin{flushright}
% Supervised by:\\
% \textbf{David Leslie}
% \end{flushright}
% \begin{center}
% \line(1,0){250}
% \end{center}
\begin{abstract}
\thispagestyle{empty}% no header and footer
This dissertation covers two main topics: multi-armed bandits theory with extensive treatment of Thompson Sampling approach as well as application of MAB in active learning. The comprehensive introduction to the theory underlying multi-armed bandits is gradually developed to cover concepts necessary for understanding the basic bandits strategies. The later part of this dissertation presents first of its kind(as far as we know) application of Thompson Sampling inspired MAB algorithm to solve computer science task of learning in the environment of insufficient information.\\
Reader does not require any prior knowledge in this field, only basics of statistics and probability theory are necessary to smoothly follow the text.\\
\begin{center}
Keywords: \textbf{multi-armed, bandit, active, semi-supervised, learning, exploration, exploitation, Thompson Sampling}
\let\thefootnote\relax\footnote{\noindent This dissertation together with all figures and experiment source code is available as \texttt{GitHub} repository at: \url{https://github.com/So-Cool/MAB}.}
\end{center}
\end{abstract}
\begin{dedication}
I would like to thank my parents who provides me with any kind of support. For their guidance and advice which helps me to make the right choices throughout the life and fulfill my dreams.\newline
It would also be a painful journey without my supervisor Dr.~David~Leslie who always served me with an advice how to ``read'' and ``write'' all the maths and avoid unnecessary and overwhelming clutter in the books.\newline
Finally, big thanks to Iza and Kuba who often take care of my leisure time even though it always lacks.\\[2cm]
% \foreignlanguage{polish}{}
\begin{flushright}
Dzi\k{e}kuj\k{e} mamo,\\
dzi\k{e}kuj\k{e} Tomek.
\end{flushright}
% I was lost now I'm found
\textcolor{white}{found me!}
\end{dedication}
\newpage
\thispagestyle{empty}
\cleardoublepage
\pagestyle{plain}
\setcounter{page}{1}
\tableofcontents
% \newpage
\chapter{Introduction\label{chap:intro}}
The \emph{multi-armed bandits} problem has been rapidly developing field of statistics and probability theory since early 20\ts{th} century. With a vastly growing number of tasks that could be framed as a bandit scenario the field has become a topic of research for many scientists, economists not to mention companies looking for efficiency improvements and savings. All things considered, these problems can be solved with ease by finding a balance between \emph{exploration} and \emph{exploitation}.\\
Multi-armed bandits are a class of problems originated from a sequential allocation dilemma. They were defined during Second World War and quickly reached the fame of being too difficult to solve to become abandoned for decades. First general solution was constructed by John Gittins\ref{sec:gitind} in late 60's nevertheless, his work was overlooked for almost 20 years to become renown in the early 80's.~\citep{gittins+glazebrook+weber}
\noindent The main reference for this chapter is~\citep{berry+firstedt}.\\
\section{Background}
Many people consider statistics and probability as analyzing processes or data in various aspects---as a rather \emph{static} science. But what if the process of our interest is continuously developing while we want to discover it or it demands our interaction? Simple statistics or probability might not be able to handle such cases as good as bandits theory.\\
To begin with, lets consider \emph{fruit machine} as it is a first thing that comes to reader's mind after hearing about multi-armed bandits. Imagine a row of slot machines in front of you. Pulling an arm of each of these automaton will result in different outcome each with corresponding probability according to some unknown distribution. For the simplicity we will consider result as various reels combinations and we will assume that each automaton gives binary result: \emph{win} with probability $p$ and \emph{loose} with probability $p-1$. Without lost of any information, row of such machines can be transformed into only one automation but with multiples arms or buttons each corresponding to single machine in a mentioned row.\\
The natural example that follows binary bandits is a row of coins, where some of them may be unfair. In presented scenario each coin corresponds to an \emph{Arm} of a bandit and tossing one of them for several times can be considered as realization of a Bernoulli process with unknown parameters.\\
If a player is rewarded when the outcome of a trial is \textbf{H}ead then the goal is to find the coin which has bias with maximum probability of \textbf{H} and play it forever.\\
If a gambler does not want to loose all possessed money really quick it would be probably a good idea to have some kind of strategy that maximizes chances of winning. It is assumed that the gambler is for the first time in a given casino so any prior information regarding expected return from each arm is assumed to be unknown. Initially random arm is chosen as all of them ``look the same''. On contrary, during the second turn selecting \emph{optimal} arm to be played becomes a serious dilemma that you might have not yet realized. The gambler faces a choice between already pulled arm with sample expected return that is known and any other arm which for now on remain a mystery as there is no information about potential reward.\\
If gambler decides to take advantage of already known arm and pull it again we call this action \emph{exploitation}---that is taking advantage of already checked possibilities. On the other hand, taking a risk and choosing one of the unknown arms will result in gathering some more information about the system what is usually said to be \emph{exploration} step.\\
\section{Applications} % emphasized underlying
Multi-armed bandits are not just a theory that one reads from a book and try to memorize; they extends to many real life applications. This section is devoted to simple case study in which it seems natural to use ``bandit approach''. Applications are versatile ranging from drug testing and maximizing income from web advertisement through semi-supervised machine learning in modern computer science as well as time and budget management of research projects.\\
To begin with, imagine a hospital testing two new medicines for a certain disease. Patients are queuing up to receive a treatment. Assuming that doctor cannot refuse to treat anyone, each person suffering from a disease(each \emph{play}) can be given two possible drugs(two \emph{arms}). The key assumption here is that the effect of chosen action occur immediately, in other words treated person either stays sick or is cured(immediate \emph{payoff}). The goal of a doctor is to maximize the number of healed people. This model defines two-armed bandit.\\
The second mentioned approach is nowadays widely incorporated by companies such as Google~\citep{AYPSze12, ASMB:ASMB874}, LinkedIn~\citep{Tang:2013:AAF:2505515.2514700}, Microsoft~\citep{graepel2010web}, Yahoo~\citep{Li:2010:CAP:1772690.1772758} to their services. Research groups of these companies are using bandits algorithms to choose the best website layout and advertisements locations to increase click-through rate(CTR)\footnote{Measure of success of an on-line advertising campaign.}, improve recommendation systems or enhance performance of semi-supervised and active learning algorithms in the field of machine learning.\\
The bandit theory also plays a key role in experiment allocation with restricted resources like time and budget~\citep{gittins+glazebrook+weber}. While considering bandit's \emph{arms} as research projects pending to be conducted with limited amount scientist, time or funding the goal is to maximize the number of accomplished tasks(\emph{payoff}) before distributing the money among them.\\
With all presented above scenarios barely scratching the tip of the iceberg we are now focusing on foundations of multi-armed bandits theory that one needs to precisely describe the processes happening ``behind the scene''.\\
We begin with introducing the reader with necessary notation and nomenclature. Then we are moving to basic process description.
\section{Terminology}
Statistical decision theory defines ``simple'' multi-armed bandit as a sequential selection from $N \geq 2$ stochastic processes---generally called \emph{arms}, where both time and processes may be discrete or continuous. The goal is typically to recover unknown parameters that characterize stochastic processes the \emph{arms} to maximize expected \emph{payoff}.\\
It is handy to be familiar with most common terms used to describe multi-armed bandit in papers and books. The basic concepts are:
\begin{description}
\item[Multi-armed bandit ($N$)]---a ``device'' with $N \geq 2$ possible choices of action(\emph{arms}).
\item[Agent]---a person who decides which \emph{arm} to pull based on chosen \emph{strategy} $\tau$.
\item[Strategy ($\tau$)]---tells the \emph{agent} which \emph{arm} to pull at given stage of the \emph{game}. A strategy is \emph{optimal} if it yields maximal expected \emph{payoff}.
\item[Arm]---one of $N$ actions that may be taken by the \emph{agent}. An \emph{arm} is \emph{optimal} if it is the best selection when following some \emph{optimal} strategy.
\item[Play]---an \emph{arm} pulled at stage $m$ of the \emph{game} (i.e.\ one turn).
\item[Game]---a sequence of \emph{arms} pulled based on chosen \emph{strategy} $\tau$.
\item[Payoff]---a return of a game such as \emph{win--loose} or \emph{amount} of money gained.
\item[Discount series]---factors that define how valuable is each of \emph{payoffs}. For example only first $m$ outcomes may count and all the rest is neglected or the longer the \emph{game} is played the less particular outcome counts toward overall expected \emph{payoff}.
\end{description}
\section{General assumptions}
Two fundamental assumptions to vast majority of bandit problems regard benefits from selecting an arm, namely:
\begin{itemize}
\item immediate \emph{Payoff} i.e.\ \emph{agent} knows the result of taken action immediately,
\item information gathered after a \emph{play} can be used to modify chosen \emph{strategy}.
\end{itemize}
Moreover in some cases we may restrict the memory of an \emph{agent} to last $s$ outcomes. Therefore, selected \emph{strategy} $\tau$ can relay on up to $s$ previous \emph{plays}. These bandits approaches are called \emph{finite memory}.\\
Majority of this dissertation assumes that \emph{arms} are independent. If this setting changes it will be clearly stated.
\section{Discount Sequence}
To specify the rules governing the ``significance'' of outcome from a single play at stage $m$ the \emph{discount sequence} is introduced. It is a vector $\mathbf{A}$ of specified length, which can also be infinite.
$$
\mathbf{A} = \left( \alpha_1, \alpha_2, \alpha_3, ... \right)
$$
When an \emph{arm} is selected the discount sequence is modified by left shift:
$$
\left( \alpha_1, \alpha_2, \alpha_3, ... \right)
\rightarrow
\left( \alpha_2, \alpha_3, \alpha_4, ... \right) \text{ .}
$$
\subsection{Most common sequences}
There are many different discount sequences used with multi-armed bandits each with numerous assumptions. In the literature only two of them are described in great detail and both are presented below.
\subsubsection{Uniform sequence}
This discount sequence is most commonly used when a player wants to maximize the payoff of the first $h$ outcomes.
The $h$-horizon uniform discount sequence is defined as:
$$
\alpha_i =
\begin{cases}
1 & \text{for } i \leq n \\
0 & \text{for } i > n \text{ ,}
\end{cases}
$$
leading to:
$$
\mathbf{A} = ( \underbrace{ 1, 1, 1, ..., 1}_{h\text{ elements}}, 0, 0, 0, ... ) \text{ .}
$$
After \emph{arm} selection the horizon of discount sequence is decreased by $1$.
\subsubsection{Geometric sequence}
The geometric discount sequence is expressed with components $\alpha_i = a^{i-1}$ for some $a \in ( 0, 1 )$ resulting in:
$$
\mathbf{A} = \left( a^0, a^1, a^2, ... \right) \text{ ,}
$$
where $\alpha_1 = a^0$ is always equal to $1$. The characteristic feature of this series is that after \emph{arm} selection the discount sequence is still proportional to the original sequence.\\
It can also be shown that when the discounting is geometric, a bandit problem involving $k$ independent arms can be solved by solving $k$ different two-armed bandits, each involving one known and one unknown arm.\\[2cm]
In case of both sequences the decision problem and optimal strategy is unchanged if a discount series is multiplied by some constant. Furthermore, the geometric sequence is effectively the same throughout the \emph{game}.
\subsubsection{Mixtures of uniform sequences}
If we consider Bernoulli trials(with success worth $1$) of a clinical test the experiment can terminate with positive probability $\nu_m$ at any stage $m$ as we do not know the exact number of incoming patients. In such cases we may form a discount sequence where factor at given stage is the probability that the experiment has not been terminated so far, $\alpha_m = \sum_{i=m}^\infty \nu_i$. Clearly discount series that arises in presented scenario is a mixture of uniforms. This case is identical to deterministic sequence---an average of uniforms weighted by the $\nu_m$'s.\\
An example my be a $\gamma$ probability of terminating at each stage therefore, the factor at each reached stage is: $\nu_m = (1-\gamma)\gamma^{m-1}, m=1,2,3,...$, leading to geometric discount sequence.
\subsubsection{Random discounting}
The random discounting is more complex. In mixture of uniforms we get no advantage conditioning on the past: as long as we get $1$ the process continues and there is no possibility that previous factors were $0$; once we get $0$ the process is no more of interest. In general, with random discount sequence we are only provided with some prior probability distribution over the space of all possible discount sequences.\\
\subsubsection{Observable and non-observable sequences}
Sometimes the discount sequence cannot be observed: it is either blended with a reward as a single value or it is not given to us together with the reward. In such cases we need to estimate it. Like in the random scenario, given the distribution over all possible discount sequences we can estimate it by: $\hat{\alpha}_m = \mathbb{E}(\alpha_m | \text{``probability distribution over all discount sequences''})$.\\
We say that a discount sequence is observable if together with a reward we get a value of discount factor at given stage. This cases may be harder to solve as with random sequences we cannot replace them with a non-random sequences without significantly altering the problem.\\
In presented scenarios strategies tend to depend on deterministic, observable discount factors.\\
\subsubsection{Real-time sequences}
In real time sequences the intervals between events are random therefore directly influence decision making process. To visualize this scenario we consider clinical trial with patients arriving in random time intervals. In such cases we usually use discount sequence described by: $A = ( \exp(-\beta t_1), \exp(-\beta t_2), \exp(-\beta t_3),... )$, where $\beta$ is weight coefficient and $t_i$ are known arrival times.\\
more complicated situation for real-time discounting can be described as:
$$
\alpha_t =
\begin{cases}
1 & \text{for } t \in [0,1] \text{ ,} \\
0 & \text{for } t \in [1,\infty) \text{ .}
\end{cases}
$$
Such sequence can express interest in maximizing the response of patients arriving in first unit interval. It is worth noting that the action choice may be significantly different if the first event occurs at time $0.01$ then if it would occur at time $0.99$; generally, risky arm could be appropriate in first case and mean maximizing action can be a good choice in second case.\\
\subsubsection{Non-monotone sequences}
Majority of discount sequences considered in literature are monotone increasing or decreasing. This fact is motivated by real life applications where our interest in process decreases---the player is interested in quick reward---or increases---the player is interested in high reward in the future by first learning about environment.\\
The most popular and one of a few non-monotone sequences are defined by: $\alpha_n = 1$ and $\alpha_m = 0$ for $m \neq n$. This structure of discount sequence indicate that the first $n-1$ stages are played for sole purpose of obtaining information to make the best possible choice at stage $n$ and all the other rounds do not matter.\\
\section{Bandits settings}
Multi-armed bandit setting can be used to solve variety of problems. The most common settings have been discussed in chapter~\ref{chap:intro}; nevertheless, some non-trivial environments are also used and presented below.\\
\subsubsection{Real time, random bandits}
Real time, random bandits are usually present when decision needs to be made for event of given class or assigned to particular problem that occur in non-deterministic intervals of time and information of the event can only be acquired during its occurrence. Such scenarios can be similar to ones described in \emph{Random discounting} section.\\
\subsubsection{Information gathering}
One of scenarios where multi-armed bandits are also used is to gather information about available choices. If such learning needs to be done in efficient manner MAB is often a good choice. An example presented in \emph{non-monotone} section, where a number of rounds at the beginning was used to learn some information about the environment is one of them.\\
\subsubsection{Dependant arms}
MAB with dependent arms grow in popularity due to the Internet advertising applications. In such scenario the ads displayed to the user are usually dependent(one manufacturer different products or one class of products made by different companies) and they very often can be easily grouped. Policies for such scenarios are made to consider this connections between actions and to exploit underlying information~\citep{Pandey:2007:MBP:1273496.1273587}.\\
\chapter{Solution \texttt{\textbf{Exploration}}}
In this chapter we will present number of approaches to find an optimal solution. First of all, we will briefly describe methodologies well-known for a couple of decades; then we will give extensive description of one of the most recent approach called \emph{Thompson Sampling}. Finally, we form conclusions and highlight ongoing research.\\
\section{Seeking an optimal solution}
Presented below strategies are used to balance exploration and exploitation to find an optimal solution---determine equilibrium between acquiring information and immediate payoff.\\
Let's recall the hospital example; we have to decide whether it is worth to scarify wellbeing of early coming patients to learn more about particular condition by means of experimenting. Such choice would lead to treatment improving over time yielding better results on future patients---sanctifying early payoff to gain some more information about the system and maximize future return.\\
We could also democratize such dramatic scenario by using geometric sequence therefore, health of current patients would be worth the same in comparison to future patients.\\
In general we can consider strategy as optimal if in the limit of time it converges to optimal selection---chooses optimal arm infinitely many times. We usually consider \emph{preemptive} case---arbitrary switching between actions is allowed and takes negligible time.\\
\subsection{Examples}
\subsubsection{Index approach (Gittins Index)\label{sec:gitind}}
It is one of the oldest approaches providing optimal solution. The pioneering index method introduced by John C. Gittins moved significantly forward not only multi-armed bandits concept but the whole wide class of sequential allocation problems.\\
The basic intuition behind this approach is to assign \emph{priority indices} to each action, where index of particular arm should only depend on the history and outcomes for this action and no other. The decision process then comes down to choosing action with highest current index.\\
Basic theory of indeces allocation is based on calibrating actions at given state against some standardized actions with simple properties. The main advantage of such approach is restricting \emph{state function} of an action to depend only on its history therefore reducing the complexity~\citep{gittins+glazebrook+weber}.\\
Indeces are described as real-valued functions(dynamic allocation index) on the set of all available alternatives; the selection process is maximizing this function.\\
With such indeces we can specify optimal policy for particular problem(which has any set of possible alternatives of a given type) with regard to so called \emph{standard bandit problem}.\\
The index theorem says that a policy for bandit process is optimal if it is an index policy with respect to $\nu(B_1, \cdot), \nu(B_2, \cdot),..., \nu(B_n, \cdot)$ where $\nu$ is index function and $B_i$ is given bandit process.\\
The significant drawback of index strategy is requirement of vast amount of resources such as computational power and memory storage. Moreover, the following assumptions need to hold:\\
\begin{itemize}
\item rewards are accumulated up to an infinite time horizon,
\item there is constant and strict exponential discounting,
\item unless arm is pulled no rewards are collected and its state does not change,
\item there is only one processor(server).\\
\end{itemize}
In the simplest case we consider multi-armed bandit as a number of semi-Markov decision processes. The \emph{index theorem} states that there exists \emph{index policy} which is optimal. It means that there exists a real-valued index, say $\nu ( B_i , \xi_i(t) )$, where $B_i$ is i-th bandit process and $\xi_i$ is a state of i-th process(dependent on history)---each index depends only on its current state and no other processes' state. As previously mentioned we play the process with highest current index.\\
\subsubsection{Upper Confidence Bound(UCB) and Lower Confidence Bound(LCB)}
This method is based on UCB and LCB for the mean reward of each arm, where selection process is based on correspondingly largest and lowest bound. It is shown that under certain assumptions the policy will only select suboptimal arms finite number of times---the optimal arm will be played exponentially more often than any other in the time limit. Some algorithms of this family require only the knowledge of sample mean for each arm therefore, being simple and computationally inexpensive~\citep{Scott:2010:MBL:1944422.1944432}.\\
\subsubsection{'Stay on a winner'}
It is well known strategy, very often used by gamblers. It is a myopic policy where arm $a$ is selected in round $t+1$ if it was considered successful at time $t$. On contrary, if selected arm failed we either select next at random or according to some deterministic policy.\\
It was shown by~\citep{berry+firstedt} that continuation on successful arm is a good strategy, nevertheless, ``failure switch'' is not always desirable. This strategy tends toward equal allocation what leads to over-exploration~\citep{Scott:2010:MBL:1944422.1944432}.\\
% \subsubsection{Minimax approach}
% ~\citep{Scott:2010:MBL:1944422.1944432}.\\
\subsubsection{Thompson Sampling approach}
This method is the most recent making its way by simplicity and low computational cost. The basic idea is to sample from posterior distribution of each action and select arm with highest sample. The more uncertain we are about particular action the higher variance estimate we have yielding more sparse samples---this natural mechanisms guarantees infinite exploration of all actions.\\
This method is extensively described in section~\ref{sec:thompsonsampling}; we motivate our choice by relatively few undergraduate level texts available. Furthermore, Thompson Sampling is cutting edge approach with a lot of ongoing research.\\
\subsection{Strategies---Policies}
\subsubsection{Finite memory strategy}
It is a class of strategies where the decision process has finite memory in a sense of the history. The restriction can be set to $w$ last rounds; some policies can suffer large loss with restricted memory while others will work great.\\
\subsubsection{Myopic strategies}
They are not good in general. These strategies try to maximize the expected reward during the next infinitesimal interval of time. They are called myopic because they neglect things that may happen later or sacrifice a long-range vision to maximize current payoff. Uncertainty(exploratory value) is taken under account but it focuses on current choice---limited lookahead.\\
For some class of problems it can be shown that short term optimum is simultaneously a long term one, so for these kind of problems the solution seems the best as it does not perform complex computation or expensive lookahead. This approach can be applied to many strategies but it is very often used with index methods.\\
% is thompson samplkng one of myoptic methods? intead of taking longe ranfe view it ties to minimize current vaiance of hypothesis to decide on best one.\\
% local not global solution with respect to time intervals.
\subsubsection{Undirected}
In this class of strategies we make our choice by only considering exploitative value e.g.\ $\epsilon$-greedy, $\epsilon$-decreasing, Boltzmann action selection. We select an action based on currently highest reward estimate and do not include exploratory component.\\
% \subsubsection{Belief-lookahead}
\subsubsection{Bayesian approach}
This set of policies uses Bayesian inference---combines prior belief with likelihood estimate---to produce current approximation of the process. Bayes' theorem allows for ``straight forward'' application to adaptive learning so it is a good tool in sequential decision making processes like MAB.\\
Thompson Sampling is a good example of Bayesian approach.\\
% It can be fully Bayesian approach where---here the chosen action is supposed to maximise the expected cumulative reward for the rest of the process. Drwback of such approach is hard to guarantee inifinite exploration.\\
\section{Thompson Sampling\label{sec:thompsonsampling}}
To fully understand \emph{Thompson Sampling} approach to multi-armed bandits theory, two concepts need to be introduced first:
\begin{itemize}
\item Bayesian statistics \textit{and}
\item sampling theory.
\end{itemize}
It is possible to use any probability distribution with Thompson Sampling but for the sake of simplicity we will consider only normal distribution in this paper. Mentioned restriction does not mean that it is impossible to apply this technique with any other distribution but due to space constrains the theory will be presented with \emph{normal likelihood}. Furthermore, discussing other scenarios of Thompson Sampling would require reader to be familiar with analytic approach to Bayesian \emph{posteriors}, \emph{priors} and \emph{likelihoods}.
\subsection{Bayesian statistics\label{sec:bayesian}}
In this section we will shortly introduce normal distribution. Then we will discuss its aspects with regard to Bayesian statistics.
\subsubsection{The normal distribution}
The most common distribution in statistics with well-known bell-shaped(see figure~\ref{fig:normaldist}) curve is the normal distribution also called \emph{Gaussian} distribution. If a random variable $\mathrm{X}$ follow such distribution parametrized by \emph{mean} $\mu$ and \emph{standard deviation} $\sigma$ we commonly write $\mathrm{X} \sim \mathcal{N}\left( \mu, \sigma^2 \right)$. Probability density function of $\mathrm{X})$ is then:
$$
f \left(x | \mu, \sigma \right) = \frac{1}{\sigma \sqrt{2 \pi }} e^{- \frac{ {\left ( x - \mu \right )}^2 }{2 \sigma^2} } \text{ ,}
$$
where the first part ${\left( \sigma \sqrt{2 \pi } \right)}^{-1}$ is a normalizing factor and the later part is a distribution ``kernel''.\\
The area under the curve from $-\infty$ to $+\infty$ integrates to $1$~\citep{rice1995mathematical}:
$$
\int_{-\infty}^{+\infty} \! f \left(x | \mu, \sigma \right) \, \mathrm{d}x = 1 \text{ .}
$$
\begin{figure}[htbp]
\centering
\includegraphics[width=0.5\textwidth]{graphics/normalpdf.pdf}
\begin{tiny}
\caption{Probability density function of normal distribution $\mathcal{N}\left( 0, 5 \right)$ in characteristic bell shape.\label{fig:normaldist}}
% created in \texttt{R} (Snippet in \emph{Appendix~\ref{snip:normaldist}})
\end{tiny}
\vspace{1cm}
\end{figure}
\subsubsection{\emph{prior}, \emph{posterior} and \emph{likelihood} distributions}
The next concept that we introduce is the dependency between prior, posterior and likelihood of a particular distribution. To illustrate it the \emph{normal} distribution will be used for the reasons mentioned above. The generalization to other distributions is straight forward~\citep{gelman2003bayesian}.\\
The basic theorem underlying all the section is called Bayes' theorem for point probabilities and states:
$$
p \left( \mathrm{B} | \mathrm{A} \right) = \frac{ p \left( \mathrm{A} | \mathrm{B} \right) p \left( \mathrm{B} \right) }{ p \left( \mathrm{A} \right) }
$$
\begin{center}
or
\end{center}
$$
p \left( \mathrm{B} | \mathrm{A} \right) \propto p \left( \mathrm{A} | \mathrm{B} \right) p \left( \mathrm{B} \right) \text{ ,}
$$
where:
\begin{description}
\item[$p \left( \mathrm{B} | \mathrm{A} \right)$] is \textbf{posterior}---being conditional probability of event \textrm{B} given event \textrm{A},
\item[$p \left( \mathrm{A} | \mathrm{B} \right)$] is \textbf{sampling density (``likelihood'')}---being conditional probability of event \textrm{A} given event \textrm{B},
\item[$p \left( \mathrm{B} \right)$] is \textbf{prior}---being marginal probability of event \textrm{B} and
\item[$p \left( \mathrm{A} \right)$] is \textbf{marginal} probability of event \textrm{A}(data), being normalizing factor.
\end{description}
Now we will focus on general results of Bayesian statistics when our likelihood function is normally distributed. From this point onwards we can develop a couple of different scenarios described below.\\[2cm]
\textbf{\textrm{Non-informative prior. }}If we are lacking information about prior distribution the best that can be done is to minimize its influence on the inference. According to \emph{principle of insufficient reason} proposed by Bayes and Laplace we should assume that prior is \emph{uniformly} distributed so all outcomes are equally likely. We also assume that it is distributed over the real line for both $\mu$ and $\log \sigma^2$(transformation to $\log$ scale is performed because $\sigma^2$ is non-negative quantity and it results in stretch along the real line). These operations result in joint probability $p \left( \mu, \sigma^2 \right) \propto \frac{1}{\sigma^2} $ leading to posterior distributions given by $p \left( \mu | \mathrm{X}, \sigma^2 \right) \sim \mathcal{N} \left( \bar{x}, \frac{\sigma^2}{n} \right) $ and $p \left( \sigma^2 | \mathrm{X}, \mu \right) \sim \mathrm{Inv}\text{-}\mathrm{Gamma} \left( \frac{n}{2} , \sum_{i} \frac{\left( x_i - \mu \right)^2}{2} \right) $(inverse gamma distribution). This approach is not perfect and its criticism is widely known but for our application it is sufficient~\citep{Syversveen98noninformativebayesian}.\\
% give equation for normal likelihood --- uniform prior
\textbf{\textrm{Informative prior. }}It is the opposite scenario to the one described above. With known prior distributions the application of Bayesian statistics is as simple as finding corresponding posterior and calculating parameters.\\
From this point onward we will assume that our prior is normally distributed and informative.\\[2cm]
\textbf{\textrm{Known variance. }}Firstly we will consider \emph{normal prior--normal likelihood} with $\sigma^2$ known and $\mu$ unknown(our variable).
$$
f \left( \mu | \mathrm{X} \right) \propto f \left( \mathrm{X} | \mu \right) f \left( \mu \right) \text{ .}
$$
The $\sigma^2$ in the notation is omitted for clarity purpose. In this case out prior is defined as follows:
$$
f \left( \mu \right) \sim \mathcal{N}\left( \mu, \tau^2 \right) \text{ ,}
% ~
$$
giving:
$$
f \left( \mu \right) = \frac{1}{\sqrt{2\pi} \tau} e^{- \frac{{\left( \mu - M \right)}^2}{2 \tau^2} } \text{ ,}
$$
where $M$ is prior mean and $\tau^2$ is variance of $\mu$ round $M$; likelihood is given by:
$$
f \left( \mathrm{X} | \mu \right) \sim \mathcal{N}\left( \mu, \sigma^2 \right) \text{ ,}
$$
resulting in:
$$
f \left( \mathrm{X} | \mu \right) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi} \sigma} e^{- \frac{{\left( \mu - x_i \right)}^2}{2 \sigma^2} } \text{ ,}
$$
where, $x_i \in X$ are data points.\\
After plugging both of them into Bayes rule we get:
$$
f \left( \mu | \mathrm{X} \right) \propto \frac{1}{\sigma \tau} e^{ -\frac{ {\left( \mu - M \right)}^2 }{2 \tau^2} -\frac{ \sum_{i=1}^{n} {\left( \mu - x_i \right)}^2 }{2 \sigma^2} } \text{ ,}
$$
what clearly contains a kernel of normal distributions. After some algebraic transformation we can derive that posterior is \textbf{normally} distributed with mean $\epsilon$ and variance $\delta^2$---$f \left( \mu | \mathrm{X} \right) \sim \mathcal{N} \left( \epsilon, \delta^2 \right) $:
\begin{eqnarray*}
\epsilon &=& \frac{\sigma^2 M + n \tau^2 \bar{x}}{n \tau^2 + \sigma^2} = \frac{ \frac{1}{\tau^2} }{ \frac{1}{\tau^2} + \frac{n}{\sigma^2} }M + \frac{ \frac{n}{\sigma^2} }{ \frac{1}{\tau^2} + \frac{n}{\sigma^2} } \bar{x} \text{ ,} \\
\delta^2 &=& \frac{\sigma^2 \tau^2}{n \tau^2 + \sigma^2} = \frac{ \frac{\sigma^2}{n} \tau^2 }{ \tau^2 + \frac{\sigma^2}{n} } \text{ .}
\end{eqnarray*}
\\
\textbf{\textrm{Unknown variance. }}This is more realistic case with the posterior model:
$$
p \left( \mu, \sigma^2 | \mathrm{X} \right) \propto p \left( \mathrm{X} | \mu, \sigma^2 \right) p \left( \mu, \sigma^2 \right) \text{ .}
$$
We now need to specify the details of prior distribution. One way is to assume independence of both $\mu$ and $\sigma^2$ and establish separate priors for each with $p(\mu, \sigma^2) = p(\mu) p(\sigma^2)$ what is documented as a good technique~\citep{gelman2003bayesian}.\\
With unknown variance common approach is to follow \emph{non-informative} prior scenario described above. The other strategy is to assume that $\mu \sim \mathcal{N} \left( M, \tau^2 \right)$ with choice of parameters resulting in a flat distribution e.g. $M=0$, $\tau^2 = 10^4$. Furthermore, it is easy to spot that $\sigma^2$ follows \textrm{Inverse}-\textrm{Gamma} distribution.\\
Basic results of Bayesian statistics for normal distribution---likelihood are available in wide variety of books. For further reading please refer to~\citep{lynch2007introduction}\citep{gelman2003bayesian}.\\
\subsection{Sampling}
In general, sampling is a technique used in statistics to select at random a subset of individuals or data points from a population of interest. The aim of such procedure is to gather representative group which holds the properties of the original population. The main advantage of this approach is lowering the amount of data needed to be processed.\\
In our approach we use sampling to draw values from posterior distribution of each \emph{arm} to make an action choice and reduce uncertainty of parameters estimates.\\
\begin{figure}[htbp]
\centering
\includegraphics[width=0.7\textwidth]{graphics/sampling.png}
\begin{tiny}
\caption{Sampling and updating posterior probability density function of a normal distribution $\mathcal{N}\left( \mu , \sigma^2 \right)$. Figure taken from~\citep{Jacobs2008normalnormal}.\label{fig:sampling}}
\end{tiny}
\vspace{1cm}
\end{figure}
\subsection{Introduction to Thompson Sampling\label{sec:thompson}}
Thompson Sampling was first proposed in 1933 by William R.\ Thompson~\citep{thompson:biom33} and mainly concerned Bernoulli processes. In this section we explain this approach and discuss related work extending originally proposed methodology~\citep{May:2012:OBS:2503308.2343711}. Furthermore, it can be proved that under presented below assumptions the approach behaves optimally.\\
Described here solution is widely used for on-line advertising by IT giants like Microsoft, Google and Yahoo due to its simplicity, flexibility and scalability~\citep{graepel2010web}. Moreover, simulations indicate superiority of Thompson Sampling over all competitors~\citep{May:simulation}.\\
% \subsubsection*{Introduction}
% Firstly the general idea is presented.
\subsubsection{Intuition}
We will begin with intuitive description of mechanisms ruling the process.\\
With a view of clarifying we assume that at each time step we are able to obtain a sample from posterior distribution with mean $\mu_a$ and variance $\sigma_a$ for each action $a$. If our goal is to maximize the reward we simply select the arm with highest local reward(highest sample). If all assumptions about our process hold by selecting given action we improve our estimate of its parameters. It is desirable to incorporate obtained sample and update mean and variance---therefore our posterior becomes a prior for the next round.\\
In order to visualize this we assume that we have 2 arms; the posterior of first one is always less than of the second one and the later behaves as shown in figure~\ref{fig:sampling}. We can see that as we sample and update the posterior it converge to true mean and simultaneously the variance of our estimate decreases. It is clear that we can only update the estimate of selected action and no other.\\
Usually, drawn sample will be quite close to mean~\ref{fig:sim1}(normal distribution) yielding choice of action with highest mean; occasionally it will lie in the distribution tail leading to exploration of suboptimal at given time action~\ref{fig:sim2}. In other words, the higher the variance of action, the greater the probability of exploring particular arm. This argument intuitively guarantees infinite exploration of environment thus convergence to overall optimal solution.
\begin{figure}[htbp]
\centering
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=0.99\linewidth]{graphics/sim1.png}
\caption{\label{fig:sim1}}
\end{subfigure}
\begin{subfigure}[b]{0.49\textwidth}
\centering
\includegraphics[width=0.99\linewidth]{graphics/sim2.png}
\caption{\label{fig:sim2}}
\end{subfigure}
\begin{tiny}
\caption{Sampling from two posteriors of normally distributed arm.\label{fig:sim}}
\end{tiny}
\vspace{1cm}
\end{figure}
\subsubsection{Example}
Lets assume we are given uniform prior and we model each arm as normally distributed with unknown mean $\mu_a$ and known variance $\sigma^2_a$. Ties are broken randomly; we choose any action in the first round.\\
We generate normal posterior according to theory presented in section~\ref{sec:bayesian}.\\
After we have obtained posterior we draw a sample for each arm $\mathit{s}_a \leftarrow \mathcal{N}(\mu_a^{\texttt{ pos}}, \sigma^{2\texttt{ pos}}_a)$.\\
Our selection procedure is $\mathit{a}_{t+1} = \operatorname{arg\,max}_{a \in \mathscr{A}} \mathit{s}_a$. So we play an arm $s$ in round $t+1$ and repeat previous steps. We place all outcomes and samples in the history vector.\\
\subsubsection{Assumptions}
In our considerations the process runs for an infinite sequence of \emph{time steps}, $t \in \mathscr{T} = \{ 1,2,3... \}$; at each step we observe \emph{regressor}, $x_t \in \mathscr{X}$ (compact set); an \emph{action} choice at time $t$ is $a_t \in \mathscr{A} = \{ 1,2...A \}$ and $A<\infty$.\\
Furthermore action selection is based on \emph{information sequence}(history vector) defined as:
$$
\tilde{\mathscr{L}}_t = \left( \text{ } (x_1,a_1,r_1);...(x_{t-1},a_{t-1},r_{t-1}) \text{ } \right) \text{ ,}
$$
where $x$ and $a$ is defined as above, $r$ is a reward which results from action at given time and $r_t = f_{a_t}(x) + z_{t,a_t}$; the later value in the sum is i.i.d.\ \footnote{independent and identically distributed} random variable with unknown distribution, zero mean and finite variance. We also denote information available prior to experiment as $\mathscr{L}_0$ and information at time $t$ as $\mathscr{L}_t = \left( \mathscr{L}_0 \cup \tilde{\mathscr{L}}_t \right)$.\\
Finally, $f_a : \mathscr{X} \rightarrow \mathbb{R}$ is an unknown continuous function of the regressor specific to the action $a$.\\
We can therefore define the optimal action---with highest expected reward---at time $t$ by:
\begin{equation}
a_t^\star = \argmax_{a \in A} f_a (x_t) \text{ ,}
\end{equation}
and its approximation by:
\begin{equation}
\label{eqn:optsol} \hat{a}_t^\star = \argmax_{a \in A} \hat{f}_a (x_t) \text{ .}
\end{equation}
We can prove that this policy converges with time to the best possible policy(infinite selection of arm with highest reward)---\emph{convergence criterion}:
\begin{equation}
\label{eqn:convg}
\frac
{\sum_{s=1}^t f_{a_s} (x_s)}
{\sum_{s=1}^t f^\star (x_s)}
\xrightarrow{\text{ a.s.\ }} 1 \text{ as } t \rightarrow \infty \text{ .}
\end{equation}
Above formula states that in infinite process we will finally discover the optimal arm and we will be playing it forever since.\\
We need to consider two major steps of constructing such policy: action evaluation and selection schemes. We also require that evaluation function is \emph{consistent}.\\
To achieve consistency we need to guarantee:
$$
\forall a \in A \text{, } \forall x \in X \text{, } \left[ \hat{f}_{t, a}(x) - f_a(x) \right] \xrightarrow{\mathbb{P}} 0 \text{ as } n_{t,a} \rightarrow \infty \text{ .}
$$
Here consistency means that by improving our estimate of regressor function after each selection and information gain we will finally get really close to the true function.\\
Even though we guarantee consistent estimator it is possible that optimal action will never be selected. This feature of solution requires as to make sure that arms seen as sum-optimal will also be explored. The solution is to ensure infinite exploration:
$$
\mathbb{P} \left( \bigcap_{a \in A} \left\{ n_{t,a} \rightarrow \infty \text{ as } t \rightarrow \infty \right\} \right) = 1 \text{ .}
$$
The above equation tells us that the number of times each action is selected tends to infinity as the process continues in time.\\
Finally, we want the action(arm) selection process to exploit obtained information. We want it to be greedy \footnote{Locally optimal choice at each stage.} in the limit. This assumption leads to:
$$
\mathbb{P} \left( a_t = \hat{a}_t^\star | \mathscr{L}_t, x_t \right) \xrightarrow{\text{a.s.\ }} 1 \text{ as } t \rightarrow \infty \text{ ,}
$$
where $\hat{a}_t^\star$ is defined as in equation~\ref{eqn:optsol}. The above equation means that the action that our policy estimate as optimal will be truly optimal in the time limit.\\
If all above conditions hold we call the policy \emph{GLIE}---greedy in the limit with infinite exploration.\\
\subsubsection{Exploration and exploitation}
Action selection mechanism uses sum of \emph{exploitative} and \emph{exploratory} values to make a greedy decision at each time step. The first quantity corresponds to immediate reward---expected value of posterior distribution for each action at given time---and we define it as:
\begin{equation}
\label{eqn:exploitative} \hat{f}_{t,a} (x_t) = \mathbb{E} ( f_a(x_t) | \mathscr{L}_t, x_t ) \text{ .}
\end{equation}
The exploratory parameter can be expressed in many different ways reviewed by~\citep{meuleau:exploration}. Broadly speaking, this value describes how keen we should be on exploring other than optimal actions; it can be defined as:
$$
\delta_{t,a} \myeq
\begin{cases}
\sigma_{t,a} \delta_0(\mathscr{L}_t) & \text{for known variance,}\\
s_{t,a} \delta_0(\mathscr{L}_t) & \text{for unknown variance,}
\end{cases}
$$
where $\delta_0(n)$ is some positive decreasing function of \emph{history} called \emph{unit exploration bonus}.\\
$\delta_0$ for instance can be dependent on discount sequence: decreasing the importance of exploration later in the experiment or giving the same priority to exploration independently of time. Moreover, the \emph{exploratory bonus} can be calculated simply as standard deviation of the expected reward(of posterior distribution of each action). Such approach favours actions which predictions are uncertain hence, improve their true reward estimation.\\
\subsubsection{Approach development}
We now refine described above approach and present \emph{Local Thompson Sampling}(LTS). To begin with, we design exploratory value in a way that will guarantee non-sticking~\footnote{Avoid focusing on one action hence under-explore all the other(sub-optimal solution).}. One solution is to randomize it. Such behavior can be achieved by sampling from the posterior distribution and using acquired quantities as an error in exploitative value estimate for current regressor; we denote $\tilde{f}_{t,a}^{\text{ Th}} (x_t)$ as sample from posterior distribution of $f_a(x_t) - \hat{f}_{t,a}(x_t)$, given $\mathscr{L}_t$ and $x_t$:
\begin{equation}
\label{eqn:exploratory} \tilde{f}_{t,a}^{\text{ Th}} (x_t) \sim \{ f_a(x_t) - \hat{f}_{t,a}(x_t) \} | \mathscr{L}_t,x_t \text{ .}
\end{equation}
Intuitively, $\hat{f}_{t,a}(x_t)$ is our estimate of the mean of true function based on all the information and regressors that we observed so far; $f_a(x_t)$ is the true function of the regressor. The sampled value from the distribution given by difference of these two functions provides us with estimate of uncertainty of given action which is random.\\
Sum of both exploratory and exploitative values given in equations~\ref{eqn:exploratory}~and~\ref{eqn:exploitative} is used in LTS to greedily decide on the next action; this process is equivalent to choosing an action with optimal posterior probability for current regressor. It can be expressed as sampling from:
$$
\{ f_a(x_t) - \hat{f}_{t,a}(x_t) \} + \mathbb{E} ( f_a(x_t) | \mathscr{L}_t, x_t )
=
\{ f_a(x_t) - \hat{f}_{t,a}(x_t) \}+ \hat{f}_{t,a} (x_t)
=
f_a(x_t) \text{ ,}
$$
which is just a posterior distribution of an action $a$.\\
As we do not have the true function $f$ for arm $a$ the closest approximation is posterior distribution. Therefore, the policy says to sample from the posterior distribution and choose the arm with highest sample.\\
It is proved that in LTS algorithm probability of each action being selected is the posterior that the action is yielding the highest expected reward. Furthermore, we assume that posterior of all actions that have not been taken at round $t$ does not change.\\
We call this approach \emph{local} as it uses greedy action selection---the choice is made based on probabilities calculated for current regressor. The original work by Thompson discussed only Bernoulli bandits but as shown above it can be generalized to sample from the posterior distribution of the expected reward of each action and then select the arm with highest value.\\
Under mild assumptions~\footnote{Assumptions that are expected to hold if the true posterior distributions and expectations are available.} it can be shown that LTS guarantees convergence given by equation~\ref{eqn:convg}. Such solution was first proposed by Thompson~\citep{thompson:biom33}.\\
% \subsubsection{Correctness \& convergence} %LTS and OBS are myopic methods.
% Sampling from sum of exploratory and exploitative values given in equations~\ref{eqn:exploratory}~and~\ref{eqn:exploitative} yield:
% $$
% \left[ f_a(x_t) - \hat{f}_{t,a}(x_t) \right] + \hat{f}_{t,a}(x_t) = f_a(x_t) \text{ .}
% $$
\RestyleAlgo{boxed}
\vspace{2cm}
\begin{algorithm}[H]
\KwData{Posterior distributions for each action---$\left\{ p_a(\cdot|\mathscr{L}_t,x_t) : a \in \mathscr{A} \right\}$}
\For{$a \in \mathscr{A}$}{
\text{sample: }$\mathscr{Q}^\text{Th}_{t,a} \leftarrow p_a(\cdot|\mathscr{L}_t,x_t)$ \;
}
sample $a_t$ uniformly from $\operatorname{arg\,max}_{a \in \mathscr{A}} \mathscr{Q}^\text{Th}_{t,a}$ \;
\caption{Local Thompson Sampling(LTS).\label{al:LTS}}
\end{algorithm}
\vspace{2cm}
We denote $p_a(\cdot | \mathscr{L}_t, x_t)$ the posterior distribution of $f_a(x_t)$ given $\mathscr{L}_t$ and $x_t$ and $\mathscr{Q}^\text{Th}_{t,a}$ a random variable with distribution $p_a(\cdot | \mathscr{L}_t, x_t)$ which we draw our samples from.\\
\subsubsection{Drawbacks and possible improvements}
Main disadvantage of this approach is possibility of exploratory value having zero expectation hence, it can be negative. The consequence may lead to probability of particular action being selected to decrease with the posterior variance of its exploratory value.\\
This issue arises as we represent the samples as sum of exploratory and exploitative values therefore if actions with the highest exploitative value has a lot of uncertainty then it is less likely to be played then if the estimate had little uncertainty.\\
The approach which avoids such situations is called \emph{Optimistic Bayesian Sampling}(OBS) where exploratory value is given by:
$$
\tilde{f}_{t,a} (x_t) = max \left( 0, \tilde{f}_{t,a}^{\text{ Th}} (x_t) - \hat{f}_{t,a}(x_t) \right)
= \left [ \tilde{f}_{t,a}^{\text{ Th}} (x_t) - \hat{f}_{t,a}(x_t) \right]_+
\text{ .}
$$
Expressed in this way exploratory value has positive expectation and cannot be negative. This feature yields increased selection of uncertain actions---what seems desirable. It also can be shown that OBS satisfies convergence criterion(equation~\ref{eqn:convg}) and outperforms LTS~\citep{May:2012:OBS:2503308.2343711}.\\
\RestyleAlgo{boxed}
\vspace{2cm}
\begin{algorithm}[H]
\KwData{Posterior distributions for each action---$\left\{ p_a(\cdot|\mathscr{L}_t,x_t) : a \in \mathscr{A} \right\}$}
\For{$a \in \mathscr{A}$}{
\text{sample: }$\mathscr{Q}^\text{Th}_{t,a} \leftarrow p_a(\cdot|\mathscr{L}_t,x_t)$ \;
$\hat{f}_{t,a} (x_t) \leftarrow \mathbb{E}( \mathscr{Q}^\text{Th}_{t,a} | \mathscr{L}_t,x_t )$ \;
$\mathscr{Q}_{t,a} \leftarrow \operatorname{max}(\mathscr{Q}^\text{Th}_{t,a}, \hat{f}_{t,a} (x_t))$ \;
}
sample $a_t$ uniformly from $\operatorname{arg\,max}_{a \in \mathscr{A}} \mathscr{Q}_{t,a}$ \;
\caption{Optimistic Bayesian Sampling(OBS).\label{al:OBS}}
\end{algorithm}
\vspace{2cm}
\subsubsection{Conclusions}
Presented here class of algorithms is robust---as long as relevant assumptions hold, convergence criterion is satisfied even if approximations of posteriors and expectations are used. Above algorithms are also easy to implement given that posterior distribution of expected rewards can be calculated. Furthermore, the method is computationally cheap in comparison with belief-lookahead approaches such as Gittins index. Last but not least, LTS is widely used by Microsoft, Google and Yahoo to address contextual bandit problem.\\
\chapter{\texttt{\textbf{Exploitation}}---Practical application\label{ch:MAB-AL}}
There is a number of historical applications of multi-armed bandits (MAB) mainly concerned with research time management, general job-scheduling, economics and military~\citep{gittins+glazebrook+weber}.
Nowadays, one of main utilisations is found in contextual bandits which are foundation for optimizing online advertising income. The general idea behind this concept is to present different variations of ads layout at given webpage which are suited to user by inspecting one's history and are widely used by IT giants like Google, Yahoo, Facebook, LinkedIn or Microsoft. For majority of these companies the theory is of high interest as lion's share of their revenue flows from click-through rate in online advertising.~\citep{graepel2010web, Scott:2010:MBL:1944422.1944432}\\
Another application presented in~\citep{Antos09activelearning} describes how to apply well known in machine learning approach: \emph{active learning} to estimate mean of finite number of arms to achieve equally good estimation of these values. Authors claim described problem is of high interest while considering quality control with multiple production lines; proposed solution gives guideline how often company should check products from particular machine.\\
In this chapter we will look at this problem the other way around, namely: how can we generalise multi-armed bandit framework to create efficient and robust active learning algorithm.\\
Other, most recent application of MAB theory in computer science is concerned with classification issue in environment of insufficient information. This approach to machine learning field seems to be under-explored and needs more attention from a research world as to the best of our knowledge it was addressed so far in only one scientific publication, namely~\citep{DBLP:journals/corr/GantiG13}.\\
Our decision was to pursue this topic mainly by presenting here deep analysis of chosen in~\citep{DBLP:journals/corr/GantiG13} approach. After broad explanation we seek for potential improvements and alternatives of presented algorithm.
\section{Structure of this chapter}
We first introduce foundations of machine learning needed to develop concepts of semi-supervised (SSL) and active (AL) learning. Next, we introduce whole family of active learning scenarios and give detailed description of the one needed for understanding~\citep{DBLP:journals/corr/GantiG13}. Then, we present possible connection between processes of MAB and AL. This interface, will allow to frame active learning as a bandit process.\\
Proceeding, main section of this chapter focuses on described in \citep{DBLP:journals/corr/GantiG13} ``bridge'' connecting both worlds. Finally, we explore potential changes that can be introduced and improvements that can be beneficial for described earlier algorithm.
%explore in great detail
\section{Machine Learning 101}
\subsection{General Learning}
Machine Learning is a field of Computer Science which is mainly concerned with building a model for a given data. Classification problem is of the form $\mathit{l} : \mathscr{X} \rightarrow \mathscr{Y}$, where $\mathscr{X}$ is called \emph{instance space} and $\mathscr{Y}$ is \emph{output space} --- in supervised scenario often replaced with $\mathscr{L}$ and called \emph{label space}.\\
Such function $\mathit{l}$ is called \emph{model}, which we try to approximate. The instances which are supplied to the model are preprocessed to extract \emph{features}, which represent each instance in a form of a vector describing chosen attributes. Based on feature vector given model assigns the element to particular region in output space; or in case where $\mathscr{Y} = \mathscr{L}$ gives the instance chosen label.\\
In classical scenario called \emph{supervised learning} the approximation of model(~$\mathit{\hat{l}}$~) is generated by supplying pairs $(\mathit{\mathbf{x}}, \mathit{y})$ which belong to dataset called \emph{training set} where $\mathit{\mathbf{x}}$ is feature vector and $\mathit{y}$ is target label.\\
In general supervised learning is one of the most common scenarios and is concerned with finding a \emph{class} for a given exemplar. Widely used in literature example is to build a model which predicts whether a particular e-mail is spam or non-spam based on extracted features i.e.\ words from the content of the message.\\
Another important building block of machine learning theory is the concept of evaluating accuracy or performance of a trained model. The most popular measure in presented above supervised scenario is so called error rate. It is calculated using \emph{test set} which is disjoint with training set i.e.\ unseen by trained model. Typically the value is calculated as $\mathtt{Err} = \frac{\text{incorrectly classified instances}}{\text{all instances}}$.\\
In different words, the main goal of machine learning is to choose some \emph{hypothesis} from \emph{hypothesis space} which best ``fits'' given data sample. For our purposes we generalise hypothesis space to a set of all possible assignments of classes to training data points.\\
For instance if we are given (in fixed order) three data points $(x_1, x_2, x_3)$ and our task is binary classification i.e.\ we either predict $+$ or $-$, our hypothesis space is then $H = {(+, +, +), (+, +, -), (+, -, +), (-, +, +), (+, -, -), (-, +, -), (-, -, +), (-, -, -)}$.\\
If we discuss performance of given hypothesis more sophisticated measure of uncertainty is \emph{loss function} $L : \mathbb{R} \rightarrow [0, \infty)$ which maps \emph{margin} $z(\mathbf{x})$ of particular exemplar to an associated loss $L(z(\mathbf{x}))$. It is primarily used in scoring and ranking classifiers.\\
By utilising ranking value and true class information we construct a margin which can be understood as the higher positive value the better the model fits for the exemplar; the lower negative value the worse the model fits for the exemplar.\\
The loss function rewards large positive margins and penalises large negative values. Often we assume $L(0) = 1$ what gives a loss of an exemplar on decision boundary, $L(z) \geq 1$ for $z < 0$ and $0 \leq L(z) < 1$ for $z > 0$. Sometimes the quantity of interest is average loss over test set \texttt{Te} calculate as $\mathtt{Te} = \frac{1}{|\mathtt{Te}|} \sum_{x \in \mathtt{Te}} L(z(\mathbf{x}))$.\\
The simplest loss function called \emph{0--1 loss} is defined as:
\[
L_{\text{0--1}} (z) =
\begin{cases}
1 & \text{for } z \leq 0 \\
0 & \text{for } z > 0 \text{.}
\end{cases}
\]~\\
We have already seen average 0--1 loss as an error rate \texttt{Err}:
\[ \frac{1}{|\mathtt{Te}|} \sum_{x \in \mathtt{Te}} L_{\text{0--1}}(z(\mathbf{x})) =
\frac{1}{|\mathtt{Te}|} \sum_{x \in \mathtt{Te}} \mathds{1} \left[ z(\mathbf{x}) \leq 0 \right] =
\frac{1}{|\mathtt{Te}|} \sum_{x \in \mathtt{Te}} \mathds{1} \left[ \text{incorrectly classified instance?} \right] =
\mathtt{Err}\text{.}
\]~\\
Some other popular loss functions are:
\begin{description}
\item[hinge loss] $L_{\text{h}} (z) = 1 - z$ for $z \leq 1$ and $L_{\text{h}} (z) = 0$ for $z > 1$,
\item[logistic loss] $L_{\text{log}} (z) = \log_2 (1 + \exp(-z))$,
\item[exponential loss] $L_{\text{exp}} (z) = \exp(-z)$,
\item[squared loss] $L_{\text{sq}} (z) = (1 - z)^2$; which sometimes is equated to $0$ for $z>1$.\\
\end{description}
One way of categorizing classification in predictive scenarios is based on task:
\begin{description}
\item[Classification] finding a label to describe instance(e.g.\ spam / non-spam),
\item[Scoring and ranking] for each instance outputs score vector over all classes,
\item[Probability estimation] for each instance outputs probability vector over all classes,
\item[Regression] learn approximation to the true labeling function.\\
\end{description}
As mentioned earlier there are a few scenarios used in ML. Main settings of machine learning can be divided by two criteria. Namely the form of supplied training data: whether it is labeled --- \emph{supervised} learning or unlabeled --- \emph{unsupervised} learning and the goal of classification: whether process aims at predicting a target variable(most often label) --- \emph{predictive} model or discover the structure of the data(describe the underlying format) --- \emph{descriptive} model.\\
\begin{table}[htbp]
\begin{tabular}{ r | c p{5cm} }
& Predictive models & Descriptive models \\
\hline
Supervised learning & classification, regression & subgroup discovery \\
Unsupervised learning & predictive clustering & descriptive clustering, association rule discovery \\
\end{tabular}~\\[0.1cm]
\caption{Main learning models in ML.\label{fig:learning_models}}
\end{table}~\\
The main reference for further reading is~\citep{flach2012machine} containing comprehensive introduction to machine learning.
\subsection{Semi-supervised Learning}
There is a missing link in the chain which, connects supervised and unsupervised scenarios called \emph{semi-supervised} learning (Table~\ref{fig:learning_models}). The goal of such classification is to train classifier on relatively small sample of labeled data in a way that its performance on unseen data is close to one that we could achieve with supervised algorithm using huge sample of labeled data. The general strategy is to use the most informative points which best describe distribution underlying classes in the data.\\
The issue with such approach is difficulty of acquiring such points. Provided data can be noisy --- contain outliers due to measurement errors therefore we need to have enough of them to estimate these values from sample.\\
There are many algorithms which build models based on small sample of labeled data but minority of them measure and utilize information about exemplars being maximally informative to improve model they output.\\
The main application area of semi-supervised learning is natural language, video or audio processing same as other fields where data are relatively easy to acquire --- for instance by placing camera in a public place, hours of material can be recorded --- but the process of labeling gathered data to find good fitting supervised model is expensive, time consuming or demand hours of manual labour.\\
The hidden concept is to describe small fraction of data by hand to train semi-supervised model which performs similarly to supervised algorithm in terms of error rate on data.
\subsection{Active Learning}
AL approach is considered as a special case of Semi-supervised learning. What differentiates these two models is the way initial training sample is organised. In S-SL algorithm is supplied with labeled data and does not have influence on what these data is and how it was chosen from the whole pool. On contrary, in AL the learner can choose whether it wants to know label of particular instance that it has access to. If algorithm is interested in label of chosen exemplar it queries so called \emph{label oracle} denoted by $\mathscr{O}$ to acquire it. We assume that if one point is queried more than once $\mathscr{O}$ will return always the answer it gave for the first query. Most often to produce a good oracle output the average label assigned by some expert population is chosen.\\
The major advantage of such approach is selectiveness of label information. The learner is not fed with fixed data where there is no possibilities to investigate potentially interesting hypothesis which are lacking evidence to be supported. It has free will which is most often managed via some kind of hypothesis risk minimisation techniques.\\
Within active learning we can name different scenarios of querying $\mathscr{O}$. We can distinguish algorithms based on:
\begin{description}
\item[Membership Query] the learner can query $\mathscr{O}$ for any point in input space $\mathscr{X}$ where $x$ does not necessarily belong to support of marginal distribution i.e.\ it is not in our training set,
\item[Stream Query] the learner sample points from marginal distribution one after another and decide on-the-fly whether to acquire its label or not,
\item[Pool Query] the learner has access to unlabeled \emph{pool} $\mathscr{P} = \left\{ \mathbf{x}_1, \mathbf{x}_2, \dots \mathbf{x}_{n-1}, \mathbf{x}_n \right\}$ of exemplars sampled from some marginal distribution. The learner is allowed to query $\mathscr{O}$ for a label of chosen point i.e.\ acquire $y \sim P_{Y|\mathscr{X} = \mathbf{x}}$.\\
\end{description}
We also need to make a decision about the strategy of finding best hypothesis. There are two main approaches: exploiting (cluster) structure in data and efficient search through hypothesis space. The first one boils down to clustering exemplars based on some distance measure between them and sampling number of points from each cluster to arrive at label (see Figure~\ref{fig:cluster}). The later one is organised in systematic manner search algorithm that outputs hypothesis fitted to given data (see Figure~\ref{fig:hypsearch}).\\
From this point onward we are narrowing our focus to \textbf{hypothesis space searched, active learning algorithms based on pool queries}.\\
\begin{figure}[htbp]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=0.5\linewidth]{graphics/cluster1.png}
\caption{\label{fig:cluster_a}}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=0.5\linewidth]{graphics/cluster2.png}
\caption{\label{fig:cluster_b}}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=0.5\linewidth]{graphics/cluster3.png}
\caption{\label{fig:cluster_c}}
\end{subfigure}
% \includegraphics[width=0.3\textwidth]{graphics/cluster1.png}
% \includegraphics[width=0.3\textwidth]{graphics/cluster2.png}
% \includegraphics[width=0.3\textwidth]{graphics/cluster3.png}
\begin{tiny}
\caption{The process of exploiting cluster structure in data within active learning algorithm. In \ref{fig:cluster_a} we are supplied with some unlabeled data with obvious structure --- 2 clusters. In \ref{fig:cluster_b} we query for labels of some points from each cluster --- clusters are not pure, collected sample is noisy. In \ref{fig:cluster_c} we assign majority class to each cluster and arrive at presented model.\label{fig:cluster}}
\end{tiny}
\vspace{1cm}
\end{figure}
\begin{figure}[htbp]
\centering
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=0.5\linewidth]{graphics/hypsearch1.png}
\caption{\label{fig:hypsearch_a}}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=0.5\linewidth]{graphics/hypsearch2.png}
\caption{\label{fig:hypsearch_b}}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=0.5\linewidth]{graphics/hypsearch3.png}
\caption{\label{fig:hypsearch_c}}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=0.5\linewidth]{graphics/hypsearch4.png}
\caption{\label{fig:hypsearch_d}}
\end{subfigure}
\begin{subfigure}[b]{0.3\textwidth}
\centering
\includegraphics[width=0.5\linewidth]{graphics/hypsearch5.png}
\caption{\label{fig:hypsearch_e}}
\end{subfigure}
% \includegraphics[width=0.3\textwidth]{graphics/cluster1.png}
% \includegraphics[width=0.3\textwidth]{graphics/cluster2.png}
% \includegraphics[width=0.3\textwidth]{graphics/cluster3.png}
\begin{tiny}
\caption{The process of search through hypothesis space within active learning algorithm. In \ref{fig:hypsearch_a} we select some hypothesis (~$h_i$~) that are potentially the best. To eliminate first we ask for a labels of points on both sides of $h_1$ and disregard it as on both sides we spot the same class (Figure \ref{fig:hypsearch_b}). In \ref{fig:hypsearch_c} and \ref{fig:hypsearch_d} we do the same for $h_2$ and $h_3$. We reject both of them with similar to previous observation. Finally, we arrive at hypothesis $h_5$ which agrees on all queried points and become output of our learning algorithm --- we assign to the rest of points majority class on given side of $h_5$.\label{fig:hypsearch}}
\end{tiny}
\vspace{1cm}
\end{figure}
\section{Active Learning in a view of Multi-Armed Bandits}
To build a ``bridge'' between MAB and AL we first need to describe settings of such scenario. For the sake of simplicity we focus on pool based, binary active learning with an approach of search through \emph{hypothesis space} $\mathscr{H}$, restricted by a query budget $B$. Moreover, we restrict our hypothesis space to linear classifiers(see later Section~\ref{sec:linearclassifiers}). The key point is to view learning problem as exploration-exploitation trade-off. We will first focus on finding connections between characteristic features of MAB and discover correspondence in AL.\\
To begin with, we will frame learning problem as $B$ round game --- finite horizon with discount sequence $\mathbf{A} = ( \underbrace{ 1, 1, 1, ..., 1}_{B}, 0, 0, 0, ... )$ --- where in each round agent is faced with pulling one of given arms and suffer some loss $L_t$ on chosen action. For simplicity, in this scenario we change reward for a loss what leads to optimal strategy being the one that minimizes loss instead of maximizing reward over $B$ rounds.\\
It is worth noting that if $B \rightarrow \infty$ then this becomes supervised algorithm which is fed by data stream.\\
To be more precise, we define equivalence between arms of a bandit and hypothesis $\mathit{h} \in \mathscr{H}$. We also need to find a proxy of MAB loss signal in our active learning scenario, but lets first clarify goals of our strategy.\\
The main aim is to estimate optimal hypothesis i.e.\ the one with lowest risk (loss) with use of as little labeled points as possible. If we knew the best $\mathit{h}$, so the one with minimal cumulative loss then the optimal strategy would be to pull this arm in each round of our experiment. This leads to the goal of finding optimal strategy as quickly as possible. Now we are left with defining a strategy telling us which arm to pull in each round and dilemma how to transform feedback --- in our case ground truth (genuine label) --- of queried point $x$ from our pull into a loss signal.\\
\subsection{Linear classifier\label{sec:linearclassifiers}}
To begin with, we define linear classifier as:
$$
c(x) = \mathbf{w} \cdot \mathbf{x} - t \text{~,~}
$$
where for binary classification we have:
$$
c(x) \geq 0 \rightarrow \text{ $x$ is assigned to $+$}
$$
$$
c(x) < 0 \rightarrow \text{ $x$ is assigned to $-$.}
$$\\
Where $\mathbf{w}$ is weight vector and $\mathbf{x}$ is data point with $\mathbf{x} \in \mathbb{R}^n$.\\
In general, linear model (hypothesis class) is concerned with finding the weight vector $\mathbf{w}$ such that our classifier is of the form given above. We can transform above equation to express decision boundary $t = \mathbf{w} \cdot \mathbf{x}$. This result tells us that decision boundary is a plane in the space spanned by $x_i \in \mathbf{x}$ variables. Where vector $\mathbf{w}$ is perpendicular to boundary and points in direction of $+$ class.\\
The notation of classifier can be simplified by extending both vectors $\mathbf{x}$ and $\mathbf{w}$ with $x_0 = 1$ and $w_0 = -t$ leading to decision boundary $c(x) = \mathbf{w}^{\circ} \cdot \mathbf{x}^{\circ} = 0$. In this form for the expense of one extra dimension we get decision boundary passing through origin of our coordinate system.\\
For visual example please see figure~\ref{fig:binclas}.\\
\begin{figure}[htbp]
\centering
\includegraphics[width=0.5\linewidth]{graphics/binclas.png}
\begin{tiny}
\caption{Visualization of binary, linear classifier in two dimensions.\label{fig:binclas}}
\end{tiny}
\vspace{1cm}
\end{figure}
\subsection{Choosing best hypothesis\label{sec:risk}}
To address the first mentioned above issue we refer to~\citep{DBLP:journals/corr/GantiG13} where an unbiased estimator of the risk of hypothesis is defined as:\\
\begin{equation}\label{eq:mean}
\hat{L}_t(h) = \frac{1}{Nt} \sum_{n=1}^{N} \sum_{\tau = 1}^{t} \frac{Q^{\tau}_n}{p^{\tau}_n} L(y_n h(\mathbf{x}_n)) \text{~,}
\end{equation}
where $p^t_n$ is probability of querying $\mathbf{x}_n$ in round $t$ extracted from vector $p^t$ described in section~\ref{sec:pointtoquery}.\\
Furthermore, $Q^t_n$ is a random variable which takes values $\{0,1\}$; $1$ if $\mathbf{x}_n$ was queried in round $t$ and $0$ otherwise.\\
We also need to decide on loss function to employ in our model. All loss functions which are convex w.r.t\ margin and which are upper bounded by $L_{\text{max}} < \infty$ will be suitable. For example we could use logistic loss, squared loss or exponential loss.\\
To sum up, each hypothesis in group of linear classifiers can be fully describe by weight vector $\mathbf{w}$. This observation allows us to notationally unify output of hypothesis $h_j$ on point $\mathbf{x}_n$ as $h_j(\mathbf{x}_n) = \hat{c}(\mathbf{x}_n)$.\\
\subsection{Choosing a point to query --- Constructing loss signal\label{sec:pointtoquery}}
To iteratively improve our guess of best hypothesis we query possibly most informative point in each round. Intuitively such information allows us to tell apart good hypothesis which agree with acquired label and bad ones which mis-classify given point. To make the best possible choice we construct sampling distribution which minimizes the variance of the risk estimate $\hat{L}_t(h)$ defined above. We construct it by utilizing labels which we have already acquired from oracle and labels that our current hypothesis assigns to points which have not been queried yet. To guarantee infinite exploration of all hypothesis and avoid $0$ probabilities we introduce parameter $p_{min}$ which is arbitrarily set by user and denote minimum probability of each point being queried.\\
Let's define $\mathscr{Q}_t$ as a set of labels $y_n$ acquired by queering $\mathscr{O}$ for labels of $\mathbf{x}_n$ up to time $t$.\\
Just to recall we handle a binary classification which for data point $\mathbf{x}_n$ outputs result $c(\mathbf{x}_n)$. This result can be any real number, which is translated into class by rule given above: $c(\mathbf{x}_n) > 0 \rightarrow +$ and $c(\mathbf{x}_n) < 0 \rightarrow -$. As a notation shorthand it is often use $\textit{sign}(c(\mathbf{x}_n)) = y_n \in \{+, -\}$.\\
For each data point $\mathbf{x}_n \in \mathscr{P}$ in our pool we have:
$$
\hat{y}_n = \begin{cases}
y_n = c(\mathbf{x}_n) & \text{if } \mathbf{x}_n \in \mathscr{Q}_{t-1} \\
\textit{sign}(h_t(\mathbf{x}_n)) & \text{otherwise.}
\end{cases}
$$