-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintroduction.tex
410 lines (395 loc) · 17.4 KB
/
introduction.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
% !TEX root = main.tex
%
% Introduction
%
Various extensions of language models have been proposed for handling infinite alphabets.
%words carrying data values in an infinite domain (e.g. integers) e.g. data processing
Some automata
%\reviews{1) and grammar with storage?}
with memory extensions
allow restricted storage and comparison of input symbols
%and correspond with logics,
(see~\cite{Segoufin06csl} for a survey).
\florent{register: skip refs and details, add Mikolaj recent}
They use pebbles for marking positions~\cite{NevenSchwentickVianu04FSMinfinite},
registers~\cite{KaminskiFrancez94},
or %computations in 2 steps
separate computation on subsequences
with the same attribute values~\cite{Bojanczyk11FO2}. % data words automata.
%
%for the and verification of infinite state systems
%(model checkers: alphabet = long bit-vectors)
%...For the representation of in model checking, verification and
%
Moreover, automata at the core of model checkers
compute on input symbols represented by large bitvectors~\cite{Vardi07ciaa} %\cite{BaierKatoen08MC}
%(sets of assignments of Boolean variables) %propositional variables)
and, in practice, %implementation
each transition accepts a set of such symbols (instead of an individual symbol),
represented by Boolean Formulas or Binary Decision Diagrams.
%
Following this idea, % and generalizing,
in symbolic automata (\SA)~\cite{Veanes12symbolic,dAntoniVeanes17CAV,dAntoni21CACM},
transitions are guarded by predicates over large or infinite domains.
With appropriate closure conditions on the sets of such predicates, % (alphabet theories),
all the good properties enjoyed by automata over finite alphabets are preserved.
The ability of \SA to compare input symbols is restricted compared to the
former automata with memory.
It can however be extended with the addition of a stack~\cite{dAntonyAlur14SVPDA}.
Other extensions of language models %(automata and grammars)
assign one weight value to every input~\cite{Droste09handbook}.
%help in dealing with non-determinism
This is useful for the quantitative modelling of
\eg probabilistic or stochastic recursive programs, %( for pushdown models),
quantitative database queries, or semi-structured data, % (in infinite domains),
as well as for the verification of quantitative properties of systems
related to %\florent{quantitative reasoning/verification on programs}
quality measures, resource-consumption, distance metrics, probabilistic guarantees, \etc
%
In the context of parsing, when grammars return weight values,
it is possible to rank derivations (and \emph{abstract syntax trees})
in order to select a best one (or $n$ bests),
%\florent{= "weighted parsing"}
\eg in case of ambiguity~\cite{Goodman99SemiringParsing,Nederhof03weightedParsing,MorbitzVogler19weighted-parsing}.
%This is roughly the principle of \emph{weighted parsing} approaches
%\reviews{1) jump from automata to grammars is uneasy and not really necessary}
% for which there may exist several derivations yielding one input word.
%
In \emph{weighted language models}
like \eg probabilistic context-free grammars % (CFG),
and weighted automata (\WA)~\cite{Droste09handbook},
a weight is associated to each transition rule, % production rule
and the weights of the rules involved in a computation are combined with an
associative product operator~$\otimes$. % to yield the weight of a whole computation.
%\reviews{1) AST for grammar, run for WA}
A second operator~$\oplus$
is moreover used to resolve the ambiguity raised by the existence
of several (in general exponentially many) computations on a given input.
Typically, $\oplus$ selects the best of two weight values.
%Intuitively,~$\oplus$ selects, or ranking, the syntax trees.
The weight domain, equipped with these two operators is typically
a \emph{semiring} %$\Semiring$
where $\oplus$ can be extended to infinite sums,
%\cite{Eilenberg74automata}
such as the Viterbi semiring and the tropical min-plus algebra% % see Figure~\ref{fig:semirings}.
%of domain $\mathbb{R}^+ \cup \{ +\infty\}$,
%where $\oplus$ is min and $\otimes$ is plus .
%by ranking
%making the weight domain a semiring.
%Some efficient specialized parsing algorithms~\cite{Huang05kbest} have been proposed in this context
%% models represented as hypergraphs \cite{Huang05kbest}
%in order to compute the $n$ best syntax trees of a given input word without having to enumerate them all.
%Generally based on dynamic programming, these algorithms rely on
%additional algebraic properties of~$\Semiring$.
%-- see \eg~\cite{Huang05kbest} for some NLP applications.
%The extraction of $n$ best list is useful
%the problem: quantitative parsing or symbolic parsing
%of parsing of words over infinite input alphabet.
In this paper, we present some results for
\emph{Symbolic Weighted} finite states language models %(\swM)
generalizing the Boolean guards of~\SA %in transitions
to functions into an arbitrary semiring,
and also the~\WA, by handling infinite alphabets (Figure~\ref{fig:hierarchy}).
%
In short, a transition rule $q \xrightarrow{\phi} q'$ of a Symbolic Weighted Automaton,
from state $q$ to $q'$,
%\florent{c'est un résumé de la Fig. \ref{fig:hierarchy}}
%input symbols are variables, %(parameters)
is labeled by a function~$\phi$ associating to every input symbol~$a$, a weight value~$\phi(a)$
in a semiring~$\Semiring$.
%
These models are also
%\SWA and \SWVPA are both
particular cases of the very general class of
Weighted Symbolic Automata with Data Storage~\cite{Herrmann16dlt,Herrmann20phd},
for appropriate storage types.
%with special data storage types...
%\florent{Both WA and SA generalized by \cite{Herrmann16dlt,Herrmann20phd}...}
%\reviews{cite \cite{Herrmann16dlt,Herrmann20phd}}
\begin{figure}
\centering
\begin{tikzpicture}
\node (SWADS) at (2.8,5.0) {%
\(
\begin{array}{c}
\mathsf{SWADS}\,\mbox{\cite{Herrmann16dlt,Herrmann20phd}}
\end{array}
\)
};
%
\node (SWA) at (0,3.4) {%
\(
\begin{array}{c}
\mathsf{SWA}: \Sigma_{\mathsf{inf}}^* \to \mathbb{S}\\
q \xrightarrow{\phi} q', \phi : \Sigma_\mathsf{inf} \to \mathbb{S}
\end{array}
\)
};
%
\node (WA) at (-1.8,1.7) {%
\(
\begin{array}{c}
\mathsf{WA}\,\mbox{\cite{Droste09handbook}}: \Sigma_{\mathsf{fin}}^* \to \mathbb{S}\\
q \xrightarrow{a, w} q', a \in \Sigma_\mathsf{fin}, w \in \mathbb{S}
\end{array}
\)
};
%
\node (SA) at (1.8,1.7) {%8
\(
\begin{array}{c}
\mathsf{SA}\,\mbox{\cite{Veanes12symbolic,dAntoni21CACM}}: \Sigma_{\mathsf{inf}}^* \to \mathbb{B}\\
q \xrightarrow{\phi} q', \phi : \Sigma_\mathsf{inf} \to \mathbb{B}
\end{array}
\)
};
%
\node (NFA) at (0,0) {%
\(
\begin{array}{c}
\mathsf{FA} : \Sigma_{\mathsf{fin}}^* \to \mathbb{B}\\
q \xrightarrow{a} q', a \in \Sigma_\mathsf{fin}
\end{array}
\)
};
%
\node (VPA) at (5.6,0) {%
\(
\mathsf{VPA}\,\mbox{\cite{AlurMadhusudan09nested}}
\)
};
%
\node (WVPA) at (4.6,1.7) {%
\(
\mathsf{WVPA}\,\mbox{\cite{Mathissen08weighted,Caralp12VPAmult}}
\)
};
%
\node (SVPA) at (6.6,1.7) {%
\(
\mathsf{SVPA}\,\mbox{\cite{dAntonyAlur14SVPDA}}
\)
};
%
\node (SWVPA) at (5.6,3.4) {%
\(
\mathsf{SWVPA}
\)
};
%
\draw[->] (NFA)--(WA);
\draw[->] (NFA)--(SA);
\draw[->] (WA)--(SWA);
\draw[->] (SA)--(SWA);
\draw[->] (SWA)--(SWADS);
\draw[->] (VPA)--(WVPA);
\draw[->] (VPA)--(SVPA);
\draw[->] (WVPA)--(SWVPA);
\draw[->] (SVPA)--(SWVPA);
\draw[->] (SWVPA)--(SWADS);
%\begin{array}{c} \mathsf{NFA} : \Sigma^* \to \mathbb{B} \end{array}
\end{tikzpicture}
\caption{Classes of Symbolic/Weighted Automata.
Here, $\Sigma_\mathsf{fin}$ and $\Sigma_\mathsf{inf}$ denote finite/countable alphabets,
$\mathbb{B}$ the Boolean algebra,
$\mathbb{S}$ a commutative semiring.
$q \xrightarrow{\dots} q'$ is a transition between states $q$ and $q'$.}
\label{fig:hierarchy}
\end{figure}
%
%This approach is close to the case of
%Symbolic Automata (SA)~\cite{dAntoniVeanes17CAV,dAntoni21CACM},
%except that the domain for weight values is not restricted to be Boolean,
%like for the guards in the rules of SA,
%but can be an arbitrary commutative semiring (assuming some restrictions).
\noindent
The models studied are:
symbolic-weighted automata (\SWA),
defining series over infinite alphabets,
transducers (\SWT),
defining distances between finite words over infinite alphabets, % following~\cite{Mohri03ijfcs}.
and pushdown automata with a visibility restriction~\cite{AlurMadhusudan09nested} (\SWVPA),
operating sequentially on %\emph{nested words}~\cite{AlurMadhusudan09nested},
words structured with markup symbols (parentheses) and describing linearizations of trees.
%A \SWVPA $A$ associates a weight value $A(t)$ % \in \Semiring$
%to a given nested word $t$, which is itsef the linearization of a weighted AST. %representing a parse tree.
%
The main contributions of this paper are:
%\reviews{1) not $i$! see response.txt}
%($i$)~the introduction of automata, \SWA, transducers, \SWT (Section~\ref{section:SWA}),
%and visibly pushdown automata \SWVPA (Section~\ref{sec:SWVPA}),
%generalizing the corresponding classes of symbolic and weighted models,
%for the weighted computation on (nested) words over infinite alphabets;
\begin{description}
\item[($i$)] a construction à la Bar-Hillel Perles Shamir of a \SWA
computing a \SWT-defined distance between a \SWA input language and a word (Proposition~\ref{prop:epsilon}),
\item[($ii$)] closure results for \SWVPA (Proposition~\ref{prop:SWVPA-product}),
\item[($iii$)] a polynomial best-search algorithm for \SWVPA (Proposition~{th:best-search}. %(Section~\ref{sec:best})
\end{description}
%
% a framework for parsing over infinite alphabets,
Moreover, we present in Section~\ref{sec:parsing} an
\begin{description}
\item[($iv$)] application to the problem of weighted parsing over infinite input alphabets,
called \emph{SW-parsing}.
\end{description}
%
The goal of the latter problem is, given an input word~$s$,
to find~$t$ minimizing the distance, in the sense of~\cite{Mohri03ijfcs},
$T(s, t) \otimes A(t)$, where $T$ is a \SWT and $A$ a \SWVPA.
The notion of transducer-based distances allows to consider
different infinite alphabets for the input $s$ and output $t$.
Moreover, the use of \SWVPA permits to search for an output $t$ in the form of a (nested) word,
instead of a tree.
%technical convenience
%the SW models are integrated in a uniform framework.
\emph{SW-parsing} is solved with a Bar-Hillel construction~\cite{NederhofSatta03ParsingIntersection}
%\cite{GruneJacobs08parsing}
of a \SWVPA $B$ such that, for all~$t$, $B(t) = T(s, t) \otimes A(t)$
and the application of a best-search procedure to this automaton $B$.
% reduction to the shortest distance in graphs~\cite{Mohri02semiring,Huang05kbest}.
%Like weighted-parsing methods~\cite{Goodman99SemiringParsing,Nederhof03weightedParsing,MorbitzVogler19weighted-parsing},
%\reviews{1) WARNING: \cite{MorbitzVogler19weighted-parsing} incomparable to our def. weighted parsing}
%Parsing is the problem of structuring a linear representation
%(a finite word) according to a language model. % (a formal grammar). % natural language, programming language,
Context-free parsing approaches~\cite{GruneJacobs08parsing}
generally assume a finite and reasonably small input alphabet. %models and algorithms
%Such a restriction makes perfect sense in the context of
%NLP tasks such as constituency parsing,
%compilers or interpreters for programming languages.
Considering large or infinite alphabets can however be of
practical interest when dealing with large characters encodings such as UTF-16~\cite{dAntoni21CACM}.
% processing strings in \eg for vulnerability detection in Web-applications~\cite{dAntoni21CACM},
%We believe that these results could also be useful
It is also true in the context of automata-based quantitative verification techniques~\cite{Kwiatkowska07quantitativeVerif}, %of systems processing %analysis of
in particular when dealing with data streams,
serialization of structured documents~\cite{Segoufin06csl,NevenSchwentickVianu04FSMinfinite},
or timed execution traces~\cite{Bouyer03algebraic}.
% algebraic definition of a class of data languages
% (notion of monoid recognizability, based on registers, comparable to Bojancszik et al. data words)
The latter case of timed traces
is related to a problem that motivated the present work:
automated music transcription,
that we shall use as a running example
thorough the paper, in order to illustrate our results.
%
Representations capturing music performances
are essentially linear~\cite{Selfridge-Field97beyondMIDI};
%audio files or the widely used MIDI format %\cite{MIDIfile}.
they ignore the hierarchical structures that frame the
conception of music, at least in the Western area.
Hierarchical structures, on the other hand, are present, either explicitly or implicitly,
in Common Western Music Notation~\cite{Gould11Notation}:
%in music notation~\cite{Gould11Notation}:
Music scores are partitioned in measures,
measures in beats, and beats can be further recursively divided.
It follows that written music events do not occur at arbitrary timestamps,
but respect a discrete partitioning of the timeline incurred by
these recursive divisions.
The \emph{transcription problem} takes
as input a linear performance (in audio or MIDI format)
and aims at re-constructing structured notation,
by mapping input events to this hierarchical rhythmic space.
It can therefore be stated as a parsing problem
over an infinite alphabet of timed events~\cite{foscarin:hal-01988990}.
%In expressiveness, they are equivalent to weighted CFG.
%and can be used in a general approach for parsing over infinite input alphabets.
%
%Let $A$ be a \SWVPA, associating $A(w) \in \Semiring$
%to a given a nested word $w$ (representing a parse tree),
%and let a \SWT compute a distance $d$, in $\Semiring$,
%between 2 strings over respectively an infinite input alphabet and the
%same (infinite) alphabet of $A$.
%Then, the problem of Symbolic Weighted Parsing is,
%given an input string $s$, to find a nested word $w$ minimizing
%(according to the ranking defined by $\oplus$)
%the distance $d(s, w) \otimes A(w)$ between $s$ and $A$,
%as defined in~\cite{Mohri03ijfcs}.
% First one general edit-distance is defined by a weighted word
% transducer~\cite{Mohri}
% %Symbolic automata transducers are extended models~\cite{VeanesdAntoniJACM}
% %dealing with infinite set of input symbols...
% value in a semiring...
\florent{expressiveness: VPA have restricted equality test.
comparable to pebble automata? $\to$ conclusion}
\begin{example}%[Running example]
\label{ex:running}
We consider a very simplified %toy
example of \emph{music transcription}:
a given input \emph{timeline} of musical events
from an infinite alphabet $\Sigma$,
is parsed into a structured music score.
Input events of $\Sigma$
are pairs $\mu \tstamp{\tau}$, where $\mu$ is a
MIDI key number~\cite{Selfridge-Field97beyondMIDI}, %\cite{MIDIfile}
and $\tau \in \mathbb{Q}$ is a timestamp in seconds.
Such inputs typically correspond
to the recording of a live performance, \eg
\[I = 69\tstamp{0.07},
71\tstamp{0.72},
73\tstamp{0.91},
74\tstamp{1.05},
76\tstamp{1.36},
77\tstamp{1.71}.\] % over interval $[0,2[$,
The output of parsing is a sequence of
timed symbols $\nu\tstamp{\tau'}$ in an alphabet $\Delta$,
%events (or \emph{notes})
%in Common Western Music Notation (CWMN)~\cite{Gould11Notation}
where $\nu$ represents
an \emph{event} (or \emph{note}),
specified by its \emph{pitch name}
(\eg, $\mbox{\footnotesize A4}$, $\mbox{\footnotesize G5}$, \etc.),
an event \emph{continuation} (symbol `$-$`, see Example~\ref{ex:SWT}),
or a \emph{markup} (opening or closing parenthesis). %(\emph{parenthesis}),
The temporal information $\tau'$
is either a time interval, for the opening parentheses
(representing the duration between the parenthesis
and the matching closing one),
or a timestamp, for the other symbols.
The time points in $\tau'$ belong to a rhythmic ``grid'' obtained from recursive divisions:
whole notes ($\musWhole$) split in halves ($\musHalf$), halves
in quarters ($\musQuarter$), eights ($\musEighth$), \etc.
%
For instance, the output score
\includegraphics[scale=0.35,trim=0 5mm 0 0]{pictures/ex1.pdf},
%\includegraphics[scale=0.20]{pictures/score5.png},
corresponds to a hierarchical structure
that can be linearized as the sequence
$O =$
$\ccall{\mathsf{m}}\tstamp{[0,1]}$,
$\ccall{2}\tstamp{[0,1]}$,
$\mbox{\footnotesize A4}\tstamp{0}$,
$\ccall{2}\tstamp{[\frac{1}{2},1]}$,
$-\tstamp{\frac{1}{2}}$,
$\ccall{2}\tstamp{[\frac{3}{4},1]}$,
$\mbox{\footnotesize B4}\tstamp{\frac{3}{4}}$,
$\mbox{\footnotesize C$\sharp$5}\tstamp{\frac{7}{8}}$,
$\creturn{2}\tstamp{1}$,
$\creturn{2}\tstamp{1}$,
$\creturn{2}\tstamp{1}$,
$\creturn{\mathsf{m}}\tstamp{1}$,
$\ccall{\mathsf{m}}\tstamp{[1,2]}$,
$\ccall{3}\tstamp{[1,2]}$,
$\mbox{\footnotesize D5}\tstamp{1}$,
$\mbox{\footnotesize E5}\tstamp{\frac{4}{3}}$,
$\mbox{\footnotesize F5}\tstamp{\frac{5}{3}}$,
$\creturn{3}\tstamp{2}$,
$\creturn{\mathsf{m}}\tstamp{2}$.
%\end{enumerate}
The opening markups $\ccall{\mathsf{m}}$ %and $\creturn{\mathsf{m}}$
delimit \emph{measures},
which are time intervals of duration~1 in this example,
while the subsequences of $O$ between markups~$\ccall{d}$ and~$\creturn{d}$,
for some natural number~$d$,
represent a division of the time interval attached to $\ccall{d}$,
of duration $\ell$,
into $d$ sub-intervals of equal duration $\frac{\ell}{d}$.
%
We will show that $O$ is a solution for the
parsing of $I$. Note that several other parsings are possible
like \eg \includegraphics[scale=0.35,trim=0 5mm 0 0]{pictures/ex2.pdf}.
SW-parsing associates a weight value
to each solution, and our approach
aims at selecting the best one with respect to this weight.
\endex
\end{example}