Regular Expression Matching using Bit Vector Automata (SPLASH 2023 - OOPSLA)

Sun 22 - Fri 27 October 2023 Cascais, Portugal

Who

Alexis Le Glaunec, Lingkun Kong, Konstantinos Mamouras

Track

SPLASH 2023 OOPSLA

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

By setting a time band, the program will dim events that are outside this time window. This is useful for (virtual) conferences with a continuous program (with repeated sessions).
The time band will also limit the events that are included in the personal iCalendar subscription service.

Display full programSpecify a time band

Save

When

Thu 26 Oct 2023 11:18 - 11:36 at Room II - language semantics Chair(s): Sebastian Erdweg

Abstract

Regular expressions (regexes) are ubiquitous in modern software. There is a variety of implementation techniques for regex matching, which can be roughly categorized as (1) relying on backtracking search, or (2) being based on finite-state automata. The implementations that use backtracking are often chosen due to their ability to support advanced pattern-matching constructs. Unfortunately, they are known to suffer from severe performance problems. For some regular expressions, the running time for matching can be exponential in the size of the input text. In order to provide stronger guarantees of matching efficiency, automata-based regex matching is the preferred choice. However, even these regex engines may exhibit severe performance degradation for some patterns. The main reason for this is that regexes used in practice are not exclusively built from the classical regular constructs, i.e., concatenation, nondeterministic choice and Kleene's star. They involve additional constructs that provide succinctness and convenience of expression. The most common such construct is bounded repetition (also called counting), which describes the repetition of the pattern a fixed number of times.

In this paper, we propose a new algorithm for the efficient matching of regular expressions that involve bounded repetition. Our algorithms are based on a new model of automata, which we call nondeterministic bit vector automata (NBVA). This model is chosen to be expressively equivalent to nondeterministic counter automata with bounded counters, a very natural model for expressing patterns with bounded repetition. We show that there is a class of regular expressions with bounded repetition that can be matched in time that is independent from the repetition bounds. Our algorithms are general enough to cover the vast majority of challenging bounded repetitions that arise in practice. We provide an implementation of our approach in a regex engine, which we call BVA-Scan. We compare BVA-Scan against state-of-the-art regex engines on several real datasets.

DOI

https://doi.org/10.1145/3586044

Alexis Le Glaunec

Rice University

United States

Lingkun Kong

Rice University

United States

Konstantinos Mamouras

Rice University

United States

Time Zone

The program is currently displayed in (GMT+01:00) Lisbon.

Use conference time zone: (GMT+01:00) LisbonSelect other time zone

The GMT offsets shown reflect the offsets at the moment of the conference.

Time Band

Display full programSpecify a time band

Save

Session Program

Thu 26 Oct
Displayed time zone: Lisbon change

11:00 - 12:30	language semanticsOOPSLA at Room II Chair(s): Sebastian Erdweg JGU Mainz

11:00 18m Talk		The Essence of Verilog: A Tractable and Tested Operational Semantics for Verilog OOPSLA Qinlin Chen Nanjing University, Nairen Zhang Nanjing University, Jinpeng Wang Nanjing University, Tian Tan Nanjing University, Chang Xu Nanjing University, Xiaoxing Ma Nanjing University, Yue Li Nanjing University DOI
11:18 18m Talk		Regular Expression Matching using Bit Vector Automata OOPSLA Alexis Le Glaunec Rice University, Lingkun Kong Rice University, Konstantinos Mamouras Rice University DOI
11:36 18m Talk		Bidirectional Object-Oriented Programming: Towards Programmatic and Direct Manipulation of Objects OOPSLA Xing Zhang Peking University, Guanchen Guo Peking University, Xiao He University of Science and Technology Beijing, Zhenjiang Hu Peking University DOI
11:54 18m Talk		Bring Your Own Data Structures to Datalog OOPSLA Arash Sahebolamri Syracuse University, Langston Barrett Galois, Scott Moore Galois, Kristopher Micinski Syracuse University DOI
12:12 18m Talk		Rhombus: A New Spin on Macros without All the Parentheses OOPSLA Matthew Flatt University of Utah, Taylor Allred University of Utah, Nia Angle independent, Stephen De Gabrielle independent, Robert Bruce Findler Northwestern University, Jack Firth independent, Kiran Gopinathan National University of Singapore, Ben Greenman University of Utah, Siddhartha Kasivajhula independent, Alex Knauth independent, Jay McCarthy Reach, Sam Phillips independent, Sorawee Porncharoenwase University of Washington, Jens Axel Søgaard independent, Sam Tobin-Hochstadt Indiana University DOI Pre-print