Skip to content

Commit adfe31a

Browse files
Merge pull request #13 from virtual-labs/dev
Merge dev to testing
2 parents 92fce52 + e422d3e commit adfe31a

File tree

1 file changed

+147
-19
lines changed

1 file changed

+147
-19
lines changed

experiment/theory.md

Lines changed: 147 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -11,31 +11,40 @@ Regular expressions are a formal way to describe patterns in strings. Think of t
1111

1212
#### Basic Building Blocks of Regular Expressions
1313

14-
1. **Basic Symbols (Literals)**: Individual characters from an alphabet
14+
1. **Empty String (ε)**: Represents the string with no characters
15+
- Example: The regular expression `ε` matches only the empty string ""
16+
- This is the identity element for concatenation
17+
18+
2. **Empty Set (∅)**: Represents a pattern that matches no strings at all
19+
- The empty set language contains no strings
20+
- This is different from the empty string, which contains one string (the empty string itself)
21+
22+
3. **Basic Symbols (Literals)**: Individual characters from an alphabet
1523
- Example: The regular expression `a` matches only the string "a"
1624
- Example: The regular expression `7` matches only the string "7"
1725

18-
2. **Concatenation**: Writing expressions next to each other to form sequences
26+
4. **Concatenation**: Writing expressions next to each other to form sequences
1927
- Example: `ab` matches only the string "ab"
2028
- Example: `hello` matches only the string "hello"
2129

22-
3. **Union (Alternation) - |**: Choice between alternatives
30+
5. **Union (Alternation) - |**: Choice between alternatives
2331
- Example: `a|b` matches either "a" or "b"
2432
- Example: `cat|dog` matches either "cat" or "dog"
2533

26-
4. **Kleene Star (**)**: Zero or more repetitions
34+
6. **Kleene Star (*)**: Zero or more repetitions
2735
- Example: `a*` matches "", "a", "aa", "aaa", etc.
2836
- Example: `(ab)*` matches "", "ab", "abab", "ababab", etc.
2937

30-
5. **Plus (+)**: One or more repetitions
38+
7. **Plus (+)**: One or more repetitions
3139
- Example: `a+` matches "a", "aa", "aaa", etc. (but not the empty string)
3240
- Example: `digit+` matches one or more digits
3341

3442
#### Examples of Regular Expressions in Action
3543

36-
- **Email validation pattern**: `[a-zA-Z0-9]+@[a-zA-Z0-9]+\.[a-zA-Z]{2,4}`
37-
- **Phone number pattern**: `\d{3}-\d{3}-\d{4}`
38-
- **Variable names in programming**: `[a-zA-Z_][a-zA-Z0-9_]*`
44+
- **Binary strings ending with 01**: `(0|1)*01`
45+
- **Strings containing at least one 'a'**: `(a|b)*a(a|b)*`
46+
- **Even number of 0's**: `1*(01*01*)*`
47+
- **Strings starting and ending with the same symbol over {a,b}**: `a(a|b)*a|b(a|b)*b|a|b`
3948

4049
### Non-deterministic Finite Automata (NFA)
4150

@@ -75,6 +84,17 @@ Consider an NFA that accepts strings ending with "01":
7584

7685
Thompson's construction, developed by Ken Thompson in 1968, is a systematic method for converting regular expressions into equivalent NFAs. The algorithm works recursively by building small NFAs for basic expressions and then combining them using specific patterns for each operator.
7786

87+
#### Implementation Approach
88+
89+
Thompson's construction is typically implemented using a **stack-based approach**. As the algorithm processes the regular expression (usually in postfix notation), it:
90+
91+
1. **Pushes NFAs onto a stack** as basic symbols are encountered
92+
2. **Pops NFAs from the stack** when operators are encountered
93+
3. **Combines the popped NFAs** according to the operator's construction rule
94+
4. **Pushes the resulting NFA** back onto the stack
95+
96+
This stack-based approach naturally handles the compositional nature of regular expressions, where smaller expressions are built up into larger ones. The stack keeps track of the NFAs being built, allowing for proper handling of nested expressions and operator precedence.
97+
7898
#### Historical Fun Fact
7999

80100
Ken Thompson, who developed this algorithm, is the same person who co-created the Unix operating system and the C programming language. His work on regular expressions was motivated by the need for pattern matching in text editors, particularly the 'ed' editor in early Unix systems.
@@ -86,48 +106,156 @@ Thompson's algorithm follows these specific construction patterns:
86106
**1. Base Case - Single Symbol 'a':**
87107

88108
```text
89-
Start → [state1] --a--> [state2] (final)
109+
┌────────┐ a ┌────────┐
110+
→│ q₀ │ ───────────────> │ q₁ │ ◎
111+
└────────┘ └────────┘
112+
(start) (final)
90113
```
91114

115+
This creates a simple two-state NFA where:
116+
117+
- q₀ is the start state (indicated by →)
118+
- q₁ is the final/accepting state (indicated by ◎)
119+
- The transition labeled 'a' connects them
120+
92121
**2. Concatenation - Expressions r₁r₂:**
93122

123+
```text
124+
┌──────────┐ ┌──────────┐
125+
→ │ NFA₁ │ │ NFA₂ │ ◎
126+
│ (r₁) │ │ (r₂) │
127+
└──────────┘ └──────────┘
128+
│ ↑
129+
└────────── ε ──────────┘
130+
131+
Detailed view:
132+
┌────┐ ┌────┐ ε ┌────┐ ┌────┐
133+
→ │ q₀ │ ─(r₁)→ │ q₁ │ ───────> │ q₂ │ ─(r₂)→ │ q₃ │ ◎
134+
└────┘ └────┘ └────┘ └────┘
135+
```
136+
137+
Construction steps:
138+
94139
- Build NFAs for r₁ and r₂ separately
95140
- Connect the final state of r₁ to the start state of r₂ with an ε-transition
96141
- The start state of r₁ becomes the new start state
97142
- The final state of r₂ becomes the new final state
98143

99144
**3. Union - Expressions r₁|r₂:**
100145

146+
```text
147+
ε ┌──────────┐ ε
148+
┌─────────> │ NFA₁ │ ─────────┐
149+
│ │ (r₁) │ │
150+
┌────┐ │ └──────────┘ │ ┌────┐
151+
→ │ q₀ │ ───┤ ├──> │ qf │ ◎
152+
└────┘ │ ┌──────────┐ │ └────┘
153+
(new) │ │ NFA₂ │ │ (new)
154+
└────────────> │ (r₂) │ ─────────┘
155+
ε └──────────┘ ε
156+
```
157+
158+
Construction steps:
159+
101160
- Build NFAs for r₁ and r₂ separately
102-
- Create a new start state with ε-transitions to both start states
103-
- Create a new final state with ε-transitions from both final states
161+
- Create a new start state (q₀) with ε-transitions to both start states
162+
- Create a new final state (qf) with ε-transitions from both final states
163+
- This allows the automaton to "choose" either path
104164

105165
**4. Kleene Star - Expression r*:**
106166

167+
```text
168+
┌─────── ε (skip) ─────────┐
169+
│ │
170+
↓ ↓
171+
┌────┐ │ ┌──────────┐ ┌────┐
172+
→ │ q₀ │ ───┴──> │ NFA │ ──────> │ qf │ ◎
173+
└────┘ ε │ (r) │ ε └────┘
174+
(new) └──────────┘ (new)
175+
│ ↑
176+
└──┘
177+
ε (repeat)
178+
```
179+
180+
Construction steps:
181+
107182
- Build an NFA for r
108-
- Create new start and final states
183+
- Create new start (q₀) and final (qf) states
109184
- Add ε-transitions: new start → original start, original final → new final
110-
- Add ε-transitions: new start → new final (for zero repetitions)
111-
- Add ε-transitions: original final → original start (for repetitions)
185+
- Add ε-transition: new start → new final (for zero repetitions)
186+
- Add ε-transition: original final → original start (for multiple repetitions)
187+
- This creates a loop allowing 0 or more repetitions
112188

113189
**5. Plus Operation - Expression r+:**
114190

191+
```text
192+
┌────┐ ┌──────────┐ ┌────┐
193+
→ │ q₀ │ ──────> │ NFA │ ──────> │ qf │ ◎
194+
└────┘ ε │ (r) │ ε └────┘
195+
(new) └──────────┘ (new)
196+
│ ↑
197+
└──┘
198+
ε (repeat)
199+
200+
Note: Unlike r*, there is NO direct ε-transition from q₀ to qf
201+
```
202+
203+
Construction steps:
204+
115205
- Build an NFA for r
116206
- Similar to Kleene star but without the direct ε-transition from start to final
117-
- This ensures at least one occurrence of r
207+
- This ensures at least one occurrence of r before accepting
208+
- The loop back allows for multiple repetitions
118209

119210
#### Step-by-Step Example
120211

121212
Let's convert the regular expression `(a|b)*abb` to an NFA:
122213

123214
1. **Build NFA for 'a'**: Simple two-state NFA
124-
2. **Build NFA for 'b'**: Simple two-state NFA
215+
216+
```text
217+
→ (q₀) ─a→ (q₁)
218+
```
219+
220+
2. **Build NFA for 'b'**: Simple two-state NFA
221+
222+
```text
223+
→ (q₂) ─b→ (q₃)
224+
```
225+
125226
3. **Build NFA for 'a|b'**: Use union construction
227+
228+
```text
229+
ε ─a→
230+
┌───→ ( ) ───┐ ε
231+
→ ( ) │ ├──→ ( )
232+
└───→ ( ) ───┘
233+
ε ─b→
234+
```
235+
126236
4. **Build NFA for '(a|b)*'**: Apply Kleene star construction
127-
5. **Build NFAs for second 'a', third 'b', fourth 'b'**: Simple constructions
128-
6. **Concatenate all parts**: Connect using concatenation construction
129237

130-
The resulting NFA will have multiple states connected with both symbol transitions and ε-transitions.
238+
```text
239+
┌──────ε──────┐
240+
│ ↓
241+
→ ( ) ┴──ε→ [a|b] ──┴─ε→ ( )
242+
↑ │
243+
└──┘ ε
244+
```
245+
246+
5. **Build NFAs for 'a', 'b', 'b'**: Simple constructions
247+
248+
```text
249+
→ ( ) ─a→ ( ) → ( ) ─b→ ( ) → ( ) ─b→ ( ) ◎
250+
```
251+
252+
6. **Concatenate all parts**: Final NFA for `(a|b)*abb`
253+
254+
```text
255+
→ [(a|b)*] ─ε→ [a] ─ε→ [b] ─ε→ [b] ◎
256+
```
257+
258+
The resulting NFA has approximately 11 states with multiple ε-transitions connecting the components. Each component maintains Thompson's properties: single entry and exit points, enabling clean composition.
131259

132260
### The Fundamental Equivalence
133261

0 commit comments

Comments
 (0)