tutorial-databases/db-management.qmd at master · berkeley-scf/tutorial-databases · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
---
title: Database management
format:
  html:
    theme: cosmo
    css: assets/styles.css
    toc: true
    code-copy: true
    code-block-bg: true
    code-block-border-left: "#31BAE9"
ipynb-shell-interactivity: all
code-overflow: wrap
execute:
  freeze: auto
---


```{r setup, include=FALSE}
options(replace.assign=TRUE, width=65)
knitr::opts_chunk$set(eval = FALSE, message = FALSE) ## turned off message() output
library(DBI)
SLOWEVAL <- TRUE
```

We'll illustrate some basic database management using a different example dataset that contains some data on webtraffic to Wikipedia pages. Note that the input file used here involved some pre-processing relative to the data you get the directly from the Wikistats dataset available through Amazon Web Services (AWS) because in the data posted on AWS, the datetime information is part of the filename, rather than field(s) in the table.

You can get the raw input files of Wikistats data [here](http://www.stat.berkeley.edu/share/paciorek/tutorial-databases-data.zip)

## 1 SQLite

### 1.1 Setting up a database and using the SQLite command line

With SQLite you don't need to deal with all the permissions and administrative overhead of a client-server style of DBMS because an SQLite database is simply a file that you can access without a password or connecting to a database server process.

To start the SQLite interpreter in Linux, either operating on or creating a database named `wikistats.db`:

```bash
sqlite3 wikistats.db
```

Here's the syntax to create an (empty) table:

```bash
create table webtraffic (date char(8), hour char(6), site varchar, page varchar, count integer, size double precision);
.quit
```

### 1.2 Populating a table

Here's an example of reading from multiple files into SQLite using the command line.
We create a file `import.sql` that has the configuration for the import:

```sql
.separator " "
.import /dev/stdin webtraffic
```

Then we can iterate through our files from the UNIX shell, piping the output of gzip to the `sqlite3` interpreter:

```bash
for file in $(ls part*gz); do
    echo "copying $file"
    gzip -cd $file | sqlite3 wikistats.db '.read import.sql'
done
```

Let's check that the records are in the database. We could do this from R or Python, but here we'll do it in the SQLite interface:

```sql
select * from webtraffic limit 5;
```

### 1.3 Data cleaning

A problem in this example with importing from the data files into SQLite as above is the presence of double quote (") characters that are not meant to delineate strings but are actually part of a field. In this case probably the easiest thing is simply to strip out those quotes from UNIX. Here we use `sed` to search and replace to create versions of the input files that don't have the quotes.

```{bash}
for file in $(ls *gz); do
    gzip -cd ${file} | sed  "s/\"//g" | gzip -c > wikistats-cleaned/${file}
done
```

::: {.callout-warning}
If you want to read the data into SQLite yourself, you *will* need to do something about the quotes; I haven't stripped them out of the files.
:::

## 2 DuckDB

DuckDB also has an [interpreter (a command line interface)](https://duckdb.org/docs/api/cli) that you can run from the command line instead of using it via R or Python or other languages.

I won't demonstrate that here. Instead I'll demonstrate creation of a database directly from R without having to read the data into memory in R. We control the schema from R as well using standard arguments of `read.csv()` (which is used behind the scenes in setting up the DuckDB database but not for reading in all the data).

```{r, eval=FALSE}
library(duckdb)
dbname <- "wikistats.duckdb"
drv <- duckdb()
db <- dbConnect(drv, dbname)
duckdb_read_csv(db, 'webtraffic', list.files('.', pattern = "^part-"),
                    delim = ' ', header = FALSE, na.strings = 'NA',
                    colClasses = c('character','character','character','character','integer','numeric'),
                    col.names = c('date','hour','site','page','count','size')))
dbDisconnect(db, shutdown = TRUE)
```

The DuckDB Python interface has `read_csv` and `read_parquet` functions for creating a database from one or more data files.

## 3 PostgreSQL

### 3.1 Setting up a database and using the Postgres command line

First make sure Postgres is installed on your machine.

On Ubuntu, you can install Postgres easily via `apt-get`:

```bash
sudo apt-get install postgresql postgresql-contrib
```

Next we'll see how to set up a database. You'll generally need to operate as the `postgres` user for these sorts of manipulations. Of course if you're just a user accessing an existing database and existing tables, you don't need to worry about this.

```bash
sudo -u postgres -i  # become the postgres user
psql  # start postgres interpreter
```

Now from within the Postgres interpreter, you can create a database, tables within the database, and authenticate users to do things with those tables.


```sqlpostgresql
create database wikistats;
create user paciorek with password 'test';
grant all privileges on database wikistats to paciorek;
```

PostgreSQL and other DBMS (not SQLite) allow various kinds of control over permissions to access and modify databases and tables as well.
It can get a bit involved because the administrator has fine-grained control over what each user can do/access.


Now let's create a table in the database, after first connecting to the specific database so as to operate on it.

```sqlpostgresql
\connect wikistats
create table webtraffic (date char(8), hour char(6), site varchar, page varchar,
       count integer, size double precision);
grant all privileges on table webtraffic to paciorek;
\quit
```

::: {.callout-note}
Notice the use of `\` to do administrative tasks (as opposed to executing SQL syntax), and the use of `;` to end each statement. Without the semicolon, Postgres will return without doing anything.
:::

If you want control over where the database is stored (you probably only need to worry about this if you are creating a large database), you can do things like this:

```sqlpostgresql
show data_directory;
create tablespace dbspace location '/var/tmp/pg';
create database wikistats tablespace dbspace;
create user paciorek with password 'test';
grant all privileges on database wikistats to paciorek;
```

### 3.2 Populating a table

Here's an example of importing a single file into Postgres from within the psql interpreter running as the special postgres user. In this case we have space-delimited text files. You can obtain the file `part-00000` as discussed in the introduction (you'll need to run `gunzip part-00000.gz` first).

```sqlpostgresql
\connect wikistats
copy webtraffic from 'part-00000' delimiter ' ';
```

If one had CSV files, one could do the following

```sqlpostgresql
copy webtraffic from 'part-00000' csv;
```

To actually handle the Wikistats input files, we need to deal with backslash characters occurring at the end of text for a given column in some rows. Ordinarily in standard Postgres 'text' format (different from Postgres 'csv' format), a backslash is used to 'quote' characters that would usually be treated as row or column delimiters (i.e., preceding such a character by a backslash means it is treated as a character that is part of the field). But we just want the backslash treated as a character itself. So we need to tell Postgres not to treat a backslash as the quoting character. To do that we specify the `quote` character. However, the quote keyword is only provided when importing 'csv' format. In 'csv' format the double-quote character is by default treated as delineating the beginning and end of text in a field, but the Wikistats files have double-quotes as part of the fields. So we need to set the quote character as neither a double-quote nor a backslash. The following syntax does that by specifying that the quote character is a character (`\b`) that never actually appears in the file. The 'e' part is so that Postgres treats `\b` as a single character, i.e., 'escaping' the backslash, and the 'csv' is because the quote keyword only works with the csv format, but note that by setting the delimiter to a space, it's not really a CSV file!

```sqlpostgresql
copy webtraffic from 'part-00000' delimiter ' ' quote e'\b' csv;
```

Often you'll need to load data from a large number of possibly zipped text files. As an example of how you would load data in a case like that, here's some shell scripting that will iterate through multiple (gzipped) input files of Wikistats data, running as the regular user:

```bash
export PGPASSWORD=test  # set password via UNIX environment variable
for file in $(ls part*gz); do  # loop thru files whose names start with 'part' and end with 'gz'
  echo "copying $file"
  ## unzip and then pass by UNIX pipe to psql run in non-interactive mode
  gzip -cd $file |
    psql -d wikistats -h localhost -U paciorek -p 5432 -c "\copy webtraffic from stdin delimiter ' ' quote e'\b' csv"
done
```

::: {.callout-note}
Using `\copy` as above invokes the psql `copy` command (`copy` would invoke the standard SQL `copy` command), which allows one to operate as a regular user and to use relative paths. In turn `\copy` invokes `copy` in a specific way.
:::

### 3.3 Data cleaning

One complication is that often the input files will have anomalies in them. Examples include missing columns for some rows, individual elements in a column that are not of the correct type (e.g., a string in a numeric column), and characters that can't be handled. In the Wikistats data case, one issue was lines without the full set of columns and another was the presence of a backslash character at the end of the text for a column.

With large amounts of data or many files, this can be a hassle to deal with. UNIX shell commands can sometimes be quite helpful, including use of sed and awk. Or one might preprocess files in chunks using Python.

For example the following shell scripting loop over Wikistats files ensures each row has 6 fields/columns by pulling out only rows with the full set of columns. I used this to process the input files before copying into Postgres as done above. Actually there was even more preprocessing because in the form of the data available from Amazon's storage service, the date/time information was part of the filename and not part of the data files.

```{bash}
for file in $(ls *gz); do
    gzip -cd $file | grep "^.* .* .* .* .* .*$" | gzip -c > ../wikistats-fulllines/$file
done
```

Note that this restriction to rows with a full set of fields has already been done in the data files I provide to you.

## 3 Database administration and configuration miscellanea

You can often get configuration information by making a query. For example, here's how one can get information on the cache size in SQLite or on various settings in Postgres.

```{r, eval=FALSE}
# SQLite
dbGetQuery(db, "pragma cache_size")
dbGetQuery(db, "pragma cache_size=90000")
# sets cache size to ~90 GB, 1 KB/page, but not really relevant as
# operating system should do disk caching automatically

# Postgres
dbGetQuery(db, "select * from pg_settings")
dbGetQuery(db, "select * from pg_settings where name='dynamic_shared_memory_type'")
```

## 4 Remote access to PostgreSQL databases

If you want to connect to a Postgres database running on a different machine, here's one approach that involves SSH port forwarding. For example, you could connect to a Postgres database running on some server while working as usual in R or Python on your laptop.

First, on your machine, set up the port forwarding where 63333 should be an unused port on your local machine and PostgresHostMachine is the machine on which the database is running.

For Linux/Mac, from the terminal:

```bash
ssh -L 63333:localhost:5432 yourUserName@PostgresHostMachine
```

Using Putty on Windows, go to 'Connection -> SSH -> Tunnels' and put '63333' as the 'Source port' and '127.0.0.1:5432' as the 'Destination'. Click 'Add' and then connect to the machine via Putty.

In either case, the result is that port 63333 on your local machine is being forwarded to port 5432 (the standard port used by Postgres) on the server. The use of 'localhost' is a bit confusing - it means that you are forwarding port 63333 to port 5432 on 'localhost' on the server.

Then (on your local machine) you can connect by specifying the port on your local machine, with the example here being from R:

```{r, eval=FALSE}
db <- dbConnect(drv, dbname = 'wikistats', user = 'yourUserName',
   password = 'yourPassword', host = 'localhost', port = 63333)
```

## 5  UNIX tools for examining disk access (I/O) and memory use

### 5.1 I/O

`iotop` shows disk input/output in real time on a per-process basis, while iostat shows overall disk use.

```{bash, io-monitor}
iotop    # shows usage in real time
iostat 1 # shows usage every second
```

### 5.2 Memory

To see how much memory is available, one needs to have a clear understanding of disk caching. As discussed above, the operating system will generally cache files/data in memory when it reads from disk. Then if that information is still in memory the next time it is needed, it will be much faster to access it the second time around. While the cached information is using memory, that same physical memory is immediately available to other processes, so the memory is available even though it is in use.

We can see this via `free -h` (the -h is for 'human-readable', i.e. show in GB (G)).

```
              total        used        free      shared  buff/cache   available
Mem:           251G        998M        221G        2.6G         29G        247G
Swap:          7.6G        210M        7.4G
```

You'll generally be interested in the `Memory` row. (See below for some comments on `Swap`.) The `shared` column is complicated and probably won't be of use to you. The `buff/cache` column shows how much space is used for disk caching and related purposes but is actually available. Hence the `available` column is the sum of the `free` and `buff/cache` columns (more or less). In this case only about 1 GB is in use (indicated in the `used` column).

`top` and `vmstat` both show overall memory use, but remember that the amount available is the amount free plus any buffer/cache usage.
Here is some example output from vmstat:

```
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 215140 231655120 677944 30660296    0    0     1     2    0    0 18  0 82  0  0
```

It shows 232 GB free and 31 GB used for cache and therefore available, for a total of 263 GB available.

Here are some example lines from top:

```
KiB Mem : 26413715+total, 23180236+free,   999704 used, 31335072 buff/cache
KiB Swap:  7999484 total,  7784336 free,   215148 used. 25953483+avail Mem
```

We see that this machine has 264 GB RAM (the total column in the Mem row), with 259.5 GB available (232 GB free plus 31 GB buff/cache as seen in the Mem row). (I realize the numbers don't quite add up for reasons I don't fully understand, but we probably don't need to worry about that degree of exactness.) Only 1 GB is in use.

`swap` is essentially the reverse of disk caching. It is disk space that is used for memory when the machine runs out of physical memory. You never want your machine to be using swap for memory, because your jobs will slow to a crawl. Here the swap line in both free and top shows 8 GB swap space, with very little in use, as desired.