Skip to content

Commit 5251470

Browse files
hlcianfagnaamotl
authored andcommitted
Debezium: Tutorial about replicating data from MSSQL
1 parent 8946a4b commit 5251470

File tree

3 files changed

+289
-4
lines changed

3 files changed

+289
-4
lines changed

docs/conf.py

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,9 @@
7575

7676
linkcheck_anchors_ignore_for_url += [
7777
# Anchor 'XXX' not found
78-
r"https://pypi.org/.*"
78+
r"https://pypi.org/.*",
79+
# https://kafka.apache.org/documentation/#topicconfigs - Anchor 'topicconfigs' not found
80+
r"https://kafka.apache.org/.*",
7981
]
8082

8183
# Configure intersphinx.

docs/integrate/debezium/index.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -27,9 +27,9 @@ SQL Server, IBM DB2, Cassandra, Vitess, Spanner, JDBC, and Informix.
2727
::::{grid}
2828

2929
:::{grid-item-card} Tutorial: Replicate data from MSSQL
30-
:link: https://community.cratedb.com/t/replicating-data-to-cratedb-with-debezium-and-kafka/1388
31-
:link-type: url
32-
Replicating data from MSSQL to CrateDB with Debezium and Kafka.
30+
:link: debezium-tutorial
31+
:link-type: ref
32+
Replicate data from MSSQL to CrateDB with Debezium and Kafka.
3333
:::
3434

3535
:::{grid-item-card} Webinar: Replicate data from other databases
@@ -41,5 +41,11 @@ How to replicate data from other databases to CrateDB with Debezium and Kafka.
4141
::::
4242

4343

44+
:::{toctree}
45+
:maxdepth: 1
46+
:hidden:
47+
Tutorial <tutorial>
48+
:::
49+
4450

4551
[Debezium]: https://debezium.io/
Lines changed: 277 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,277 @@
1+
(debezium-tutorial)=
2+
# Replicate data from MSSQL to CrateDB with Debezium and Kafka
3+
4+
## Introduction
5+
6+
You may have line-of-business applications such as ERP software that work with transactional database systems like MSSQL, Oracle, or MySQL.
7+
8+
The setup may work perfectly fine for day-to-day operations, but you may find that it is not ideal for doing data analytics.
9+
10+
Attempting to run analytic workloads against the operational databases you may see concurrency issues deriving from locking, the analytics queries may have an impact on the performance of business-critical operations, and you may also find that the performance and feature-set in the transactional database system may not be good enough for analyzing large amounts of data.
11+
12+
Considering this, many organisations come to the conclusion they need to copy data to a separate environment to run reporting and dashboards, this is sometimes done with replication, sometimes with backups, and sometimes with complex ETL pipelines. This often comes with a set of challenges:
13+
14+
* ballooning license costs
15+
* custom ad-hoc routines for getting the data to the analytics environment, requiring development, monitoring, and troubleshooting
16+
* a need to design and maintain an indexing strategy for the analytics copy of the data
17+
* high availability requirements for the analytics environment as the business starts relying on it
18+
19+
We know we can address several of these points by using a system like CrateDB. CrateDB is a feature-rich, open-source, SQL database which out-of-the-box automatically implements indexes, compression, and a columnar store so that most analytical queries can run much faster without any need to fiddle with settings. Because it is open-source, there is no need to be concerned about licensing expenses. Additionally, it can scale horizontally, which means that the number of nodes can be adjusted as needed to handle changing data volumes and workloads, and it provides high availability without requiring administrative effort.
20+
21+
If only we could replicate data from our operational database to CrateDB without having to write custom code… it turns out we can.
22+
23+
Enter Debezium, [Debezium ](https://debezium.io/)is a standard open-source system, built on top of Kafka, which allows to capture changes on a source database system and replicate them on another system without having to write custom scripts.
24+
25+
In this post I want to show an example replicating changes on a table from MSSQL to CrateDB.
26+
27+
## Setup on the MSSQL side
28+
29+
We will need a SQL Server instance with the SQL Server Agent service up and running, if you are running MSSQL on a container you can get the agent running by setting the environment variable `MSSQL_AGENT_ENABLED` to `True`.
30+
31+
Connect to the instance with a client such as `sqlcmd`, SSMS, or [DBeaver](https://dbeaver.io/).
32+
33+
We are now going to go through a number of steps, if you already have a working system feel free to skip the operations you do not need.
34+
35+
Let’s create a database with a test table on it:
36+
37+
```sql
38+
CREATE DATABASE erp;
39+
GO
40+
USE erp;
41+
CREATE TABLE dbo.tbltest (
42+
id INT PRIMARY KEY IDENTITY,
43+
createdon DATETIME DEFAULT getdate(),
44+
srcsystem NVARCHAR(max)
45+
);
46+
```
47+
Let’s now create an account for Debezium to use to pull the changes:
48+
49+
```sql
50+
CREATE LOGIN debeziumlogin WITH PASSWORD = '<enterStrongPasswordHere>';
51+
CREATE USER debeziumuser FOR LOGIN debeziumlogin;
52+
CREATE ROLE debeziumrole;
53+
EXEC sp_addrolemember 'debeziumrole', 'debeziumuser';
54+
EXEC sp_addrolemember 'db_datareader', 'debeziumuser';
55+
```
56+
57+
And let’s enable change data capture on our example table:
58+
59+
```sql
60+
EXEC sys.sp_cdc_enable_db;
61+
ALTER DATABASE erp ADD FILEGROUP cdcfg;
62+
ALTER DATABASE erp ADD FILE (
63+
NAME= erp_cdc_file1,
64+
FILENAME='/var/opt/mssql/data/erp_cdc_file1.ndf'
65+
) TO FILEGROUP cdcfg;
66+
EXEC sys.sp_cdc_enable_table
67+
@source_schema='dbo',
68+
@source_name='tbltest',
69+
@role_name='debeziumrole',
70+
@filegroup_name='cdcfg',
71+
@supports_net_changes=0;
72+
```
73+
74+
## Setup on the CrateDB side
75+
76+
We will need a CrateDB instance, for the purpose of this example we can spin one up with:
77+
78+
```bash
79+
sudo apt install docker.io
80+
sudo docker run --publish 4200:4200 --publish 5432:5432 --env CRATE_HEAP_SIZE=1g crate:latest -Cdiscovery.type=single-node
81+
```
82+
83+
Now we need to run a couple of SQL commands on this instance, an easy way to do this is using the Admin UI that can be accessed navigating with a web browser to port 4200 on the server where CrateDB is running, for instance `http://localhost:4200` and then open the console (second icon from the top on the left-hand side navigation bar).
84+
85+
We will create a user account for Debezium to use:
86+
87+
```sql
88+
CREATE USER debezium WITH (password='debeziumpwdincratedb123');
89+
```
90+
91+
The table on our MSSQL source is on the `dbo` schema, let’s imagine we want to have a `dbo` schema on CrateDB as well, the `debezium` account will need permissions on it:
92+
93+
```sql
94+
GRANT DQL,DML,DDL ON SCHEMA dbo to debezium;
95+
```
96+
97+
And let’s create the structure of the table that will receive the data:
98+
99+
```sql
100+
CREATE TABLE dbo.tbltest (
101+
id INT PRIMARY KEY /* we need the PK definition to match the source table so that this can be used to lookup records when they need to be updated */
102+
,createdon TIMESTAMP /* CrateDB supports defaults -of course- but because the source table already has a default value we do not need that here */
103+
,srcsystem TEXT
104+
);
105+
```
106+
107+
## Zookeeper and Kafka
108+
109+
To use Debezium we will need to have working setups of Zookeeper and Kafka.
110+
111+
For the purpose of this example I will spin them up with containers on the same machine:
112+
113+
```bash
114+
sudo docker run -it --rm --name zookeeper -p 2181:2181 -p 2888:2888 -p 3888:3888 debezium/zookeeper
115+
sudo docker run -it --rm --name kafka -p 9092:9092 --link zookeeper:zookeeper --add-host host.docker.internal:host-gateway debezium/kafka
116+
```
117+
118+
We need to create some special topics in Kafka:
119+
120+
```bash
121+
sudo docker exec -it kafka "bash"
122+
bin/kafka-topics.sh --create --replication-factor 1 --partitions 1 --topic my_connect_configs --bootstrap-server host.docker.internal:9092 --config cleanup.policy=compact
123+
bin/kafka-topics.sh --create --replication-factor 1 --partitions 1 --topic my_connect_offsets --bootstrap-server host.docker.internal:9092 --config cleanup.policy=compact
124+
exit
125+
```
126+
127+
Please note this is a very basic setup, for production purposes you may want to adjust some of [these settings](https://kafka.apache.org/documentation/#topicconfigs).
128+
129+
## Prepare and start a Debezium container image
130+
131+
We need to customize the base `debezium/connect` Docker image adding a JDBC sink and the PostgreSQL drivers.
132+
133+
For this we need to download the zip file from [kafka-connect-jdbc](https://www.confluent.io/hub/confluentinc/kafka-connect-jdbc) and then run the below replacing `*************` with the appropriate URL:
134+
135+
```bash
136+
mkdir customdockerimg
137+
cd customdockerimg
138+
wget *************/confluentinc-kafka-connect-jdbc-10.6.3.zip
139+
sudo apt install unzip
140+
mkdir confluentinc-kafka-connect-jdbc-10.6.3
141+
cd confluentinc-kafka-connect-jdbc-10.6.3
142+
unzip -j ../confluentinc-kafka-connect-jdbc-10.6.3.zip
143+
cd ..
144+
cat > Dockerfile <<EOF
145+
FROM debezium/connect
146+
USER root:root
147+
COPY ./confluentinc-kafka-connect-jdbc-10.6.3/ /kafka/connect/
148+
RUN cd /kafka/libs && curl -sO https://jdbc.postgresql.org/download/postgresql-42.5.4.jar
149+
USER 1001
150+
EOF
151+
sudo docker build -t cratedb-connect-debezium .
152+
```
153+
154+
Let’s now start this custom image:
155+
156+
```bash
157+
sudo docker run -it --rm --name connect -p 8083:8083 \
158+
-e GROUP_ID=1 \
159+
-e CONFIG_STORAGE_TOPIC=my_connect_configs \
160+
-e OFFSET_STORAGE_TOPIC=my_connect_offsets \
161+
--add-host host.docker.internal:host-gateway \
162+
--add-host $(hostname):host-gateway \
163+
-e BOOTSTRAP_SERVERS=host.docker.internal:9092 \
164+
-e KEY_CONVERTER=org.apache.kafka.connect.json.JsonConverter \
165+
-e VALUE_CONVERTER=org.apache.kafka.connect.json.JsonConverter \
166+
cratedb-connect-debezium
167+
```
168+
169+
This assumes Kafka is running locally on the same server, you will need to adjust `BOOTSTRAP_SERVERS` if that is not the case.
170+
171+
## Configure a source connector
172+
173+
Let’s create a `connector.json` file as follows:
174+
175+
```json
176+
{
177+
"name": "mssql-source-tbltest",
178+
"config": {
179+
"connector.class": "io.debezium.connector.sqlserver.SqlServerConnector",
180+
"tasks.max": "1",
181+
182+
"database.history.kafka.bootstrap.servers": "host.docker.internal:9092",
183+
"schema.history.internal.kafka.bootstrap.servers": "host.docker.internal:9092",
184+
"topic.prefix": "cratedbdemo",
185+
"database.encrypt": "false",
186+
187+
"database.hostname": "host.docker.internal",
188+
"database.port": "1433",
189+
"database.user": "debeziumlogin",
190+
"database.password": "<enterStrongPasswordHere>",
191+
"database.server.name": "mssql-server",
192+
193+
"database.names": "erp",
194+
"table.whitelist": "dbo.tbltest",
195+
"database.history.kafka.topic": "schema-changes.mssql-server.tbltest",
196+
"schema.history.internal.kafka.topic": "schema-changes.inventory.mssql-server.tbltest"
197+
}
198+
}
199+
```
200+
201+
We can observe that there are settings there concerning the Kafka setup to use, the details to connect to MSSQL, the name of the table that we want to pull changes from, and the Kafka topics that will be used to track these changes.
202+
203+
Let’s deploy this:
204+
205+
```bash
206+
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @connector.json
207+
```
208+
209+
## Configure a target
210+
211+
Let’s create a `destination-connector.json` file as follows:
212+
213+
```json
214+
{
215+
"name": "cratedb-sink-tbltest",
216+
"config": {
217+
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
218+
"tasks.max": "1",
219+
220+
"connection.url": "jdbc:postgresql://host.docker.internal:5432/",
221+
"connection.user": "debezium",
222+
"connection.password": "debeziumpwdincratedb123",
223+
224+
"topics": "cratedbdemo.erp.dbo.tbltest",
225+
"table.name.format": "dbo.tbltest",
226+
"auto.create": "false",
227+
"auto.evolve": "false",
228+
229+
"insert.mode": "upsert",
230+
"pk.fields": "id",
231+
"pk.mode": "record_value",
232+
233+
"transforms": "unwrap",
234+
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState"
235+
}
236+
}
237+
```
238+
239+
We got details to connect to CrateDB, the name of table that will receive the changes (please note this is case sensitive), and some transform instructions to flatten the JSON data stored in the Kafka topic.
240+
241+
Let’s deploy this:
242+
243+
```bash
244+
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @destination-connector.json
245+
```
246+
247+
## Testing
248+
249+
Let’s see this in action.
250+
251+
Let’s create a record from the MSSQL side:
252+
253+
```sql
254+
INSERT INTO erp.dbo.tbltest (srcsystem) VALUES (@@version);
255+
```
256+
257+
And now let’s go to CrateDB and check the table:
258+
259+
```sql
260+
SELECT * FROM dbo.tbltest;
261+
```
262+
263+
As magic the record is there.
264+
265+
Let’s now try an update from the MSSQL side:
266+
267+
```sql
268+
UPDATE erp.dbo.tbltest
269+
SET srcsystem = 'Updated successfully'
270+
WHERE id = 1
271+
```
272+
273+
![image|690x191](https://us1.discourse-cdn.com/flex020/uploads/crate/original/1X/4476976451e1f943112082a7d1e0cd36524740a1.png)
274+
275+
## Conclusion
276+
277+
Using Debezium we can replicate changes from different database systems to CrateDB without having to develop any custom logic, we can then take advantage of CrateDB’s performance and features for our analytic workloads.

0 commit comments

Comments
 (0)