Welcome to our restaurant! This site was designed as an experiment to help people understand how to use simple tools to process data locally.
The file sales.csv contains a list of purchase receipts with all the articles they contain. Each purchase is identified by the field tx_id and each article bought has a price item_price and a two-letter code identifying the country.
In this recipe, you are required to produce a list with the receipts, but adding the full country name at the end. The translation between country codes and labels can be found in countries.csv.
Result should look like:
SBXWHCG156,1.99,MV,Maldives
BEIGKQD194,93.87,MV,Maldives
BYXCGLP161,4.44,MW,MalawiSolve the exercise by sending the ante-penultimate row, sorted by item_price descending.
curl -sX POST https://kitchen.luisbelloch.es/api/:team/e1 -d 'FPQINMY120,0.04,SD,Sudan'Using awk, read the file sales.csv and produce a list of the top-ten countries with most sales. Sorting is not a good operation for awk, you may want to use the sort command.
Solve the exercise by sending the second row
curl -sX POST https://kitchen.luisbelloch.es/api/:team/e2a -d 'KM|595.81'How to prepare a Panda for cooking
Using Pandas, produce a list of the top-ten countries with more sales.
One easy way to get started is to use a Jupyter environment in Docker:
docker run -v $(pwd):/home/jovyan/ -p 8888:8888 -p 4040:4040 jupyter/scipy-notebookThen navigate to the provider URL. ctrl+enter executes the current cell, use a or b keys to add cells. Alternatively you may use VS Code Jupyter extensions, but it's a bit painful to install.
Solve the exercise by sending the second row
curl -sX POST https://kitchen.luisbelloch.es/api/:team/e2b -d 'KM|595.81'The file pancake_orders.10M.csv.gz contains 10 million registers, with the following columns:
┌──────────────┬────────┬────────────┬────────────────┬─────────────┐
│ ts │ price │ item_count │ source_country │ coupon_code │
│ timestamp │ double │ int64 │ varchar │ varchar │
├──────────────┼────────┼────────────┼────────────────┼─────────────┤
│ 16:36:19.794 │ 3.85 │ 1 │ BR │ 501826 │
│ 16:49:30.072 │ 6.36 │ 4 │ PY │ 2bd108 │
│ 16:51:26.371 │ 6.36 │ 1 │ MF │ cfaed6 │
└──────────────┴────────┴────────────┴────────────────┴─────────────┘Data file can be found here: pancake_orders.10M.csv.gz. Do not decompress it, DuckDB is able to read compressed files on the fly.
You are required to get the top-ten countries with more sales, mixing the data with countries.jsonl file. You can only use DuckDB, no external tooling.
Result should look like this:
country,total
Guinea-Bissau,297833.1000000053
"Cocos (Keeling) Islands",298836.7600000056
Namibia,299101.0800000052
...Solve the exercise by sending the 4th row (Finland).
curl -sX POST https://kitchen.luisbelloch.es/api/:team/e3 -d 'Namibia,299101.0800000052'Optional: Try to query directly the CSV and also load the data into a some.duckdb file. Does the later make a difference in performance?
Optional: Save the results as Parquet and repeat the performance experiment.
Repeat the DuckDB exercise, but this time using ClickHouse:
- Run query using
clickhouse local - Start server and create two tables:
ordersandcountries - Load the data on them. You may benefit by using Parquet format from previous exercise. Also remember that ClickHouse is able to read compressed files on the fly.
- Run query using
clickhouse client
Solve the exercise by sending the 5th row (Peru), like in the previous exercise.
curl -sX POST https://kitchen.luisbelloch.es/api/:team/e4 -d 'Namibia,299101.0800000052'Using jq only, read the file sales.csv and produce a list the top-ten countries with most sales. Because reasons.
Food Allergy Warning: Please be advised that our food may have come in contact or contain unix nuts, terminal traces, bash fish or console peanuts.
🌾 Gluten-Free