Task Details

The task is to evaluate batches of join queries on a set of pre-defined relations. Each join query specifies a set of relations, (equality) join predicates, and selections (aggregations). The challenge is to execute the queries as fast as possible without (much) prior indexing.

Input to your program will be provided on the standard input, and the output must appear on the standard output.

Testing Protocol

Our test harness will first feed the set of relations to your program's standard input. That means, your program will receive multiple lines (separated by the new line character '\n') where each one contains a string which represents the file name of the given relation. The relation files are already in a binary format and thus do not require parsing. Our quick-start package already contains sample code that mmaps() a relations into main memory. The binary format of a relation consists of a header and a data section. The header contains the number of tuples and the number of columns. The data section follows the header and stores all tuples using a column storage. Hence, all of the values of a column are stored sequentially, followed by the values of the next column, and so on. The overall binary format is as follows (T0C0 stands for tuple 0 of column 0; pipe symbol '|' is not part of the binary format):

uint64_t numTuples|uint64_t numColumns|uint64_t T0C0|uint64_t T1C0|..|uint64_t TnC0|uint64_t T0C1|..|uint64_t TnC1|..|uint64_t TnCm

After sending the set of relations, our test harness will send a line containing the string "Done".

Next, our test harness will wait for 1s until it starts sending queries. This gives you time to prepare for the workload, e.g., sampling of the relations. The test harness sends the workload in batches: A workload batch contains a set of join queries (each line represents a query). A join query consists of three consecutive parts (separated by the pipe symbol '|'):

  • Relations A list of relations that will be joined. We will pass the ids of the relation here separated by spaces (' '). The relation ids are implicitly mapped to the relations by the order the relations were passed in the first phase. For instance, id 0 refers to the first relation.
  • Predicates Each predicate is separated by a '&'. We have two types of predicates: filter predicates and join predicates. Filter predicates are of the form: filter column + comparison type (greater '>' less '<' equal '=') + integer constant. Join predicates specify on which columns the relations should be joined. A join predicate is composed out of two relation-column pairs connected with an equality ('=') operator. Here, a relation is identified by its offset in the list of relations to be joined (i.e., we implicitly bind the first relation of a join query to the identifier 0, the second one to 1, etc.).
  • Projections A list of columns that are needed to compute the final check sum that we use to verify that the join was done properly. Similar to the join predicates, columns are denoted as relation-column pairs. Each selection is delimited by a space character (' ').

Example: "0 2 4|0.1=1.2&1.0=2.1&0.1>3000|0.0 1.1"
Translated to SQL:
SELECT SUM("0".c0), SUM("1".c1)
FROM r0 "0", r2 "1", r4 "2"
WHERE 0.c1=1.c2 and 1.c0=2.c1 and 0.c1>3000

The end of a batch is indicated by a line containing the character 'F'. Our test harness will then wait for the results to be written to your program's standard output. For each join query, your program is required to output a line containing the check sums of the individual projections separated by spaces (e.g., "42 4711"). If there is no qualifying tuple, each check sum should return "NULL" like in SQL. Once the results have been received, we will start delivering the next workload batch.

For your check sums, you do not have to worry about numeric overflows as long as you are using 64 bit unsigned integers.

Your solution will be evaluated for correctness and execution time. Execution time measurement starts immediately after the 1s waiting period. You are free to fully utilize the waiting period for any kind of pre-processing.

NEW: Task Hints and Constraints

  • All join graphs are connected. No cross products!
  • Cyclic queries and self joins are possible
  • The maximum number of relations per query is 4
Update (03/14/18): We have created a new (larger) public dataset for local testing. You can download it here.

Quick Start Package

We provide a quick start package in C++. It is in the format required for submission. It includes a query parser and a relation loader. This project is only meant to give you a quick start into the project and to dig right into the fun (coding) part. It is not required to use the provided code. You can create a submittable submission.tar.gz file using the included package.sh script.
For testing, we provide CSV versions (.tbl files) of each relation + SQL files to load the relations into a DBMS. We also provide a file (small.work.sql) that contains SQL versions of all queries in small.work.

Update (03/04/18): We have added a C++ reference solution to the quickstart package. It implements a basic query execution model featuring full materialization. It does not implement any query optimization. It only uses standard STL containers (like unordered_map) for the join processing. Its query processing capabilities are limited to the demands of this contest. DISCLAIMER: Although we have tested the package intensively, we cannot guarantee that it is free of bugs. Thus, we test your submissions against the results computed by a real DBMS.

Evaluation Process

Our testing infrastructure will evaluate each submission by unpacking the submission file, compiling the submitted code (using your submitted compile.sh script), and then running a series of tests (using your submitted run.sh script). Extraction time is limited to 10 seconds, and compilation time is limited to 60 seconds. Submissions that exceed these limits will not be tested at all.

Each test uses the test harness to supply an initial dataset and a workload to your submitted program. The total time for each test is limited to 4-5 minutes (different tests may have slightly different limits). The total per-test time includes the time to ingest the dataset and the time to process the workload. These per-test execution time limit may be reduced later in the contest.

Submissions will be unpacked, compiled and tested in a network-isolated container with characteristics shown in the following table:

Processor 2x Intel Xeon E5-2660 v2 CPU (2.20 GHz, 3.00 GHz turbo)
Configuration 20 Cores / 40 Hyperthreads
Main Memory 256 GB DDR3 RAM
Operating System Ubuntu 17.10 (GNU/Linux 4.13.0-32-generic x86_64)
Software Autoconf 2.69, Automake 1.15, Boost 1.62, CMake 3.9.1, Go 1.8.3, Maven 3.5.0, Node.js 6.11.4, Oracle Java 9, Python 2.7.14, Python 3.6.3, Ruby 2.3.3, Rust 1.23.0, TBB 2017~U7, ant 1.9.9, clang 5.0, gcc/g++ 7.2.0, gccgo 7.2.0, jemalloc 3.6.0
Dockerfile Dockerfile is available for download here

A submission is considered to be rankable if it correctly processes all of the test workloads within their time limits. As discussed in the testing protocol description, initial import of relations is not included in a submission's score. The leaderboard will show the best rankable submission for each team that has at least one such submission.

Submitting

The SIGMOD 2018 Programming Contest management system accepts submissions consisting of a single compressed tar file called submission.tar.gz. Submission files must be no larger than 5 MB - larger files will be rejected. Each submission must include the following files/directories:
  • run.sh
    This script is responsible for running your executable, which should interact with our test harness according to the testing protocol, reading from standard input and writing results to standard output.
  • README
    This file must contain information about the submission, including:
    1. team name
    2. for each team member: full name, e-mail, institution, department, and degree program
    3. advisor/supervisor name (if any)
    4. a brief description of the solution
    5. a list of the third party code used and their licenses
  • Source Files
    The package must contain all of the source code.
  • compile.sh
    This shell script must be able to compile the source contained in the src directory. The produced executable file(s) must be located within the src directory. The benchmark environment is isolated and without internet access. You will have to provide all files required for successful compilation.

You can use the quick start package, which has the required format, as a starting point.

Rules

  • The 2018 SIGMOD Programming Contest is open to undergraduate and graduate students from degree-granting institutions all over the world. However, students associated with the organizers' institution (Technical University of Munich) are not eligible to participate.
  • Teams must consist of individuals currently registered as graduate or undergraduate students at an accredited academic institution. A team may be formed by one or more students, who need not be enrolled at the same institution. Several teams from the same institution can compete independently, but one person can be a member of only one team. There is no limit on team size. Teams can register on the contest site after 26 February, 2018.
  • All submissions must consist only of code written by the team or open source licensed software (i.e., using an OSI-approved license). For source code from books or public articles, clear reference and attribution must be made. Final submissions must be made by 8 April 2018, 23:59 CEST (UTC+02:00).
  • All teams must agree to license their code under an OSI-approved open source license. By participating in this contest, each team agrees to publish its source code. The finalists' implementations will be made public on the contest website.
  • A team is eligible for the prize if at least one of its members will come to present the work at the SIGMOD 2018 conference. The travel grants will only be granted to eligible teams.