Industrial Session – LOD 2019 Challenge

As companies today are becoming more and more “data driven”, this year we introduced an Industrial Session to encourage both researchers and companies to interact with each other. The basic idea is to create an environment where companies may present their vision, approach and issues on Big Data related topics, while researchers may be better tailor their results for the industrial need.

Thus we encourage both researchers and companies to submit original papers on issues that mostly cover, but are not limited to, areas such as:

Real industrial Big Data/AI based use cases
Big Data solutions
Industrial applications

Furthermore, participating companies may apply for a 10-min oral talk to present their Big Data related issues (no paper submission is required for this activity). Following these presentations, ad-hoc meetings will be encouraged and organized by LOD itself between companies and researchers to address specific company issues. The idea is just to create a unique environment where companies discuss their Big Data issues with researchers in order to identify solutions for their problems. And, at the same time, offer to researchers the opportunity to develop new algorithms and solutions for Big Data real problems.

LOD 2019 Big-Data Challenge

Our sponsor, Neodata Lab, will offer a prize of €2000 to the applicant who develops the most accurate algorithm to process an “approximate SQL-like query answering system” on a real dataset.

https://github.com/Neodata-Group/LOD-2019-challenge

In order to participate to this contest (or if you have any inquire about the challenge) please send an email to the following address:

lod2019challenge@neodatagroup.com

specifying your full name and affiliation. We will contact you with directions on how to download sample data.

LOD 2019 Challenge Specifications

Sample data for Neodata Lab prize of €2000 to the applicant who develops the most accurate algorithm to process an “approximate SQL-like query answering system” on this real dataset. In order to participate in this contest (or if you have any inquiries about the challenge) please send an email to: lod2019challenge@neodatagroup.com specifying your full name and affiliation. We will contact you with instructions for downloading the dataset.

All applicants will be given a link to download a real large encrypted dataset. The dataset is composed of two tables, one USER table with about 60M unique user ids and an ACTION table with a total of about 600M rows. Both tables are given in a tab separated format. The USER table is about 2.2Gb while the ACTION table is about 35Gb.

The applicants are required to develop an algorithm to provide an approximate answer to a “SQL-like GROUP BY” query in a given time-frame (a few seconds). Basically, a query should run for no longer than the given time-frame and should return the best approximations for the specified counters (more details below).

Please find all challenge rules in the following:

The algorithm has to be written in one of the following programming languages: C, Java, Python, R, Scala, Go (if you want to use a different language please contact us);
The algorithm will receive as input (1) a “condition” C which applies on the given schema and (2) a “time-out in seconds” T. The algorithm has to terminate within T seconds and should return back the most precise counter of the number of tuples matching C. The condition C is something like a standard SQL WHERE clause which can contain any standard SQL operators (including SUBSTRING). Each contestant can choose his/her own syntax for C as long as all possible standard SQL-WHERE operators can be specified. The condition C can spawn over the two input tables (USER and ACTIONS) which are joinable by the user_id field.
The final comparison among all algorithms will take place during the conference. All algorithms will be evaulated on the same hardware platform, this will be a cloud hosted Linux virtual machine 8 cores, 56 Gb RAM, 1Tb SSD Disk. The algorithms can run as standalone process as well as a distributed system (preferred), all libraries need to be included or publicly available.
By July 19th 2019 all applicants have to submit to lod2019challenge@neodatagroup.com all the following:
- An executable batch file that installs all libraries and tools necessary to support algorithm execution. Please make sure that no special operation, other than running the script, is required. In case of errors or missing execution the organizers may decide to discard the applicant participation to the contest.
- An extended abstract (6 pages), following the conference printing guidelines, that describes the algorithm and the approach in full details. This abstract will be published in the conference proceedings.
- The complete code.
- Any other support file(s) necessary to execute the algorithm.
On July 5th 2019 all applicants will receive a set of queries Q, the same set for everyone. Those are the queries that will be used at conference time to evaluate all algorithms. The algorithm that, on average, provides the most accurate answers will win the competition. Please note that the queries Q during the competition will be evaulated against some arbitrary time-outs.

IMPORTANT DATES:

Evaluation queries available on: July 5, 2019
Submission of full system and abstract: July 19, 2019
Notification of Acceptance: July 30, 2019
Challenge presentation: September 10-13, 2019 (final date to be defined)
Contact: lod2019challenge@neodatagroup.com

FAQ LOD 2019 Challenge

1) The application returns only the number of matches or all the rows matching the query? In the latter case which fields the application should return?

Only the number of unique users that match some conditions is required.

2) Can the application use a relational or NoSQL database?

Relational or NoSQL databases can be used. You should then provide us the runtime environment to be executed in the machine we will use to obtain the challenge results.

3) Can we preprocess the dataset and more generally prepare it for the application? Can we sort the rows, split or merge files, store the data in another format?

You can do each kind of pre-process you want, to optimize the execution of the queries. This pre-processing should not be specific to particular queries or dataset but enough general to be used proficently to execute general queries in an optimized way.

4) The application should work on a generic dataset or it is fine to optimise it for the given dataset?

The dataset used to obtain the final result of the challenge will be similar but not the same we provided you. So you cannot think to optimize using particular patterns of the actual dataset but only general patterns.

5) Can the system maintain a state during its execution (i.e. have memory of past queries and queried data)?

Yes, it can be done. Remember that the queries can be very variable.

6) The sentence “… the most precise counter of the number of tuples matching C …” reported in the challenge means that the number can be both an over or an under estimate?

Yes, an estimation is admitted because the goal is not having the exact number but something that is the more precise as possible in a few seconds.

7) The example queries reported in the document contain the following statements: “Visiting at least 10 times the urls containing …” and “Viewing at least 10 times an advertisement …”. In these cases we count the views occurred in different dates (rows) or we use the field ‘imps’ and/or ‘pageviews’ to count the number of views and/or visits? Or we sum all the ‘pageviews’ occurred in different dates (rows)?

The sum of pageviews or imps or clicks is intended to be calculated.

8) Which is the primary key of the ACTION TABLE?

The key in the ACTION TABLE is not given. The field user_id is the foreign key to idemtify uniquely the users in the USER TABLE. Remenber that the goal is always counting the number of unique users!

9) We noticed that some fields of the ACTION TABLE are missing. In particular instead of 18 values/fields we found 16 values/fields for some rows, the missing fields seems to be ‘url’ and ‘domain’.

In some case the field values can be null (that is the case of url and domain).

10) We also noticed that contrary to what stated in the document the url are not anonymised in the ACTION TABLE.

In this case we just took urls from a real set to make the challenge more realistic.