====== Downloads ====== === RapidMiner Extensions === {{:rapidminer_190.jpg?nolink|}} Our contributions for RapidMiner can be found [[rapidminer|here]]. \\ === aspects-DB-Dataset for Focus Aspect Value (FAV) model for Explainable Subjective Interpretation === ^ Dataset ^ Description ^ | [[http://madm.dfki.de/files/captioning_data/|aspects-DB]] | The dataset contains Aspects and the corresponding image URLS. Please follow the README for further information. | \\ === Datasets for Image Captioning === ^ Dataset ^ Description ^ | [[http://madm.dfki.de/files/captioning_data/captioning_in_the_wild.zip|captioning_in_the_wild.zip]] | Crowdsourcing annotations of image captions from the Yahoo Flickr Creative Commons 100 Million Dataset (YFCC100M). The dataset contains responses with respect to subjectivity, visibility, appeal and intent of around 2.2k image titles. | \\ === Datasets for Document Analysis === Here you find the data sets that have been generated at MADM for research purposes. Detailed information about each dataset can be obtained on the specific page. ^ Topic ^ Name ^ Description ^ | Document security | [[downloads-ds-doctor-bills|Doctor bills]] | The data set contains genuine and forged doctor bills. Forgeries are made by re-engineering of genuine documents. | | Document security | [[downloads-ds-mic|MIC dataset]] | The data set contains print-outs from color laser printers and copiers that show Machine Identification Codes (MIC), also known as "yellow dots" or counterfeit protection system codes.| | Document security | [[downloads-ds-staver|StaVer dataset]] | The data set contains scanned invoices with color logos, color text and various kinds of stamps.| | Document security | [[downloads-ds-scandist|Scan Distortion dataset]] | The dataset contains gray scale invoices from the same source as well as copies of genuine invoices to detect and measure the scanning distortions.| | Document security | [[downloads-ds-distorted-textline|Distorted Text-Lines dataset]] | The dataset contains synthetic gray scale document images with single column text where the last paragraph is either rotated or mis-aligned. Different fonts and font sizes are used.| | Document security | [[downloads-ds-printing-technique|DFKI Printing Technique dataset]] | This dataset contains documents printed on 7 inkjet and 13 laser printers. | \\ === Datasets for Image and Video Analysis === ^ Dataset ^ Description ^ | [[http://www.dfki.uni-kl.de/~ulges/youtube-22concepts/|YouTube-22concepts]] | A dataset of YouTube video clips tagged with 22 different concepts for experiments with automatic video annotation. | \\ === Datasets for Audio Analysis=== ^ Dataset ^ Description ^ | [[http://audiopairbank.dfki.de|AudioPairBank]] | A Large-Scale Tag-Pair-Based Audio Dataset (385.5 hours, 1116 classes) | \\ === Datasets for Machine Learning === ^ Dataset Generator ^ Description ^ | {{:downloads:dfki-bayes-data-generator-1.05.zip|}} | Python code for generating synthetic datasets with known Bayes error rate and defined statistical properties. | | [[http://madm.dfki.de/files/sentinel/EuroSAT.zip|EuroSAT (RGB color space images)]] | EuroSAT: A land use and land cover classification dataset based on Sentinel-2 satellite images. | | [[http://madm.dfki.de/files/sentinel/EuroSATallBands.zip|EuroSAT (all 13 bands)]] | EuroSAT: A land use and land cover classification dataset based on Sentinel-2 satellite images. | \\ === Datasets for Unsupervised Anomaly Detection === Below datasets for unsupervised anomaly detection could be found. The outlier label must not be used for detection, only for evaluation. The first row contains the column naming. For the UCI datasets, permission for republication has been granted. For more information please refer to [[http://archive.ics.uci.edu/ml/]]\\ \\ **More unsupervised anomaly detection datasets for evaluation can be now found on the Harvard Dataverse: [[http://dx.doi.org/10.7910/DVN/OPQMVF|http://dx.doi.org/10.7910/DVN/OPQMVF]]** ^ Dataset ^ Records ^ Dimensions ^ % outliers ^ Description ^ | {{:downloads:dfki-artificial-3000-unsupervised-ad.zip|}} | 3000 | 2 | 1.23 | Artificial test data set with 4 normal distributions (one of which with low density), a micro cluster and local anomalies. | | [[https://dataverse.harvard.edu/api/access/datafile/2711924?format=original|breast-cancer-unsupervised.csv]] | 367 | 30 | 2.72 | Modified "Breast Cancer Wisconsin (Diagnostic)" dataset from the UCI machine learning repositoy. Original version available [[https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+%28Diagnostic%29|here]]. | | [[https://dataverse.harvard.edu/api/access/datafile/2711918?format=original|pen-local-unsupervised.csv]] | 6724 | 16 | 0.15 | Modified "Pen-Based Recognition of Handwritten Digits" dataset from the UCI machine learning repositoy. Original version available [[https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits|here]]. | | [[https://dataverse.harvard.edu/api/access/datafile/2711923?format=original|pen-global-unsupervised.csv]] | 809 | 16 | 11.1 | Modified "Pen-Based Recognition of Handwritten Digits" dataset from the UCI machine learning repositoy. Original version available [[https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits|here]]. | | [[https://dataverse.harvard.edu/api/access/datafile/2711919?format=original|shuttle-unsupervised.csv]] | 46464 | 9 | 1.89 | Modified "Statlog (Shuttle)" dataset from the UCI machine learning repositoy. Original version available [[http://archive.ics.uci.edu/ml/datasets/Statlog+%28Shuttle%29|here]]. | | [[https://dataverse.harvard.edu/api/access/datafile/2711925?format=original|satellite-unsupervised.csv]] | 5100 | 36 | 1.49 | Modified "Statlog (Landsat Satellite)" dataset from the UCI machine learning repositoy. Original version available [[https://archive.ics.uci.edu/ml/datasets/Statlog+%28Landsat+Satellite%29|here]]. | | [[https://dataverse.harvard.edu/api/access/datafile/2711920?format=original|annthyroid-unsupervised.csv]] | 6916 | 21 | 3.61 | Modified "Thyroid Disease" dataset from the UCI machine learning repositoy. See version "ann-thyroid". Original version available [[http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease|here]]. | | [[https://dataverse.harvard.edu/api/access/datafile/2711916?format=original|kdd99-unsupervised-ad.csv]] | 620089 | 38 | 0.17 | Modified "KDD Cup 1999" dataset from the UCI machine learning repositoy. Only HTTP connections selected. Original version available [[https://archive.ics.uci.edu/ml/datasets/KDD+Cup+1999+Data|here]]. |