Extreme multi-label learning aims to annotate each data point with the most relevant subset of labels from an extremely large label set. Slice is an efficient 1-vs-All based extreme classifier that is specially designed for low-dimensional dense features. Slice achieves close to state-of-the-art accuracies while being significantly faster to train and predict than most other extreme classifiers. Slice can efficiently scale to datasets with as many as 100 million labels and 240 million training points. Please refer to the research paper for more details.
This code is made available as is for non-commercial research purposes. Please make sure that you have read the license agreement in LICENSE.doc/pdf. Please do not install or use Slice unless you agree to the terms of the license.
Download Slice source code in C++ and Matlab as well as precompiled Windows/Linux binaries.The code for Slice is written in C++ and should compile on 64 bit Windows/Linux machines using a C++11 enabled compiler. The code also uses the publically available implementation of the HNSW algorithm to find the Approximate Nearest Neighbors. Installation and usage instructions are provided below. The default parameters provided in the Usage Section work reasonably on the benchmark datasets in the Extreme Classification Repository
Please contact Himanshu Jain and Manik Varma if you have any questions or feedback.
Please visit the Extreme Classification Repository to download the benchmark datasets and compare Slice's performance to baseline algorithms.
./slice_train [input feature file name] [input label file name] [output model folder name] -m 100 -c 300 -s 300 -k 300 -o 20 -t 1 -f 0.000001 -siter 20 -b 0 -stype 0 -C 1 -q 0where:
-m = params.M : HNSW M parameter. default=100 -c = params.efC : HNSW efConstruction parameter. default=300 -s = params.efS : HNSW efSearch parameter. default=300 -k = params.num_nbrs : Number of labels to be shortlisted per training point according to the generative model. default=300 -o = params.num_io_threads : Number of threads used to write the retrived ANN points to file. default=20 -t = params.num_threads : Number of threads used to train ANNS datastructure and the discriminative classifiers. default=1 -C = params.classifier_cost : Cost co-efficient for linear classifiers default=1.0 SVM weight co-efficient. default=1.0 -f = params.classifier_threshold : Threshold value for sparsifying linear classifiers' trained weights to reduce model size. default=1e-6 -siter = params.classifier_maxiter : Maximum iterations of algorithm for training linear classifiers. default=20 -stype = param.classifier_kind : Kind of linear classifier to use. 0=L2R_L2LOSS_SVC, 1=L2R_LR (Refer to Liblinear). default=0 -q = param.quiet : Quiet option to restrict the output for reporting progress and debugging purposes 0=no quiet, 1=quiet. default=[value saved in trained model]Feature file should be in dense matrix text format and label file should be in sparse matrix text format (refer to Miscellaneous section).
./slice_predict [feature file name] [model dir name] -t 1 -b 0 -q 0where:
-s = params.efS : HNSW efSearch parameter. default=[value saved in trained model] -k = params.num_nbrs : Number of labels to be shortlisted per training point according to the generative model. default=[value saved in trained model] -o = params.num_io_threads : Number of threads used to write the retrived ANN points to file. default=[value saved in trained model] -b = params.b_gen : Bias parameter for the generative model. default=0 -t = params.num_threads : Number of threads. default=[value saved in trained model] -q = param.quiet : Quiet option to restrict the output for reporting progress and debugging purposes 0=no quiet, 1=quiet. default=[value saved in trained model]Feature file should be in dense matrix text format (refer to Miscellaneous section).
[metrics] = get_all_metrics([test score matrix], [test label matrix], [inverse label propensity vector])
perl convert_format.pl [repository data file] [output feature file name] [output label file name]
[matrix] = read_text_mat([text matrix name]);To write a Matlab matrix into text format:
write_text_mat([Matlab sparse matrix], [text matrix name to be written to]);
[weights vector] = inv_propensity([training label matrix],A,B);A,B are the parameters of the inverse propensity model. Following values are to be used over the benchmark datasets:
Wikipedia-LSHTC: A=0.5, B=0.4 Amazon: A=0.6, B=2.6 Other: A=0.55, B=1.5