Slice: Scalable Linear Extreme Classifiers


Himanshu JainVenkatesh B.Bhanu Teja ChunduriManik Varma

Extreme multi-label learning aims to annotate each data point with the most relevant subset of labels from an extremely large label set. Slice is an efficient 1-vs-All based extreme classifier that is specially designed for low-dimensional dense features. Slice achieves close to state-of-the-art accuracies while being significantly faster to train and predict than most other extreme classifiers. Slice can efficiently scale to datasets with as many as 100 million labels and 240 million training points. Please refer to the research paper for more details.

Download Slice

This code is made available as is for non-commercial research purposes. Please make sure that you have read the license agreement in LICENSE.doc/pdf. Please do not install or use Slice unless you agree to the terms of the license.

Download Slice source code in C++ and Matlab as well as precompiled Windows/Linux binaries.

The code for Slice is written in C++ and should compile on 64 bit Windows/Linux machines using a C++11 enabled compiler. The code also uses the publically available implementation of the HNSW algorithm to find the Approximate Nearest Neighbors. Installation and usage instructions are provided below. The default parameters provided in the Usage Section work reasonably on the benchmark datasets in the Extreme Classification Repository

Please contact Himanshu Jain and Manik Varma if you have any questions or feedback.

Experimental Results and Datasets

Please visit the Extreme Classification Repository to download the benchmark datasets and compare Slice's performance to baseline algorithms.

Usage

Linux/Windows makefiles for compiling Slice have been provided with the source code. To compile, run "make" (Linux) or "nmake -f Makefile.win" (Windows) in the Slice folder. Run the following commands from inside the Slice folder for training and testing.

Training

C++:
./slice_train [input feature file name] [input label file name] [output model folder name] -m 100 -c 300 -s 300 -k 300 -o 20 -t 1 -f 0.000001 -siter 20 -b 0 -stype 0 -C 1 -q 0
where:
	-m = params.M                       :        HNSW M parameter. default=100
	-c = params.efC                     :        HNSW efConstruction parameter. default=300
	-s = params.efS                     :        HNSW efSearch parameter. default=300
	-k = params.num_nbrs                :        Number of labels to be shortlisted per training point according to the generative model. default=300
	-o = params.num_io_threads          :        Number of threads used to write the retrived ANN points to file. default=20
	-t = params.num_threads             :        Number of threads used to train ANNS datastructure and the discriminative classifiers. default=1
	-C = params.classifier_cost         :        Cost co-efficient for linear classifiers            default=1.0 SVM weight co-efficient. default=1.0
	-f = params.classifier_threshold    :        Threshold value for sparsifying linear classifiers' trained weights to reduce model size. default=1e-6
	-siter = params.classifier_maxiter  :        Maximum iterations of algorithm for training linear classifiers. default=20
	-stype = param.classifier_kind      :        Kind of linear classifier to use. 0=L2R_L2LOSS_SVC, 1=L2R_LR (Refer to Liblinear). default=0
	-q = param.quiet                    :        Quiet option to restrict the output for reporting progress and debugging purposes 0=no quiet, 1=quiet. default=[value saved in trained model]
Feature file should be in dense matrix text format and label file should be in sparse matrix text format (refer to Miscellaneous section).

Testing

C++:
./slice_predict [feature file name] [model dir name] -t 1 -b 0 -q 0
where:
	-s = params.efS                     :        HNSW efSearch parameter. default=[value saved in trained model]
	-k = params.num_nbrs                :        Number of labels to be shortlisted per training point according to the generative model. default=[value saved in trained model]
	-o = params.num_io_threads          :        Number of threads used to write the retrived ANN points to file. default=[value saved in trained model]
	-b = params.b_gen                   :        Bias parameter for the generative model. default=0
	-t = params.num_threads				      :        Number of threads. default=[value saved in trained model]
	-q = param.quiet                    :        Quiet option to restrict the output for reporting progress and debugging purposes 0=no quiet, 1=quiet. default=[value saved in trained model]
	
Feature file should be in dense matrix text format (refer to Miscellaneous section).

Performance Evaluation

Scripts for performance evaluation are only available in Matlab. To compile these scripts, execute "make" in the Tools folder from the Matlab terminal.
Following command is executed from Tools/metrics folder and outputs a struct containing all the metrics:
	[metrics] = get_all_metrics([test score matrix], [test label matrix], [inverse label propensity vector])

Miscellaneous

Toy Example

The zip file containing the source code also includes the EUR-Lex dataset as a toy example. To run Slice on the EUR-Lex dataset, execute "bash sample_run.sh" (Linux) or "sample_run" (Windows) in the Slice folder.

References

1   Y. Prabhu,  A. Kag,  S. Harsola,  R. Agrawal, and M. Varma, Slice: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising, in WWW 2018.