cart-elc

Source code for CART-ELC
git clone git://git.laack.co/cart-elc.git
Log | Files | Refs | README | LICENSE

README.md (4123B)


      1 # CART-ELC
      2 
      3 CART-ELC is an oblique decision tree induction algorithm. Our approach has a comparatively high computational cost, but the resulting models often have higher classification accuracies and better interpretability. Given the computational cost of our algorithm, it is best applied to small datasets ($n < 5000$). 
      4 
      5 Our paper is currently under review at TMLR, but is accessible on [arXiv](https://doi.org/10.48550/arXiv.2505.05402). 
      6 
      7 This repository contains the following:
      8 
      9 ### Implementations
     10 
     11 | Model       | Language        | Notes                                                  |
     12 |-------------|-----------------|--------------------------------------------------------|
     13 | CART-ELC    | C++ / Python    | Includes C++ tests                                     |
     14 | HHCART(A)   | Python          | See [HHCART](https://doi.org/10.48550/arXiv.1504.03415)|
     15 | HHCART(D)   | Python          | See [HHCART](https://doi.org/10.48550/arXiv.1504.03415)|
     16 
     17 ### Results
     18 
     19 - Experiment scripts for HHCART(A), HHCART(D), and CART-ELC
     20 - Experimental results for CART-ELC
     21 
     22 ### Documentation
     23 
     24 - Usage instructions / implementation details
     25 - Source files for our paper
     26 - Source files for our presentation
     27 
     28 # Usage
     29 
     30 ## Dependencies
     31 
     32 - Python 3.8+
     33 - NumPy
     34 - scikit-learn (for loading datasets)
     35 - pandas
     36 - A C++ compiler (e.g., 'g++')
     37 - Make
     38 - Graphviz (for rendering dot output)
     39 - Setuptools
     40 - pybind11
     41 - python3-dev
     42 
     43 If you encounter issues on Linux, please feel free to open an issue.
     44 
     45 ## Run Experiments
     46 
     47 **Note:** Running the full experiment suite may take a significant amount of time. Additionally, you will need to create a 'datasets' directory in the root of the repository and populate it with the 'housing.csv', 'diabetes.csv', 'cancer.csv', 'dim.csv', and 'bright.csv' files corresponding to the datasets described in our paper.
     48 
     49 To run the CART-ELC experiments from our paper, from the root of the repository, run:
     50 
     51 ```
     52 make experiments
     53 ```
     54 
     55 This will compile the C++ code, move the compiled binary to the directory with the python code, and run all of our python experiments in series.
     56 
     57 ## Fit and Predict
     58 
     59 To use our implementation of CART-ELC you will first need to compile the C++ code for use with our python bindings. To do this, from the root of the repository, run 
     60 
     61 ```
     62 make so
     63 ```
     64 
     65 This compiles the shared object and places it in the 'examples/cart-elc-experiments' directory. You can then cd to this directory and you should see a new file named 'decision_tree.so'. You should also see other .py files. Each of these files has been configured to read in a dataset, call the fit method, and call the predict method. These .py file should be sufficient to understand how CART-ELC can be used. Additionally, these files are the files we used for the experiments in our paper and can thus be ran individually to validate our findings.
     66 
     67 ## Python Interfaces
     68 
     69 Shown below are the interfaces for the 'ELCClassifier' object:
     70 
     71 ```python
     72 class ELCClassifier:
     73     def __init__(self, depth: int, combinations: int, threadCount: int, objectiveFunction: Optional[str] = None) -> None: ...
     74     def fit(self, X: np.ndarray, samples: int, y: np.ndarray, features: int) -> None: ...
     75     def predict(self, X: np.ndarray, samples: int, features: int) -> np.ndarray: ...
     76     def getDot(self) -> str: ...
     77     def getSplits(self) -> int: ...
     78 ```
     79 
     80 The primary distinction between sklearn's implementations of decision trees and ours is that we require the consumer to pass in the sizes for the arrays being passed to the classifier. This was done for ease of development as under the hood we are primarily using raw c-style arrays that don't have a size attribute.
     81 
     82 We also note getDot differs slightly from sklearn's plot_tree implementation. When called, the object returns a dot language string. This string can then be rendered using Graphviz.
     83 
     84 
     85 # Acknowledgments
     86 
     87 This project uses the Eigen library, a third-party C++ library for linear algebra. We are not affiliated with or endorsed by the developers or maintainers of the Eigen library. The use of Eigen in this project is for research and educational purposes only.