Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,15 @@ dataset = openml.datasets.get_dataset("credit-g") # or by ID get_dataset(31)
X, y, categorical_indicator, attribute_names = dataset.get_data(target="class")
```

Get a missing-value summary for a dataset:

```python
import openml

dataset = openml.datasets.get_dataset(31)
summary = dataset.get_missing_summary()
```

Get a [task](https://docs.openml.org/concepts/tasks/) for [supervised classification on credit-g](https://www.openml.org/search?type=task&id=31&source_data.data_id=31):

```python
Expand Down
20 changes: 20 additions & 0 deletions openml/datasets/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -794,6 +794,26 @@ def get_data( # noqa: C901
assert isinstance(y, pd.Series)
return x, y, categorical_mask, attribute_names

def get_missing_summary(self) -> dict:
"""Returns a missing-value summary for the dataset.

Returns
-------
dict
{
"n_missing_total": int,
"missing_per_column": dict
}
"""
df, _, _, _ = self.get_data()
missing_per_column = df.isna().sum().to_dict()
n_missing_total = sum(missing_per_column.values())

return {
"n_missing_total": n_missing_total,
"missing_per_column": missing_per_column,
}

def _load_features(self) -> None:
"""Load the features metadata from the server and store it in the dataset object."""
# Delayed Import to avoid circular imports or having to import all of dataset.functions to
Expand Down
Binary file not shown.
Binary file not shown.
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
<oml:data_set_description xmlns:oml="http://openml.org/openml">
<oml:id>40945</oml:id>
<oml:name>Titanic</oml:name>
<oml:version>1</oml:version>
<oml:description>**Author**: Frank E. Harrell Jr., Thomas Cason
**Source**: [Vanderbilt Biostatistics](http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.html)
**Please cite**:

The original Titanic dataset, describing the survival status of individual passengers on the Titanic. The titanic data does not contain information from the crew, but it does contain actual ages of half of the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. The datasets used here were begun by a variety of researchers. One of the original sources is Eaton &amp; Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

Thomas Cason of UVa has greatly updated and improved the titanic data frame using the Encyclopedia Titanica and created the dataset here. Some duplicate passengers have been dropped, many errors corrected, many missing ages filled in, and new variables created.

For more information about how this dataset was constructed:
http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt


### Attribute information

The variables on our extracted dataset are pclass, survived, name, age, embarked, home.dest, room, ticket, boat, and sex. pclass refers to passenger class (1st, 2nd, 3rd), and is a proxy for socio-economic class. Age is in years, and some infants had fractional values. The titanic2 data frame has no missing data and includes records for the crew, but age is dichotomized at adult vs. child. These data were obtained from Robert Dawson, Saint Mary's University, E-mail. The variables are pclass, age, sex, survived. These data frames are useful for demonstrating many of the functions in Hmisc as well as demonstrating binary logistic regression analysis using the Design library. For more details and references see Simonoff, Jeffrey S (1997): The &quot;unusual episode&quot; and a second statistics course. J Statistics Education, Vol. 5 No. 1.</oml:description>
<oml:description_version>10</oml:description_version>
<oml:format>ARFF</oml:format>
<oml:upload_date>2017-10-16T01:17:36</oml:upload_date>
<oml:licence>Public</oml:licence> <oml:url>https://api.openml.org/data/v1/download/16826755/Titanic.arff</oml:url>
<oml:parquet_url>https://data.openml.org/datasets/0004/40945/dataset_40945.pq</oml:parquet_url> <oml:file_id>16826755</oml:file_id> <oml:default_target_attribute>survived</oml:default_target_attribute> <oml:tag>Data Science</oml:tag><oml:tag>History</oml:tag><oml:tag>Statistics</oml:tag><oml:tag>text_data</oml:tag> <oml:visibility>public</oml:visibility> <oml:minio_url>https://data.openml.org/datasets/0004/40945/dataset_40945.pq</oml:minio_url> <oml:status>active</oml:status>
<oml:processing_date>2018-10-04 07:19:36</oml:processing_date> <oml:md5_checksum>60ac7205eee0ba5045c90b3bba95b1c4</oml:md5_checksum>
</oml:data_set_description>
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
<oml:data_set_description xmlns:oml="http://openml.org/openml">
<oml:id>123</oml:id>
<oml:name>quake</oml:name>
<oml:version>1</oml:version>
<oml:description>**Author**:
**Source**: Unknown -
**Please cite**:

Dataset from Smoothing Methods in Statistics
(ftp stat.cmu.edu/datasets)

Simonoff, J.S. (1996). Smoothing Methods in Statistics. New York: Springer-Verlag.</oml:description>
<oml:description_version>1</oml:description_version>
<oml:format>ARFF</oml:format>
<oml:upload_date>2014-04-23T13:17:24</oml:upload_date>
<oml:licence>Public</oml:licence> <oml:url>https://test.openml.org/data/v1/download/123/quake.arff</oml:url>
<oml:file_id>123</oml:file_id> <oml:default_target_attribute>richter</oml:default_target_attribute> <oml:version_label>1</oml:version_label> <oml:visibility>public</oml:visibility> <oml:status>active</oml:status>
<oml:processing_date>2025-06-16 08:08:53</oml:processing_date> <oml:md5_checksum>7ede4fd775db9eae5586b2f55c6d98c6</oml:md5_checksum>
</oml:data_set_description>
Loading