{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Working with Pyaro filters\n", "\n", "Filters in Pyaro exist to reduce or modify the amount of data delivered by a database.\n", "\n", "Pyaro has a set of build-in filters under `pyaro.filters`. In addition, engines can add\n", "additional filters for their specific engine.\n", "\n", "## Listing the default filters" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "mappingproxy({'variables': VariableNameFilter(**{'reader_to_new': {}, 'include': [], 'exclude': []}),\n", " 'stations': StationFilter(**{'include': [], 'exclude': []}),\n", " 'countries': CountryFilter(**{'include': [], 'exclude': []}),\n", " 'bounding_boxes': BoundingBoxFilter(**{'include': [], 'exclude': []}),\n", " 'flags': FlagFilter(**{'include': [], 'exclude': []}),\n", " 'time_bounds': TimeBoundsFilter(**{'start_include': [], 'start_exclude': [], 'startend_include': [], 'startend_exclude': [], 'end_include': [], 'end_exclude': []})})" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pyaro\n", "\n", "pyaro.timeseries.filters.list()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The keys of the return dictionary, i.e. variables, stations,... should be used to get a initialized filter, e.g. a\n", "country-filter selecting only Norway:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "countries\n", "CountryFilter(**{'include': ['NO'], 'exclude': []})\n" ] } ], "source": [ "norway_filter = pyaro.timeseries.filters.get('countries', **{'include': ['NO']})\n", "print(norway_filter.name())\n", "print(norway_filter)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Listing the filters of an engine" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[VariableNameFilter(**{'reader_to_new': {}, 'include': [], 'exclude': []}),\n", " StationFilter(**{'include': [], 'exclude': []}),\n", " CountryFilter(**{'include': [], 'exclude': []}),\n", " BoundingBoxFilter(**{'include': [], 'exclude': []}),\n", " TimeBoundsFilter(**{'start_include': [], 'start_exclude': [], 'startend_include': [], 'startend_exclude': [], 'end_include': [], 'end_exclude': []}),\n", " FlagFilter(**{'include': [], 'exclude': []})]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pr_csv = pyaro.list_timeseries_engines()['csv_timeseries']\n", "pr_csv.supported_filters()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Programmatic and Declarative usage of filters\n", "\n", "When opening the data-source/the database, these filters can be given as dictionary or list.\n", "The following two open-calls are identical. The first one is programmatical, while the second one\n", "is declarative. The declarative version is often preferred for use in larger programs like pyaerocom.\n", "\n", "If multiple filters are given, all filters must pass the filter tests, in other words, filters are\n", "implicitly connected by an AND operator." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[\"{'station': 'station1', 'latitude': 10.5, 'longitude': 172.5, 'altitude': 0.0, 'long_name': 'station1', 'country': 'NO', 'url': ''}\", \"{'station': 'station2', 'latitude': 45.5, 'longitude': -103.2, 'altitude': 0.0, 'long_name': 'station2', 'country': 'NO', 'url': ''}\"]\n", "[\"{'station': 'station1', 'latitude': 10.5, 'longitude': 172.5, 'altitude': 0.0, 'long_name': 'station1', 'country': 'NO', 'url': ''}\", \"{'station': 'station2', 'latitude': 45.5, 'longitude': -103.2, 'altitude': 0.0, 'long_name': 'station2', 'country': 'NO', 'url': ''}\"]\n" ] } ], "source": [ "ts = pyaro.open_timeseries('csv_timeseries', filename=\"../../tests/testdata/csvReader_testdata.csv\", filters=[norway_filter])\n", "print([str(stat) for stat in ts.stations().values()])\n", "ts = pyaro.open_timeseries('csv_timeseries', filename=\"../../tests/testdata/csvReader_testdata.csv\", filters={'countries': {'include': ['NO']}})\n", "print([str(stat) for stat in ts.stations().values()])\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filter-Usage outside of a Reader\n", "\n", "Sometimes users want to work with a existing reader with different sets of filters. Many Filters (all which inherit\n", "from DataIndexFilter) can work with an existing reader. The `FilterCollection` helps to bundle these filters." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "all SOx data: 104\n", "filtered SOx data: 52\n" ] } ], "source": [ "ts = pyaro.open_timeseries('csv_timeseries', filename=\"../../tests/testdata/csvReader_testdata.csv\")\n", "fc = pyaro.timeseries.FilterCollection({\n", " \"countries\": {\"include\": [\"NO\"]},\n", " \"stations\": {\"include\": [\"station1\"]},\n", " })\n", "print(\"all SOx data:\", len(ts.data(\"SOx\"))) # 104\n", "print(\"filtered SOx data:\", len(fc.filter(ts, \"SOx\"))) # 52" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you want to apply several filtercollections on the same data without re-reading it from the reader\n", "you can use `FilterCollection.filter_data`, i.e. here for filtering data more explicit." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Data-points for SOx and station1: 52\n", "Data-points for SOx and station2: 52\n" ] } ], "source": [ "# store some information the filters might need\n", "stations = ts.stations()\n", "variables = ts.variables()\n", "all_data = ts.data(\"SOx\")\n", "for station in stations.keys():\n", " fc = pyaro.timeseries.FilterCollection({\n", " \"countries\": {\"include\": [\"NO\"]},\n", " \"stations\": {\"include\": [station]},\n", " })\n", " data = fc.filter_data(all_data, stations, variables)\n", " print(f\"Data-points for {data.variable} and {data.stations[0]}: {len(data)}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Filtering data without using a Filter\n", "\n", "The `Data` returned from a reader can also be sliced with a numpy-index array (boolean array with the same\n", "size as data). The follow example will only give data-points for low latitutudes <20° north (i.e. only station2,\n", "see above stations listing.)\n" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "low latitude data-points: 52\n" ] } ], "source": [ "low_lat_data = all_data.slice((all_data.latitudes < 20) & (all_data.latitudes > -20))\n", "print(\"low latitude data-points: \", len(low_lat_data))" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.12" } }, "nbformat": 4, "nbformat_minor": 2 }