Skip to content

Commit

Permalink
Update documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
amotl committed Oct 1, 2023
1 parent f4aefe4 commit 206f9da
Show file tree
Hide file tree
Showing 4 changed files with 52 additions and 45 deletions.
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -208,11 +208,11 @@ Use a different backend (default: ``ddlgen``)::

skeem infer-ddl --dialect=postgresql --backend=frictionless data.ndjson

Reading data from stdin needs to obtain both the table name and content type separately::
Reading data from STDIN needs to obtain both the table name and content type separately::

skeem infer-ddl --dialect=crate --table-name=foo --content-type=ndjson - < data.ndjson

Reading data from stdin also works like this, if you prefer to use pipes::
Reading data from STDIN also works like this, if you prefer to use pipes::

cat data.ndjson | skeem infer-ddl --dialect=crate --table-name=foo --content-type=ndjson -

Expand Down
78 changes: 35 additions & 43 deletions doc/backlog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -54,10 +54,6 @@ Features
- Example: https://s3.amazonaws.com/crate.sampledata/nyc.yellowcab/yc.2019.07.gz
- https://github.com/leenr/gzip-stream

Bugs
====
- [x] Why is "frictionless" resource being read twice?

Documentation
=============
- [x] Inline code comments
Expand All @@ -70,10 +66,10 @@ Documentation
Infrastructure
==============
- [o] Add "examples" to test suite
- [o] CI/GHA
- [o] Docker build & publish
- [x] CI/GHA
- [x] Docker build & publish
- [o] Docs: RTD
- [o] Release 0.1.0
- [x] Release 0.1.0
- [o] Issue: Hello world

Quality
Expand All @@ -93,46 +89,38 @@ Formats
Iteration 3
***********

Bugs
====
- Source url: https://docs.google.com/spreadsheets/d/e/2PACX-1vTyMYzq-Gh8dbMhID8XzDqwwmY2e8ahw9VRM_yLMT2_hz3XzR-rCLoFAU2Qdo2v4_IgnjurwW1c85E_/pub?gid=0&single=true&output=csv
Destination table: my_import_data

Next steps
==========
- [o] Docs: Improve "library use" docs re. ``ContentType``.
- [o] Docs: Add list of supported databases. /cc @seut
- [o] Option to suppress ``NOT NULL`` constraint. /cc @seut
- [o] Different kinds of sampling methods? /cc @seut
- [o] Performance considerations / HTTP server

Formats
=======
- [o] Format: TSV
- [o] Format: Add Zarr (.zarr) input format
- [o] Format: Add JSON5, YAML, TOML input formats
- [o] Format: Partitioned Geoparquet
https://github.com/gadomski/chalkboard/blob/main/notebooks/isd-demo.ipynb
- [o] Format: dBase and friends
- [o] Format: Lance and ORC. -- https://github.com/eto-ai/lance
- [o] Format: CSV without headers: https://commonscreens.com/?page_id=1492


Bugs
====
- [o] Why is "ddlgen" resource being read twice? See ``_eval_lineprotocol``.
=> Workaround: Add ``@cachetools.func.lru_cache``
- [o] Can get hogged on resources like. Resolve: Automatically download before working on it.

- https://www.unidata.ucar.edu/software/netcdf/examples/sresa1b_ncar_ccsm3-example.nc
- s3://fmi-gridded-obs-daily-1km/Netcdf/Tday/tday_2022.nc
- [o] WMI_Lear.nc has "time" as "TIMESTAMP", but "sresa1b_ncar_ccsm3-example.nc" uses "TEXT"
- [o] Does not detect semicolon as field delimiter

- https://archive.sensor.community/2015-10-01/2015-10-01_ppd42ns_sensor_27.csv
- [o] FrictionlessException: [source-error] The data source has not supported or has inconsistent contents: The HTTP server doesn't appear to support range requests. Only reading this file from the beginning is supported. Open with block_size=0 for a streaming file interface.

- https://archive.sensor.community/parquet/2015-10/ppd42ns/part-00000-77c393f3-34ff-4e92-ad94-2c9839d70cd0-c000.snappy.parquet
- [o] RuntimeError: OrderedDict mutated during iteration

- s3://openaq-fetches/realtime/2023-02-25/1677351953_eea_2aa299a7-b688-4200-864a-8df7bac3af5b.ndjson

- [o] Compute Engine Metadata server unavailable on attempt 1 of 3. Reason: timed out
- [o] Failed to decode variable 'valid_time': unable to decode time units 'seconds since 1970-01-01T00:00:00' with "calendar 'proleptic_gregorian'". Try opening your dataset with decode_times=False or installing cftime if it is not installed.
- https://github.com/gadomski/chalkboard/blob/main/notebooks/isd-demo.ipynb
- [o] Format: dBase and friends
- [o] Format: Lance and ORC.

- https://dd.weather.gc.ca/analysis/precip/hrdpa/grib2/polar_stereographic/06/CMC_HRDPA_APCP-006-0100cutoff_SFC_0_ps2.5km_2023012606_000.grib2
- [o] ``HTTP/1.1 403 Forbidden`` gets masked badly
- [o] Fix ``cat foo | --backend=fl -``
- [o] ``logger.warning`` will emit to STDOUT when running per tests
- [o] RecursionError: maximum recursion depth exceeded
``skeem infer-ddl --dialect=crate --content-type=ndjson --backend=frictionless - < tests/testdata/basic.ndjson``
- https://github.com/eto-ai/lance
- https://eto-ai.github.io/lance/notebooks/quickstart.html
- [o] Format: CSV without headers: https://commonscreens.com/?page_id=1492
- [o] Format: Pickled embeddings like https://huggingface.co/flair/ner-german-large/resolve/main/pytorch_model.bin
- [o] Format: InfluxDB line protocol files also available in compressed format (gzip, more?)
``influxd inspect export-lp lalala --compress``
- [o] Format: CBOR, MessagePack: https://github.com/remarshal-project/remarshal
- [o] Format: EDN and Transit: https://github.com/borkdude/jet

Features
========
Expand All @@ -143,10 +131,8 @@ Features
- [o] Library: Derive schema directly from pandas DataFrame, or others
- [o] IO: Export to descriptor and/or schema
- [o] Resource caching with fsspec? -- https://github.com/blaylockbk/Herbie/pull/153/files

Documentation
=============
- [o] Improve "library use" docs re. ``ContentType``
- [o] Improve data type detection. e.g. heuristically infer ``ts`` columns. See
https://gist.github.com/seut/497ef886db8755f9c8f27959e197149f

General
=======
Expand Down Expand Up @@ -179,6 +165,9 @@ General
- [o] Provide options to control sample size
- [o] Startup time is currently one second. Can this be improved?
- [o] Add support for "InfluxDB annotated CSV" input format

- https://docs.influxdata.com/influxdb/v2.6/reference/syntax/annotated-csv/
- https://docs.influxdata.com/influxdb/v2.6/reference/syntax/annotated-csv/extended/
- [o] Load Parquet files efficiently from S3
- [o] Unlock more fsspec sources

Expand Down Expand Up @@ -225,10 +214,13 @@ Iteration 4

- Arrow / Datafusion
- Dask
- Fugue
- Ibis: https://github.com/ibis-project/ibis
- Lance
- Modin
- Pandas
- Polars
- Ray
- Spark
- Vaex: https://github.com/vaexio/vaex
https://vaex.io/blog/8-incredibly-powerful-Vaex-features-you-might-have-not-known-about
Expand Down
10 changes: 10 additions & 0 deletions doc/notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -120,3 +120,13 @@ Substrait
- https://github.com/substrait-io/substrait-java
- https://github.com/apache/arrow-datafusion-python/pull/145
- https://github.com/duckdblabs/duckdb-substrait-demo


Misc
====
- https://github.com/toddwschneider/nyc-taxi-data
- https://github.com/taichi-dev/taichi
- Vaex' ``infer_schema``
https://github.com/vaexio/vaex/blob/652937db59ef099a42ad650cdb19567dcbe1905a/packages/vaex-core/vaex/csv.py#L231-L292
- https://vaex.io/docs/guides/io.html#Text-based-file-formats
- https://vaex.io/blog/8-incredibly-powerful-Vaex-features-you-might-have-not-known-about
5 changes: 5 additions & 0 deletions doc/test-data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,10 @@ Development
- https://www.kaggle.com/datasets
- https://github.com/earthobservations/testdata
- https://dd.weather.gc.ca/climate/observations/daily/csv/YT/
- https://srv.demo.crate.io/datasets/power_consumption.json
- https://srv.demo.crate.io/datasets/home_data_aa.csv
- https://srv.demo.crate.io/datasets/home_data_ab.csv


Production
==========
Expand All @@ -26,3 +30,4 @@ Production
- csv-to-lineprotocol: https://dganais.medium.com/getting-started-writing-data-to-influxdb-54ce99fdeb3e
- https://github.com/pandas-dev/pandas/issues/36688
- https://github.com/earthobservations/testdata
- https://www.kaggle.com/datasets/cesaber/spam-email-data-spamassassin-2002

0 comments on commit 206f9da

Please sign in to comment.