sql parser and type checker in haskell
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

todo 15KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398
  1. State of the code:
  2. There are some big issues
  3. 1. the dialect handling for parsing is very limited
  4. 2. schema support and general catalog operations are limited
  5. 3. typechecking of query exprs is limited, and other dml and ddl is
  6. non-existent
  7. 4. there are lots of postgres specific things baked into the system
  8. 5. the typeconversion code, which handles overloaded function
  9. resolution, implicit casting, most of the typing of literals, and
  10. most of determining the precision, scale and nuillability of
  11. expressions is a complete nightmare
  12. 6. typechecking for ansi, sql server and oracle dialects are very
  13. limited and sql server and oracle are very incorrect
  14. 7. the tests are very limited
  15. 8. a lot of the parsing code is very crufty and/or suspect
  16. 9. the syntax design is a bit poor
  17. 10. some of the code is too coupled (especially between modules)
  18. Immediate in progress/provisional soon to be started tasks
  19. new catalog/ new typeid
  20. includes a bunch of feature updates:
  21. proper schemas
  22. proper unquoted id handling
  23. character sets and collations
  24. more checking on catalog updates/ddl + nice error messages
  25. (will be linked with ddl syntax eventually)
  26. restrict and cascade behaviour
  27. roles and ownership (will be extended to permissions eventually)
  28. integrate new catalog
  29. better tests, complete catalogs for ansi and sql server
  30. better tests for matching, nullability, precision, literals
  31. copy syntax and parsing from simple-sql-parser
  32. Long term tasks
  33. 0. minor issues
  34. put the special operators like between and substring in a separate
  35. namespace in the catalog and typechecking so they can co-exist with
  36. functions of the same name
  37. implicit case uses type. The syntax should not depend on this, maybe
  38. replace with something just specific to implicit cast (meaning the
  39. syntax types should not depend on any of the typechecking types)
  40. full dialect type is used all the way through the internals. it is
  41. nice to have a single dialect type for the user, but to reduce
  42. coupling internally, there should be minimal dialect types for each of
  43. lexing,parsing,pretty printing and typechecking (meaning, when we pass
  44. a dialect type to the parsing code, it shouldn't contain e.g. type
  45. checking specific fields). Maybe the lexing,parsing and pretty
  46. printing will be the same type, but don't want the parser to depend on
  47. all the code that the typechecking part of the dialect depends on
  48. the annotation type pulls in a bunch of stuff into the syntax. can the
  49. tree be parameterized on the annotation type? can the annotation type
  50. be implemented in a separate module at least? it uses some ag stuff at
  51. the moment
  52. rename some of the internal modules, types, ctors and functions,
  53. etc. to be more clear and consistent.
  54. fix the catalogs and tests for oracle and sql server so they are not
  55. based on postresql, which makes them very wrong
  56. the dialects used in the tests are a bit confused because of legacy
  57. issues. The tests need to be more methodical as well.
  58. the odbc catalog does not attach onto the current dialect
  59. properly. the typechecker should access the odbc via the dialect and
  60. not directly. Odbc support both for parsing and typechecking should be
  61. a flag on the dialects or something (and disabled by default). The
  62. odbc typechecking should work in different dialects.
  63. literals don't directly type as unknown. This should be fixed at least
  64. for postgres, and probably is a good idea for at least some of the
  65. other dialects.
  66. try to do code coverage
  67. try to set up continuous integration somewhere - mainly want to catch
  68. build failures with older ghcs
  69. can also do packdeps
  70. also: much better testing. big weak points:
  71. we don't test nullability, precision and scale much, or the typing of literals
  72. want some more tests on unusual dialects (e.g use numeric and have no
  73. int types, dialect only has one text type and it isn't char or
  74. varchar, dialect doesn't have a decimal type (affects number literals
  75. for instance)
  76. the parser tests are much less comprehensive that in the
  77. simple-sql-parser project
  78. there aren't many anomaly tests
  79. for instance, bad typing
  80. using sql which doesn't work in the current dialect/catalog
  81. there are also a bunch of places where the error messages can be made much nicer
  82. for instance:
  83. when you don't match a function it can provide similarly named
  84. functions, and if you got the name right, it can explain exactly
  85. what the overloads are, what your types are, and why it didn't
  86. match any of them (e.g. non are valid for the types, or it is
  87. ambiguous)
  88. can also consider the similar names for identifiers too
  89. can the overload matching/implicit casting be tested by generating
  90. lots of test cases using another dbms like postgres and sql server?
  91. add more qualified imports and explicit import lists
  92. get rid of the remaining calls to error
  93. expose internals only via modules called ...Internal. There are a few
  94. internal functions expose for utils or testing. These should be kept
  95. more separate
  96. more work with strings
  97. 1. data types
  98. (ansi has clear restrictions/differences between char/varchar/clob
  99. and nchar/nvarchar/nclob), postgres more or less just has text only
  100. (I think char and varchar in postgres are now more or less sugar
  101. around text).
  102. 2. character sets (we will treat a character set as an encoding)
  103. 3. collations
  104. every string will carry a character set and optional collation with it
  105. (collations have the default/implicit/explicit tag)
  106. a dialect can specify that the character set of char,varchar,clob must
  107. be from a subset of the character sets available, and nchar,
  108. nvarchar is another subset, or it can allow any character set for
  109. any string data type.
  110. you set the default character set, plus you can set the character set
  111. for a table, plus you can set the character set per column, and you
  112. can convert text from one character set to another. We will
  113. consider character sets as encodings, character repertoires will
  114. not have explicit representation in hssqlppp (you can either
  115. convert between two character sets or you can't, and the
  116. availability of a conversion is always explicit and direct).
  117. use the ansi rules for collations: each character set has a default
  118. collation. Not sure if this can be changed, or changed per schema
  119. for instance. Use the default/implicit/explicit collation rules for
  120. expressions. table columns can have an explicit collation, and you
  121. can use the collate operator in expressions (including group by and
  122. order by).
  123. information about:
  124. character sets
  125. datatype/character set compatibility
  126. maybe each datatype can have All | Whitelist [characterset] |
  127. Blacklist [characterset]
  128. collations + which character set a collation is for
  129. default collation for a character set
  130. character sets and collations for table columns and for view
  131. expression types
  132. anything else (like domains, ...)
  133. will go in the catalog
  134. make example checking automated
  135. add record syntax to createtable, check for other likely victims for
  136. this treatment
  137. review replace field in createtable, createtableas, move to correct
  138. place (and check for other areas like this)
  139. change the catalog arg in typecheck functions to be maybe, then
  140. nothing means use the default catalog for the dialect
  141. 1. rewrite the catalog code
  142. support schemas and schema search path properly
  143. support some more object types
  144. help fix the confusion with the type Type
  145. support dependencies and cascade/restrict
  146. make sure the canonicalization of names, and case handling is done
  147. properly
  148. correctness handling of ddl operations
  149. support drop and alter directly
  150. good test coverage directly on the catalog api
  151. understand some of the dialect issues wrt catalog stuff better
  152. handle the odbc option better - this needs some interaction with the
  153. general catalog stuff
  154. later want to handle permissions to some extent. not sure exactly what
  155. to do here or if the code will be in hssqlppp or sqream. We need at
  156. least parsing and catalog support for registering permissions here,
  157. even if the authorization code itself might not be here
  158. 2. rewrite the typechecking code from scratch
  159. it is a real mess
  160. lots of things missing
  161. it needs to be much more literate because it is weird, so that
  162. programmers who don't know uuagc stand a chance of working with this
  163. code
  164. there should be much clearer way of dealing with nullability,
  165. precision and scale
  166. more complete typechecking for query exprs
  167. complete typechecking for other dml and for ddl added
  168. there are some tree rewrites which can be done during
  169. typechecking. These should be refactored as separate passes instead of
  170. it being tied together
  171. the typechecking should be better as a compiler front end. The key
  172. addition is tracking where identifiers are defined, so e.g. we can
  173. easily tell which schema a table has matched to, or if a subquery is
  174. correlated.
  175. the typechecker could also check that asts don't contain syntax which
  176. isn't supported in the current dialect
  177. 3. rewrite the typeconversion code and the general nullability,
  178. precision and scale stuff
  179. this is even worse mess
  180. get rid of the hacks and special cases for e.g. datepart
  181. I think the current code tries to be too rule based. For simple
  182. reoccuring patterns, we can use some simple rules. Other than that, I
  183. think passing in functions to e.g. determine the output precision is
  184. better. The default rules and the special case functions should be
  185. connected to the catalog functions in the dialect files right next to
  186. the catalog definitions, instead of being embedded in the type
  187. conversion code itself.
  188. 4. simple-sql-parser
  189. there is a bunch of improved syntax and parsing code (and pretty
  190. printing) in the simple-sql-parser project. there can be some copy
  191. pasting into hssqlppp for now, and eventually they should be
  192. synchronized and kept the same (somehow - we can't use
  193. simple-sql-parser directly or use the source code since we must use
  194. .ag in hssqlppp and I definitely don't want to use this in
  195. simple-sql-parser).
  196. 5. improve the project documentation and examples
  197. = later tasks
  198. think about demo code to convert between dialects (especially things
  199. like types and functions which need to be desugared)
  200. try to get the old chaos sql preprocessor working again
  201. improve the quasiquotation system: maybe also switch from text back to
  202. strings (apart from the input to parsing, and the output from pretty
  203. printing)
  204. typesafe access and composing sql expressions from haskell
  205. work on ansi procedural sql + get the typechecking for procedural sql
  206. working again. Probably the most important dialect here is TSQL, but
  207. maybe plsql is also important (the plpgsql work should be revived also
  208. since the syntax is already there).
  209. synchronization with simple-sql-parser
  210. the syntaxes should be the same. Maybe there can be a mechanical
  211. conversion from the simple-sql-parser haskell syntax to the uuagc in
  212. hssqlppp
  213. simple-sql-parser should have dialect support and annotations for this
  214. to happen.
  215. can't share the code or use simple-sql-parser from hssqlppp because
  216. the syntax in hssqlppp must be written in ag, and simple-sql-parser it
  217. should not use this so that it is much more accessible.
  218. maybe replace parsec with megaparsec
  219. maybe replace it with uu-parsinglib (has incremental parsing which is
  220. really nice, and also maybe better error messages, and also a better
  221. way to handle left factoring - I think it just does it automatically
  222. more or less)
  223. use wl-pprint-text or something - maybe this will be better for the
  224. complex and identation heavy syntax?
  225. do proper solution to operator precedence parsing. would like to do an
  226. ast pass approach if possible. This is also needed at least for from
  227. clauses and set operators too, both of which almost certainly get the
  228. fixity wrong at the moment.
  229. syntax extensibility?
  230. lint for sql with plugins
  231. parameterized annotation
  232. ------------------------
  233. old notes
  234. == typechecking
  235. param query type fn
  236. rough todo/checklist for exprs:
  237. combinequeryexpr
  238. values
  239. withqueryexpr, withquery
  240. jointref variations
  241. join onexprs
  242. funtrefs
  243. table aliases
  244. aggregates, windows
  245. + add checks for aggregates/group by
  246. liftapp
  247. inlist, inqueryexpr?
  248. go through old typecheck tests and reinsert most of them?
  249. -> want to get much better coverage of typechecking
  250. start looking at getting the type errors back up to the old level
  251. to serve effectively as a front end, the parser should produce nice
  252. error messages, and the typechecking of crud should be very
  253. comprehensive
  254. what will useable non ascii character set support take? Maybe this
  255. works fine already?
  256. == small fixes
  257. see if can fix error messages from lex/parse in qq: come out all with
  258. the string passed through show so can't read the error message
  259. better approach to parsing selects - the into stuff is a mess
  260. alter the postprocess for astinternal.hs: want to add haddock line for
  261. the data type and each ctor for all the data defs in the file, but
  262. not recordize them
  263. review examples: add new examples, clean up code, and example page on
  264. website. Add simple parsing and typechecking example to index page
  265. rename examples
  266. review docs, esp. haddock which contains all sorts of shockingly bad
  267. typos, spellings, inconsistencies, etc.
  268. junk tests to get working: extensions, roundtrip?
  269. want to be ready to do comprehensive review of pg syntax support for
  270. 0.7.0, so can work through and get a reasonably comprehensive list
  271. of what is missing
  272. documentation:
  273. easy way to put in some sql, see if it parses, view the resultant
  274. ast
  275. same with typechecking: show writing a cat by hand, and examples to
  276. generate from postgresql database and from ddl sql
  277. == misc small bits
  278. idea for :: cast
  279. -> parse typenames as scalarexpressions by embeding them in
  280. numberlits (hacky but should be unambiguous) - or antictors?
  281. then transform to cast after parsing. This can then use the proper
  282. precedence in buildExpressionParser to parse ::. Also produce parse
  283. errors if after parsing, you try to cast to something which isn't a
  284. typename
  285. could do something similar for other operators: '.', [], in?
  286. add support for enum types
  287. == website nav
  288. == examples
  289. add compilation of examples to automated tests, also add tests in the
  290. documentation
  291. == report generator
  292. the idea is to have the following for experimentation, evaluate how
  293. well hssqlppp supports some existing sql, support while developing
  294. sql (possibly with syntax extensions), and generating
  295. documentation:
  296. take source sql:
  297. standard postgresql sql in text files
  298. sql taken from postgresql dump from live db
  299. syntax extended sql in text files
  300. do some or all of the following:
  301. parse and type check - report problems
  302. parse, pretty print, reparse and check
  303. generate documentation, catalog
  304. load into postgresql and double check catalog from typechecker
  305. load and dump postgresql, reparse and typecheck for changes
  306. == documentation generator for sql codebases