Feature Type

  • [ ] Adding new functionality to pandas

  • [X] Changing existing functionality in pandas

  • [ ] Removing existing functionality in pandas

Problem Description

Observation

After upgrading to Pandas 2.2.0 calling read_sql_query with a SQLAlchemy query on an SQLAlchemy connection yielded "Query must be a string unless using sqlalchemy." without any indication why or what may have broken with the upgrade.

Analysis

Tracking the problem down it was found that pandasSQL_builder (code) calls sqlalchemy = import_optional_dependency("sqlalchemy", errors="ignore") to resolve the optional SQLAlchemy package (which was installed in my case)

Due to errors="ignore" the violation of the minimal required version for SQLAlchemy does not lead to a warn or raise (code) but just silently returns with None.

This in turn lets pandasSQL_builder silently default to SQLiteDatabase(con).

Feature Description

Proposed improvement

Do not ignore the minimal version violation in import_optional_dependency in this context but make it obvious. For example by introducing an additional "errors"-mode like import_optional_dependency("sqlalchemy", errors="raise-on-version-violation").

Alternative Solutions

./.

Additional Context

No response

Comment From: rhshadrach

Thanks for the report! It looks like this was changed in #45679 to allow for other dependencies. Perhaps the logic could be changed there so that when SQLAlchemy is used, we do check that the version is satisfied.

Further investigations and PRs to fix are welcome!

Comment From: luke396

I have attempted to recreate the issue in a fresh environment with pandas==2.2.0; sqlalchemy==2.0.25, and the output seems to be correct. I may require assistance in reproducing the issue.

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine("sqlite:///:memory:")

df = pd.DataFrame({
    "id": [1, 2, 3],
    "name": ["Alice", "Bob", "Charlie"]
})
df.to_sql("Person", engine, index=False)

sql_query = "SELECT * FROM Person"
result = pd.read_sql_query(sql_query, engine)
print(result)
   id     name
0   1    Alice
1   2      Bob
2   3  Charlie

Comment From: bebach

The issue is not that it works with SQLAlchemy 2 but that it gives no clue why it does not work with SQLAlchemy < 2.x

To replicate install SQLAlchemy==1.4 and rerun the code-snippet from above. This results in..

$ python pandas_test.py
D:\temp\pandas-issue-57178\pandas_test.py:10: UserWarning: pandas only supports SQLAlchemy connectable (engine/connection) or database string URI or sqlite3 DBAPI2 connection. Other DBAPI2 objects are not tested. Please consider using SQLAlchemy.
  df.to_sql("Person", engine, index=False)
Traceback (most recent call last):
  File "D:\temp\pandas-issue-57178\pandas_test.py", line 10, in <module>
    df.to_sql("Person", engine, index=False)
  File "D:\temp\pandas-issue-57178\venv\lib\site-packages\pandas\util\_decorators.py", line 333, in wrapper
    return func(*args, **kwargs)
  File "D:\temp\pandas-issue-57178\venv\lib\site-packages\pandas\core\generic.py", line 3081, in to_sql
    return sql.to_sql(
  File "D:\temp\pandas-issue-57178\venv\lib\site-packages\pandas\io\sql.py", line 842, in to_sql
    return pandas_sql.to_sql(
  File "D:\temp\pandas-issue-57178\venv\lib\site-packages\pandas\io\sql.py", line 2851, in to_sql
    table.create()
  File "D:\temp\pandas-issue-57178\venv\lib\site-packages\pandas\io\sql.py", line 984, in create
    if self.exists():
  File "D:\temp\pandas-issue-57178\venv\lib\site-packages\pandas\io\sql.py", line 970, in exists
    return self.pd_sql.has_table(self.name, self.schema)
  File "D:\temp\pandas-issue-57178\venv\lib\site-packages\pandas\io\sql.py", line 2866, in has_table
    return len(self.execute(query, [name]).fetchall()) > 0
  File "D:\temp\pandas-issue-57178\venv\lib\site-packages\pandas\io\sql.py", line 2673, in execute
    cur = self.con.cursor()
AttributeError: 'Engine' object has no attribute 'cursor'

which is surprising given that we just created the engine with SQLAlchemy. And it gives the user no clue that this is due to violation of version requirement.

It would be obvious what is the issue if the version check in import_optional_dependency raises the

            msg = (
                f"Pandas requires version '{minimum_version}' or newer of '{parent}' "
                f"(version '{version}' currently installed)."
            )

https://github.com/pandas-dev/pandas/blob/f538741432edf55c6b9fb5d0d496d2dd1d7c2457/pandas/compat/_optional.py#L152

... instead of just silently returning None (which only later leads to the miss-leading error from above.)

Comment From: luke396

Seems related #57049, whether to support Sqlalchemy 1.4 is under discussion.

Comment From: turbotimon

I think this is related. It should state that sqlalchemy <2.0 is not supported instead of "not installed".

import sqlalchemy
print("sqlalchemy Version:", sqlalchemy.__version__)
pd.read_sql("select * from osm_point limit 1", connection_string)
sqlalchemy Version: 1.4.52
...
ImportError: Using URI string without sqlalchemy installed.

Comment From: SoulofAkuma

take

Comment From: SoulofAkuma

I have so far implemented something that will raise for import_optional_dependency if the min_version of what is defined for sqlalchemy in requirements-dev.txt is not met and `isinstance(con, str).

The part about what should be done for sqlalchemy.engine.Connectable objects is a little more tricky. In an ideal case: - There should be an import error for both str and sqlalchemy.engine.Connectable if sqlalchemy is not found or does not meet the version requirements - There should be no warning if sqlalchemy is not required, no matter if it is installed

The problem is that I would either have to check whether the passed connection is of instance sqlalchemy.engine.Connectable before actually calling import_optional_dependency for sqlalchemy or would have to do the version check manually after calling/call import_optional_dependency twice to do the version check for me and raise an error.

The latter options kind of go against what import_optional_dependency is for.

But this seems to be an issue in multiple places that there is no paved path for how an optional import should be handled that should throw in case a condition is met that can only be available after the import. Any suggestions from contributors welcome, if I am missing something or someone has an idea.