I love working with Python + Pandas, but sometimes working with lots of data or even loading that data into memory can be a problem.
It can be better to have a database to perform DB operations, like merges and filters, and then
do the final operations in Pandas, when the data is more manageable. For that I use Postgres + Python,
psycopg2 before loading the data into Pandas. Lets get to work!
I like to start by loading my database variables into memory using
python-dotenv. Here’s how:
- Create a
SQL_USER='your_user' SQL_PASS='your_pass' SQL_HOST='your_localhost' SQL_PORT='your_port' SQL_DB='your_db'
- Load that file into your
import os from dotenv import load_dotenv, find_dotenv load_dotenv(find_dotenv()) SQL_USER = os.environ.get('SQL_USER') SQL_PASS = os.environ.get('SQL_PASS') SQL_HOST = os.environ.get('SQL_HOST') SQL_PORT = os.environ.get('SQL_PORT') SQL_DB = os.environ.get('SQL_DB')
Then you can use these variables to connect to your database:
import psycopg2 conn = psycopg2.connect( dbname=SQL_DB, user=SQL_USER, password=SQL_PASS, port=SQL_PORT, host=SQL_HOST )
If everything went fine so far, you won’t see any error messages. The next step is to build the query we want to pass to
the database, and execute it. We can pass normal SQL queries and execute them with
pscycopg2 and then we can as a final
step load that data into
Pandas. Here is the code:
import pandas.io.sql as sqlio SQL_QUERY = "SELECT * FROM test_table WHERE id = ANY(%s)" test_ids = [1, 2, 3] result_df = sqlio.read_sql_query(SQL_QUERY, params=(test_ids,), conn)
And boom! The results of your query show up in a dataframe. You can get all the code here if you want to star my gist, or copy paste from below. Best, Jose :)