PDA

View Full Version : QT SQL advice



enricong
1st November 2014, 02:42
My software is doing the following
1. read in several large txt data files
2. perform statistics

I do this by parsing through the large txt file and inserting into a sqlite db for each txt file. On subsequent runs, I check if the db file exists, if it does, then I read the db file instead

I have a main in-memory db.
After reading each txt file (and creating the db), I execute ATTACH and add each db into the main db. By doing all of this in a couple transactions, I've been able to get this to go fairly quickly.
I set the cache size very large on each db to try to improve performance

My statistics that I perform are mainly various SUM queries based on different selections. Then doing some averages. This goes extremely slowly. I'm trying to figure out how to speed this up.

Questions
1. Do I need to "close" each db at some point to ensure the cache is "flushed"?
2. If I just created the DB, I would think most if not all of it should be in cache. When I do an "ATTACH" and access the DB via the "main in-memory db", does it use the cache?
3. When I run my program (and the DB's are already created), the entire db needs to be read from the HD since my sums basically will touch every element at least once.
4. Is using a SQLITE database even the best approach? Should I just store everything in data structures instead?

ChrisW67
1st November 2014, 03:11
If the actual Sql is the speed problem (you can check by executing the same query outside your program) then good places to look are any WHERE clauses you have, and correct joins between tables. For large data sets well placed indexes are your friend. Without specifics it is hard to be more targetted.

If this is actually a Qt issue then you need to show what your code is doing that is so slow

enricong
1st November 2014, 14:17
Actually, currently I am not doing any joins. I haven't created any indices yet, I forgot about that. Hopefully that will help. But I did have a few general questions:
1. I setup my DB with a large cache size. At what point does everything get written to disk? Immediately once the transaction is completed?
2. If I use one db connection to create my db (and have a large cache), then attach it to a new db, new connection, will the cache be inherited or is the whole thing read from disk?
3. I don't really understand what the transaction does? Does it just keep everything in cache and wait to write to disk until you commit? I understand how this would help with multiple INSERTs but is that useful with multiple SELECT queries?

ChrisW67
3rd November 2014, 20:20
There are no transactions around select queries, only queries that modify data. The transaction logging is used by Sqlite (or any similar RDBMS) to maintain a consistent set of data files in the face of multiple users and possible abnormal termination. When a transaction is committed the pending changes will be permanently written to the main data file on disk before control is returned. Sqlite uses temporary files to track transactions in progress.

I do not know if there is one cache or many ... but you should not need to know. The cache exists to minimise the need to read recently used data from disk by keeping the most recently used blocks in memory. The Sqlite cache content is managed internally. If you have huge tables and do operations that read them completely (e.g. Select max(blah) from foo; with no indexes) then no cache smaller than the table will help much.

enricong
3rd November 2014, 23:31
Well, I have lots of large txt files that I need to parse.
I'm thinking that if I parse it once and save it to a database file, then it will be faster to read the database file in future runs of the program instead of reparsing the txt file. I can also delete the large txt files.

So I'm inserting data into database files (with a large cache).
Then attaching them to a memory database to do processing.

So on that first run, when the data should all be in cache, when I do the attach and read the database via the memory database, does that read from memory or from the disk?

my database structure is like:
id INTEGER
range INTEGER
group1 INTEGER
value REAL

I create an index (id, degrees, group)

When I read, I do a :
SELECT DISTINCT id

then for each id,
SELECT SUM(value) FROM database WHERE group=x AND id=id AND (range > y AND range < z)

The table has about 400000 rows
That are about 200 distinct ids and Four groups so 800 select sum queries that I then manually average
It is summing about 500 numbers each time although this can vary between 0 and 2000
Every select query will read different data.
this select query takes about 40ms
so total about 4 seconds.

Lesiok
4th November 2014, 07:53
For this two queries You should have two indexes : first (id) and second (group,id).

enricong
4th November 2014, 12:32
I just remembered the "Group By" sql argument.
so I think I can just do:
SELECT SUM(value) FROM database WHERE group=x AND (range > y AND range < z) GROUP BY id

Lesiok
4th November 2014, 13:24
I just remembered the "Group By" sql argument.
so I think I can just do:
SELECT SUM(value) FROM database WHERE group=x AND (range > y AND range < z) GROUP BY id
Of course but I think that it should looks like :

SELECT id, SUM(value) FROM database WHERE group=x AND (range > y AND range < z) GROUP BY idWithout it you will not know what is the id of the sum.

wysota
4th November 2014, 13:25
Could you please take your queries, launch an sqlite console against your database and execute each query prepending it with "EXPLAIN QUERY PLAN"? E.g. modify a "SELECT SUM(value) FROM database WHERE group=x AND (range > y AND range < z) GROUP BY id" to become "EXPLAIN QUERY PLAN SELECT SUM(value) FROM database WHERE group=x AND (range > y AND range < z) GROUP BY id". Paste the results here, please.

enricong
5th November 2014, 01:53
Yes, I forgot the id when I typed up that message

I get the following with explain:
SCAN TABLE database
USE TEMP B-TREE FOR GROUP BY


This now runs in about 80ms so about 400X faster.
Typically, I need to do this about 32 times. Over all the user is waiting about 10-15 seconds so its slowing down elsewhere.
Ideally, I'd like an instant response but I think it's acceptable right now.

wysota
5th November 2014, 08:01
If you get a table scan then you are missing an index on the field from WHERE clause (likely 'group' column in your case).

enricong
6th November 2014, 12:19
I had a bug where I was creating the index. now its about 50% faster down to about 30-40ms from 80.