PDA

View Full Version : Filtering iterators



akiross
31st August 2011, 14:16
Hello,
In my program I've a data model, and to access data I'm using java-style iterators.
Since data has to be filtered somehow, I was thinking about using iterator filtering.

In other words, given a java style iterator, I would build another custom iterator, say FilteredIterator, which takes the original iterator and the filter parameter. The filtering is applied automatically when using .hasNext() and .next() methods on the FilterIterator.

Does something like this exist? Shall I use some abstract type or guideline when implementing java-style iterators? Do you think this approach is Ok? I couldn't find much information about this kind of iterators and having experience with stl-like iterators, I'm not sure about how to proceed.

I've some ideas on how to do it (just implement the class with required methods), but I appreciate any advice on this topic.

Thanks in advance for any help
~Aki

high_flyer
31st August 2011, 14:19
I am not sure what do you exactly mean with "filter".
I think it will be less efficient to filter for each iterator call, than first to filter your set, and then use regular iterator on the filtered set, which then only has the wanted elements in it.

wysota
31st August 2011, 14:31
I think this might work but only for one-way iterators. In a general case I agree with Daniel that it would be too slow. You could cache some filtering results in the iterator which would make it faster but nevertheless I see little use for such a concept (however the concept itself seems ok). I might suggest another approach -- to create a filtering container and use foreach (like I did here: http://blog.wysota.eu.org/index.php/2007/04/27/domelementcontainer/), on the other hand I don't know if such approach is more useful than yours. The practical difference is the filtering can be done while the container is created (once!) and not when it is iterated. Of course you could do the filtering in constructor of your iterator as well, I think the two approaches are more or less similar.

akiross
31st August 2011, 15:37
@high_flyer, in this case I mean that an iterator will return true to a hasNext() call if the next() object is considered relevant to user's perspective. For example, having a collection of natural numbers, an iterator that filters odd numbers will return only even numbers in the next() call. In other words, when calling next(), the element is evaluated to be relevant and, in case it is, it's returned, else next elements are considered iteratively.

About speed: of course a cache could be made, but would be made at user's place, instead of the container providing the iterators. I won't go in details about why I'm trying to use such a technique, it's most about style :) But it's not really slower and, actually, it saves me a few loops, so I think that despite the overhead, it may be faster for larger collections.

@wysota: I think I'm missing something, because your container looks like what I'm willing to do, but I don't see how you're filtering the content once: you return the iterator and intermediate result is not saved anywhere (it seems, but I don't know how the DOM works in Qt). From your code, it seems that every time you call the begin_const(), an iterator is built from the first element, and the filtering is done every time. In fact, you're retrieving the next element using

m_e = m_e.nextSiblingElement(m_t);
which is the actual filtering. If you didn't have such method (say you had to check the element tag by yourself), you would iterate the sibling elements checking for the next one that matched the tag, for example as in (I'm inventing the methods):


const_iterator &operator++() {
while (m_e = m_e.nextSibling()) {
if (m_e.tag == m_t)
return *this;
}
return const_iterator(); // End iterator
}

Caching of the result (to do the filtering once), would require saving the items you find in the iterator and use the saved copy until the original data is not changed. So, unless qt DOM implementation is doing that for you, I can't see how you're filtering once.

Anyway thanks for your suggestions,
I'm implementing it and was just testing phase :) I'll let you know how it worked.

high_flyer
31st August 2011, 15:46
@high_flyer, in this case I mean that an iterator will return true to a hasNext() call if the next() object is considered relevant to user's perspective. For example, having a collection of natural numbers, an iterator that filters odd numbers will return only even numbers in the next() call. In other words, when calling next(), the element is evaluated to be relevant and, in case it is, it's returned, else next elements are considered iteratively.
I understood perfectly what you mean.
But you probably are not aware of that actually it means to do what you want.
For the iterator to "evaluate" it needs to access all the elements, until it it access one which fits the filter.
This is not efficient.
Again, its more efficient first to filter the set, and then normally iterate over the resulting list.


About speed: of course a cache could be made, but would be made at user's place, instead of the container providing the iterators. I won't go in details about why I'm trying to use such a technique, it's most about style But it's not really slower and, actually, it saves me a few loops, so I think that despite the overhead, it may be faster for larger collections.

No, actually the larger the collection is, the larger is the inefficiency.
I don't know what loops it can save you, the loops probably can be gotten rid of, or optimized.
I see no other real option as far as efficiency goes, which is better than once filtering the set.

akiross
31st August 2011, 16:26
Thanks again for your considerations, which I was aware of even before posting.
As you said, I can get rid of these loops, and a method to do that is with iterator filtering.

Anyway, I opened the thread to talk about qt style when implementing such a thing as iterator filtering, not to debate about efficiency.
But it's my bad: I didn't make it clear.

Efficiency is not an issue: I've already considered efficiency problems in my current code and which problems could be solved with this solution.
~Aki

high_flyer
31st August 2011, 16:46
I know you don't want to debate this, but you can just read, or not - as you wish :)
You don't have to answer.

As you said, I can get rid of these loops, and a method to do that is with iterator filtering.

This is not getting rid of the loops, it just moves them to another place.
The "matching" done by your "filtering iterator" IS the loop.
When you need to filter, you can't avoid going at least once through your data set.
But you can avoid going through it more than that.
Using the filtering iterator approach, you will be looping through at least parts of your set for every iterator operation.
And in an extreme case, where you have 1000 elements of type X, and your iterator is looking for type Y, and your set only has one Y element, at the end, every time you do iterator->next(Y) you will be looping the whole set.
Not the case, if you work on a filtered list.

akiross
31st August 2011, 17:22
It's not that I don't like debating, but it's useless in this case :)

You're pointing out a non-problem: you deduce your conclusion from *your* imaginary implementation, which is not *mine*.
To show you that iterator filtering has theoretically no disadvantages, lets examine this example:
collection of natural numbers, in random order, from 1 to n=1'000'000. You have to extract the even items that are divisible by 3. You've 3 filtering functions (evenF, divF, andF) that does respectively the following: returns true if the passed argument is even, returns true if the passed argument is divisible by 3, and returns true if both the arguments are true. These are my tools: I've to use them in my solution.

A trivial but fast solution, without inferring anything about the data structure, is this:


QList matching;
foreach (int n, collection) {
if (andF(evenF(n), divF())
matching << n;
}


This takes O(n). Which is pretty fast. Actually, I think, is the best result when it comes to random order.

The solution involving iterations is like this:


struct Iterator {
Iterator(Collection c): _iter(c.iterator()) {}
bool hasNext() {
while (_iter.hasNext()) {
if (andF(evenF(_iter.peek()), divF(_iter.peek()))
return true; // Found next, is current iterator
return _iter.next(); // Filtered items, try next one
}
return false;
}
int next() {
return _iter.next(); // Next one is valid
}
private:
CollectionIter _iter;
};

QList matching;
Iterator i(collection);
while (i.hasNext()) matching << i.next();


(I hope you get the idea, the code is written on-the-fly and may be wrong)

Since hasNext always advance the hidden _iter, the iterator does EXACTLY the same number of iteration over the sequence. The outer while is necessary because of the iterator nature, but doesn't make the performances worst over larger collections. This takes O(n), too.

Of course, it's OBVIOUS that the simpler implementation above requires less steps, less calls, and thus is more efficient, but both solution has the same time complexity.

Clearly, the advantages in this solutions are architectural. For example, in the first example, your're REQUIRED to execute the check in the foreach loop. If you have multiple filtering steps, you must use functors, or the advantage will vanish.

On the other hand, in the iterator example, you can use ANOTHER iterator type as CollectionIter, thus you can compose iterators, without using functors, without strict requirements when it comes to function apply order. You can't compose while loops, that's why immediate filtering is necessary in the first example.

And if you say: "yeah but the first is calculated once", while the second is not and iteration is repeated every time you need to access this, I reply with, "yes, but"
1. the first is calculated once ONLY if you don't have to change the filtering function often (as I do)
2. it's easy to cache the second example using a QList when you access the data in the second while loop. But in my program, since data and filtering functions are mutable, cache is a minor advantage. Data is asked almost once when it's needed, and the next call is almost sure to access to new data, hence requiring a caching refresh.

I hope this answers your points.

To clarify again: the point is not the speed, but the style. I hope this example is clear enough to show that speed is not compromised by extra-loops and that style (architecture?) benefits a lot from such pattern.

There are many real-life example and languages/libraries that use this approach, I'm sure they have discussed this topic (not only style, but also efficiency) enough.

high_flyer
31st August 2011, 17:49
you deduce your conclusion from *your* imaginary implementation, which is not *mine*.
True, I can only work with what I have, and that is the information I have (or rather don't)
So I just took what I know from experience to be the "usual" case.


This takes O(n). Which is pretty fast. Actually, I think, is the best result when it comes to random order.
Which was my point all along.


Since hasNext always advance the hidden _iter, the iterator does EXACTLY the same number of iteration over the sequence.
I'll go even one step toward you and say, probably, less, as it only searches the next item, and not all.
As I said in my previous post:

Using the filtering iterator approach, you will be looping through at least parts of your set for every iterator operation.
Which is the core of my argument.
If use your iterator in such a "sterile" example like you posted, the difference is only semantical, as you point out.
But I take it, that if you take the trouble to create an iterator, you will want to use it in real use cases.
This usually involves creating local iterators, and each call to next() on them is going to do a comparison, or many, depending on how far is the desired item from current position of the iterator.

But when you use a filtered list, you perform your filtering only once, and use it in all the places you need to instead of creating a new iterator (which is forced to filter again when you use it).


Of course, it's OBVIOUS that the simpler implementation above requires less steps, less calls, and thus is more efficient, but both solution has the same time complexity.
I know what you meat by that so I want go in length to split hairs.
But more function calls, do lower the efficiency as you stated, and in large sets, that can be a major performance hit.

Hope I was clearer this time.

I guess there might be some use cases where filtering iterators make sense, maybe on rather small sets, which are highly diverse (that is, with large amount of types), and in such cases, instead of filtering the set many times (per type) you can just use an filtering iterator.
But for large sets which are not very diverse, I think they are not practical, except for the ease of use (from a developer perspective)


1. the first is calculated once ONLY if you don't have to change the filtering function often (as I do)
2. it's easy to cache the second example using a QList when you access the data in the second while loop. But in my program, since data and filtering functions are mutable, cache is a minor advantage. Data is asked almost once when it's needed, and the next call is almost sure to access to new data, hence requiring a caching refresh.

I hope this answers your points.
Yes it does, and I agree, that in you case, probably, this will make sense.

akiross
31st August 2011, 18:08
If use your iterator in such a "sterile" example like you posted, the difference is only semantical, as you point out.
But I take it, that if you take the trouble to create an iterator, you will want to use it in real use cases.
You may not believe it, but this is exactly how I intend to use the iterators :) That is my *real* use case.


But more function calls, do lower the efficiency as you stated, and in large sets, that can be a major performance hit.
Very true, but this depends on implementation: raw pointers implementation for iterators may yield a very efficient result, and the iterator pattern (or paradigm or idiom or whatever we want to call it) lets you to separate concerns in the architecture, yet allowing a very fast implementation. As you could notice, my choice of using java-style iterator means that I'm not concerned about performances, as it's clearly stated in qt documentation "they are a bit slower".


I guess there might be some use cases where filtering iterators make sense, maybe on rather small sets, which are highly diverse (that is, with large amount of types), and in such cases, instead of filtering the set many times (per type) you can just use an filtering iterator.
But for large sets which are not very diverse, I think they are not practical, except for the ease of use (from a developer perspective)

You're absolutely right, but this is not my case :) Large collections may happen, but since filtering tests are not small and quick, but they involve dynamic features, RTTI, string comparisons, etc, function calling is a minor issue.

wysota
31st August 2011, 18:44
About speed: of course a cache could be made, but would be made at user's place, instead of the container providing the iterators.
If you ask me, I'd prefer to have the container cache things than forcing the user to do it.


@wysota: I think I'm missing something, because your container looks like what I'm willing to do, but I don't see how you're filtering the content once: you return the iterator and intermediate result is not saved anywhere (it seems, but I don't know how the DOM works in Qt). From your code, it seems that every time you call the begin_const(), an iterator is built from the first element, and the filtering is done every time. In fact, you're retrieving the next element using

m_e = m_e.nextSiblingElement(m_t);
which is the actual filtering.
That's correct. My point was to have a container instead of an iterator. The container can do the filtering while it is created, that's my point. However in my exact code, no such thing is done, the filtering is done on the fly.


Caching of the result (to do the filtering once), would require saving the items you find in the iterator and use the saved copy until the original data is not changed.
No, you cache filtering results, not the data. Say you have 100 elements in a container. When you iterate it for the first time, you can fill a bitmap with zeroes and ones marking whether a particular item meets the conditions or not. Then when you iterate again or when you go back, you can use the bitmap to avoid checking the data again.

akiross
31st August 2011, 23:22
No, you cache filtering results, not the data. Say you have 100 elements in a container. When you iterate it for the first time, you can fill a bitmap with zeroes and ones marking whether a particular item meets the conditions or not. Then when you iterate again or when you go back, you can use the bitmap to avoid checking the data again.

Yes, but filtering may depend from the data and the data may change. What if you filter your XML tags using the value of an attribute (say "class=foo") and you change that value later on?

Anyway, you're right about filtering containers: I didn't think of that, but it's kinda obvious now that you noticed it. But I think that in case of "infinite iterators", for example when data is generated from external devices, network or pipelined algorithm, containers would be pointless. Not that's my case, just pointing it out :)

Anyway, I'm branched my source and I'm trying filtered containers now. I think it's more practical in some points of my code, because I'm working with non-linear set ordering and iterators aren't clean to use... So, knowing about copy-on-write on qt containers, it's worthy to try.

Thanks for all your advises, guys.
~Aki

wysota
31st August 2011, 23:29
Yes, but filtering may depend from the data and the data may change.
In that case your iterator is pretty much useless anyway.


What if you filter your XML tags using the value of an attribute (say "class=foo") and you change that value later on?
Then you are violating the basic rule of not cutting the branch you're currently sitting on. Regular iterators are not mutable by default. Special considerations need to be taken to make an iterator mutable.


Anyway, you're right about filtering containers: I didn't think of that, but it's kinda obvious now that you noticed it. But I think that in case of "infinite iterators", for example when data is generated from external devices, network or pipelined algorithm, containers would be pointless. Not that's my case, just pointing it out :)
Iterators would be as pointless. The point of iterator is that you can iterate over all elements. If the data is infinite, it is hard to iterate through it. Then you have a generator and not an interator.

akiross
1st September 2011, 00:16
I'll be quick :)

1. Why?

2. I wasn't talking about mutable iterators. Probably there is a misunderstanding here, because caching works the way I say: when you cache something, you must ensure it's validity when you're using it later. If data has changed (and it may change also when you finished to iterate it), you must refresh. That's my point.

3. Generators generates. Iterators iterates. I think they are quite different things. You generate data with generators and process it; you CAN use iterators to process such data. If you want to call "iterators" the things that are associated with a finite container, fine, but you've to find another word for such things that work on infinite data sequences, because generator is wrong in this case. By the way, STL calls these things iterators, or stream iterators. For example, istream_iterator is not bounded by any number, and it's a perfectly legal iterator.

You may be interested in this: http://stackoverflow.com/questions/2622591/is-an-infinite-iterator-bad-design

~Aki

wysota
1st September 2011, 01:15
I'll be quick :)
Likewise.


1. Why?
Deleting the pointed item invalidates the iterator. So does addition in certain cases.


2. I wasn't talking about mutable iterators. Probably there is a misunderstanding here, because caching works the way I say: when you cache something, you must ensure it's validity when you're using it later. If data has changed (and it may change also when you finished to iterate it), you must refresh. That's my point.
My point is that the iterator/container/whatever can only work on const objects. And you can't modify those.


but you've to find another word for such things that work on infinite data sequences
Yeah, I call them generators :) If something generates something then it is a generator. If you want to iterate over a generator, that's fine but the iterator iterates, not generates, if you ask me. I know it's just a matter of taxonomy but still.


istream_iterator is not bounded by any number, and it's a perfectly legal iterator.
If it has end() defined, then it is finite. Otherwise I hate to be in a skin of a person trying to process every element of such collection.

akiross
1st September 2011, 19:00
Deleting the pointed item invalidates the iterator. So does addition in certain cases.

My point is that the iterator/container/whatever can only work on const objects. And you can't modify those.


Modification doesn't regard only creation and deletion. There is also modification which doesn't invalidate iterators. Anyway, *this was not the point* and you went on another topic. The point was: if you use iterators for filtering, and data is highly mutable (NOT while iterating), then caching is useless and iterators are Ok to use.


Yeah, I call them generators :) If something generates something then it is a generator. If you want to iterate over a generator, that's fine but the iterator iterates, not generates, if you ask me. I know it's just a matter of taxonomy but still.

Probably you didn't read well what I wrote OR I expressed badly. I said:


I think that in case of "infinite iterators", for example when data is generated from external devices, network or pipelined algorithm, containers would be pointless.

I didn't say that iterators were used to generate data (and I thought that was obvious). A stream is not an iterator, but is an infinite source of data. I was talking about using iterators OVER an infinite collection of data, and as you say: "if you want to iterate over a generator, that's fine". And that was the point of my statement in first place.


If it has end() defined, then it is finite. Otherwise I hate to be in a skin of a person trying to process every element of such collection.

I don't agree. You're assuming that we are talking about STL-style iterators, which uses begin() and end(). Java-style iterators, for example, checks using "hasNext()" and not testing if they reached a defined end. That method could return true to give an infinite iterator.
In addition, I think that the semantic of the "end" method in STL iterators is NOT do define a length of a container, but to define a case in which has no longer sense to go on (which is very very different from saying that the iterator will find an end).
It's easy to build a circular list with >0 elements which will iterate forever, even if the end() is defined.
~Aki

wysota
1st September 2011, 20:09
Modification doesn't regard only creation and deletion.
No, but if there exist a situation when modifying a container is unsafe then modifying a container is in general unsafe.


The point was: if you use iterators for filtering, and data is highly mutable (NOT while iterating), then caching is useless and iterators are Ok to use.
I don't see your point. If you modify an element in a way that it changes the result of filtering then from the "filtering container's" point of view you "add" or "remove" an element from the container. I don't see how you intend to cope with that in a general case. At some point you will wake up in "nowhereland" and you won't even notice it (e.g. you modify the current element in such a way that changes the filtering result of the element hence your iterator should immediately become invalid, however you have no way of rechecking the filtering condition for the current element and also calling next() and prev() will not get you to the original element as during "prev()" the condition will be rechecked and the element filtered out). Of course in many cases it will work out just fine but I'm not really fond of software that "usually works".

akiross
2nd September 2011, 00:57
I don't see your point. If you modify an element in a way that it changes the result of filtering then from the "filtering container's" point of view you "add" or "remove" an element from the container.
Sure. Didn't say no :) The point was: if you have to change 90% of the container data every time you have to access it, then, probably, caching is not worth the effort. The point was: containers with caching are surely useful, but there exist at least one case when caching is not and iterators alone will do. Got it?

~Aki

wysota
2nd September 2011, 01:08
If it doesn't cost you anything to cache then I don't see why not cache. Let's end this discussion, it probably doesn't lead to any useful conclusions :)