Hands-on Machine Learning -- Clustering Algorithms - Sous-titres bilingues

Hi, everybody.
I'm Kalika Curry.
I am an experienced IT professional.
I've been working with data in some regards since about 2006.
I have a passion for artificial intelligence and machine learning and that's me in a nutshell.
Today we are discussing chapter nine in hands on machine learning with Scikit-Learn, Keras in TensorFlow.
Bye!
Is it Aurelian?
Is that how you pronounce the name?
Aurelian, Garon?
Aurelio.
Aurelio, all right.
So me share my screen here.
There we are.
Share.
Can everybody see the screen?
We can see it.
Yep.
Let me see if I can figure out how to get a slide show going.
See this.
Doesn't want to let me.
Do slide show windows.
It's just.
Oh, there we go.
I you can find all the different buttons and things, but there we go.
All right.
So we are leading a book discussion on chapter nine on supervised learning.
So what it is on supervised learning?
So he pretty much described it as modeling unlabeled data.
You know,
previously in other chapters,
we learned what supervised learning is,
where you go through, you have a labeled data set, and you run predictions off of this labeled data set.
Whereas unsupervised learning is very different.
You have unlabeled data and you are trying to find information about this unlabeled data set.
What he mentions about unsupervised learning is that it's got a high potential.
There's mention about like a manufacturing example where you can identify certain things about a series of photographs taken in a manufacturing lab to determine whether or not parts are defective.
Last week, we talked about dimensionality reduction, which is also a common form of unsupervised learning.
And then you get some types of unsupervised learning.
He you know, clustering was where you objects are grouped together.
There's anomaly detection where you learn, you know, what's normal and to find out what's abnormal.
And then there's this sort of death density estimation, which is a.
I guess a probability density function and it's estimated random process for anomaly detection.
Anybody have any other inputs that they would like to kind of jump in there about unsupervised learning?
I think one thing we can say at the beginning is that generally speaking,
all the different unsupervised learning, whatever techniques, they tend not to be your end goal.
They tend to be a tool that helps you towards some eventual end goal.
So it might be a tool that you use in feature engineering.
It might be a tool you use for whatever reasons,
compression, dealing with noise,
but getting reduced dimensions,
just getting clusters,
but generally speaking not what,
people are interested in,
they just use that as one building block to ultimately, you classify things, ultimately, you know, build a predictive model or forecast or whatever.
Yeah, I think they also mentioned using it for something for like.
Getting insight is customers.
So something of that nature where you just kind of throw everything around and you look at it and you say,
okay, this group looks a lot like all the other groups and then you kind of, you can kind of figure out what to do about it from there.
Right.
Yeah, so I mean, you know, super simple examples.
You have a store and you have frequent customers, repeat customers, and then you just have, you know, everybody else, right?
What do your frequent customers look like?
Is it is frequent for you three times a year or is frequent for you, you know, 20 times a year?
and how much do they spend?
And then maybe you actually find that there's multiple clusters that amongst your frequent customers, you different kinds, right?
You have big spenders, you have, you know, people who come in and do a small amount of whatever, right?
There's different ways, yeah, that you can just sort of use clustering to help with exploring and understanding.
All right.
Okay, so let's see if I can.
Oh, I want to jump in and just add one more thing.
So there's sort of a continuum of there's not just unsupervised and then supervise through there's like kind of a range between that where there can be unsupervised semi supervised.
and supervise and like anywhere in between.
And lot of the times what unsupervised really means is just you don't have the actual target that you're interested in,
but you might do something where you still kind of do unsupervised learning on the data itself.
So like there's stuff like BERT right now where it's initially trained with no labels and so it's unsupervised in a sense,
but it's just learning from the data itself and then eventually that builds into a very strong supervised system.
But the unsupervised part is kind of the important.
So the unsupervised can build into that.
What is Burt again?
So Burt is right now kind of the state of the art text model.
And the unsupervised part or self-supervised or semi-supervised
it is basically it does this task where it looks at a whole bunch of text and then it
just on its own will remove words and try to fill in the blanks.
And so it's learning from itself.
We don't have any data necessarily.
And in learning that task of figuring out what words typically go where.
By filling in those blanks, it builds a strong language model that can then be transferred to learning how to classify other things.
But that's the gist of bird is it starts in that sort of unsupervised soap supervised stage.
Yeah, I would also add that, you know, this whole unsupervised thing is our unsupervised learning is got a lot of great usefulness because the vast majority of the data out there is unlimited.
So we, and it's not going to be labeled in the future.
Nobody's going to nicely go, oh yeah, that's the dude that, you know, this is a bad customer and that's a good customer.
We're going to, you know, nicely give you a binary division and this you're, you're not going to.
Unsupervised learning methods I think are probably more important to some degree than supervised learning methods because the amount of data just out there is unsupervised it's,
and we don't really know what the, you know, if there's classification what the classes are going to be.
We just sort of bump into them and manipulate them as we go along.
So I think this is probably you know, a very, very important section of this book.
All right, are we ready to move on?
Great.
All right, so we're focused on clustering today.
Because as we mentioned earlier, this chapter is rather long.
We have two parts in chapter nine, we have a clustering division and we have this Gaussian mixture division.
And with this being such an important topic, we thought it was best to split it up and kind of let us have this discussion.
clustering.
So, what is clustering?
He mentions that there's no real definition of what a cluster really is.
If you want to talk about a cluster, you have to consider a context.
We know that it, you know, in a clustering.
model, you're assigning instances to groups, and it's similar to supervised classification models, but without the labels.
And then you have, you know, that there are different ways to cluster there.
You can use a random point as a centroid you can run around in a circle.
There's a bottom up approach.
There's a top down approach.
are pickle.
I never pronounce that word right.
And of course there's so many different ways.
Let's see here.
All right, so we'll see some a little later in the course.
Does have anything they want to touch based on before we move forward about the Y clustering?
No, I think I'll wait more towards the end and then we can kind of like add color and extend and fill in different things.
All right, so the why of clustering is, as we mentioned a little earlier, we have this, you know, dimensionality reduction in reprocessing.
And the really cool thing about this was that I hadn't seen this before is that you can actually include this clustering model in the pipeline.
And you can use like a grid search CV to find like the best value.
And here I mentioned this value for K and we'll get into that a little later as to what K is.
You can also use it for semi supervised learning,
as we mentioned earlier, to kind of get this, you know, you can train a data set.
apart from labeled clusters.
So cluster everything, you label it, and then you train your data set.
And then there's this interesting idea that he brings forward about segmenting of an image.
And he specifically references color segmentation as an example.
And then we have like additional clustering information about like customer segmentation, as we mentioned earlier.
And there's you know data analysis that can be used for anomaly detection once again,
and then search engines, which I think might be, no, that's not at all related to what you said earlier.
Is right?
With Bert?
In a sense, yeah.
Yeah.
So those are some of the reasons why we might cluster.
And then here's part that I really wanted to bring up,
which is like,
and this is kind of towards the end of the chapter, but I wanted to bring it forward because of what we're doing here.
If you look at this, this page here.
a lot of different types of clustering models.
Yeah, it means DB scan, flammative, urch, mean affinity propagation spectrum.
There's so many, and I have a link here on the slides that you can run down this psychic learn and test.
The gist is, for this chapter, we seem to have focused on canines and TV scan.
And then that's just something I wanted to bring,
like, to the front because that's kind of you know when we have this many to go forward sometimes you just have to pick a couple and I think
those those are the two he went with is came came in some db scan.
Now we get the history of came in.
He says that it's best where I think I found that it was best described with this and once again we
have a word I can't pronounce Voronwoy.
I think it's Warren.
Warren.
Warren.
I don't know.
Warren.
And so this diagram here is, is used to sort of represent what a K means clustering.
That cluster.
And we learned that, you know, originally it was developed in 1957 by a guy named Lloyd, but it was under copyright.
So it comes back again in 1965, introduced all alone in its own by another guy named Forgy.
So it's also known as Lloyd Forgy.
algorithm.
And that's, I just thought that was really cool, that it's just, it's not only isn't that old, but it's been brought up twice.
Around 2006, we got a faster or way of identifying the centroids of Akameen's algorithm.
them and that was called the K-means plus plus.
We have a couple of other varieties such as the Minning Batch K-means which is faster but has
more inertia and then it requires the a number of clusters parameter which is represented as a K-value.
All right, so that's a lot of information all at once about the history of paintings.
And so, if I could just do a complete, like this is the total non sequitur but just to complete the side.
I think one of the worst things in science and statistics I find to be.
Rife with this is when you just name things after people, because then there's no intuition around this.
Okay, so if you say, like, there's the whole forgy algorithm.
That means nothing to me whereas at least K means I can create some sort of mental picture to try to remember
What is this algorithm?
And you know and so I find that actually the name K means really helps me remember what that algorithm is and so I
You know,
I would love it if somebody named something after me,
you know,
that be awesome and everything but like like scientists,
I think as,
you know,
as a way of doing service to future learners,
you should try and give it a more intuitive name and not just name things after people.
It comes from a good We've got to hold it to that when the key algorithm is discovered, developed.
Yeah, we'll see if I change my tune.
We already have it.
In academia, you see people don't get much money.
They're only called to fear.
And in very close state society is this name of the theorem, of the lemma, something like this, of the method.
So they're very partial to it.
And I can simplify.
that, you know, if you tell me, you know, Fubini's theorem, that there's just no intuition around it, you know?
Yeah, I can see that.
And I can see how, like, the K means, I mean, what does K mean, actually?
right?
It's like, well, it's it's a number that I have to input that represents something.
I forgot what, but when it's Lloyd Forgy, you're just like, okay, what are these people?
But it's cool because at the same time you get to get to learn who they are,
and then learn more about, you know, What caveings has been through is quite a lot.
It's had quite a lot of, I guess, transformations and additions to it since its original introduction.
Yeah, I'm just complaining because this is the pain I go through where I'm trying to learn about
something and they'll say, oh, yes, there's this statistical process you can use.
But before you use the process,
you should,
and I'm just picking random names here,
you should run the leaventhall test, and you should also run the Rosenberg test to make sure that this is an appropriate thing to do.
And if it's appropriate, then you can use the, you know, Smith, Heifen, Johnson, you know, test, like, well, what?
None of it's just.
just gobbledygook for me.
Yeah, I know exactly what you're talking about.
I just now forgot about every analytics test that I ever have to use.
They're all named after people.
Right, so then it just comes this horrible rote memorization.
There's no intuition.
And know,
I wouldn't say that everything in math is intuitive,
but you know,
you can try to say, what is a group, you know, but then as soon as you say, I've named things after people.
So, you know, it's, it's a Hermitian.
Well, great.
You after a guy.
That me nothing.
Yeah.
Yeah.
Sorry.
I think it helps if you name the model and then maybe name which version of the algorithm inspired by who the scientist is.
So like whenever I think of like K-means like one technique I follow it's two different scientists named Wong and Khosimara.
So I'll say oh I'm running a K-means clustering but if I specify like in my notes you know I'm following algorithms
from Wong or Khosimara I'm following their their techniques but I think it helps if you name
the model first because like if you say Pearson,
I mean, Pearson came out with a lot of statistical mathematical techniques, which one are you talking about?
So I think adding more context,
if you're gonna use the particular technique of the scientist, it helps to name a technique first in the scientist.
Yeah, yeah, like, you know,
if I said to you a Pearson number versus a correlation number, I think one of those two has like a better intuition.
All right, sorry, I'm not my rant now.
That was cool.
We should call this the key naming criteria from now on.
I really, Ted, I don't know what you're complaining about.
You got that whole talk series named after you, and I don't understand the community.
You have a talk series name after you?
Okay.
Now he's making a joke about Ted talks.
Ted talks.
Oh.
Yes, I get no royalties from Ted talks.
All right.
So there are some things you learn about canines.
I think the primary instruction we're given is to remember to scale the data.
Canines does not do very well with data that's, you know, all over the place.
I don't think I know very many algorithms that do very well with data that's scaled all over the place.
But Moving on, cluster amount.
I that's our K value.
There are ways to finding this number of clusters that we need.
You've got this thing called a inertia elbow chart.
There's a silhouette line graph, there's a diagram.
We've heard that this is a limited solution.
I wrote the comment here.
Well, we're all a little limited.
There's, there's things that it can do.
There's things that it can't do that supply.
There's probably so many clustering algorithms out there, right?
We know that it doesn't perform well with non spherical shapes.
And I put in there.
That means, and I say whatever that means is because, like, what do you mean by a non spherical shape?
I mean, you actually have to be able to visualize the data in order to identify whether or it's a spherical shape and that.
requires, I believe, a bit of clustering, doesn't it?
Actually, okay, it means not necessarily spherical shapes.
They should be compactly distributed around this and through it.
So it could be, for example, it could be, by the way, a square.
But in general, we measured distance around the centroid, and it should be less than particular distance.
Well, if it's there are no other centroids in this particular direction.
So they're actually limited with this radius, and when you do radius with a center, you get a sphere.
So this is what they mean by spherical shape here, it is not actual.
It is just because they have a reduce, which they measure for instance, actually.
Yeah, if you go back to the the psychic learn that compares the different algorithms, you'll see this one that has three sort of the column ellipses but they're very elongated ellipses, the fourth row.
This one right here.
Well, this one.
Yes, so K means will not do so well with those because they're more oblong.
Now, yes, you're right.
We can't necessarily visualize it in higher dimensions,
but in two, we can say it's the difference between circular versus more stretched out and then however many dimensions.
It's still the same concept concept that whatever your distance.
like Maya said, you want them to sort of be compact and similar in all directions.
And that's why the scaling that you mentioned is also important.
And that would be this,
yeah, like you said, the fourth row right here, which would be sort of this ideal shape for the happy path for K means is that fifth row.
Right.
And in order to acquire that though, like, how would you be able to tell that this is the shape of data I have.
I know in, in what I've learned is it's easier to just do dimensionality reduction and reduce it down to.
two features and plot it.
Is there any other way to manage that other than just trial and error, which is also what I'm used to doing?
I don't know if anybody here knows.
I mean, I don't think you can perfectly know a priori.
that that's one of the things we've the themes that we've seen at different points in this discussion.
You can't know the best algorithm, whether it's clustering or whether it's whatever.
So you you can't necessarily know operari whether or your data is very,
I don't have a better word for it, very spherical in the different dimensions.
The means similar things in all the dimensions.
And in particular, not just what it means, but where your data points lie, you know?
So like if you're doing customers and,
you know,
one of them is how frequently they come into the store,
well, if you have the difference between people who come in five times or, you know, 10 times, that's a pretty reasonable spread.
But you have some people who,
you know,
go to your website and they buy a thousand times, they're just like on a completely different scale from the people who buy five times.
And that's, you know, that's gonna give you challenges.
So think the best we can say, like very broadly, is if you have some intuition around your data.
Do you have a sense that the data is somewhat uniformly spaced,
and in the book they have an example where they have two clusters that are very dense,
or three, and then a couple that are more, you know, larger radius, more spread out, and you actually will then see that.
Yeah, K-means maybe didn't do so well on the ones that are so.
By our eyes,
maybe most humans would agree that we should include those,
you know, in the more spread out ones, but K means doesn't have that sense of intuition to just measuring distances.
So, maybe get some of those wrong.
You may also have similar weaknesses that if you have dense ones and spread out ones and you're trying to do them both.
So the best we could do is sort of talk about these algorithms and say, do they do well if you have weird shapes?
Do they do well if you have different densities?
Do they do well if your cluster has some variation in it, right?
Like, it's all kind of dense, but maybe in some places it's double the density of others.
Some algorithms will just split that cluster where in that region of slightly lower density.
So at least we can talk about those strengths and weaknesses broadly.
Okay, I would like to add something.
Please go back to this picture.
Oh, sorry.
So the low level.
So the low.
Yes.
The role is a very left.
Example is a square.
The scene is that if you k means, I don't see k means here for some reason.
If you apply k means to such kind of shape and you require four clusters, this you can get.
I just experimented a lot with k means,
and this is the most, I think the most high probability, the most probable distribution of cluster you can get this k means on it.
And then not exactly.
around because they push each other.
You see, they are too close to each other.
But if they have their own centroids in the centers of squares,
and if you measure the distance from this centroid to the whole point, points that discolor it.
For example, well, pick up your favorite color.
Then the color it points in the color it square will be the most close to the others will be closer to their own centroids.
So this is how it looks like.
Well, it is not a good choice of cluster numbers, but this is how it looks like.
Yes, it does, because that's another thing about canes is that, you know, we discussed it already, but this cluster amount is that that key value actually has a huge impact on how your clusters are
going to come out and whether or you do get this sort of.
I guess cluster that you're expecting or anticipating of the data.
So they describe, I guess, our metric is inertia being used to identify the best solution.
And that's the mean squared distance between each instance and its closest centroid.
Um, there's this idea.
where he kind of mentioned is getting lucky as a thing that you have to be aware of with K-means.
And that's when you're looking at your centroid initialization methods, you you're picking out where these centroids are going to be.
And they can, you know, in one type of algorithm, they're just picked at random anywhere.
And, and then this eventually they're going to converge, and that's going to be the answer you get.
And then there's this K means plus plus algorithm that was introduced in 2006 that took a completely different method and so you're not just getting lucky anymore.
That'd be like a kind of good way of like pointing that one out.
Okay.
And then you have your,
your internet value, which it comes in with a default setting and determines just how many centuries should be selected for the optimal solution.
And then if you know your initialization points for all the clusters, you can input this initiation parameter manual.
And I think you set the N in it parameter to one.
My, there it goes.
Yeah.
And that'll get you what you need.
So.
There's all of that about how to use k-means.
He actually gives a pretty good example in the book about the and in it parameter and how to how and when to set it to one and under what circumstance.
And it's really is if you do know exactly where those initial parameters should be, then you can set them out.
All right, all right, this is a fun one because I looked at, he mentions there are two different kinds of clustering, right?
There's a hard clustering and a soft clustering.
And in hard clustering, one instance is assigned to a cluster, that's it.
Soft cluster gives everybody is gives each instance a score per cluster.
One example he gave was just the distance from the centroid.
And he says it can be a,
but it can also be a similarity score or affinity score,
such as the Gaussian radial basis function, and refers us back to fact in chapter five.
And then it said that the transform method can be used to gather the distance between each instance to each centroid and he says okay that's in fact the Euclidean distance.
So, and then he points out okay so the.
Or I wanted to point out that the similarity affinity function that he mentioned is not provided in the solutions or the notebook that he provided.
And the only information I could get from the infinity function was that the number of features will increase drastically with an extremely large.
So I'm wondering,
you know,
what does this mean for dimensionality reduction when we're working with this for clustering,
can we expect that,
you know,
using this singularity,
similarity or affinity function,
would that be overkill or over extensive And then over here in the left hand corner,
all I have is just what little I can find about the affinity function that he provided from chapter five.
Has anybody worked with this before you guys know anything about it.
So, so I haven't worked directly with with that, but I will say that in the past I've done stuff.
We're like just looking at the difference between hard clustering and soft clustering.
Like sometimes you can do a method that gives you directly like this falls in cluster 12345.
But I've also done stuff with.
I don't know how to say it, but it's FA ISS.
It's, it's Facebook's similarity.
And basically you can do a couple of different things with that,
where you can say basically like return me the distance to the five closest points and stuff like that.
And so that's kind of a form of soft clustering in a sense if you're saying like I want all of the things that are similar to.
This one and then it returns just a distance to all of the other points because it's not necessarily giving you a single label to all of those points,
but it's just telling you here's how far away it is and here's some other similar points.
Right.
Yeah I mean it's the difference basically between having just a hard.
a single variable, which is the class, the cluster ID versus more having like a feature vector.
And so if your feature vector looks sort of like,
you know, one zero zero zero zero, then that basically says I'm really, really sure it's in cluster one.
But it at least allows you then to have things where you sort of say like hey point five point four point one zero zero,
you know, so then you can tell that you're you think it's more likely in one than two but you're still not that sure.
And depending on the algorithm like say if you're feeding into your neural network, you know, it may actually.
Appreciate knowing the difference between when you're sure and when you're not so sure.
Okay, that makes a lot more sense than that.
Then, then trying to find out.
What a radio function is.
Yeah, so, so you would say that.
There's a time and a place for both,
and that one of them is necessarily better than the other, but it's probably going to be more, what's the word?
larger than the other, right?
More dense or complex.
Yeah.
So you've shared in other weeks a little bit about AutoML.
Do know, does try to do any clustering in its feature engineering or is that really a little too?
to esoteric for it in terms of what it manipulates.
That is one of the options, but it's not frequently one that wins.
So we have a whole set of transformers that it will try.
And of them is basically do clustering.
return those clusters as a feature, and that basically never wins.
It is one of the things that's tried, but it's never really found to be useful.
And we have a few different varieties of that and it just doesn't, it doesn't tend to.
because the distance itself is encoded in all of the features that you're gonna pass in anyways.
Okay.
Anybody else have any questions about hard clustering and soft clustering?
So this one's going to be a quick slide.
I didn't get a whole lot out of DB scan section.
It mentions that you want to use DB scan, you're looking at continuous, continuous regions of high density.
So you're not looking at spheres anymore.
You're looking at areas where there's this.
Amount of dense populations.
DB scan has an epsilon neighborhood, which is a count of the instances that are within a small distance.
from it.
You get a minimum number of samples.
If it has at least these number of samples, it's a core instance.
And I think the rule is if you're in the same neighborhood you belong to the same cluster, and you can.
Many core instances, so if there's a long running instance, you're all one.
If not a core instance and is in the neighborhood, or what is it?
If it's not in a core instance and it's not in the neighborhood.
It's an outlier, which is represented with a negative one.
And there's no predicament that in DB scan,
they found that a classifier is better at predicting the cluster the data may belong to,
and that would be the canine example that's over on the right.
And there's actually a link to the D, his, his actual notebook there as to how he got this image to.
up here.
But yeah, that was, that was a DB scan.
And that's not actually the end of the presentation that I got for you.
So Yeah,
so one of the things that's interesting if you compare dbscan with k-means,
right, is you don't tell it a priori the number of clusters you're expecting, it'll do it just
based on the hyperparameter based on contiguous regions that it finds.
You can see obviously here the two moons data set,
it has no problem with it, k would utterly fail because these are very, very far from round circular sphericals.
clusters.
But the other thing that I think is particularly interesting about DB scan is you head in the notes there,
but it does not require that every point in your training set and your unlabeled training set be part of a cluster,
whereas K means requires that.
So that's the very key difference.
So if you had an extreme outlier,
You know,
if you just have a bunch of points here,
a of points here,
and I just add one point way over there,
that is always going to get calculated into the centroid of some cluster, and it's going to significantly pull the centroid.
And so if you have data that you have,
you're concerned about that you actually have some kind of outliers, whether it's bad data measurement or just whatever.
Then you might prefer an algorithm that doesn't require all the points together.
because then it'll basically reject ones that it says are just too far away and those won't go into
sort of the computation of where the clusters are.
Always felt like DB scan is a little sensitive in those parameters for I think it's the minimum number of samples and epsilon where it's like if
you don't get those just right you are a bit.
quite a in danger of the bulk of your data being interpreted as an outlier.
Oh, and naturally it's dependent upon what your data looks like, but that's, that's just always been kind of my, my little experience with DB scan.
I've seen some great things done with it.
It just.
Yeah, no, that's a great point.
I'm actually just about to do a clustering project at work.
And I guess I'll have to really look at my metrics
to make sure to your point that DB scan has good parameters and that I'm not just getting a bunch of garbage out.
So, when you use a technique like like BB scan and do you usually use some sort of like grid search algorithm then or some sort of, you know, we have a series of well let's
try this and let's try that to knock things out.
or declare things an outlier.
Or do you.
In my experience, you, you just have to look at the data because it's not something that you have an answer to necessarily.
If you're doing clustering, you can't directly say, well.
Since you don't have the labels you can't say well you got this cluster entirely wrong or there's all of these outliers and they should have been here.
So typically you,
because it's unlabeled you have to go through and you have to just kind of manually evaluate does this look like reasonable clusters or not.
Well you can use some like cluster performance metrics like the silhouette score, for example, I think.
Do they cover that in this class?
They, they mentioned it in the chapter and silhouette score is.
But it's a metric that isn't necessarily telling you that it found what you're interested in.
It's just telling you how well represented your clusters are like how good it was it actually grouping things up.
So, I I, Silhouette Square is useful, but I don't think it necessarily guarantees that it clustered it in the way that you're interested.
And I think Silhouette kind of corresponds with K means because it's, it's looking at distance.
So, so if you have, like here, these two moons data set if you have a cluster that's very non.
Then some of those within cluster points are going to be quite far away, but that doesn't mean it's not a really good accurate clustering.
Yeah, I mean, I'm.
Yeah, I mean, I think they're, I don't know, like a metric that I could imagine some metric where it's like more based on.
like local distances like within like the cluster that could account for that.
Yeah, I'm just saying like, you know, there's usually, I guess like some metric that would
give you a better idea of like the quality of Yeah,
I think like the author was saying, if you use K&N and it sort of agrees with your clusters, that's a good starting point.
If K&N doesn't agree with your clusters, then you're in deep trouble, probably.
But I was thinking, I think Jerry, you're really in the ass.
So, you know, for Epsilon, I mean, you could just chart, you could plot what number of points, you know, are going to fall within epsilon of given points for different values of epsilon.
So if epsilon small enough, you will get to to to where only the point itself falls within epsilon.
So for 100% of your data set.
So that's clearly it's epsilon.
And then you can have one that I don't even know what number what metric you use,
but for this value of epsilon half the data set falls within, you know, my circle around my of every point.
And so that's clearly an epsilon that's too big,
but again, not depending on how the data scaled you may not are priority know what that is but you can.
And you can do some EDA to say,
okay, so it looks like, you know, this is the range because obviously you want epsilon to contain a certain number of samples but not, not a gazillion and not none.
And this would probably one of those things where you had to do like a visualization prior to, you know, evaluating your your.
or as you're evaluating your data, right?
Yeah, I'm just saying I think you can evaluate
the impact of epsilon without actually
running the clustering algorithm just by straight up seeing how many points are near each other point for different values of epsilon.
You know, I think I've seen that algorithm but I forgot where it was.
I was in an algorithm or was it it was just a function or something but yeah, I think I know what you're talking about.
Now, I don't have a lot of experience with these but I'm just wondering what happens that higher dimensionality spaces.
I mean, we're looking, you know, the examples are obviously 2D or whatever, but 3D.
What do you do within a real world data when you start pushing into the, you know, tens of dimensions?
Yeah, so once you get beyond 20 dimensions, I think it's pretty much impossible to really kind of know.
I people do this stuff where they create the clusters,
and they'll do like T-SNE or something,
and they'll try to say, does seem like when I do this dimensionality reduction, the clusters are sort of separated from each other?
I don't know the creation.
I know.
Yeah, yeah, you could have really good clusters that don't actually seem separated on, on, on a plot, you know, but what I will say is if the comment I made earlier about clustering is not,
if it's not your end goal, if it's just an in between.
step.
So if what I'm doing is I'm clustering in order to identify certain customer types and then I'm going to basically predict future sales to these customers,
let's say.
Then what you can do is you create these clusters,
those clusters create features for your regression model that's predicting sales and basically that's the ultimate metric is.
clusters?
Did those clusters actually improve the accuracy of your regression model?
So you don't necessarily need to know, ah, priori, do we think these clusters make sense?
It's really when they're not the end goal, then it's you know whatever.
So like if you're doing anomaly detection,
that's one of the few times where you are sort of directly saying I'm going to use the output of the cluster.
So I use dbscan and dbscan says it's an outlier than I'm going to predict that it's anomaly.
You can then take your test set and you can say did it actually do a good job.
So it's still It did the classification,
but then you can use your classification metrics to say how many false positives did I have on predicting,
you know, my outliers and how many false negatives that I have, and then look at that.
So, again, because typically the clusters are not the end goal in and of itself.
I think at one point,
we were actually just looking,
at least,
I can't remember exactly if we were looking at centroids or what,
but we were actually looking at what these clusters,
or what these clusters came up with with regards to the features,
and we used a,
like, a, a, like a heat map table to see, you know, okay, I think we were looking at basketball scores
at the time and,
you know,
seeing, you know, this person has been performing extremely well on defense and this person is doing very well on offense and this is like, you know,
defensive play versus an offense player I think was the end goal but like without it was just like an exercise.
So by looking at sort of the intensity of where and the similarities of where these are,
you can get a good feels to whether or not you have a good sort of set of parameters for this for your particular clustering algorithm.
I think there are a couple things to mention around clustering.
I think the first thing.
There's a lot of domain knowledge and heuristics that and prior theories that need to occur before you do the class any type of clustering.
So typically, I would say a lot of time is in the industry.
It's mostly the marketing team.
Oh my gosh, you support any type of marketing team.
They're always trying to create some kind of cross cell offers on the revenue team.
They want to know about customer retention.
service team may want to know about client segmentation.
And typically,
when they want these models deployed,
they're going to ask the professionals that kind of have some domain knowledge or to work collaboratively with domain knowledge work professionals.
So like, Whenever I do customer segmented models, I've spent like two, three years already doing all kinds of analysis on customers.
So when it's like, oh, we want a customer retention model, I've already seen some of the trends within prior projects.
So a lot of times you kind of have an idea of what features are.
into the model and it makes it easier to interpret the results because you've worked with the data before.
It's harder to do clustering if you're unfamiliar with the domain because then you're more likely to pull features that may not give you meaningful clusters.
So there is.
that prior heuristics and domain knowledge that needs to happen before you're given a clustering project?
Yeah, okay, so that's an emphasis on feature selection.
Right.
And perhaps engineering as well,
making sure that,
you know,
within the domain knowledge you,
because you have this domain knowledge,
you're able to make sure that you are selecting the correct features and not just join everything all into one mix.
piece of information that I have ever known within this, you know, section or sector, and trying to get some information out of it, right.
I the other part is just with like domain knowledge is just interpreting the results,
because like, once you like cluster and you get like budget clusters, you know, it's probably important.
With more domain knowledge, you feel like a better idea.
You know what those clusters could represent or just what to look for in the feature space that like would define each of like the clusters that you get at the end as well.
Okay, I want to add to this something.
There was a question about her demand.
DB scan, as I've seen in one place mentioned, does not work very well in these very high dimensions.
And one of the things, maybe something else, but one of the things is one of the problems is when you can see the radius.
And when you have it to the tens degree, for example, 0.3 became millions.
The volume of your sphere drops quite quickly in high dimensions.
And actually, it is not only true for numbers which are below one, which naturally when you take them to higher degrees.
tends to go to zero, but the formula for multidimensional sphere includes a factorial for number of dimensions.
Which means that even units will be dropping in size, will be dropping in volume.
And so pick up correct epsilon became sort of complicated in this situation.
Yeah, to follow up on the comments that Robert and Christy made earlier so.
So for the project that I'm working on, I'm expressly planning to exclude any demographic features from the clustering algorithm.
And what I'm hoping to be able to do is to then,
if I have what I hope our To then see,
are there any correlations with the demographics,
so for the people who behave this way,
do we see pattern where they're younger, they're older, so like I can tell you something that's like a no brainer.
We've predominantly done in-person medical appointments in the past, and now we offer telehealth once we started doing this, when COVID hit, right?
Well, I don't think anyone's going to be surprised to know that the people who prefer
telehealth are average, much younger in age than the people who prefer in-person instead of and online.
Right, but so being able to find insights like that.
That's one of the things that that we're hoping to be able to do So so definitely,
you know, I'm I could build clusters on that, but that's not That's not what I'm looking to cluster on.
That's what I'm looking to learn with, you know, without looking at that information.
So you said,
so Ted,
in your paraphrase what you're saying,
you're essentially working backwards from the problem,
trying to, you already come with a predefined I'm trying to find this out for you even approach the data.
Yeah, so basically, I'm only going to give the clustering algorithm the features that
I care about for clustering, so in our case it's behavior around appointments.
You could theoretically give any clustering algorithm, whether it's dbscan or k-means, a million features, right?
you could give it the person's height,
their hair color,
you know,
whether the the last digit of their social screen numbers odd or even and eventually it
will cluster on all of those different things but It's because this is unsupervised,
in a supervised learning algorithm,
if give a gradient boosting machine,
an extra column that's just complete noise,
you know,
whether they're so screwed in terms out or even,
it'll basically ignore that feature clustering algorithms don't know how to ignore features, they will use all of them.
And so it's important that you give it just the features that are the behaviors that you want to cluster.
out.
And so I'm a little bit worried about like, well, which features do I give it?
Am I going to overemphasize certain aspects of it versus others?
You like, do what it is important to me, whether or not people cancel their appointments.
And if I put a lot of cancellation features in it,
then then I'm going to skew that model towards caring about that behavior,
whereas I could theoretically say we care about these other things we don't care if they cancel their appointments.
And so then I shouldn't put any of those features in when I'm building my clusters.
Does that answer your question.
Yeah, that was a good example.
Thanks.
I had a pretty major clustering project I worked on a while ago.
One of the old systems I worked on at my old.
was we had these building blueprints and on those blueprints there's a whole ton of different
markings and those markings have all kinds of different meanings and typically there's a legend
somewhere but even that isn't necessarily up to date so you might have something that's marked as
like L1 and that's lighting fixture one but then you have to make some mapping to like what actually is lighting fixture one.
And so,
in our workflow,
people had to draw little bounding boxes around every single L1 and then define what it was,
and that could take a long time if you're doing a whole hotel's worth of lighting fixtures,
because there's all of these pages and then you have to find every instance of that lighting fixture.
So, we had a system that could basically try to find the rest of them after they drew a bounding box around the first one, but even finding the first one of all of the different
assets there can be all kinds of different lighting fixtures there can all kinds of different electrical assets,
there can be faucets and and all kinds of other stuff.
So just sifting through 200 pages of documents and finding every single symbol for all of the different assets can take a while.
So one of the things that I tried to build was an initial like
just basically asset finder model that would just look across all of the pages and just try to find
every single asset regardless of what it was and then the hope was then you can have the
user label that and say oh this is a one this is a certain kind of lighting fixture.
And so the first layer was finding all of the assets,
and then the second layer was presenting that to the user and saying you figure out what all of these are.
This says, oh, one, what does that actually mean.
part of the problem is if you have 200 instances of L1,
you need to in some way derive this is like the L1 not logo symbol,
and then have them only label that one because you can't show them 200 instances of L1 and then expect them to sort of map all of those together.
So we tried to do clustering on these little tiny image snippets for all of the different symbols.
And I tried a whole bunch of different methods trying to figure out is there some way I can basically take the pool of all of the assets and then cluster it.
And then I can just present the center of the of the cluster or something like that,
and then they can label that one and then apply it to the whole cluster.
So if there's l one l two l three,
and a bunch of other stuff, just present them a single l one and say what is this a single l two, what is this.
And so I tried a whole bunch of different methods,
and like we were talking about earlier, the different hyper parameters can get drastically different results.
So I tried a whole bunch of different algorithms.
I tried a whole bunch of different hyper parameters,
and I wasn't able to find anything that worked generally across a whole bunch of different documents on each document.
I could kind of tune it up and get a result.
that looked normal for one set of documents, but not another and stuff like that.
So that was my experience with clustering on that.
And that's a semi-supervised learning method that you're talking about there, right?
It was, it was unsupervised.
It was all of this stuff that we were,
we were looking at today like that,
that graphic that you had at the beginning with like affinity propagation and all of those kinds of things.
That was something that I was looking at back then was how do I figure out which one of these applies and I basically just did a brute force search and just said,
run all of these,
and then I'm going to look at the folders and just evaluate is this correctly mapping out clusters that look reasonable so like here's the one's L twos A ones,
whatever else, and I, I some.
Okay, in certain scenarios, but if I tried applying it to a bunch of different documents and different conditions I couldn't get it to to consistently work.
In the way that I wanted at least.
If I may ask you, what kind of methods you use what kind of.
And what was the size of your data set?
So it's all of those methods shown there, basically, is I tried the full range.
I did a brute force search.
I just said, run all of these, and then I'm going to evaluate the folders.
So what I had is I had it output the little snippets of a one or two,
whatever I had it pushed it into their own folders under different sub folders saying affinity propagation means it should spectral clustering, et cetera.
And then I tried to.
look within those and see if it was reasonable clusters or not.
But the data set itself was in a single set, I could have 1000 or so assets.
And they were images that would range from.
14 by 14 to maybe 28 by 28 so they were fairly small images but it's still pretty high dimensionality when we're talking about clustering.
That was something that was mentioned earlier when you go really high dimensionality it kind of muddies things.
One of the things that consistently showed up in my clustering was if you have some sort of occlusion,
so like on building plants, you have a whole bunch of straight lines and like dashed lines and stuff that might intersect your symbol.
And so I would end up with clusters that were reasonable clusters where it was like oh these these clusters all have the same straight line in one spot.
because it's like a row where it's like all of the L ones got intersected by a straight line that just went through all of them in the same spot and then
Those all got placed in one cluster, but they clustered based on the vertical line that was inside of not based on the symbol itself.
When I start off like performing clusters.
Um, I use like different techniques.
So I think one of the things that I'm not sure if it mentioned in this book,
but I remember reading a long time ago that the way you approach clusters is it's not really a single algorithm,
but like a set of algorithms which help define like the business problem.
So it is multiple types of.
you would use to get to meaningful clusters.
So I have my features,
one of the downfalls of clustering is that the algorithm will do whatever you tell it to do,
regardless if you have any meaningful clusters in your data.
So when I have all of my features that I'm gonna use, I run a Hopkins statistic test first.
So this is a technique that came out in 1990.
And what it does is it assesses the probability
That is generated from a uniform data distribution or a non uniform data distribution and it tests like the spatial randomness of your data and how it works with the threshold is that,
you know, it return a threshold of zero ranging from zero to one.
If you move beyond like 0.5 or closer to one, then it means that it has significant types of clusters in your data set.
If it's close to zero, then it means that you pretty much don't have any meaningful clusters to even do a clustering analysis on.
I use that first because that helps me before I even get started with the modeling.
Seeing, okay, do I have meaningful clusters or not?
After I do that,
one of the techniques that I learned a long time ago,
like if you're doing K means, where you're selecting K, always start with odd numbers and not with even numbers.
So naturally, you know, for clustering models since your Hopkins statistics determined that you have more clusters.
I don't use two, four, or six.
I start building clusters with like three.
And then after that, when it comes to evaluating the model, I use four different techniques for evaluation.
So the first one is just visualizing the clusters,
but what I do is I create an elliptical star visualization that shows me how my cluster spits out the points and how they're connected.
And then I used for evaluation,
I use the Sylvana elbow,
which the book mentions those techniques, but then I use a new, a newer evaluation technique that Stanford University created called the gap statistic.
And what I'm looking at is,
so visually I'm looking to see,
okay, is three or five, what is the optimal number of clusters, because I don't want to see clusters on top of the clusters.
You should be able to see some separation in how they cluster.
If you just see a bunch of clusters on top of each other, then you have too many.
And then when it comes to Silhouette and elbow,
I like to see,
okay, what is their ideal number, you know, Silhouette says three does elbow say three, you know, they should be around each other like if one says three the other ones.
four or so.
Okay, then you kind of make a decision on what you want to, which, you know, meaningful clusters you want to use based on looking at your graphic but it shouldn't be far away, like Silhouette
shouldn't say three, and the elbow shouldn't say three.
That's very far away.
So I use silhouette, elbow, and gap and typically I kind of do a waiting.
So most of the time, silhouette and gap might be like aligned with the number of clusters and maybe elbow won't.
features first and then research the gap statistic method for evaluation by Stanford and compare that with silhouette and elbow when you're running your models.
Would like Hopkins the Hopkins statistic work in this like first row example where it's like kind of not typically what you would typically think as like clustered data,
you know.
Right, yeah, so it will help assess that and it'll give you an actual value.
whether or not ranging from like when I build it in R,
the threshold for me is if it's over like 0.5, ranging from 0.501, then I know that the data set has meaningful clusters.
It won't tell you how many it has, but at least let you know that you do have meaningful clusters in your data set.
And then that's where...
the other evaluation techniques.
Like, know, you usually wanna look at the clusters
and that's when you leverage like silhouette or elbow and the gap statistic, if you're doing K-means.
Now the gap statistic, when Stanford created that evaluation technique, that will work on hierarchical, aggromative models as well.
So that doesn't just work on K-means,
they actually created that to work on all different, types of clustering algorithms for evaluating the number of clusters.
Okay, thank you very much.
Can you please put it at the end of your slides?
It be fantastic, you what kind of...
methods could be used for definition of a number of clusters.
And would like to note that.
Okay, I was working on estimating the best cluster using album methods mathematically.
You can, you know, and those using mathematical means.
And I wanted to submit it for psychic learn.
On which they reply to me, but.
They have much better methods for this, which unifies several different methods for such kind of goal.
And they show to me, and yes, in developer stage, they have fantastic methods, which does.
it, but it is still in developer stage.
The last came, it was two years ago, and they asked for them to add me to developers, and I got no answer at all.
But I can show you some about the album method.
It depends on the scaling of your, can I please throw my screen?
It depends on the scaling of your of your variables.
So, here's the album method for some, okay, here's the album method for some clustering.
And right now,
the most possible angles could be But if I stretch this axis and cut it, for example, somewhere here and got the plot.
Okay, this is my initial.
This is the initial plot,
then I start to stretch it,
and you see 2 is not good anymore,
but it looks like 3,
4 or 5 could be reasonable or maybe 9, then I stretch it a little bit more, and 3 is not good.
four drops out two and five doesn't look very good actually nine is the best here
and yes the nine is the best number here because here are my clusters When you say stretch it, what do you mean by that?
I mean that actually visualizing this elbow method is not very reliable method.
Did you,
by any chance try the,
I think it's called the Silhouette charting method,
where it shows you the different Silhouette scores in like a chart graph with the representatives is, I guess, No, I didn't try.
Yes, I understand what you mean.
No, I didn't try select method.
I to work with elbow methods.
Okay, and I wanted to make a company a methods which shows which computes what is the best clusters for elbow method.
So the complication is sometimes within not the, the method so much is how you visualize that elbow chart.
Yes.
You can obviously have a computer measure the angle.
You have to just eyeball it, right?
So we're at the bottom of the hour.
And one thing I just wanted to mention is if you guys
are interested in this idea of clustering and you want to do a little bit more reading, you can certainly read more about clustering.
There's tons of literature about that.
But one of the things that I've seen some of the trends
right now is that if you think about DB scan and sort of drawing a circle around each point,
Um, you're sort of pairwise comparing the points to a large degree.
That's, that's the extent of, of what you're doing.
And I'm seeing more and more literature now about people using graph representations of their data.
And so now, instead of just saying, sort of, in terms of your immediate, you know, neighbors in a graph.
Let's also take into account sort of the features that are in the graph,
but that are two hops away, three hops away and trying to factor in the global architecture.
And so there are a number of,
they don't call it clustering,
I don't know why not,
it's usually called community algorithms for graphs, but you can look, for example, at Louvain modularity and compare that to DB scan.
And so if you have other features and you can sort of represent your data,
not just tabular IID, but sort of in this graphical nature.
then the idea is generally speaking that you're getting data from not just sort of your one hop neighborhood,
but you can sort of try and pull in more global,
the words escaping the architecture is not the right word, but the global structure into how you're defining your work.
communities, your clusters.
So I thought I would just share that in terms of people want to do more reading.
Yeah, I think.
Because in my work of mostly for like clustering type analyses and mostly using like graph based.
Like clustering methods, like low gain, I think newer one would be like light in.
So, I guess like, it's kind of just been a standard in, I guess like the field of like analyzing like sequencing, like single cell sequencing data but not exactly sure why I think it's messy
just like.
Um, and I probably like accounting for, um, or like a complex like feature space or something like that.
Um, but yeah, and also like one thing that I've like.
Um, sort of been interested in is that with like graph based, like clustering methods, you can kind of,
um, There are ways to cluster, I guess a multi modal data sets where you have like different sets of features spaces.
I guess like, I don't know what a good example of this would be that.
maybe with like patient data,
you might have like health record type data,
but you also might have some like patient tests where you like,
I always think of like sequencing data,
like you like you like sequencing data,
I don't know you like take like a blood sample like sequence it and that's like in a totally different feature space and let's say like I don't know they're normal like electronic like health record
type data so like with with like graph based clustering methods you can essentially like construct.
I guess like two separate graphs in each of the feature spaces and kind of like cluster on those like simultaneously.
which, I don't know, that's like something I've been looking into and been interested in recently.
Yeah, thanks for sharing.
Yeah, I think that's the, I think that goes good color to why it is that people think that things they do on these graphs can potentially go beyond what you would
do with just a straight,
like, you know, even if we don't go into all the core details, I think it just, it just, you can start to see like there's these options.
There's different things you could try.
And again, it's that looking more for, you know, global structure in a graph.
You know,
if you look at like the,
the, the, the half data set, if you imagine that there was some kind of graph on it, you might.
be able to very quickly see there's very high connectivity within each of the
moons and there may be some connectivity but a very low degree of connectivity
you know between points from one moon to the other moon and so a graph algorithm
a very simple one like label propagation or something like that might just
with something like the half moons data set if you had good graph information that gave you that connectivity structure within each of the moons.
All right.
Anybody else?
Any last questions or comments before we close out the book club for today?
All right.
Thanks, Kalika.
Thanks for being our discussion leader today.
And will say thank you also for volunteering to not just lead one discussion, but actually to do two weeks in a row.
So very different vibe when we start talking about Gaussian mixture models.
And so that's gonna be our topic next week.
And then after that, just, previewing, you know, we are going to take a break with the holidays and stuff.
And after the New Year's total sort of shift of gears, we're going to get into neural networks and deep learning.
So we've been really mostly talking about society.
And so after the new year, it's going to really be about care and it's going to be about building neural networks in TensorFlow.
So I hope to see you guys next week,
and then again, we'll kick things off sort of on a whole new direction when we start talking about neural networks.
Langue de traduction
Sélectionner

Débloquez plus de fonctionnalités

Installez l'extension Trancy pour débloquer plus de fonctionnalités, y compris les sous-titres IA, les définitions de mots IA, l'analyse grammaticale IA, la parole IA, etc.

feature cover

Compatible avec les principales plateformes vidéo

Trancy offre non seulement le support des sous-titres bilingues pour des plateformes telles que YouTube, Netflix, Udemy, Disney+, TED, edX, Kehan, Coursera, mais propose également la traduction de mots/phrases IA, la traduction immersive de texte intégral et d'autres fonctionnalités pour les pages web régulières. C'est un véritable assistant d'apprentissage des langues tout-en-un.

Tous les navigateurs de plateforme

Trancy prend en charge tous les navigateurs de plateforme, y compris l'extension du navigateur Safari iOS.

Modes de visualisation multiples

Prend en charge les modes théâtre, lecture, mixte et autres modes de visualisation pour une expérience bilingue complète.

Modes de pratique multiples

Prend en charge la dictée de phrases, l'évaluation orale, le choix multiple, la dictée et d'autres modes de pratique.

Résumé vidéo IA

Utilisez OpenAI pour résumer les vidéos et saisir rapidement le contenu clé.

Sous-titres IA

Générez des sous-titres IA précis et rapides pour YouTube en seulement 3 à 5 minutes.

Définitions de mots IA

Appuyez sur les mots dans les sous-titres pour rechercher des définitions, avec des définitions alimentées par l'IA.

Analyse grammaticale IA

Analysez la grammaire des phrases pour comprendre rapidement le sens des phrases et maîtriser les points de grammaire difficiles.

Plus de fonctionnalités web

En plus des sous-titres vidéo bilingues, Trancy propose également la traduction de mots et la traduction intégrale de texte pour les pages web.

Prêt à commencer

Essayez Trancy aujourd'hui et découvrez ses fonctionnalités uniques par vous-même