pirika logo

ホームページ Pirikaで化学 ブログ 業務リスト お問い合わせ
Pirikaで化学トップ 情報化学+教育 HSP 化学全般
情報化学+教育トップ 情報化学 MAGICIAN MOOC プログラミング

A MAGICIAN is a person who can associate Materials Genome, Materials Informatics, Chemo-Informatics and Networks.
MAGICIANとは、材料ゲノム(Materials Genome)、材料情報学(Materials Informatics)、情報化学(Chemo-Informatics)とネットワーク(Networks)を結びつけて(Associate)いかれる人材です。

MAGICIAN Training Course > Lecture materials > Formulation top page > Tools for Analysis : 非線形解析ツール、MIRAI


MAGICIAN(MAterials Genome/Informatics and Chemo-Informatics Associate Network)Training Course


MIRAI(Multiple Index Regression for AI) Analysis tools used for cases with few data, many identifiers, and nonlinearities.

MIRAI(Multiple Index Regression for AI) データ数が少ない、識別子が多い、非線形性があるケースに使う解析ツール。

Example of data to be analyzed


JPA 2021165831 (Japanese Patent:ナガセケムテックスの特許)

[Subject] To provide a photoresist stripping solution with excellent storage stability while maintaining sufficient resist stripping performance.


The data are summarized as follows.

In particular, let's consider predicting resist stripping properties after 30 minutes of treatment at 70°C.

The total number of experimental data is only 17 at most.

However, there are 19 different components of the stripping solution.

And what makes this table unique is that most of it is blank.

The ingredients marked in blue have been used only once.

From the standpoint of analysis, this is inconvenient. A component that is used only once will absorb the calculation error, so the descriptiveness will be very high, but the coefficients themselves will be meaningless.

If you want to become a researcher in materials Integration(MI), try to paste the following data into Excel or some other program and do the calculations yourself.
Materials Integration(MI)、材料複合化の研究者を目指すのなら、次のデータをExcelなどにペーストして、自分で計算してみよう。

Since we used three of the experimental data as prediction data, we will be analyzing 14 experimental data with 19 different identifiers.

To solve a simultaneous equation, the number of equations must be greater than the number of variables. For example, let's use the regression analysis function of Excel to force a calculation. The coefficients may be obtained, but we will soon see that the answer is meaningless.

MIRAI method

Using our developed MIRAI (Multiple Index Regression for AI), the result is as follows.
我々の開発した、 MIRAI(Multiple Index Regression for AI)を使うと結果は次のようになる。

When the analysis is performed with MIRAI, the following analysis results are obtained.

Resist Stripping =-1.172 + 0.7715* (G-AM0.6430 * G-B^ -0.6459 * G-C10.4303 * G-C20.2851 * G-C30.1963 * G-D0.0741 )

If I organize both sides and take log, we get a multiple regression equation because the exponential(Index) part comes before the log.

Then, identifiers with similar properties are considered as one group. In this case, the number of groups is 6, which is much smaller than the number of Exp. data 14. The nonlinearity is expressed by the fact that each group is a power function. Finally, the groups are multiplied together, which introduces group interaction.

Then, within a group, each member is represented by a linear function. The base of the power function must be greater than 0, so add 1.

Quaternary ammonium hydroxide case,
G-AM: 0.8947*Am1+0.5771*Am2+0.5006*Am3+0.4772*Am4+1

Comparing these coefficients, for example, the coefficient of Am4 is only half of that of Am1, which means that we need to double the amount used to get the same performance.

Water, 水:
G-B: 0.6173*B +1

G-C1: 0.6030*C1-1 +0.1743*C1-2 +0.8063*C1-3+1

G-C2: 0.6934*C2-1+0.9484*C2-2+0.0615*C2-3+0.0172*C2-4+0.4211*C2-5+1

G-C3: 0.9072*C3-1+ 1.5071*C3-2+ 0.8163*C3-3+ 0.000247*C3-4+1

G-D: 0.2028*D1+0.7369*D2+1

Each group is multiplied, so no free values are possible, and the interaction between Groups is expressed.

As a result, with a very small number of experimental data, I can obtain MIRAI equations with very high predictive performance, in which nonlinearities and item interactions are introduced.

This can be seen as a kind of feed-forward neural network method. Compared to a normal neural network method, the connection between input neurons and intermediate neurons is sparse.

Multiple Regression method: 重回帰法

Let's analyze the same data using the multiple regression method.

All the data used to create the MR equation is on a nice straight line.

This happens when the number of identifiers is larger than the number of experimental data (excluding the three points for prediction).

And the data in the prediction is way off on all three points.

In other words, the usual multiple regression analysis is completely meaningless.

Principal Component Analysis(PCA):主成分解析

Many textbooks teach that for systems where the number of explanatory variables is larger than the number of Exp. data, principal component analysis(PCA) is performed to compress the dimensions.

For example, when there are two-dimensional data points as shown in the figure below, if the XY axis is rotated and set to X'Y', Y' becomes almost zero, so each point can be represented only by reading the X' axis. In other words, two-dimensional data can be compressed into one-dimensional data.

PCA analysis can be calculated on the pirika page, so please try it.

However, the actual calculation is as follows.

How many principal components can be combined to represent the results of this experiment? We can see that even if we combine 10 of them, we can only express 92.14% of the results.
主成分をいくつ組み合わせれば、この実験結果を表現できるか? 10個組み合わせても92.14%しか表現できないことがわかる。

In other words, this result shows that there is almost no dimensional compression in this case.

That is, in a way, natural. Experiments 1, 3, 4, 7, 8, and 10 use components that are only used in that experiment. The dimensions of those six cannot be compressed in any way.

I did a principal component regression using 10 principal components, created a multiple regression equation with 14 experimental data using 10 variables, and predicted 3.

In this case, too, the predicted values deviated significantly from the experimental results.

Presumably, dimensional compression is not possible using PLS as well. If anyone has tried this, please let me know.

In the absence of big data, the neural network method would be out of the question.

For the analysis of such compounding formulations, it is necessary to consider what kind of analysis software should be developed to make such analysis possible.

Will we wait for someone to create it in Phython and give it to the library?

Experimental chemists may be fine with it, but if you're an MI expert, you need to be able to respond quickly to these requests from your clients. It's not enough to just saying "give me big data! more big data!". 実験化学者はそれでも良いかもしれない。MIの専門家であるなら、クライアントのこうした要請にもすぐに答えなくてはならない。「ビッグデータ、ビッグデータ」と歌っていれば良いものではない。

Going further. さらに先に行こう

In this patent, Am2, Am3, and Am4 are used only in experiment Nos. 7, 8, and 9. They may not have been used because of their lower performance compared to Am1. However, if the amount of Am2, An3, and Am4 is increased, the evaluation points of 6 and 7 might appear.

Once the MIRAI formula has been constructed, it will be easy to leave the rest work to the AI to search for a better formula. (In other words, you should check it yourself before submitting a patent.)

This is how advanced AI countries are targeting Japanese patents. Well, Japan did it to the US a long time ago, so we can't complain.

Here, let's consider a more advanced method of breaking patents.

breaking patents. 特許破り

In this patent, C1, C2, and C3 solvents are specified in terms of the range of Hansen solubility parameter polarization term (dP) and hydrogen bonding term (dH). It is shown in the figure below.
この特許では、C1, C2, C3の溶媒をハンセンの溶解度パラメータの分極項(dP)と水素結合項(dH)の範囲で規定している。それを図示すると下の図になる。

This is a very broad range, and there are plenty of counterexamples (examples that fall into the range but do not perform) that can be found.

However, it would not be very interesting to destroy the patent by doing so.

For example, let's consider the solvent in group C2. The coefficients of the five solvents are as follows. 例えば、C2グループの溶媒を考えてみよう。5つの溶媒の係数は次のようになる。

G-C2: 0.6934*C2-1+0.9484*C2-2+0.0615*C2-3+0.0172*C2-4+0.4211*C2-5+1

If this coefficient can be predicted, and new solvents with even larger coefficients can be explored, the research will be greatly accelerated. もしこの係数が予測でき、さらに大きな係数を持つ新たな溶媒が探索できるなら、研究は非常に加速する。

If we calculate the multiple regression with HSP values as explanatory variables as usual, we will be able to predict the coefficients with the following equation.


After that, you can use HSPiP to search for compounds that are in the range of the patent, calculate them using the formula above, sort them, and select the ones with the highest coefficient.

Of course, the stripping solution should not be too dissolved, and other conditions such as stability need to be taken into account.

However, if such a formula can be constructed, the subsequent process is very AI friendly.

This is the origin of the name MIRAI.

Once you get used to it, you will be able to cycle very fast.

MAGICIAN Training Course > Lecture materials > Formulation top page > Tools for Analysis

Copyright pirika.com since 1999-
Mail: yamahiroXpirika.com (Xを@に置き換えてください) メールの件名は[pirika]で始めてください。
Mail: yamahiroXpirika.com (Replace X with @.) The subject of your email should start with [pirika].